From: seek-taxon-admin@ecoinformatics.org on behalf of dave thau
[thau@learningsite.com]
Sent: 24 May 2004 17:35
To: seek-taxon@ecoinformatics.org
Subject: [SEEK-Taxon] guids

Attachments: guid_design.ppt

Hello everybody,

I think discussions of GUIDs in Edinburgh went quite well.  A number of conversations I had before and after the presentations, as well as some of the demonstrations of what myGrid has been doing, has lead me to reconsider the idea of using the handle system to create an initial prototype of the taxon GUID server and instead go with an intial LSID implementation.  The main reasons for going with LSIDs are:

1.  Explicit versioning
2.  Explicit metadata
3.  Interoperability with other systems
4.  Interoperability with GBIF

1.  Explicit versioning

In conversations with Bob Peet, and in numerous contexts throught the SEEK meetings, the need for versioning taxonomic concepts became clear.  LSIDs have an explicit mechanism for versioning, which grew directly out of consultation with people in the life sciences.  An LSID looks like this:
urn:LSID:authority:context:localId:version

So, the first version of a taxon might be

urn:LSID:taxaserver.org:taxon:3432:1

This notation permits different versions of a taxonomic concept to have different GUIDs, however it also makes it easy to realize that two GUIDs are versions of the same concept.  Because the version is an explicit part of the LSID, systems using LSIDs will know that these are two versions of the same thing.  Although versions could be added to the handles of a handle server, the notation for doing so would be specific to SEEK, rather than an explicit part of the handle system.  Therefore, the handle system version of versioning would not be useful to third party systems which use handles.

2.  Explicit metadata

LSID provides an explicit way to retrieve metadata about a service.  The standard for the metadata is now RDF - which is the backbone of OWL. 
Although there is no standard for what the metadata contains, there are standard calls for retrieving the data.  The handle system has no such standard, so any client using handles to retrieve data would have to know our specific way of retrieving metadata about the handle.  Such metadata might include the last time this record was updated, contact information for the person responsible for maintaining the record, or it could even include relationships to other LSIDs.

3.  Interoperability with other systems

Because LSIDs use WSDL to announce which services they provide, and SOAP, FTP, HTTP and other internet standards for providing information, LSIDs embed easily in third party systems which use the standards.  For example, the MyGrid workflow software Taverner has a way to hook up with servers providing LSIDs and use the LSIDs in those servers as input to actions. 
Apparently (I haven't tried this yet) the LSID server can provide services beyond the standard ones.  These services would be announced in the WSDL document.  If this is the case, the services could make the API to the taxon server available to all LSID clients.  This would mean that a system like MyGrid, or Ptolemy (I suppose) could call the taxon server API through the LSID server.  This allows the following scenario.

A users sees a taxaserver LSID somewhere and resolves it - getting the data behind it, and the services available for that LSID.  One of those services might be to get synonyms.  The user can then create a work flow actor which takes as input an LSID and outputs a list of synonyms that may then be inputs to another actor.  This ties the taxon server directly into work flow systems like Taverner and Kepler.  Now, it's true that a user could do this with the taxon server regardless of whether or not it is plugged into the LSID server.  However, having the LSID server interface directly with the taxon server means a user can go directly from a LSID to the services we support.  Nothing like this is available if we use the handle server.

4.  Interoperability with GBIF.

Donald Hobern is recommending that GBIF uses LSIDs for a variety of data objects.  I don't think he has explicitly considered the handle system, but he has considered DOIs.  He feels that LSIDs are the best identifiers for interoperability and have broad enough support to use with comfort. 
Aligning SEEK's integration efforts with GBIFs and agreeing to a common standard would do a great deal for interoperability between SEEK and many other biodiversity informatics resources.


The down side of LSIDs are that the domain information is in the handle, that they're not simple to reassign, and that publishers are less likely to accept them.  Here are some arguements to undermine these issues.

1.  Domain information in the handle.  I recommend registering something neutral, like taxaserver.org, and registering information through that. 
Eventually, if things take off, a body comprised of major stake holders can take it over.  If there's a functioning system in place, it will be easy to find a host if the system is being used.

2.  Not simple to reassign.  This isn't strictly true, the information can be hosted anywhere.  The only caveat is that a LSID assigned by taxaserver.org will forever have the taxaserver.org domain.

3.  Publishers are more likely to accept them.  Although many publishers have accepted DOIs, it's unclear that a handle would be preferable to an lsid.  Publishers already accept genebank accession numbers, which are quite long and very different from DOIs.  If a publisher felt strongly about only accepting a handle, it would be quite straightforward to issue a handle which resolves to an LSID.  The only downside to this is that people might start putting the handle in their databases rather than the LSID.  If they do that, their database won't be immediately interoperable with systems using LSIDs.  However, because the handle resovles to an LSID, a database using the handles would simply have to resolve them to their attached LSIDs to become interoperable with LSID systems.

Although these three objections are quite vaild, I feel that LSID's satisfaction of the SEEK goal of interoperability with  networked resources, its usage of internet standards, and its native treatement of versioning argue in favor of using LSIDs at least for the first prototype.

I've attached the start of a document outlining what features the prototype might contain.  This is very rough, but I think it's something to start from.

Dave