GUIDs, LSIDs, and metadata

Sat Sep 10 11:20:52 CEST 2005

Not much is happening on the list side of things, so in the interest of
sparking discussion here are a few thoughts.

1. GUIDs by themselves are trivial. We are awash in them (book ISBNs,
GenBank accession numbers, etc.). Software developers generate them all
the time for things Windows components, Firefox extensions, web
objects, etc. There are tools for making these, e.g. here's one:
AAF813DE-21E0-11DA-A940-000D93425524.

2. The key is to link GUIDs to information, and for that information to
be in a predictable form. For example, DOIs are widely used GUIDs, but
when you resolve a DOI you have no idea what to expect. You might get a
PDF or HTML view of a manuscript, or just an abstract, or a page asking
for money to view a manuscript. The format of the response varies
widely.

3. Of course, GUIDs ARE vital. The DiGIR protocol's biggest weakness,
in my opinion, is that it fails to provide GUIDs. Whereas it does
provide information in a standard form (Darwin Core), the user has no
way of getting a GUID. I'd briefly toyed with an interim solution for a
project I'm working on. A DiGIR GUID would be

digir.fieldmuseum.org:80/digir/DiGIR.php:MammalsDwC2:158106

which is the address of the DiGIR provider, the Resource name, and the
specimen number (in this case, the specimen is FMNH 158106). This plan
was scuppered by the fact that more than one specimen can have the same
specimen code.For example the Museum of Vertebrate Zoology has three
speciemns with the code MVZ 148946, corresponding to the taxa
Chaetodipus baileyi baileyi, Calidris mauri, and Rana cascadae. A DiGIR
request for specimen MVZ 148946 returns three totally different
specimens!

4. I like LSIDs (despite the overhead of setting them up), but for me
the main attraction is their use of metadata in RDF. This opens up a
world of tools from the Semantci Web community, such as triple stores
(databases for RDF). One can harvest metadata and store this is a
"knowledge base." As this knowledge base grows we can uncover new
facts. For example, NCBI doesn't know that Gliricidia ehrenbergii and
Hybosema ehrenbergii are synonyms, whereas IPNI does. If these database
soutput RDF we can extract this information. If you have IBM's
LaunchPad and Internet Explorer 6, or Firefox with my LSID extension,
then this link
(lsidres:urn:lsid:ipni.org.lsid.zoology.gla.ac.uk:Id:1108320-2)
displays RDF for one of IPNI's records for Gliricidia ehrenbergii
(readers without any of these tools can view the raw RDF at
http://ipni.org.lsid.zoology.gla.ac.uk/authority/metadata?lsid=urn:
lsid:ipni.org.lsid.zoology.gla.ac.uk:Id:1108320-2 ). This RDF has links
to LSIDs for nomenclatural synonyms for this name, and if you follow
those you encounter Hybosema ehrenbergii. Hence, armed with consistent
metadata one can make inferences about names.

5. Another attraction of RDF is it side steps the need for the huge,
bloated XML schema which seem to bedevil the field at the moment. RDF
tends to be simple, flat, and there are a number of existing
vocabularies we can draw on (e.g., http://www.w3.org/2003/01/geo/)

6. I must confess I regard taxonomic concepts as a potential black
hole. I understand the arguments in favour, I just don't buy that this
is a tractable problem. I also think it is largely going to be of
historical interest as more and more data become linked to specimens
and to things like DNA barcodes. The fact that reconciling even two
taxonomic classifications can be a major undertaking does not bode well
for this project. For some more general thoughts on this issue, see
http://shirky.com/writings/ontology_overrated.html (a taxonomic
classification is an ontology).

7. I think the first priority for assigning GUIDs is museum specimens.
For taxon names (if not concepts) this is trivial, given that most name
databases have their own, internally unique ids (but not all -- those
databases that use names as primary keys, or which don't expose integer
identifiers will need to rethink their design).

Regards

Rod

Professor Roderic D. M. Page
Editor, Systematic Biology
DEEB, IBLS
Graham Kerr Building
University of Glasgow
Glasgow G12 8QP
United Kingdom

Phone:    +44 141 330 4778
Fax:      +44 141 330 2792
email:    r.page at bio.gla.ac.uk
web:      http://taxonomy.zoology.gla.ac.uk/rod/rod.html
reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html

Subscribe to Systematic Biology through the Society of Systematic
Biologists Website:  http://systematicbiology.org
Search for taxon names at http://darwin.zoology.gla.ac.uk/~rpage/portal/