Re: GUIDs, LSIDs, and metadata

12 Sep 2005

      ...
This thread has also raised the issue of mapping between multiple
GUIDs. I think it is inevitable that we will have to deal with this,
I absolutely agree.  But at the same time, I think we should try (at least)
to minimize duplicate GUID issuance to the same object (and we should
certainly not encourage it!)
...
especially as there already exist major databases containing taxonomic
information. For example, consider the task of mapping between
mammalian names in DiGIR providers, and those used in GenBank (a
relatively straightforward problem).
I think it would be a mistake to try to map DiGIR specimen/observation
instances directly to GenBank sequences via taxon names (although I think
GenBank sequences should be mapped directly to the specimen record from
which it was drawn, but that's independent of the taxonomy).  Putting aside
the question of whether gene sequence blocks should fall within the same
data domain as specimen/observation objects, in both cases the taxon name is
a secondary attribute.

Instead, owners of each record should map their objects (specimen or
sequence) to the same universal GUID for the taxon name (or, preferably, to
the same taxon concept GUID) to which the specimen/observation/sequence
instance has been assigned. That way, when someone queries on the name (or
concept), the relevant DiGIR and GenBank objects show up in the results
because they are mapped via a common taxon GUID.

Consider the alternative where the DiGIR provider created its own taxon
GUID, separate from the taxon GUID assigned for the GenBank sequence. We'd
still be left with the task of mapping those two separate GUIDs as
representing the same taxon object (be it a name or a concept).
...
In some lucky cases where we have
specimen information in GenBank we can tie the two together that way,
Agreed!  But that's a completely separate issue from how either instance is
mapped to a taxon GUID.

Maybe not completely separate.  I don't think gene sequences should be
considered in the same data domain as speciemns.  They fit better with
morphological characters.  In the ideal world (and admittedly, this may be
out of reach in the immediate future). Neither sequences nor morphological
characters should link directely to taxon objects (names or concepts), but
rather inherit taxonomic attributes from a specimen object to which they are
attached (regardless of whether the specimen was or was not vouchered in a
museum).  Like I said, I'm probably reaching too far on this one.
...
but for other names/sequences we aren't this lucky. If our databases
are distributed, and run by organisations with different goals and
agendas (I doubt biodiversity rates highly in NCBIs list of things to
do), we will have to deal with this.
Agreed.  And I think the cleanest way to deal with it is for the "public"
data domains (literature citations and taxon names & concepts) to be
established via a single mechanism for assigning GUIDs (shooting as best we
can for 1:1 GUID:object instance), and then all of the "private" data
domains (specimens, sequences, characters, microbial cultures, etc.) be
managed by their respective data owners, and the onus would be on them to
map their taxonomic & literature links to the common universal GUID system.

Aloha,
Rich

P.S. I promised I would shutup, and I apologize for breaking that promise.

Richard L. Pyle, PhD
Database Coordinator for Natural Sciences
Department of Natural Sciences, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef@bishopmuseum.org
http://hbs.bishopmuseum.org/staff/pylerichard.html

Richard Pyle

tags

participants (1)