This thread has also raised the issue of mapping between multiple GUIDs. I think it is inevitable that we will have to deal with this,
I absolutely agree. But at the same time, I think we should try (at least) to minimize duplicate GUID issuance to the same object (and we should certainly not encourage it!)
especially as there already exist major databases containing taxonomic information. For example, consider the task of mapping between mammalian names in DiGIR providers, and those used in GenBank (a relatively straightforward problem).
I think it would be a mistake to try to map DiGIR specimen/observation instances directly to GenBank sequences via taxon names (although I think GenBank sequences should be mapped directly to the specimen record from which it was drawn, but that's independent of the taxonomy). Putting aside the question of whether gene sequence blocks should fall within the same data domain as specimen/observation objects, in both cases the taxon name is a secondary attribute.
Instead, owners of each record should map their objects (specimen or sequence) to the same universal GUID for the taxon name (or, preferably, to the same taxon concept GUID) to which the specimen/observation/sequence instance has been assigned. That way, when someone queries on the name (or concept), the relevant DiGIR and GenBank objects show up in the results because they are mapped via a common taxon GUID.
Consider the alternative where the DiGIR provider created its own taxon GUID, separate from the taxon GUID assigned for the GenBank sequence. We'd still be left with the task of mapping those two separate GUIDs as representing the same taxon object (be it a name or a concept).
In some lucky cases where we have specimen information in GenBank we can tie the two together that way,
Agreed! But that's a completely separate issue from how either instance is mapped to a taxon GUID.
Maybe not completely separate. I don't think gene sequences should be considered in the same data domain as speciemns. They fit better with morphological characters. In the ideal world (and admittedly, this may be out of reach in the immediate future). Neither sequences nor morphological characters should link directely to taxon objects (names or concepts), but rather inherit taxonomic attributes from a specimen object to which they are attached (regardless of whether the specimen was or was not vouchered in a museum). Like I said, I'm probably reaching too far on this one.
but for other names/sequences we aren't this lucky. If our databases are distributed, and run by organisations with different goals and agendas (I doubt biodiversity rates highly in NCBIs list of things to do), we will have to deal with this.
Agreed. And I think the cleanest way to deal with it is for the "public" data domains (literature citations and taxon names & concepts) to be established via a single mechanism for assigning GUIDs (shooting as best we can for 1:1 GUID:object instance), and then all of the "private" data domains (specimens, sequences, characters, microbial cultures, etc.) be managed by their respective data owners, and the onus would be on them to map their taxonomic & literature links to the common universal GUID system.
Aloha, Rich
P.S. I promised I would shutup, and I apologize for breaking that promise.
Richard L. Pyle, PhD Database Coordinator for Natural Sciences Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html