Gregor Hagedorn G.Hagedorn at BBA.DE
Mon Oct 4 10:56:56 CEST 2004

Gregor wrote:
> > Do you mean those GLOPP organism-
> > interaction-data that have specimen voucher information can not be
> > published/referenced in GBIF until I figure out whether a collection
> > has digitized them (most have never digitized elsewhere!)?

Richard wrote:
> Not necessarily.  I don't think the issue is whether or not the
> collection has been digitized, but rather whether GUIDs have already
> been assigned to the vouchers you want to document in the GLOPP
> dataset. So, if your question is more along the lines of "do I need to
> check to see if GUIDs have already been issued to voucher specimens
> that I cite, before I issue new GUIDs", then my answer -- in the long
> run, at least -- would be, "well....yes!" That's sort of the
> fundamental point of the GUIDs, isn't it?  But I don't see this as
> being necessarily burdensome. For example, if your GLOPP dataset
> included unambiguous pointers to specific voucher specimens (e.g., via
> InstitutionCode+CollectionCode+CatalogNumber), then it *should* be a
> relatively quick and straightforward process to find out if GUIDs have
> already been assigned (if it's not quick & easy, then the GUID service
> would be horribly inadequate!)

I think this is the problem. No single record in our database has
this information, and to my knowledge, most if not all of the
physical specimen sheets referred to do not yet have a unique
catalogue number. Adding truly unique catalogue numbers physically to
specimens (as opposed to often non-unique batch accession numbers)
has often not been made and is now done either during digitization,
or it may in recent year also happen during loan processing.

However, most of the printed literature cites collection name (which
may be historic, if collections are merged), taxon name, plus one of:
 - simply the information that it is a type (often expressed by
exclamation mark after collection acronym, indicating that a type has
been studied)
 - for non-types 2-4 elements out of: collector, collection date,
collectors field number and location. Having collector plus
collectors field number is relatively good (although uniqueness is up
to the collector, some assign batch numbers for collection events),
but again in my experience it is relatively rarely cited. None of the
GLOPP records taken from literature cited a field number. The other
fields are normally sufficiently unique if you go into the collection
and see what is there, but is a terrible key to try any matching
against a data service - at least the location is usually comparable
by automatic string matching.

> If, on the other hand, the GLOPP
> dataset does not provide unambiguous pointers to specific voucher
> specimens, then the "vouchered" aspect of those specimen citations
> seems unsupported, in which case your GUIDs would need to be assigned
> to virtual/unvouchered "specimens" (analogous to observation records),
> and hence non-duplicate.

As said, if you go into the collection, it is easy to identify them.
If you know that a fragment of the collection is completely digitized
- as opposed to random digitization which only digitizes specimens
recently loaned - you can manually identify it on the computer. I
would guess every specimen identification takes at most 5 minutes and
involves one or several queries and picking from the result list

I believe most biologists will consider the specimen citations used
in print until now a voucher - I do not agree with the blank
statement that this is something else. It is usually unambiguous -
but not very good for machine processing. Contradictions?

> > when the collection starts to digitize
> > them, they would have to create for those that have already been
> > published in GLOPP a new version of the GLOPP LSID?
> I would hope that if you assigned GUIDs to GLOPP-relevant voucher
> specimens that belong to a collection that is not-yet digitized, you
> would do the courtesy of providing the manager of that collection with
> a listing of the GUIDs you created for the specific relevant
> specimens. I would further hope that, when that collection is
> eventually digitized, the manager would have the wherewithal to assign
> new GUIDs only to those specimens that did not yet have them.

The problem is: what would the collection manager make with such a
list? It would be the problem in reverse: the list of GUIDs is not
easy to connect to the physical specimens. Most collections where
specimens can be handled attach unique numbers to the specimens
during digitization (but perhaps even this is not true in some insect
collections, where handling the specimen creates the danger of
destroying it). The list delivered by GLOPP would contain information
about specimens that have no such barcode/etc. number yet.

> accommodated in any GUID system that is developed.  My main point is
> that such "redundant" GUID issuance should be minimized (i.e., never
> done intentionally), and quickly/easily identified as such whenever it
> is discovered.

Certainly not intentionally, but is should be clear that a museum
should not start to prohibit the use of laptops when a Ph.D.
candidate comes in and "digitizes" some specimens for a taxonomic
revision. If the museum system supports it, it is wise to ask to use
the museum system, but if the system is too complex and requires long
training, rather have the monography than nothing...

> So....if/when the situation does come up that (for example) GLOPP
> assigns GUIDs to vouchers on behalf of a non-digitized collection, and
> that collection later (inadvertently) re-assigns redundant GUIDs to
> the same set of specimens; that eventual discovery of this duplication
> should be accommodated by a mechanism for "retiring" one of the IDs
> into "objective synonomy" of the other ID, and automated systems
> should be implemented in the resolver service that "auto-forward" the
> retired ID to the active ID.

I think you could rather view this as an optional deduplication
layer. Your specification explicitly contradicts at least the LSID
specifiction to retrieve repeatedly exactly the same data. If an
analysis of GLOPP is based on some data - e.g. a misidentified host
plant - and cites this, it should be recoverable. Being silently
forwarded to different information only causes confusion.

So I prefer a view where GUIDs refer to data objects. I still do not
see, how you propose to attach them to the physical objects for those
researchers working in the collection itself. A secondary service can
then know about relationships of multiple data objects referring to
the same physical object. This service may be able to find cross-
references in the data itself, have smart methods to estimate
uniqueness based even on location strings, or may have manually
create cross-reference tables.

An important point is that different "deduplication" scenarios exist.
For example, in culture collection, many strains are cross-preserved
in multiple collections. So "CBS 123.88" may be "equal" to "ATCC
1234132" or "BBA 77123". Ideally we may even know the history: "BBA
77123" > "CBS 123.88" > "ATCC 1234132". However, the chance that any
of these strains (which are like "versions") has been mixed up (or
mutations occurred) is always there. Thus, if I look for duplication
of the collection event data, I want to deduplicate. If I want to
check a confusing DNA sequence, I may want to know about other
derived strains, but I absolutely need to know exactly which strain
from which collections was sequenced.

> For the most part, though -- I see these as "growing pains" of a GUID
> system during its first years of existence.  I would predict that two
> decades from now, if one were to do an analysis of redundant GUIDs,
> one would find the bulk of those having been issued relatively early
> on.

I agree, but I probably think it is more relevant than you seem to
think. I believe the "early days" to last the next 50 years - the
time needed until collections are fully digitized *plus* the time it
takes to make publication without citing GUIDs inacceptable.

Gregor Hagedorn (G.Hagedorn at
Institute for Plant Virology, Microbiology, and Biosafety
Federal Research Center for Agriculture and Forestry (BBA)
Königin-Luise-Str. 19           Tel: +49-30-8304-2220
14195 Berlin, Germany           Fax: +49-30-8304-2203

