Globally Unique Identifier

Chuck Miller Chuck.Miller at MOBOT.ORG
Thu Sep 30 09:10:43 CEST 2004


> >which is different from the problem having duplicate CatalogNumbers you
discuss

> >The physical specimen does exist, but in the foreseeable future all data
GUIDs will be attached to data, not to the specimen.

Duplicate specimens occur because the collector collected multiple samples
of the same organism and sent them to other institutions.  The duplicate
specimens themselves probably have different CatalogNumbers in each
institution.  The specimen database records reflect the actual specimens.
Therefore, the specimen database records when combined from multiple
institutions have duplicates of the same organism.  But, only by looking at
either the Collector and Collector's number or date/location can the
duplication be recognized.

One key use of GBIF-merged specimen records is to count or plot the number
of organisms in an area.  When a wide net is thrown around the globe, the
duplicate records are caught and return overstated counts.  Ideally a GUID
would identify a single unique organism record and enable duplicates to be
identified, but I can see no easy way for that to occur within LSID.

Chuck Miller
CIO
Missouri Botanical Garden

-----Original Message-----
From: Gregor Hagedorn [mailto:G.Hagedorn at BBA.DE]
Sent: Thursday, September 30, 2004 6:22 AM
To: TDWG-SDD at LISTSERV.NHM.KU.EDU
Subject: Re: Globally Unique Identifier


> > What about duplicate specimens?  Although a specimen may be MO 1234,
> > K5678 and P AABB, they may in fact all be SMITH 10001 and duplicates
> > of the exact same specimen, not different specimens. Is that one
> > GUID or 3?
>
> In my view, we would assign only ONE GUID, which represents the
> actual, physical specimen.  That this one specimen has multiple
> catalog number assigned to it is simply additional information
> associated with that one specimen (in the same way that many specimens
> may have more than one taxonomic name applied to it, by different
> investigators at different times).

I agree on the multiple catalogue numbers, but I believe still multiple
database records of specimens will exists. Since I myself am not involved in
collection curation, but in evaluating the information therein (specifically
we work on organism interactions) we have a database of now close to 200 000
fungal host parasite records. Some express opinion without further citation,
others express opinion backed up by voucher specimen that contains all the
information that would be found in collection databases. GBIF seems to have
no place for such data so far - and it would be difficult to provide, since
we usually have none of "InstitutionCode]+[CollectionCode]+[CatalogNumber"
(which is different from the problem having duplicate CatalogNumbers you
discuss). Still what kind of data is that? What kind of data is created if a
PH.D. student digitizes the specimen records used for a taxonomic revision
in a database that is specific to that revision?

Bottomline: The physical specimen does exist, but in the foreseeable future
all data GUIDs will be attached to data, not to the specimen. The exceptions
is only where indeed it is possible to attach the GUID to the specimen, then
this could be cited.

But then we have descriptions, and for description concepts (characters,
structures, states, modifiers, etc.) we also need GUIDs to allow federating
descriptions that use a common terminology. We have discussed this in SDD on
and off (specifically we are proposing to prefer semantically neutral
identifiers, and propose a simple optional mechanism called debugid/debugref
to enrich data with calculated, semantically meaningful identifiers to
facilitate
debugging) - but at the moment SDD really waits for a more general and
common solution.

So this discussion is highly relevant to descriptions as well. My main point
is: what we are really interested in GBIF in the end is knowledge, not
physical possession. If we limit our thinking of the GBIF system to the very
special case of institutionalized collections (as both DwC and ABCD in my
opinion currently do), or names governed by a nomenclatural code, I believe
we may later have to rearchitect.

BTW, partly for these differences between institutional collection- customs
and knowledge publication customs, I vote against a strongly central system.
LSID authority (lsid.gbif.net) and namespace (with no or low semantics)
should be managed by GBIF, but not the ids/versions. GBIF may provide a
service to generate them, but should accept any locally generated ID and
trust the generator to manage uniqueness.

Gregor
----------------------------------------------------------
Gregor Hagedorn (G.Hagedorn at bba.de)
Institute for Plant Virology, Microbiology, and Biosafety Federal Research
Center for Agriculture and Forestry (BBA)
Koenigin-Luise-Str. 19          Tel: +49-30-8304-2220
14195 Berlin, Germany           Fax: +49-30-8304-2203

Often wrong but never in doubt!

------_=_NextPart_001_01C4A6F7.45CA8650
Content-Type: text/html;
        charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
5.5.2654.45">
<TITLE>RE: Globally Unique Identifier</TITLE>
</HEAD>
<BODY>

<P><FONT SIZE=3D2>&gt; &gt;which is different from the problem having =
duplicate CatalogNumbers you discuss</FONT>
</P>

<P><FONT SIZE=3D2>&gt; &gt;The physical specimen does exist, but in the =
foreseeable future all data GUIDs will be attached to data, not to the =
specimen.</FONT></P>

<P><FONT SIZE=3D2>Duplicate specimens occur because the collector =
collected multiple samples of the same organism and sent them to other =
institutions.&nbsp; The duplicate specimens themselves probably have =
different CatalogNumbers in each institution.&nbsp; The specimen =
database records reflect the actual specimens.&nbsp; Therefore, the =
specimen database records when combined from multiple institutions have =
duplicates of the same organism.&nbsp; But, only by looking at either =
the Collector and Collector's number or date/location can the =
duplication be recognized.&nbsp; </FONT></P>

<P><FONT SIZE=3D2>One key use of GBIF-merged specimen records is to =
count or plot the number of organisms in an area.&nbsp; When a wide net =
is thrown around the globe, the duplicate records are caught and return =
overstated counts.&nbsp; Ideally a GUID would identify a single unique =
organism record and enable duplicates to be identified, but I can see =
no easy way for that to occur within LSID.</FONT></P>

<P><FONT SIZE=3D2>Chuck Miller</FONT>
<BR><FONT SIZE=3D2>CIO</FONT>
<BR><FONT SIZE=3D2>Missouri Botanical Garden</FONT>
</P>

<P><FONT SIZE=3D2>-----Original Message-----</FONT>
<BR><FONT SIZE=3D2>From: Gregor Hagedorn [<A =
HREF=3D"mailto:G.Hagedorn at BBA.DE">mailto:G.Hagedorn at BBA.DE</A>] </FONT>
<BR><FONT SIZE=3D2>Sent: Thursday, September 30, 2004 6:22 AM</FONT>
<BR><FONT SIZE=3D2>To: TDWG-SDD at LISTSERV.NHM.KU.EDU</FONT>
<BR><FONT SIZE=3D2>Subject: Re: Globally Unique Identifier</FONT>
</P>
<BR>

<P><FONT SIZE=3D2>&gt; &gt; What about duplicate specimens?&nbsp; =
Although a specimen may be MO 1234, </FONT>
<BR><FONT SIZE=3D2>&gt; &gt; K5678 and P AABB, they may in fact all be =
SMITH 10001 and duplicates </FONT>
<BR><FONT SIZE=3D2>&gt; &gt; of the exact same specimen, not different =
specimens. Is that one </FONT>
<BR><FONT SIZE=3D2>&gt; &gt; GUID or 3?</FONT>
<BR><FONT SIZE=3D2>&gt;</FONT>
<BR><FONT SIZE=3D2>&gt; In my view, we would assign only ONE GUID, =
which represents the </FONT>
<BR><FONT SIZE=3D2>&gt; actual, physical specimen.&nbsp; That this one =
specimen has multiple </FONT>
<BR><FONT SIZE=3D2>&gt; catalog number assigned to it is simply =
additional information </FONT>
<BR><FONT SIZE=3D2>&gt; associated with that one specimen (in the same =
way that many specimens </FONT>
<BR><FONT SIZE=3D2>&gt; may have more than one taxonomic name applied =
to it, by different </FONT>
<BR><FONT SIZE=3D2>&gt; investigators at different times).</FONT>
</P>

<P><FONT SIZE=3D2>I agree on the multiple catalogue numbers, but I =
believe still multiple database records of specimens will exists. Since =
I myself am not involved in collection curation, but in evaluating the =
information therein (specifically we work on organism interactions) we =
have a database of now close to 200 000 fungal host parasite records. =
Some express opinion without further citation, others express opinion =
backed up by voucher specimen that contains all the information that =
would be found in collection databases. GBIF seems to have no place for =
such data so far - and it would be difficult to provide, since we =
usually have none of =
&quot;InstitutionCode]+[CollectionCode]+[CatalogNumber&quot; (which is =
different from the problem having duplicate CatalogNumbers you =
discuss). Still what kind of data is that? What kind of data is created =
if a PH.D. student digitizes the specimen records used for a taxonomic =
revision in a database that is specific to that revision?</FONT></P>

<P><FONT SIZE=3D2>Bottomline: The physical specimen does exist, but in =
the foreseeable future all data GUIDs will be attached to data, not to =
the specimen. The exceptions is only where indeed it is possible to =
attach the GUID to the specimen, then this could be cited.</FONT></P>

<P><FONT SIZE=3D2>But then we have descriptions, and for description =
concepts (characters, structures, states, modifiers, etc.) we also need =
GUIDs to allow federating descriptions that use a common terminology. =
We have discussed this in SDD on and off (specifically we are proposing =
to prefer semantically neutral identifiers, and propose a simple =
optional mechanism called debugid/debugref to enrich data with =
calculated, semantically meaningful identifiers to =
facilitate</FONT></P>

<P><FONT SIZE=3D2>debugging) - but at the moment SDD really waits for a =
more general and common solution.</FONT>
</P>

<P><FONT SIZE=3D2>So this discussion is highly relevant to descriptions =
as well. My main point is: what we are really interested in GBIF in the =
end is knowledge, not physical possession. If we limit our thinking of =
the GBIF system to the very special case of institutionalized =
collections (as both DwC and ABCD in my opinion currently do), or names =
governed by a nomenclatural code, I believe we may later have to =
rearchitect.</FONT></P>

<P><FONT SIZE=3D2>BTW, partly for these differences between =
institutional collection- customs and knowledge publication customs, I =
vote against a strongly central system. LSID authority (lsid.gbif.net) =
and namespace (with no or low semantics) should be managed by GBIF, but =
not the ids/versions. GBIF may provide a service to generate them, but =
should accept any locally generated ID and trust the generator to =
manage uniqueness.</FONT></P>

<P><FONT SIZE=3D2>Gregor</FONT>
<BR><FONT =
SIZE=3D2>----------------------------------------------------------</FON=
T>
<BR><FONT SIZE=3D2>Gregor Hagedorn (G.Hagedorn at bba.de)</FONT>
<BR><FONT SIZE=3D2>Institute for Plant Virology, Microbiology, and =
Biosafety Federal Research Center for Agriculture and Forestry =
(BBA)</FONT>
<BR><FONT SIZE=3D2>Koenigin-Luise-Str. =
19&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Tel: =
+49-30-8304-2220</FONT>
<BR><FONT SIZE=3D2>14195 Berlin, =
Germany&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
Fax: +49-30-8304-2203</FONT>
</P>

<P><FONT SIZE=3D2>Often wrong but never in doubt!</FONT>
</P>

</BODY>
</HTML>


More information about the tdwg-content mailing list