>>
>>One key use of GBIF-merged specimen records is to count or plot the number
>>of organisms in an area. When a wide net is thrown around the globe, the
>>duplicate records are caught and return overstated counts. Ideally a GUID
>>would identify a single unique organism record and enable duplicates to be
>>identified, but I can see no easy way for that to occur within LSID.
>>
>>
>
>There is some discussion of how this might happen with LSID in
>http://efgblade.cs.umb.edu/twiki/bin/view/BDEI/AbstractEntities
>
>Please feel free to contribute to it.
Because I don't really like to use Wiki's I am going to continue this
thread here. There are some useful examples abstract and "real"
instantiations of LSID 's at:
http://www.i3c.org/wgr/ta/resources/lsid/docs/LSIDqueriesAndResponses.txt
Here is a relevant section:
IsAbstractOf Query:
LSIDs may name abstract concepts suchs as "rat myoglobin" as well as
concrete instantiations of those concepts such as "rat myoglobin in FASTA
format". This query is used to find all of the LSIDs naming concrete
instantiations of a particular LSID that refers to an abstract
concept. In english, the query string below literaly reads: "Fill in the
blank, 1AFT is an abstract concept of _________?" Another way to think
about it is, "What concrete instantiations of the abstract concept '1AFT'
exist?" The response includes the LSIDs of the 1AFT data in four different
formats.
Query string:
<URN:LSID:pdb.org:PDB:1AFT:><URN:LSID:i3c.org:predicates:isAbstractOf:>?x
Response:
URN:LSID:pdb.org:PDB:1AFT-PDB:
URN:LSID:pdb.org:PDB:1AFT-mmCIF:
URN:LSID:pdb.org:PDB:1AFT-JPG:
URN:LSID:pdb.org:PDB:1AFT-FASTA:
I can think of lots of similar examples in our domain.
Browsing http://www.i3c.org/ and
http://www-124.ibm.com/developerworks/oss/lsid/ will generally provide you
with examples similar to all of our issues. Somewhere in all those
articles there must be examples of how to recognize the identify of a
sequence that has been deposited in two different sequence databases. Lots
of good reading for the long plane trip to NZ.
Julian
Julian Humphries
DigiMorph.Org
Geological Sciences
University of Texas at Austin
Austin, TX 78712
512-471-3275
Chuck Miller wrote:
>>>which is different from the problem having duplicate CatalogNumbers you
>>>
>>>
>discuss
>
>
>
>>>The physical specimen does exist, but in the foreseeable future all data
>>>
>>>
>GUIDs will be attached to data, not to the specimen.
>
>Duplicate specimens occur because the collector collected multiple samples
>of the same organism and sent them to other institutions. The duplicate
>specimens themselves probably have different CatalogNumbers in each
>institution. The specimen database records reflect the actual specimens.
>Therefore, the specimen database records when combined from multiple
>institutions have duplicates of the same organism. But, only by looking at
>either the Collector and Collector's number or date/location can the
>duplication be recognized.
>
>One key use of GBIF-merged specimen records is to count or plot the number
>of organisms in an area. When a wide net is thrown around the globe, the
>duplicate records are caught and return overstated counts. Ideally a GUID
>would identify a single unique organism record and enable duplicates to be
>identified, but I can see no easy way for that to occur within LSID.
>
>
>
There is some discussion of how this might happen with LSID in
http://efgblade.cs.umb.edu/twiki/bin/view/BDEI/AbstractEntities
Please feel free to contribute to it.
Bob
>Chuck Miller
>CIO
>Missouri Botanical Garden
>
>-----Original Message-----
>From: Gregor Hagedorn [mailto:G.Hagedorn@BBA.DE]
>Sent: Thursday, September 30, 2004 6:22 AM
>To: TDWG-SDD(a)LISTSERV.NHM.KU.EDU
>Subject: Re: Globally Unique Identifier
>
>
>
>
>>>What about duplicate specimens? Although a specimen may be MO 1234,
>>>K5678 and P AABB, they may in fact all be SMITH 10001 and duplicates
>>>of the exact same specimen, not different specimens. Is that one
>>>GUID or 3?
>>>
>>>
>>In my view, we would assign only ONE GUID, which represents the
>>actual, physical specimen. That this one specimen has multiple
>>catalog number assigned to it is simply additional information
>>associated with that one specimen (in the same way that many specimens
>>may have more than one taxonomic name applied to it, by different
>>investigators at different times).
>>
>>
>
>I agree on the multiple catalogue numbers, but I believe still multiple
>database records of specimens will exists. Since I myself am not involved in
>collection curation, but in evaluating the information therein (specifically
>we work on organism interactions) we have a database of now close to 200 000
>fungal host parasite records. Some express opinion without further citation,
>others express opinion backed up by voucher specimen that contains all the
>information that would be found in collection databases. GBIF seems to have
>no place for such data so far - and it would be difficult to provide, since
>we usually have none of "InstitutionCode]+[CollectionCode]+[CatalogNumber"
>(which is different from the problem having duplicate CatalogNumbers you
>discuss). Still what kind of data is that? What kind of data is created if a
>PH.D. student digitizes the specimen records used for a taxonomic revision
>in a database that is specific to that revision?
>
>Bottomline: The physical specimen does exist, but in the foreseeable future
>all data GUIDs will be attached to data, not to the specimen. The exceptions
>is only where indeed it is possible to attach the GUID to the specimen, then
>this could be cited.
>
>But then we have descriptions, and for description concepts (characters,
>structures, states, modifiers, etc.) we also need GUIDs to allow federating
>descriptions that use a common terminology. We have discussed this in SDD on
>and off (specifically we are proposing to prefer semantically neutral
>identifiers, and propose a simple optional mechanism called debugid/debugref
>to enrich data with calculated, semantically meaningful identifiers to
>facilitate
>debugging) - but at the moment SDD really waits for a more general and
>common solution.
>
>So this discussion is highly relevant to descriptions as well. My main point
>is: what we are really interested in GBIF in the end is knowledge, not
>physical possession. If we limit our thinking of the GBIF system to the very
>special case of institutionalized collections (as both DwC and ABCD in my
>opinion currently do), or names governed by a nomenclatural code, I believe
>we may later have to rearchitect.
>
>BTW, partly for these differences between institutional collection- customs
>and knowledge publication customs, I vote against a strongly central system.
>LSID authority (lsid.gbif.net) and namespace (with no or low semantics)
>should be managed by GBIF, but not the ids/versions. GBIF may provide a
>service to generate them, but should accept any locally generated ID and
>trust the generator to manage uniqueness.
>
>Gregor
>----------------------------------------------------------
>Gregor Hagedorn (G.Hagedorn(a)bba.de)
>Institute for Plant Virology, Microbiology, and Biosafety Federal Research
>Center for Agriculture and Forestry (BBA)
>Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220
>14195 Berlin, Germany Fax: +49-30-8304-2203
>
>Often wrong but never in doubt!
>
>
>
--
Robert A. Morris
Professor of Computer Science
UMASS-Boston
ram(a)cs.umb.edu
http://www.cs.umb.edu/efghttp://www.cs.umb.edu/~ram
phone (+1)617 287 6466
> >which is different from the problem having duplicate CatalogNumbers you
discuss
> >The physical specimen does exist, but in the foreseeable future all data
GUIDs will be attached to data, not to the specimen.
Duplicate specimens occur because the collector collected multiple samples
of the same organism and sent them to other institutions. The duplicate
specimens themselves probably have different CatalogNumbers in each
institution. The specimen database records reflect the actual specimens.
Therefore, the specimen database records when combined from multiple
institutions have duplicates of the same organism. But, only by looking at
either the Collector and Collector's number or date/location can the
duplication be recognized.
One key use of GBIF-merged specimen records is to count or plot the number
of organisms in an area. When a wide net is thrown around the globe, the
duplicate records are caught and return overstated counts. Ideally a GUID
would identify a single unique organism record and enable duplicates to be
identified, but I can see no easy way for that to occur within LSID.
Chuck Miller
CIO
Missouri Botanical Garden
-----Original Message-----
From: Gregor Hagedorn [mailto:G.Hagedorn@BBA.DE]
Sent: Thursday, September 30, 2004 6:22 AM
To: TDWG-SDD(a)LISTSERV.NHM.KU.EDU
Subject: Re: Globally Unique Identifier
> > What about duplicate specimens? Although a specimen may be MO 1234,
> > K5678 and P AABB, they may in fact all be SMITH 10001 and duplicates
> > of the exact same specimen, not different specimens. Is that one
> > GUID or 3?
>
> In my view, we would assign only ONE GUID, which represents the
> actual, physical specimen. That this one specimen has multiple
> catalog number assigned to it is simply additional information
> associated with that one specimen (in the same way that many specimens
> may have more than one taxonomic name applied to it, by different
> investigators at different times).
I agree on the multiple catalogue numbers, but I believe still multiple
database records of specimens will exists. Since I myself am not involved in
collection curation, but in evaluating the information therein (specifically
we work on organism interactions) we have a database of now close to 200 000
fungal host parasite records. Some express opinion without further citation,
others express opinion backed up by voucher specimen that contains all the
information that would be found in collection databases. GBIF seems to have
no place for such data so far - and it would be difficult to provide, since
we usually have none of "InstitutionCode]+[CollectionCode]+[CatalogNumber"
(which is different from the problem having duplicate CatalogNumbers you
discuss). Still what kind of data is that? What kind of data is created if a
PH.D. student digitizes the specimen records used for a taxonomic revision
in a database that is specific to that revision?
Bottomline: The physical specimen does exist, but in the foreseeable future
all data GUIDs will be attached to data, not to the specimen. The exceptions
is only where indeed it is possible to attach the GUID to the specimen, then
this could be cited.
But then we have descriptions, and for description concepts (characters,
structures, states, modifiers, etc.) we also need GUIDs to allow federating
descriptions that use a common terminology. We have discussed this in SDD on
and off (specifically we are proposing to prefer semantically neutral
identifiers, and propose a simple optional mechanism called debugid/debugref
to enrich data with calculated, semantically meaningful identifiers to
facilitate
debugging) - but at the moment SDD really waits for a more general and
common solution.
So this discussion is highly relevant to descriptions as well. My main point
is: what we are really interested in GBIF in the end is knowledge, not
physical possession. If we limit our thinking of the GBIF system to the very
special case of institutionalized collections (as both DwC and ABCD in my
opinion currently do), or names governed by a nomenclatural code, I believe
we may later have to rearchitect.
BTW, partly for these differences between institutional collection- customs
and knowledge publication customs, I vote against a strongly central system.
LSID authority (lsid.gbif.net) and namespace (with no or low semantics)
should be managed by GBIF, but not the ids/versions. GBIF may provide a
service to generate them, but should accept any locally generated ID and
trust the generator to manage uniqueness.
Gregor
----------------------------------------------------------
Gregor Hagedorn (G.Hagedorn(a)bba.de)
Institute for Plant Virology, Microbiology, and Biosafety Federal Research
Center for Agriculture and Forestry (BBA)
Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220
14195 Berlin, Germany Fax: +49-30-8304-2203
Often wrong but never in doubt!
> > In my view, we would assign only ONE GUID, which represents the
> > actual, physical specimen. That this one specimen has multiple
> > catalog number assigned to it is simply additional information
> > associated with that one specimen (in the same way that many specimens
> > may have more than one taxonomic name applied to it, by different
> > investigators at different times).
>
> I agree on the multiple catalogue numbers, but I believe still
> multiple database records of specimens will exists.
Yes, but I guess a central question, which Donald included in his PowerPoint
file, is whether the GUID is assigned to the physical object, or to the
electronic representation (data record). Most of my comments have been from
the standpoint that the GUID applies to the physical specimen. If it is the
electronic records that we wish to uniquely identify, then it seems to me
that the <objectID> component of an LSID should apply to the physical
specimen, and multiple database records should be uniquely identified using
the <version> component.
> Since I myself am
> not involved in collection curation, but in evaluating the
> information therein (specifically we work on organism interactions)
> we have a database of now close to 200 000 fungal host parasite
> records. Some express opinion without further citation, others
> express opinion backed up by voucher specimen that contains all the
> information that would be found in collection databases. GBIF seems
> to have no place for such data so far - and it would be difficult to
> provide, since we usually have none of
> "InstitutionCode]+[CollectionCode]+[CatalogNumber" (which is
> different from the problem having duplicate CatalogNumbers you
> discuss).
This is part of the reason why I think that the
"[InstitutionCode]+[CollectionCode]+[CatalogNumber]" solution is only a
temporary one. This raises another question: would the GUIDs be limited to
just vouchered specimens? Or, would they also be assigned to unvouchered
specimens (e.g., field observations of specific individual organisms, that
were not vouchered in a Museum collection). Or, would unvouchered
"biological instances" represent a different class of GUIDs?
The simple answer is to make observations a different class of object. But
in my data management world, I need to deal with everything from sight
records with little more data than a taxonomic determination; to specific
observations involving specific (uncollected) individual organisms
(sometimes with as much associated data as any vouchered specimen); to
collected organisms that were brought into a lab, examined by experts, but
not added to a permanent collection; to stereotypical museum voucher
specimens; to specimens that were added to the permanent voucher collection,
but later lost or destroyed. In my mind, there is little fundamental
difference between the two endpoints of this spectrum, and I have thus
decided to treat all such entities as the same class of object ("Biological
Instances") -- which also spans the [population-->multiple specimen-->single
specimen-->specimen part] continuum. It's not a perfectly clean solution;
but no solution is perfectly clean, and to my mind, this is the optimal
solution from a data management perspective.
> Still what kind of data is that? What kind of data is
> created if a PH.D. student digitizes the specimen records used for a
> taxonomic revision in a database that is specific to that revision?
I would say that the student would (ideally) reference existing specimen
GUIDs in his/her specific database -- not create new GUIDs (unless
referencing physical specimens -- vouchered or not -- that have not yet
received GUIDs, in which case new GUIDs would be assigned using the
appropriate procedure, whatever that ends up being).
> Bottomline: The physical specimen does exist, but in the foreseeable
> future all data GUIDs will be attached to data, not to the specimen.
> The exceptions is only where indeed it is possible to attach the GUID
> to the specimen, then this could be cited.
Good point! My concern, though, would be that we might end up in the same
state of chaos that we are now, where multiple electronic records of a
particular physical specimen are not rigorously cross-linked, and thus run
the risk of being counted as multiple/separate physical instances.
Identifying the physical object with the <objectID> component of an LSID,
and different electronic representations of the associated data as different
<versions> of the same <objectID> could represent a way to deal with both
"realities".
> But then we have descriptions, and for description concepts
> (characters, structures, states, modifiers, etc.) we also need GUIDs
> to allow federating descriptions that use a common terminology.
> We have discussed this in SDD on and off (specifically we are proposing
> to prefer semantically neutral identifiers, and propose a simple
> optional mechanism called debugid/debugref to enrich data with
> calculated, semantically meaningful identifiers to facilitate
> debugging) - but at the moment SDD really waits for a more general
> and common solution.
Are you talking about GUIDs for character definitions, or GUIDs for
instances of character definitions applied to individual specimens/taxa, or
both? I see character definitions as analogous to taxon concepts; names
used to represent those character definitions as analogous to taxon names;
and the application of those character definitions to specific specimens (or
taxa) as analogous to taxonomic determinations (i.e., the application of
taxon names/concepts to specimens).
> So this discussion is highly relevant to descriptions as well. My
> main point is: what we are really interested in GBIF in the end is
> knowledge, not physical possession. If we limit our thinking of the
> GBIF system to the very special case of institutionalized collections
> (as both DwC and ABCD in my opinion currently do), or names governed
> by a nomenclatural code, I believe we may later have to rearchitect.
Agreed!! (Especially considering the forum on which this discussion is
taking place.) This is part of the reason why I keep mixing taxon
names/concepts examples among specimen examples. In the back of my mind, I
was thinking SDD examples as well, though I think I failed to express that
adequately.
> BTW, partly for these differences between institutional collection-
> customs and knowledge publication customs, I vote against a strongly
> central system. LSID authority (lsid.gbif.net) and namespace (with no
> or low semantics) should be managed by GBIF, but not the
> ids/versions. GBIF may provide a service to generate them, but should
> accept any locally generated ID and trust the generator to manage
> uniqueness.
I generally agree. My leanings towards centralization revolved around GUID
generation only (not application of GUID to associated data -- which is what
I would define as the "management" part), and *perhaps* GUID resolution (but
more for "unowned" sorts of data like taxonomy, and only in the paradigm
that each GUID could be resolved by one and only one domain server, and that
the GUID would cease to have meaning if/when the branded domain server
ceased to exist).
Whether or not GUID generation happens centrally or in a distributed way
seems to me to depend on what GUID scheme is ultimately adopted.
Aloha,
Rich
Richard L. Pyle, PhD
Natural Sciences Database Coordinator, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef(a)bishopmuseum.org
http://www.bishopmuseum.org/bishop/HBS/pylerichard.html
Too many streams to choose from so I will jump in here as well.
Bob's comments on wiki vs. mail are well noted.
One really important attribute of LSID's is that they *are* intended to
be persistent enough to outlive the objects that they represent and/or
our ability to resolve them. So that, one day, we might know when
multiple documents have referenced a single object - digital or
otherwise. It is this property that is particularly useful to GBIF
where there is a requirement to identify the unique objects duplicated
within a system that is designed, from the outset, to replicate at all
costs. But as Dave V. has earlier noted - we may still be confusing our
need for LSIDs with our GUID requirement. Digir and Biocase already
provide mechanisms for index maintenance, meta-data query, data
retrieval etc.
The LSID specification, noted by Thau below, states that as well as
being persistent an LSID will always resolve to the *"same set of
bytes"* [concrete] or an *empty set* [abstract] ( I have assumed that
this meant *exactly the same*, but I could be wrong). The implication
is that DiGIR, ABCD and HISPID representations of the same object all
require unique LSIDs . The specification documents provide examples to
illustrate *hierarchical* ways of registering LSIDs for a single object
in multiple formats. Versioning is in there to cover changes to an
object - which must still be registered - but a change in format could
require yet another LSID.
Each LSID references a static object - once one has one ... unless one
has more than one, which is also possible.
never-the-less there are still other ways in which LSIDs may prove
useful to biodiversity informatics. Dave V. has mentioned the possible
use of an LSID that resolves to a DiGIR query - a static object that may
be used to retrieve from a dynamic data set. And Thau has mentioned the
fact that while the object behind an LSID must not change, meta-data for
that object may be provided from many sources and in different formats.
Perhaps there is scope here for the the object behind an LSID to be only
that relatively stable component of the contributor data set that will
be used to build the global index (or query), with LSID registration and
resolution used to simplify index maintenance and with the getMetaData
methods providing access to complete records or their access points.
Duplicate records registered with abstract LSIDs resolving to a parent
LSID...
In herbaria the physical objects are commonly duplicated. Supposedly
duplicate vouchers from a single collecting event - the primary data
source. Is our aim to provide access to the data gathered during that
event (with the benefit of taxonomic hind-sight) or to the objects that
result, or both? How do identify replicate material? Will an LSID help
with resolution of ambiguous events?
greg
On Tue, 2004-09-28 at 04:04, dave thau wrote:
> Hello everyone,
>
> Sorry to come into this late - and forgive me if I'm covering trodden
> ground here, I just re-joined the list and may have missed a few posts.
>
> I've been looking at various GUID systems for SEEK - primarily the Handle
> System underlying DOI, and LSIDs. It looks like I'll be giving an
> introduction to GUIDs and focusing on LSIDs at TDWG in New Zealand in two
> weeks. Judging by the conversation here, I think I'll be keeping my
> introduction brief to allow maximal time for discussion!
>
> There are a few things about LSIDs that I want to point out. First....
> as someone has mentioned, some of my early ramblings on GUIDs can be found
> here:
>
> http://cvs.ecoinformatics.org/cvs/cvsweb.cgi/seek/projects/taxon/docs/guid/
>
> Some of these files got munged in CVS, but I've fixed them. These files
> are over six months old, and between then and now I've become more of an
> LSID fan. So, if you do bother reading that stuff, realize it's old and
> outdated and incomplete in too many places. I'm going to write a revised
> document encompassing and expanding all of that and I'll post here when
> it's completed.
>
> Now for some miscellaneous points about LSIDs
>
> * What happens when an LSID Authority goes away?
>
> As I think Dave V. and others have pointed out, when an LSID is resolved,
> DNS is used to find the LSID authority. The LSID authority then provides
> information about how the LSID can be served up (e.g. HTTP, SOAP, FTP),
> and where to get the data behind the LSID and associated metadata. If I
> start serving up LSIDs with the authority learningsite.com and later
> decide that I'm sick of serving up LSIDs, somebody else can take over
> serving up the data and the metadata. However, I (or they) still bear the
> responsibility of running the authority which points to the data. If my
> lsids have an authority like lsid.learningsite.com
> (urn:lsid:lsid.learningsite.com:foo:bar) then someone else can take over
> the authority by taking over lsid.learningsite.com and I can still have
> www.learningsite.com, mail.learningsite.com, etc... for myself. So, with a
> little planning, it's not so hard to deal with an authority going away as
> long as the people running it are responsible.
>
> * data and metadata
>
> With LSIDs there's a big difference between the data and metadata of an
> LSID - and I think this is going to be the biggest challenge in deciding
> how to use them in our context. What's the data? What's the metadata?
> With gene sequences, the datum is the sequence, the metadata are things
> like contact information, who did the sequencing, taxonomic information
> about the thing sequenced, etc. There's an LTER site using LSIDs for
> their data sets. The LSID data is the data set itself, and the metadata
> is what you'd expect - a description of the data set, who the
> investigators were, that sort of thing. NCBI has pubmed LSIDs - they're
> not serving up the articles yet, but there's associated metadata in there..
> For these things the division between data and metadata is fairly clear.
> However, what is the data for taxa? What is the metadata?
>
> Here's another interesting thing about data and metadata in LSIDs. When
> you issue an LSID you're promising the the DATA behind that LSID never
> changes. Additionally, there's only one authority ultimately responsible
> for pointing to the data, and that never changes (although as above,
> someone else can take the authority over). However, for metadata, there
> are no such promises. Metadata can change. Furthermore, organizations
> other than the authority can provide metadata, as long as the authority
> agrees to it and adds them to a list of authorized metadata providers. I
> donÂ’t know if this is such a great idea, but itÂ’s in the specification.
>
> So, what's the data? What's the metadata? This question applies to any
> GUID system, really - the Handle System has the same issues, but less
> clearly defined. As an aside - the Handle System is very robust and the
> fee schedule is probably circumventable. However, I think LSIDs are
> better suited to the direction biodiversity informatics is taking - using
> XML-based standards and standard internet protocols to share data.
>
> * client stack versus authority server
>
> The LSID folks provide two batches of code - an authority server, for
> people who want to to serve up LSIDs themselves, and an LSID Client stack
> - which can be used by organizations to provide access to their LSIDs
> and/or proxy LSIDs provided by other organizations. It may make sense for
> an organization like GBIF to build a service using the Client Stack to
> support both their own LSIDs and those served by other organizations. The
> Client Stack has a caching mechanism which supports expiration information
> from the primary authority, so the primary authority can update where the
> LSID may be resolved and metadata of that authority.
>
> In this model, GBIF, or someone else, could support both their own LSIDs,
> and the LSIDs of others. Furthermore, it could choose which authorities
> it was going to resolve, so people who wanted to be sure to get "just the
> good stuff" according to GBIF could use the GBIF service. In addition, it
> could perform the de-duplication service that several people have
> mentioned - trying to maintain a one LSID per data item mapping.
>
> * lsid namespaces and file formats
>
> I don't think the namespace part of the LSID
> (urn:lsid:authority:namespace:object:version) is intended to be
> semantically loaded except for the relevant lsid authority. There is a
> way in the metadata to state what format the data comes in. It's not the
> traditional text/javascript mime-type tag - instead the format is another
> LSID! For example the FASTA protein sequence file format is:
> urn:lsid:i3c.org:formats:fasta. Clients that understand LSIDs, like the
> Launchpad application, can be set to attach applications to LSID formats
> so that clicking on an LSID with a given format opens up an appropriate
> application.
>
> Sorry to be so scattershot - it's hard to come into the middle of a huge
> topic like this. I’m glad to see all this discussion – it’s going to make
> working on my talk much easier (I thinkÂ…)
>
> Dave
--
Greg Whitbread <ghw(a)anbg.gov.au>
+61-2-62509482
ANBG/CPBR/ANH
I've been watching these messages but have not yet had time to respond to
all of the valuable discussion.
One thing I would note in relation to Richard's comments below (and would
suggest makes a lot of sense) is that we could adopt a model by which each
provider is responsible for resolving the identifiers for their own data,
but that we have a central fall-back server (e.g. one based on a central
data index) to which all requests get forwarded whenever a provider finds it
cannot resolve an id that it originally issued. This could solve the
problem in a reasonably standard and simple way for cases in which specimens
and associated data do get moved.
One other thing that was probably not clear in my originally sending out a
PowerPoint presentation without a covering commentary was that I think we
need a community discussion not only about the merits of centralized or
decentralized approaches but also about the most appropriate owner of any
core central infrastructure. I put GBIF in this position in my slides, but
I could equally see this as something which could be done in the name of
TDWG.
Donald
---------------------------------------------------------------
Donald Hobern (dhobern(a)gbif.org)
Programme Officer for Data Access and Database Interoperability
Global Biodiversity Information Facility Secretariat
Universitetsparken 15, DK-2100 Copenhagen, Denmark
Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480
---------------------------------------------------------------
-----Original Message-----
From: TDWG - Structure of Descriptive Data
[mailto:TDWG-SDD@LISTSERV.NHM.KU.EDU] On Behalf Of Richard Pyle
Sent: 28 September 2004 17:32
To: TDWG-SDD(a)LISTSERV.NHM.KU.EDU
Subject: Re: BioGUIDs and the Internet Analogy
> [For me, the bottom line---which however I nowhere state below---is:
> There is /so much/ existing free infrastructure source code---e.g.
> http://www-124.ibm.com/developerworks/oss/lsid/--- and (apparently)
> funding, and (manifestly) professionally designed specifications for
> LSID concerns that I am horrified at the prospect of adopting anything
> else if LSID comes even close to being what the community needs.
I can certainly understand that perspective, and that's one of the main
reasons I am still semi-supportive of the LSID approach (i.e., existing
code). My major concern has to do with the "Authority"/"Resolver" domain
portion of the LSID, and the need (or non-need) for it to be an active,
accessible domain in order to resolve the LSID. I'm also VERY concerned
about there ever being a temptation to change an ID for an object (e.g., a
specimen given from one Museum to another) -- unless it is understood that
the non-"ObjectID" portion is really thought of as metadata of sorts, and
ObjectID itself is globally unique by itself. I'll need to read the LSID
spec in more detail, and give it some more thought, before I comment further
on this.
> Or,
> let's forget about LSID and instead of deploying what satisfies 98% of
> the needs in six months, we could roll our own and deploy what satisfies
> 80% of the needs in a few years...
If that were really the balance, then it would be a no-brainer. My concern
would be adopting (effectively committing to) a scheme that satisfies 80% of
the needs in six months, instead of being patient and picking a system that
accomodates 98% of the need a few years from now. If I had confidence that
we could implement LSIDs in a "test-drive" mode for a couple of years,
without being fully committed to them, I'd be much more comfortable. But as
someone who spends a considerable amount of time trying to undo the damage
of "legacy" solutions to data problems that were hastily conceived, I'm
trying to be cautious.
> The design goals of TCP/IP and DNS, and their implementation, intersect
> the requirements of Bio UUIDs only in a very small set, in fact, deep
> down perhaps not at all.
>
> These protocols and the associated address syntax were designed
> primarily for /routing/, not in any way designed to guarantee that a
> datum twice received has any connection between the two occurrences.
>
> IP addresses are in no way persistent.
>
> IP addresses are not globally unique, albeit in several small and varied
> ways:
I don't think anyone (in this thread) was suggesting actually *USING* TCP/IP
and DNS for BioGUIDs (at least I wasn't). Rather, I was looking to it as a
source of ground-truthed schemes for reliably managing globally distributed
information. For instance, would DNS synchronization/propagation serve as a
useful model for gobally distributed, synchronized taxonomic registries? Or
would the taxonomic registry work more effectively with one or a few
centralized "masters" with which a larger set of replicates kept
synchronized? I also think, as I explained earlier, that the hierarchical
approach with centralized block ID issuance and local application might be
instructive to a bioscheme.
> In fact, if the UUIDs are meant to be semantically opaque it matters not
> one whit who or how these matters are settled.
Can you elaborate on what you mean by "semantically opaque"? I think I
understand -- but the last thing this thread needs is ambiguity about the
meaning of terms (i.e., "opaque semantics".... :-) )
> Exceptions to that are
> social, not technical. ("If you don't let me decide X, I am not going to
> use your scheme". "OK, then you won't participate in its benefits.
> That's fine with me")
...and as I said before, the real challenge in establishing universally
adopted BioGUIDs is not going to be technical; it's going to be
social/political.
> >>So, there is a heirarchy of how the "unique identifiers" are managed.
> >
> > There is
> >
> >>in fact a central authority, but it delegates to decentralized
> >
> > authorities.
>
> But this is mainly to distribute costs and speed issuance. It has
> nothing to do with the naming scheme. The number of organizations to be
> issued Bio GUIDs surely is several orders of magnitude less than those
> to be issued IBv6 addresses. So I doubt any IPv6 issuance mechanisms are
> instructive, at least in their purpose (and hence, if well implemented,
> in their implementation).
So are you saying that, because BioGUID traffic will be orders of magnitude
smaller than internet domain traffic, there does not need to be delegation
to decentralized authorities? If so, then we are in full agreement.
> > GBIF seems to me to be the principle contender.
>
> I enthusiastically agree. Also the /principal/ contender. [Sorry,
> couldn't resist. My fingers slip on that one sometimes too.]
Ouch.... :-)
> Not exactly. There is one scheme in case your application can't resolve
> it in a more nearly "local" facility. There are /lots/ of ways to find
> an IP address from a domain name. All those which comply fully with the
> DNS protocol, however, can make available two pieces of metadata: the
> TTL of the record it is offering, and the IP address of a machine at
> which you can find an authoritative record of the assignment of the dns
> name to the IP address. This protocol /might/, but you hope on
> performance grounds usually /doesn't/, lead you up as far as the root
> servers, and the "one scheme to bind them all". If there is any lesson
> here at all, it is that name resolution protocols matter, but resolution
> implementations don't. Yet another attribute on which, DNS/IP and LSID
> are not distinguishable.
This seems to be a fundamental point of confusion (for me anyway). Are the
domain names embedded within LSIDs information-bearing in the sense that
they are necessarily the internet domain at which the LSID is resolved? I
guess I should read and understand Section 13.3 of the LSID spec before
commenting further.
> More often, only when the TTLs expire, there being no motivation to do
> otherwise.
O.K., there's a great analogy that may be useful if implementing a
distributed system of synchronized/mirrored biological data servers: should
they remain in synch at fixed time intervals? In real time with each data
transaction? Or, should some sort of TTL feature be incorporated in data?
Much to think about. But time for me to get some work done....
Aloha,
Rich
Bob,
Take a look at section 13.3 "Discovering and LSID Resolution Service
using DDS/DNS".
Dave V.
Bob Morris wrote:
> Rich
>
> No reading I can make of http://www.omg.org/cgi-bin/doc?dtc/2004-05-01
> is consistent with the explanation offered below. "Authority" is not
> used at all in Section 9, "LSID Resolution Service". Instead, resolution
> is defined to be accomplished by a set of interfaces all of which take
> only a full (semantically opaque!!!! Sec. 8, p. 7!!!) LSID as argument.
> The interfaces correspond to methods offered by an LSID Resolution
> Service. Nowhere in Section 9 is there any relationship mentioned
> between resolution and the "authority identification" that is part of
> the syntax of an LSID. That is discussed in Section 8, "LSID Syntax"
> which carries the sentences (bottom of p. 7) that I have previously
> cited: "The authority identification is usually an Internet domain name.
> In this case it is recommended that it be owned by the organization that
> assigns an LSID in question".
>
> There are too many ">"s for me to understand who is claiming the stuff
> below, but whoever it is could do me a favor by telling me which part of
> the spec they are reading. (Or if they found a later document at OMG).
> Well, OK, I haven't yet read the "Accompanied Files" listed in Appendix
> A., some of which are normative and take precedence over the main
> document. Maybe the explanation comes from one of them.
>
> Richard Pyle wrote:
>
>> Many thanks for jumping in on this, Dave!
>>
>>
>>> As I think Dave V. and others have pointed out, when an LSID is
>>> resolved,
>>> DNS is used to find the LSID authority. The LSID authority then
>>> provides
>>> information about how the LSID can be served up (e.g. HTTP, SOAP, FTP),
>>> and where to get the data behind the LSID and associated metadata.
>>> If I
>>> start serving up LSIDs with the authority learningsite.com and later
>>> decide that I'm sick of serving up LSIDs, somebody else can take over
>>> serving up the data and the metadata. However, I (or they) still
>>> bear the
>>> responsibility of running the authority which points to the data. If my
>>> lsids have an authority like lsid.learningsite.com
>>> (urn:lsid:lsid.learningsite.com:foo:bar) then someone else can take over
>>> the authority by taking over lsid.learningsite.com and I can still have
>>> www.learningsite.com, mail.learningsite.com, etc... for myself. So,
>>> with a
>>> little planning, it's not so hard to deal with an authority going
>>> away as
>>> long as the people running it are responsible.
>>
>>
>
I think we just need to be clear that there are on the one hand the BioGUIDs
themselves and how they are created/assigned and on the other hand is the
method to find them later by someone else. On the Internet there are IP
addresses and how they are assigned (albeit potentially dynamically) and the
method for someone else to find them. Rarely do you get Internet references
given as an IP address (http://157.140.2.10) (certainly one reason is that
they can be dynamic) but rather as URLs (http://www.tdwg.org) that are then
resolved through the domain resolution process.
This approach enables the two things (number and name) to be managed
separately, yet linked together.
Don't we need the same two mechanisms for BioGUIDs? Does the LSID spec
address both parts of this? I guess I had better get over there and read
the whole thing.
Regarding:
>>IP addresses are in no way persistent.
>>IP addresses are not globally unique. [But they are unique on the
Internet]
Making analogies is tough because the details of the example can get in the
way, as in this case. IP addresses have several attributes. The point is
not whether IP addresses are analogous to BioGUIDs on the attributes of
static/changeable or universality. The point is that the IP is analagous
because it is a cryptic non-intelligible and unique number used to locate
another network device by a machine that starts out with no idea where it
is. That's the analogy - that IP and BioGUIDs are unique, nonintelligible
(ie. semantically opaque) strings of characters/numbers.
The issue is how in the world to you find the BioGUID you are trying to get
to? Do you just use the nonintelligible string? Or do we want somewhat
intelligible names as pointers to the nonintelligible?
Regarding:
>>The number of organizations to be issued Bio GUIDs surely is several
orders of magnitude less than those to be issued IPv6 addresses.
True. But, there are potentially billions of BioGUIDs to be created, which
could exceed the number of IPv6 addresses. Some institutions would be the
equivalent of a large Internet Registry managing millions of BioGUIDs.
This seems like a big issue to me. Would GBIF/TDWG or whomever issue
millions/billions of BioGUIDs directly to institution records and not
decentralize by assigning blocks of BioGUIDs to institutions?
Also, when there are billions of something, efficiency becomes an issue. IP
addresses follow a binary number approach that leads to efficient
processing. Is there an efficiency issue lurking ahead for the BioGUID?
Regarding:
>>...name resolution protocols matter, but resolution implementations
don't...
This is true when you have working resolution implementations as the
Internet does. But does LSID have a working resolution implementation now?
If not, creating a new one that works is not that simple I think. That
would be a good reason for piggy-backing on the existing and working
Internet DNS/domain name resolution system as I've heard mentioned.
Chuck Miller
CIO
Missouri Botanical Garden
-----Original Message-----
From: Bob Morris [mailto:ram@CS.UMB.EDU]
Sent: Monday, September 27, 2004 10:34 PM
To: TDWG-SDD(a)LISTSERV.NHM.KU.EDU
Subject: Re: BioGUIDs and the Internet Analogy
[For me, the bottom line---which however I nowhere state below---is: There
is /so much/ existing free infrastructure source code---e.g.
http://www-124.ibm.com/developerworks/oss/lsid/--- and (apparently) funding,
and (manifestly) professionally designed specifications for LSID concerns
that I am horrified at the prospect of adopting anything else if LSID comes
even close to being what the community needs. Or, let's forget about LSID
and instead of deploying what satisfies 98% of the needs in six months, we
could roll our own and deploy what satisfies 80% of the needs in a few
years...
To my mind, the only question is this: is LSID good enough? So: first make
the requirements. Then examine existing solutions. Arguing by analogy can
lead to the "hammer" solution, usually attributed to Mark
Twain: When all you own is a hammer, every problem begins to look like a
nail. (Can someone give me an actual /reference/ to that quote????]
-------------------------
The idea of modeling taxonomic uuids on the internet has been around for
about 10 years, and has been written about explicitly for at least 3-4 by
Hannu Saarenma and probably others. See his article in
http://reports.eea.eu.int/technical_report_2001_70/en/Technical%20Report%207
0%20web.pdf
The Tree of Life http://www.tolweb.org is a defacto such model, though
without name authorities or persistence.
The design goals of TCP/IP and DNS, and their implementation, intersect the
requirements of Bio UUIDs only in a very small set, in fact, deep down
perhaps not at all.
These protocols and the associated address syntax were designed primarily
for /routing/, not in any way designed to guarantee that a datum twice
received has any connection between the two occurrences.
IP addresses are in no way persistent.
IP addresses are not globally unique, albeit in several small and varied
ways:
- The "private address blocks" 10.X.Y.Z, 172.16.0-0-172.31.255.255, and
192.168.X.Y may be assigned to any machine your and my network
administrator care to, as long as neither is served on "the Internet", nor
have an duplication on "an internet".
-every machine implementing IP, besides whatever other addresses it may
have, is always known to itself as 127.0.0.1;
- the addresses in some IP address ranges are reserved to designate /many/
machines (multicast IP)).
In general the design of the IP "nomenclature"---i.e. the IP addresses---is
designed to solve routing problems, not identification problems. IP address
syntax is intentionally not "semantically opaque", contrary to a requirement
(well, a "should be") of LSIDs and well it should be. See Section 8 of
http://www.omg.org/cgi-bin/doc?dtc/04-05-01
[If you don't see why IP addressing is not semantically opaque, try this
exercise: With a single machine instruction (on most machines) how can you
determine whether or not an IP address is in the Class B private address
space(the 172 stuff above)? Sysadmin's, C programmers, and their families
and employees are not eligible for this competition]
In turn, DNS protocols are designed only to aid discovery of IP addresses.
DNS addresses are also far from persistent (in fact, DNS records held
anywhere have a "Time To Live" field which must be counted down until
expiration, at which time the holder must acquire a new instance of the
assignment of an IP address to a domain name.).
[More below, interspersed]
Richard Pyle wrote:
>>Perhaps it would be useful to look at the issues being discussed about
>>a bio identifier/locator/GUID in comparison to the same things that
>>are needed for Internet communications.
>
>
> I've long thought that parts of the DNS system would be extremely
> useful to emulate in some aspects of bioinformatics data management
> (particularly taxonomic names; see below).
>
>
>>IP addresses have to be unique world-wide to make the Internet work.
>>The Internet Corporation for Assigned Names and Numbers (ICANN-
>
> www.icann.org)
>
>>provides that uniqueness by assigning all the IP numbers in unique
>>blocks
>
> or
>
>>ranges of numbers to "Internet Registries".
>
>
> ...exactly the way that I envision an organization like GBIF would be
> charged with the task of issuing UIDs for certain biological objects.
>
>
>>There are Regional, National and Local Internet Registries that
>>subdivide
>
> and
>
>>"license" IP addresses to ISPs, who in turn license IP addresses to
>
> organizations.
Ah, this is accurate mainly for IPv6, which is much less chaotic than IPv4
(most of the current Internet) and in turn from the nearly formless void
that was IPv2. But again, ultimate "licensees" do not get persistent IP
addresses. In the US, virtually all dialup users get a different IP every
time they connect, and most home broadband users only accidently keep their
IP addresses, and only if they don't disconnect very long from the network.
It's well worth comparing the design goals of IPv6 as articulated in
http://www.apnic.net/docs/policy/ipv6-address-policy.html
with those of LSID as articulated in Section 8 of the Draft Final
Specification http://www.omg.org/cgi-bin/doc?dtc/04-05-01
>
> There could be a useful analog for this in bioinformatics
> (particularly in terms of individual institutions serving as regional
> registries for specimen UIDs, or IC_N Commissions serving as
> "regional" registries for taxon name UID assignment) -- but there
> doesn't necessarily have to be.
>
In fact, if the UUIDs are meant to be semantically opaque it matters not one
whit who or how these matters are settled. Exceptions to that are social,
not technical. ("If you don't let me decide X, I am not going to use your
scheme". "OK, then you won't participate in its benefits. That's fine with
me")
If you want to see another example of lack of semantic opacity, read the
ISBN standard ISO/TC 46/SC 9 N 326. Part of certain ISBNs can help you
determine an allegedly common publishing-germaine attribute of the US,
Zimbabwe, Puerto Rico, Ireland, Swaziland, part of Canada, and a few other
"regions".
>
>>So, there is a heirarchy of how the "unique identifiers" are managed.
>
> There is
>
>>in fact a central authority, but it delegates to decentralized
>
> authorities.
But this is mainly to distribute costs and speed issuance. It has nothing to
do with the naming scheme. The number of organizations to be issued Bio
GUIDs surely is several orders of magnitude less than those to be issued
IBv6 addresses. So I doubt any IPv6 issuance mechanisms are instructive, at
least in their purpose (and hence, if well implemented, in their
implementation).
>
> To emulate this in bioinformatics, the "hierarchy" would be achieved
> simply by allowing block-assignment of UIDs to various players -- but
> the important point here is that only *one* organization ensures
> uniqueness (in the case of Internet, of ISPs). The data to which
> those UIDs apply would be, for the most part, the responsibility of
> the UID recipient, not the UID issuer (in my world view). Thus:
> centralized issuance; delegated application.
>
>
>>Is there an analogy for BioGUIDs to have a central body who divvies
>>out
>
> the
>
>>unique numbers (like IP addresses) to decentralized bodies or large
>
> organizations?
The International ISBN Organization http://www.isbn-international.org/
is roughly the IPv6 model.
>
> GBIF seems to me to be the principle contender.
I enthusiastically agree. Also the /principal/ contender. [Sorry, couldn't
resist. My fingers slip on that one sometimes too.]
>
>
>>Since IP addresses are hard to memorize (and so too would be a
>>BioGUID),
>
> "domain names"
>
>>are used. Starting with a domain name, you can first find the name
>>and/or
>
> IP address
>
>>of a device, called the Domain Name Server, that can locate the IP
>>address
>
> of other
>
>>computers. This is a form of indirect addressing. ICANN also manages
>>the
>
> top-level
>
>>namespace for the Internet. They decide what the valid domain
>>"extensions"
>
> are (like
>
>>.com, .uk) so that everybody, everywhere knows where to look them up.
>
> Then, the domain
>
>>name extensions are separated among the Regional, National, and Local
>
> Interent Registries
>
>>around the world. There is a scheme for where to find the IP
>>addresses
>
> for every domain
>
>>extension (e.g. .com is on the ARIN registry, .com.uk is on the ).
Not exactly. There is one scheme in case your application can't resolve it
in a more nearly "local" facility. There are /lots/ of ways to find an IP
address from a domain name. All those which comply fully with the DNS
protocol, however, can make available two pieces of metadata: the TTL of the
record it is offering, and the IP address of a machine at which you can find
an authoritative record of the assignment of the dns name to the IP address.
This protocol /might/, but you hope on performance grounds usually
/doesn't/, lead you up as far as the root servers, and the "one scheme to
bind them all". If there is any lesson here at all, it is that name
resolution protocols matter, but resolution implementations don't. Yet
another attribute on which, DNS/IP and LSID are not distinguishable.
>>Then there is a layer of Domain Registrars who have been accredited by
>
> ICANN to assign
>
>>domain names for the domain extensions - e.g. tdwg.org.
>>The domain name registrars are told by the owner of the domain where
>>to
>
> find their particular
>
>>Domain Name Servers which may be many to enable redundancy - Primary,
>
> Secondary, Tertiary,
>
>>etc.
Not quite. Normally, the registrant tells the registrar who has agreed to be
the servers. The other case sometimes happens with "retailers" who are
selling individuals domain names and ISP services at the same time.
These redundant Domain Name Servers synchronize with each other at
>
> particular times
>
>>of day and may be located all around the world.
More often, only when the TTLs expire, there being no motivation to do
otherwise.
>They are the main
>
> "switchboard" for a
>
>>particular organizations computer names and associated IP addresses.
>>Then the individual organization can create multiple computers for the
>
> domain name - e.g.
>
>>www.tdwg.org - and add them to the Domain Name Server listing. There
>>can
>
> be many computers
>
>>for a domain, for instance: info.tdwg.org, www2.tdwg.org,
>>myname.tdwg.org.
>
> Each of these
>
>>can be a different computer with a different IP address. The
>>redundant
>
> Domain Name Servers
>
>>all contain the list of all these names and what IP addresses they
>>are.
Not usually. The primary and secondary name servers would normally only
cache tdwg.org permanently. They might /acquire/ a record for www.tdwg.org
in response to a request, but they would not in general renew it after it
expired and maybe not even keep it that long. To do so would be hideously
unscalable. If I put 10,000 machines in my domain, my primary and secondary
would be mighty unhappy if they had to keep them all cached.
>
>
> This is analogous in many ways to how I would envision a global
> taxonomic name service. UIDs are assigned by a centralized body
> (e.g., GBIF; or by the IC_N Commissions) to individual names.
> Analogous to multiple redundant Domain Name Servers (DNS) would be
> Taxon Name Servers (TNS). Rather than administered by one
> organization (e.g., GBIF, ITIS, Species 2000, uBio,
> etc.) these TSNs would be replicated on dozens or hundreds of servers all
> over the world, and maintained as synchronized within some reasonable time
> unit. Changes to any one replicate would be automatically propagated to
all
> replicates (either chaotically, or more strictly through one or a few
> defined "hubs"). Instead of Domain names as surrogates for IP addresses,
> there would be fully qualified "Basionyms" (e.g.,
>
"OriginalGenusName.OriginalSpeciesName.OriginalAuthor.DescriptionYear.Page.O
> therOriginalCitationDetailsAsNeeded") representations of the
> less-human-friendly GUIDs (analogues to IP addresses). Ideally, this
system
> wouldn't be limited to just taxonomic names, but extended to all taxonomic
> concepts, so that the "Domain Name" analogue would be extended to
something
> like:
>
> "OriginalGenusName.OriginalSpeciesName.OriginalAuthor.DescriptionYear.
> Page.O
>
therOriginalCitationDetailsAsNeeded_AppliedGenusName.AppliedSpeciesSpelling.
> ConceptAuthor.ConceptYear.Page.OtherConceptCitationDetailsAsNeeded"
>
>
>>The players in the Internet networking fabric all now play by these
>
> layered rules.
>
>>They all know them and follow them in order to keep the Internet
>>running.
>
> This
>
>>stuff happens out of sight to everyone but the networking people and
>>we
>
> all take it
>
>>for granted and assume it is simple. But, it's invisible not because
>>it's
>
> simple,
>
>>but rather because it's disciplined.
Agreed. Yet one more attribute where IP/DNS and LSID are not
distinguishable.
>
>
> Excellent synopsis, and (in my opinion), and excellent model to follow
> for at least taxonomic names/concepts data. Perhaps also for specimen
> data (but seems less intuitive for that.) This comes back to my
> earlier question about whether it is vital that all bioinformatics
> GUIDs be of the same scheme; or whether different schemes might be
> optimal for different classes of objects.
>
> Aloha,
> Rich
>
> Richard L. Pyle, PhD
> Natural Sciences Database Coordinator, Bishop Museum
> 1525 Bernice St., Honolulu, HI 96817
> Ph: (808)848-4115, Fax: (808)847-8252
> email: deepreef(a)bishopmuseum.org
> http://www.bishopmuseum.org/bishop/HBS/pylerichard.html
> [For me, the bottom line---which however I nowhere state below---is:
> There is /so much/ existing free infrastructure source code---e.g.
> http://www-124.ibm.com/developerworks/oss/lsid/--- and (apparently)
> funding, and (manifestly) professionally designed specifications for
> LSID concerns that I am horrified at the prospect of adopting anything
> else if LSID comes even close to being what the community needs.
I can certainly understand that perspective, and that's one of the main
reasons I am still semi-supportive of the LSID approach (i.e., existing
code). My major concern has to do with the "Authority"/"Resolver" domain
portion of the LSID, and the need (or non-need) for it to be an active,
accessible domain in order to resolve the LSID. I'm also VERY concerned
about there ever being a temptation to change an ID for an object (e.g., a
specimen given from one Museum to another) -- unless it is understood that
the non-"ObjectID" portion is really thought of as metadata of sorts, and
ObjectID itself is globally unique by itself. I'll need to read the LSID
spec in more detail, and give it some more thought, before I comment further
on this.
> Or,
> let's forget about LSID and instead of deploying what satisfies 98% of
> the needs in six months, we could roll our own and deploy what satisfies
> 80% of the needs in a few years...
If that were really the balance, then it would be a no-brainer. My concern
would be adopting (effectively committing to) a scheme that satisfies 80% of
the needs in six months, instead of being patient and picking a system that
accomodates 98% of the need a few years from now. If I had confidence that
we could implement LSIDs in a "test-drive" mode for a couple of years,
without being fully committed to them, I'd be much more comfortable. But as
someone who spends a considerable amount of time trying to undo the damage
of "legacy" solutions to data problems that were hastily conceived, I'm
trying to be cautious.
> The design goals of TCP/IP and DNS, and their implementation, intersect
> the requirements of Bio UUIDs only in a very small set, in fact, deep
> down perhaps not at all.
>
> These protocols and the associated address syntax were designed
> primarily for /routing/, not in any way designed to guarantee that a
> datum twice received has any connection between the two occurrences.
>
> IP addresses are in no way persistent.
>
> IP addresses are not globally unique, albeit in several small and varied
> ways:
I don't think anyone (in this thread) was suggesting actually *USING* TCP/IP
and DNS for BioGUIDs (at least I wasn't). Rather, I was looking to it as a
source of ground-truthed schemes for reliably managing globally distributed
information. For instance, would DNS synchronization/propagation serve as a
useful model for gobally distributed, synchronized taxonomic registries? Or
would the taxonomic registry work more effectively with one or a few
centralized "masters" with which a larger set of replicates kept
synchronized? I also think, as I explained earlier, that the hierarchical
approach with centralized block ID issuance and local application might be
instructive to a bioscheme.
> In fact, if the UUIDs are meant to be semantically opaque it matters not
> one whit who or how these matters are settled.
Can you elaborate on what you mean by "semantically opaque"? I think I
understand -- but the last thing this thread needs is ambiguity about the
meaning of terms (i.e., "opaque semantics".... :-) )
> Exceptions to that are
> social, not technical. ("If you don't let me decide X, I am not going to
> use your scheme". "OK, then you won't participate in its benefits.
> That's fine with me")
...and as I said before, the real challenge in establishing universally
adopted BioGUIDs is not going to be technical; it's going to be
social/political.
> >>So, there is a heirarchy of how the "unique identifiers" are managed.
> >
> > There is
> >
> >>in fact a central authority, but it delegates to decentralized
> >
> > authorities.
>
> But this is mainly to distribute costs and speed issuance. It has
> nothing to do with the naming scheme. The number of organizations to be
> issued Bio GUIDs surely is several orders of magnitude less than those
> to be issued IBv6 addresses. So I doubt any IPv6 issuance mechanisms are
> instructive, at least in their purpose (and hence, if well implemented,
> in their implementation).
So are you saying that, because BioGUID traffic will be orders of magnitude
smaller than internet domain traffic, there does not need to be delegation
to decentralized authorities? If so, then we are in full agreement.
> > GBIF seems to me to be the principle contender.
>
> I enthusiastically agree. Also the /principal/ contender. [Sorry,
> couldn't resist. My fingers slip on that one sometimes too.]
Ouch.... :-)
> Not exactly. There is one scheme in case your application can't resolve
> it in a more nearly "local" facility. There are /lots/ of ways to find
> an IP address from a domain name. All those which comply fully with the
> DNS protocol, however, can make available two pieces of metadata: the
> TTL of the record it is offering, and the IP address of a machine at
> which you can find an authoritative record of the assignment of the dns
> name to the IP address. This protocol /might/, but you hope on
> performance grounds usually /doesn't/, lead you up as far as the root
> servers, and the "one scheme to bind them all". If there is any lesson
> here at all, it is that name resolution protocols matter, but resolution
> implementations don't. Yet another attribute on which, DNS/IP and LSID
> are not distinguishable.
This seems to be a fundamental point of confusion (for me anyway). Are the
domain names embedded within LSIDs information-bearing in the sense that
they are necessarily the internet domain at which the LSID is resolved? I
guess I should read and understand Section 13.3 of the LSID spec before
commenting further.
> More often, only when the TTLs expire, there being no motivation to do
> otherwise.
O.K., there's a great analogy that may be useful if implementing a
distributed system of synchronized/mirrored biological data servers: should
they remain in synch at fixed time intervals? In real time with each data
transaction? Or, should some sort of TTL feature be incorporated in data?
Much to think about. But time for me to get some work done....
Aloha,
Rich