tdwg-content
Threads by month
- ----- 2024 -----
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2018 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2017 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2016 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2015 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2014 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2013 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2012 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2011 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2010 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2009 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2008 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2007 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2006 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2005 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2004 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2003 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2002 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2001 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2000 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 1999 -----
- December
- November
- October
- September
- August
- 1557 discussions
Rich
No reading I can make of http://www.omg.org/cgi-bin/doc?dtc/2004-05-01
is consistent with the explanation offered below. "Authority" is not
used at all in Section 9, "LSID Resolution Service". Instead, resolution
is defined to be accomplished by a set of interfaces all of which take
only a full (semantically opaque!!!! Sec. 8, p. 7!!!) LSID as argument.
The interfaces correspond to methods offered by an LSID Resolution
Service. Nowhere in Section 9 is there any relationship mentioned
between resolution and the "authority identification" that is part of
the syntax of an LSID. That is discussed in Section 8, "LSID Syntax"
which carries the sentences (bottom of p. 7) that I have previously
cited: "The authority identification is usually an Internet domain name.
In this case it is recommended that it be owned by the organization that
assigns an LSID in question".
There are too many ">"s for me to understand who is claiming the stuff
below, but whoever it is could do me a favor by telling me which part of
the spec they are reading. (Or if they found a later document at OMG).
Well, OK, I haven't yet read the "Accompanied Files" listed in Appendix
A., some of which are normative and take precedence over the main
document. Maybe the explanation comes from one of them.
Richard Pyle wrote:
> Many thanks for jumping in on this, Dave!
>
>
>>As I think Dave V. and others have pointed out, when an LSID is resolved,
>>DNS is used to find the LSID authority. The LSID authority then provides
>>information about how the LSID can be served up (e.g. HTTP, SOAP, FTP),
>>and where to get the data behind the LSID and associated metadata. If I
>>start serving up LSIDs with the authority learningsite.com and later
>>decide that I'm sick of serving up LSIDs, somebody else can take over
>>serving up the data and the metadata. However, I (or they) still bear the
>> responsibility of running the authority which points to the data. If my
>>lsids have an authority like lsid.learningsite.com
>>(urn:lsid:lsid.learningsite.com:foo:bar) then someone else can take over
>>the authority by taking over lsid.learningsite.com and I can still have
>>www.learningsite.com, mail.learningsite.com, etc... for myself. So, with a
>>little planning, it's not so hard to deal with an authority going away as
>>long as the people running it are responsible.
>
1
0
Hi Bob,
Those are my >>s so I'll respond. Below I'm talking about the stuff in
section 13.3 of the spec. Primarily page 27.
Dave
> Rich
>
> No reading I can make of http://www.omg.org/cgi-bin/doc?dtc/2004-05-01
> is consistent with the explanation offered below. "Authority" is not
> used at all in Section 9, "LSID Resolution Service". Instead, resolution
> is defined to be accomplished by a set of interfaces all of which take
> only a full (semantically opaque!!!! Sec. 8, p. 7!!!) LSID as argument.
> The interfaces correspond to methods offered by an LSID Resolution
> Service. Nowhere in Section 9 is there any relationship mentioned
> between resolution and the "authority identification" that is part of
> the syntax of an LSID. That is discussed in Section 8, "LSID Syntax"
> which carries the sentences (bottom of p. 7) that I have previously
> cited: "The authority identification is usually an Internet domain name.
> In this case it is recommended that it be owned by the organization that
> assigns an LSID in question".
>
> There are too many ">"s for me to understand who is claiming the stuff
> below, but whoever it is could do me a favor by telling me which part of
> the spec they are reading. (Or if they found a later document at OMG).
> Well, OK, I haven't yet read the "Accompanied Files" listed in Appendix
> A., some of which are normative and take precedence over the main
> document. Maybe the explanation comes from one of them.
>
> Richard Pyle wrote:
>
>> Many thanks for jumping in on this, Dave!
>>
>>
>>>As I think Dave V. and others have pointed out, when an LSID is
>>> resolved,
>>>DNS is used to find the LSID authority. The LSID authority then
>>> provides
>>>information about how the LSID can be served up (e.g. HTTP, SOAP, FTP),
>>>and where to get the data behind the LSID and associated metadata. If
>>> I
>>>start serving up LSIDs with the authority learningsite.com and later
>>>decide that I'm sick of serving up LSIDs, somebody else can take over
>>>serving up the data and the metadata. However, I (or they) still bear
>>> the
>>> responsibility of running the authority which points to the data. If my
>>>lsids have an authority like lsid.learningsite.com
>>>(urn:lsid:lsid.learningsite.com:foo:bar) then someone else can take over
>>>the authority by taking over lsid.learningsite.com and I can still have
>>>www.learningsite.com, mail.learningsite.com, etc... for myself. So, with
>>> a
>>>little planning, it's not so hard to deal with an authority going away
>>> as
>>>long as the people running it are responsible.
>>
>
>
1
0
[For me, the bottom line---which however I nowhere state below---is:
There is /so much/ existing free infrastructure source code---e.g.
http://www-124.ibm.com/developerworks/oss/lsid/--- and (apparently)
funding, and (manifestly) professionally designed specifications for
LSID concerns that I am horrified at the prospect of adopting anything
else if LSID comes even close to being what the community needs. Or,
let's forget about LSID and instead of deploying what satisfies 98% of
the needs in six months, we could roll our own and deploy what satisfies
80% of the needs in a few years...
To my mind, the only question is this: is LSID good enough? So: first
make the requirements. Then examine existing solutions. Arguing by
analogy can lead to the "hammer" solution, usually attributed to Mark
Twain: When all you own is a hammer, every problem begins to look like a
nail. (Can someone give me an actual /reference/ to that quote????]
-------------------------
The idea of modeling taxonomic uuids on the internet has been around for
about 10 years, and has been written about explicitly for at least 3-4
by Hannu Saarenma and probably others. See his article in
http://reports.eea.eu.int/technical_report_2001_70/en/Technical%20Report%20…
The Tree of Life http://www.tolweb.org is a defacto such model, though
without name authorities or persistence.
The design goals of TCP/IP and DNS, and their implementation, intersect
the requirements of Bio UUIDs only in a very small set, in fact, deep
down perhaps not at all.
These protocols and the associated address syntax were designed
primarily for /routing/, not in any way designed to guarantee that a
datum twice received has any connection between the two occurrences.
IP addresses are in no way persistent.
IP addresses are not globally unique, albeit in several small and varied
ways:
- The "private address blocks" 10.X.Y.Z, 172.16.0-0-172.31.255.255, and
192.168.X.Y may be assigned to any machine your and my network
administrator care to, as long as neither is served on "the Internet",
nor have an duplication on "an internet".
-every machine implementing IP, besides whatever other addresses it may
have, is always known to itself as 127.0.0.1;
- the addresses in some IP address ranges are reserved to designate
/many/ machines (multicast IP)).
In general the design of the IP "nomenclature"---i.e. the IP
addresses---is designed to solve routing problems, not identification
problems. IP address syntax is intentionally not "semantically opaque",
contrary to a requirement (well, a "should be") of LSIDs and well it
should be. See Section 8 of http://www.omg.org/cgi-bin/doc?dtc/04-05-01
[If you don't see why IP addressing is not semantically opaque, try this
exercise: With a single machine instruction (on most machines) how can
you determine whether or not an IP address is in the Class B private
address space(the 172 stuff above)? Sysadmin's, C programmers, and their
families and employees are not eligible for this competition]
In turn, DNS protocols are designed only to aid discovery of IP
addresses. DNS addresses are also far from persistent (in fact, DNS
records held anywhere have a "Time To Live" field which must be counted
down until expiration, at which time the holder must acquire a new
instance of the assignment of an IP address to a domain name.).
[More below, interspersed]
Richard Pyle wrote:
>>Perhaps it would be useful to look at the issues being discussed about
>>a bio identifier/locator/GUID in comparison to the same things that are
>>needed for Internet communications.
>
>
> I've long thought that parts of the DNS system would be extremely useful to
> emulate in some aspects of bioinformatics data management (particularly
> taxonomic names; see below).
>
>
>>IP addresses have to be unique world-wide to make the Internet work.
>>The Internet Corporation for Assigned Names and Numbers (ICANN-
>
> www.icann.org)
>
>>provides that uniqueness by assigning all the IP numbers in unique blocks
>
> or
>
>>ranges of numbers to "Internet Registries".
>
>
> ...exactly the way that I envision an organization like GBIF would be
> charged with the task of issuing UIDs for certain biological objects.
>
>
>>There are Regional, National and Local Internet Registries that subdivide
>
> and
>
>>"license" IP addresses to ISPs, who in turn license IP addresses to
>
> organizations.
Ah, this is accurate mainly for IPv6, which is much less chaotic than
IPv4 (most of the current Internet) and in turn from the nearly formless
void that was IPv2. But again, ultimate "licensees" do not get
persistent IP addresses. In the US, virtually all dialup users get a
different IP every time they connect, and most home broadband users only
accidently keep their IP addresses, and only if they don't disconnect
very long from the network.
It's well worth comparing the design goals of IPv6 as articulated in
http://www.apnic.net/docs/policy/ipv6-address-policy.html
with those of LSID as articulated in Section 8 of the Draft Final
Specification http://www.omg.org/cgi-bin/doc?dtc/04-05-01
>
> There could be a useful analog for this in bioinformatics (particularly in
> terms of individual institutions serving as regional registries for specimen
> UIDs, or IC_N Commissions serving as "regional" registries for taxon name
> UID assignment) -- but there doesn't necessarily have to be.
>
In fact, if the UUIDs are meant to be semantically opaque it matters not
one whit who or how these matters are settled. Exceptions to that are
social, not technical. ("If you don't let me decide X, I am not going to
use your scheme". "OK, then you won't participate in its benefits.
That's fine with me")
If you want to see another example of lack of semantic opacity, read the
ISBN standard ISO/TC 46/SC 9 N 326. Part of certain ISBNs can help you
determine an allegedly common publishing-germaine attribute of the US,
Zimbabwe, Puerto Rico, Ireland, Swaziland, part of Canada, and a few
other "regions".
>
>>So, there is a heirarchy of how the "unique identifiers" are managed.
>
> There is
>
>>in fact a central authority, but it delegates to decentralized
>
> authorities.
But this is mainly to distribute costs and speed issuance. It has
nothing to do with the naming scheme. The number of organizations to be
issued Bio GUIDs surely is several orders of magnitude less than those
to be issued IBv6 addresses. So I doubt any IPv6 issuance mechanisms are
instructive, at least in their purpose (and hence, if well implemented,
in their implementation).
>
> To emulate this in bioinformatics, the "hierarchy" would be achieved simply
> by allowing block-assignment of UIDs to various players -- but the important
> point here is that only *one* organization ensures uniqueness (in the case
> of Internet, of ISPs). The data to which those UIDs apply would be, for the
> most part, the responsibility of the UID recipient, not the UID issuer (in
> my world view). Thus: centralized issuance; delegated application.
>
>
>>Is there an analogy for BioGUIDs to have a central body who divvies out
>
> the
>
>>unique numbers (like IP addresses) to decentralized bodies or large
>
> organizations?
The International ISBN Organization http://www.isbn-international.org/
is roughly the IPv6 model.
>
> GBIF seems to me to be the principle contender.
I enthusiastically agree. Also the /principal/ contender. [Sorry,
couldn't resist. My fingers slip on that one sometimes too.]
>
>
>>Since IP addresses are hard to memorize (and so too would be a BioGUID),
>
> "domain names"
>
>>are used. Starting with a domain name, you can first find the name and/or
>
> IP address
>
>>of a device, called the Domain Name Server, that can locate the IP address
>
> of other
>
>>computers. This is a form of indirect addressing. ICANN also manages the
>
> top-level
>
>>namespace for the Internet. They decide what the valid domain "extensions"
>
> are (like
>
>>.com, .uk) so that everybody, everywhere knows where to look them up.
>
> Then, the domain
>
>>name extensions are separated among the Regional, National, and Local
>
> Interent Registries
>
>>around the world. There is a scheme for where to find the IP addresses
>
> for every domain
>
>>extension (e.g. .com is on the ARIN registry, .com.uk is on the ).
Not exactly. There is one scheme in case your application can't resolve
it in a more nearly "local" facility. There are /lots/ of ways to find
an IP address from a domain name. All those which comply fully with the
DNS protocol, however, can make available two pieces of metadata: the
TTL of the record it is offering, and the IP address of a machine at
which you can find an authoritative record of the assignment of the dns
name to the IP address. This protocol /might/, but you hope on
performance grounds usually /doesn't/, lead you up as far as the root
servers, and the "one scheme to bind them all". If there is any lesson
here at all, it is that name resolution protocols matter, but resolution
implementations don't. Yet another attribute on which, DNS/IP and LSID
are not distinguishable.
>>Then there is a layer of Domain Registrars who have been accredited by
>
> ICANN to assign
>
>>domain names for the domain extensions - e.g. tdwg.org.
>>The domain name registrars are told by the owner of the domain where to
>
> find their particular
>
>>Domain Name Servers which may be many to enable redundancy - Primary,
>
> Secondary, Tertiary,
>
>>etc.
Not quite. Normally, the registrant tells the registrar who has agreed
to be the servers. The other case sometimes happens with "retailers" who
are selling individuals domain names and ISP services at the same time.
These redundant Domain Name Servers synchronize with each other at
>
> particular times
>
>>of day and may be located all around the world.
More often, only when the TTLs expire, there being no motivation to do
otherwise.
>They are the main
>
> "switchboard" for a
>
>>particular organizations computer names and associated IP addresses.
>>Then the individual organization can create multiple computers for the
>
> domain name - e.g.
>
>>www.tdwg.org - and add them to the Domain Name Server listing. There can
>
> be many computers
>
>>for a domain, for instance: info.tdwg.org, www2.tdwg.org, myname.tdwg.org.
>
> Each of these
>
>>can be a different computer with a different IP address. The redundant
>
> Domain Name Servers
>
>>all contain the list of all these names and what IP addresses they are.
Not usually. The primary and secondary name servers would normally only
cache tdwg.org permanently. They might /acquire/ a record for
www.tdwg.org in response to a request, but they would not in general
renew it after it expired and maybe not even keep it that long. To do so
would be hideously unscalable. If I put 10,000 machines in my domain, my
primary and secondary would be mighty unhappy if they had to keep them
all cached.
>
>
> This is analogous in many ways to how I would envision a global taxonomic
> name service. UIDs are assigned by a centralized body (e.g., GBIF; or by
> the IC_N Commissions) to individual names. Analogous to multiple redundant
> Domain Name Servers (DNS) would be Taxon Name Servers (TNS). Rather than
> administered by one organization (e.g., GBIF, ITIS, Species 2000, uBio,
> etc.) these TSNs would be replicated on dozens or hundreds of servers all
> over the world, and maintained as synchronized within some reasonable time
> unit. Changes to any one replicate would be automatically propagated to all
> replicates (either chaotically, or more strictly through one or a few
> defined "hubs"). Instead of Domain names as surrogates for IP addresses,
> there would be fully qualified "Basionyms" (e.g.,
> "OriginalGenusName.OriginalSpeciesName.OriginalAuthor.DescriptionYear.Page.O
> therOriginalCitationDetailsAsNeeded") representations of the
> less-human-friendly GUIDs (analogues to IP addresses). Ideally, this system
> wouldn't be limited to just taxonomic names, but extended to all taxonomic
> concepts, so that the "Domain Name" analogue would be extended to something
> like:
>
> "OriginalGenusName.OriginalSpeciesName.OriginalAuthor.DescriptionYear.Page.O
> therOriginalCitationDetailsAsNeeded_AppliedGenusName.AppliedSpeciesSpelling.
> ConceptAuthor.ConceptYear.Page.OtherConceptCitationDetailsAsNeeded"
>
>
>>The players in the Internet networking fabric all now play by these
>
> layered rules.
>
>>They all know them and follow them in order to keep the Internet running.
>
> This
>
>>stuff happens out of sight to everyone but the networking people and we
>
> all take it
>
>>for granted and assume it is simple. But, it's invisible not because it's
>
> simple,
>
>>but rather because it's disciplined.
Agreed. Yet one more attribute where IP/DNS and LSID are not
distinguishable.
>
>
> Excellent synopsis, and (in my opinion), and excellent model to follow for
> at least taxonomic names/concepts data. Perhaps also for specimen data (but
> seems less intuitive for that.) This comes back to my earlier question about
> whether it is vital that all bioinformatics GUIDs be of the same scheme; or
> whether different schemes might be optimal for different classes of objects.
>
> Aloha,
> Rich
>
> Richard L. Pyle, PhD
> Natural Sciences Database Coordinator, Bishop Museum
> 1525 Bernice St., Honolulu, HI 96817
> Ph: (808)848-4115, Fax: (808)847-8252
> email: deepreef(a)bishopmuseum.org
> http://www.bishopmuseum.org/bishop/HBS/pylerichard.html
1
0
Hi,
I found the PowerPoint document from Donald really helpfull on some of the
discussed issues. Unfortunately referring to it is a little difficult
because the pages does not contain a unique identier :)
But first some real-world experience with the 'GUID' used today in GBIF
specimen network.
The combination InstitutionCode, CollectionCode and CatalogNumber was chosen
for that. Problems experienced last year:
-the 'GUID combination' is not enforced and therefore not always used
-Some collections belong to 2 or more Institutions or to none
-If part of the collection moves to another institute, the guid combination
is changed for that part.
-The InstitutionCode should be unique, and providers where asking what to do
if the code they wanted to use was already chosen, and who decides which
institute may use an institutioncode if two institutes want to use it. There
is no body responsible for that and there are no rules: the first Institute
can claim a code, or the biggest or the most well known??
-In different science areas different InstitutionCodes within one
Organisation where in use, which one to choose.
-This 'GUID' can only be used for specimen, not for other life science
objects.
Now let's look at LSID syntax:
urn:lsid:authority:namespace:object_identifier (:revision_number)
About the first part; authority:
It is naturally to want this to be unique. Therefore we can expect the same
problems as mentioned above, plus unclearity about the difference
issuing_authority vs. current_authority for the data.
The problems with authority are important for the involved authorities only,
not for the rest of the life science community. So discussions about it and
establishing an authority that takes decisions in political conflicts are a
waste of time.
We can solve it by using a unique number only and maintaining a list that
gives information about each number. It should be clear that this are only
the initial issuing authority/authorities.
About the second part; namespace:
Things like 'Specimen' or 'Experiment'. In contrast with the first part,
problems with this part are interesting for the whole lifescience community
because applications will want to use this to decide whether the data can be
used for a specific application. Standardisation of namespaces is necessary.
I think it should be devided in two parts (not currently present in LSID)
like MIME type image/jpeg etc: 'observation/abcd' or 'specimen/darwincore'
for example.
If we look at the Donalds PPT we see that in model 1: LSID assigned
centrally, the namespace is chosen centrally and by model2: LSID assigned by
each provider, the provider is free to choose one. Even language variations
of naming a namespace can already give problems, so this is why I strongly
favor a central mechanism here for assigning LSIDs, unless the provider is
somehow forced to use a certain namespace class. The potential bottleneck
problem is not really an issue I think (see also DNS mechanisms). If we
choose central mechanism the issuing authority will always be GBIF (or do we
need different authorities for different parts of Life Science?) so no
problems with that also in this case.
The third part; object id: no problems there.
The last part; revision id: whether you need it depends: do you give the
physical objects a GUID or the data records? With the first choice you do
not need a revision number because the physical object will not change (or
do they with living collections?).
At first I thought that a GUID should be put on physical object: if you are
looking for data, you are looking for data about a certain physical object,
the source of the data is not (very) important. The same data elements in
different sources about the same object should be equal, else there are
errors. Donald's PPT gives the example that someone wants to refer to a LSID
in a publication as a source. In that case you want to refer to a data
source with a certain version. Then you need to give a GUID to the data and
also you need revisions. Data is not persistent, it changes all the time.
Giving a persistent identifier to it is very difficult and not many data
systems have full revisions support. If a GUID for a 'physical object' is
chosen, a thing like a species name or author name or country should not get
a GUID. These are more a kind of attributes: most data will use one or more
species names as 'metadata'. There needs to be central datasource for each
of these 'metadata', like a NameBank for species names (with its own ID). I
am not sure whether LSID was designed for a GUID to data or to physical
objects. The use of namespace and object id instead of databasename and
recordid seems to indicate that it was designed for physical objects, but
why then the optional revision id? Instead of a revision id you can also
assign a new GUID with every change, but then how to point to a new version
from an old version of data (if you have the GUID of the old version, how to
get the GUID of the new one).
Requirements in Donald's PPT:
-if a GUID is on a physical object, the GUID must not refer uniquely to a
single data element, it must only be unique itself. It is also not a
requirement in LSID specification. There will be overlap between the
objects, so an object can belong to more then one IDs. For instance a
researcher can have its own ID and also belong to the ID of the Institute he
is working for. The data for overlapping elements like researcher name must
be equal.
-I would restrict the identifiers to life science objects.
Issues to be resolved in Donald's PPT:
It would be beneficial to maintain the GUID in the datasource itself (at
least for the owner of the datasource), but not absolutely necessary. I see
GUID in data records as a 'tightly coupled' model (which requires some work
for existing databases). I can imagine also a 'loosely coupled' model where
provider software is modified to get the identifier from a central server
(or mirror).
Wouter Addink
1
0
> Perhaps it would be useful to look at the issues being discussed about
> a bio identifier/locator/GUID in comparison to the same things that are
> needed for Internet communications.
I've long thought that parts of the DNS system would be extremely useful to
emulate in some aspects of bioinformatics data management (particularly
taxonomic names; see below).
> IP addresses have to be unique world-wide to make the Internet work.
> The Internet Corporation for Assigned Names and Numbers (ICANN-
www.icann.org)
> provides that uniqueness by assigning all the IP numbers in unique blocks
or
> ranges of numbers to "Internet Registries".
...exactly the way that I envision an organization like GBIF would be
charged with the task of issuing UIDs for certain biological objects.
> There are Regional, National and Local Internet Registries that subdivide
and
> "license" IP addresses to ISPs, who in turn license IP addresses to
organizations.
There could be a useful analog for this in bioinformatics (particularly in
terms of individual institutions serving as regional registries for specimen
UIDs, or IC_N Commissions serving as "regional" registries for taxon name
UID assignment) -- but there doesn't necessarily have to be.
> So, there is a heirarchy of how the "unique identifiers" are managed.
There is
> in fact a central authority, but it delegates to decentralized
authorities.
To emulate this in bioinformatics, the "hierarchy" would be achieved simply
by allowing block-assignment of UIDs to various players -- but the important
point here is that only *one* organization ensures uniqueness (in the case
of Internet, of ISPs). The data to which those UIDs apply would be, for the
most part, the responsibility of the UID recipient, not the UID issuer (in
my world view). Thus: centralized issuance; delegated application.
> Is there an analogy for BioGUIDs to have a central body who divvies out
the
> unique numbers (like IP addresses) to decentralized bodies or large
organizations?
GBIF seems to me to be the principle contender.
> Since IP addresses are hard to memorize (and so too would be a BioGUID),
"domain names"
> are used. Starting with a domain name, you can first find the name and/or
IP address
> of a device, called the Domain Name Server, that can locate the IP address
of other
> computers. This is a form of indirect addressing. ICANN also manages the
top-level
> namespace for the Internet. They decide what the valid domain "extensions"
are (like
> .com, .uk) so that everybody, everywhere knows where to look them up.
Then, the domain
> name extensions are separated among the Regional, National, and Local
Interent Registries
> around the world. There is a scheme for where to find the IP addresses
for every domain
> extension (e.g. .com is on the ARIN registry, .com.uk is on the ).
> Then there is a layer of Domain Registrars who have been accredited by
ICANN to assign
> domain names for the domain extensions - e.g. tdwg.org.
> The domain name registrars are told by the owner of the domain where to
find their particular
> Domain Name Servers which may be many to enable redundancy - Primary,
Secondary, Tertiary,
> etc. These redundant Domain Name Servers synchronize with each other at
particular times
> of day and may be located all around the world. They are the main
"switchboard" for a
> particular organizations computer names and associated IP addresses.
> Then the individual organization can create multiple computers for the
domain name - e.g.
> www.tdwg.org - and add them to the Domain Name Server listing. There can
be many computers
> for a domain, for instance: info.tdwg.org, www2.tdwg.org, myname.tdwg.org.
Each of these
> can be a different computer with a different IP address. The redundant
Domain Name Servers
> all contain the list of all these names and what IP addresses they are.
This is analogous in many ways to how I would envision a global taxonomic
name service. UIDs are assigned by a centralized body (e.g., GBIF; or by
the IC_N Commissions) to individual names. Analogous to multiple redundant
Domain Name Servers (DNS) would be Taxon Name Servers (TNS). Rather than
administered by one organization (e.g., GBIF, ITIS, Species 2000, uBio,
etc.) these TSNs would be replicated on dozens or hundreds of servers all
over the world, and maintained as synchronized within some reasonable time
unit. Changes to any one replicate would be automatically propagated to all
replicates (either chaotically, or more strictly through one or a few
defined "hubs"). Instead of Domain names as surrogates for IP addresses,
there would be fully qualified "Basionyms" (e.g.,
"OriginalGenusName.OriginalSpeciesName.OriginalAuthor.DescriptionYear.Page.O
therOriginalCitationDetailsAsNeeded") representations of the
less-human-friendly GUIDs (analogues to IP addresses). Ideally, this system
wouldn't be limited to just taxonomic names, but extended to all taxonomic
concepts, so that the "Domain Name" analogue would be extended to something
like:
"OriginalGenusName.OriginalSpeciesName.OriginalAuthor.DescriptionYear.Page.O
therOriginalCitationDetailsAsNeeded_AppliedGenusName.AppliedSpeciesSpelling.
ConceptAuthor.ConceptYear.Page.OtherConceptCitationDetailsAsNeeded"
> The players in the Internet networking fabric all now play by these
layered rules.
> They all know them and follow them in order to keep the Internet running.
This
> stuff happens out of sight to everyone but the networking people and we
all take it
> for granted and assume it is simple. But, it's invisible not because it's
simple,
> but rather because it's disciplined.
Excellent synopsis, and (in my opinion), and excellent model to follow for
at least taxonomic names/concepts data. Perhaps also for specimen data (but
seems less intuitive for that.) This comes back to my earlier question about
whether it is vital that all bioinformatics GUIDs be of the same scheme; or
whether different schemes might be optimal for different classes of objects.
Aloha,
Rich
Richard L. Pyle, PhD
Natural Sciences Database Coordinator, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef(a)bishopmuseum.org
http://www.bishopmuseum.org/bishop/HBS/pylerichard.html
1
0
Many thanks for jumping in on this, Dave!
> As I think Dave V. and others have pointed out, when an LSID is resolved,
> DNS is used to find the LSID authority. The LSID authority then provides
> information about how the LSID can be served up (e.g. HTTP, SOAP, FTP),
> and where to get the data behind the LSID and associated metadata. If I
> start serving up LSIDs with the authority learningsite.com and later
> decide that I'm sick of serving up LSIDs, somebody else can take over
> serving up the data and the metadata. However, I (or they) still bear the
> responsibility of running the authority which points to the data. If my
> lsids have an authority like lsid.learningsite.com
> (urn:lsid:lsid.learningsite.com:foo:bar) then someone else can take over
> the authority by taking over lsid.learningsite.com and I can still have
> www.learningsite.com, mail.learningsite.com, etc... for myself. So, with a
> little planning, it's not so hard to deal with an authority going away as
> long as the people running it are responsible.
O.K., that clears up things a great deal -- but also reinfornces my concerns
about LSIDs for specimen data. What happens when Bishop Museum sends a
specimen (or an entire collection -- but not all of the collections) to
Smithsonian? If Bishop Museum still maintained lsid.bishopmuseum.org for its
other collections, then the specimen would presumably need a new LSID based
on lsid.Smithsonian.gov. Is there a protocol for a request coming into
lsid.bishopmuseum.org to be automatically re-routed to lsid.Smithsonian.gov
for just those specimens flagged as transferred? If so, then I would feel
(slightly) more comfortable with LSIDs if the issuing organizations would
agree to use some sort of independently unique value for <ObjectID>, that
that portion would be preserved along with the specimen object.
Of course, there is also the problem I alluded to earlier, where a specimen
object with a GUID is fractioned and in need of new IDs - but this is a
general problem which will need to be dealt with no matter the GUID scheme
that is adopted.
> * data and metadata
>
> With LSIDs there's a big difference between the data and metadata of an
> LSID - and I think this is going to be the biggest challenge in deciding
> how to use them in our context. What's the data? What's the metadata?
> With gene sequences, the datum is the sequence, the metadata are things
> like contact information, who did the sequencing, taxonomic information
> about the thing sequenced, etc. There's an LTER site using LSIDs for
> their data sets. The LSID data is the data set itself, and the metadata
> is what you'd expect - a description of the data set, who the
> investigators were, that sort of thing. NCBI has pubmed LSIDs - they're
> not serving up the articles yet, but there's associated metadata in there.
> For these things the division between data and metadata is fairly clear.
> However, what is the data for taxa? What is the metadata?
VERY difficult question. I perceive taxa (names & concepts) as artificial
constructs, without unambiguosly objective reality; and as such, everything
(except the UID itself) is metadata. However, original descriptions of
taxon names do have an (almost) unambiguosly objective reality, as do
documented statements about taxonomic concepts to which those statements
apply. But even still, most of the attributes we might think of data
elements for objects like an original description of a taxcon name, could
also be interpreted as metadata.
> Here's another interesting thing about data and metadata in LSIDs. When
> you issue an LSID you're promising the the DATA behind that LSID never
> changes.
What about typographical/transcriptional errors? Can they be corrected?
> * client stack versus authority server
>
> The LSID folks provide two batches of code - an authority server, for
> people who want to to serve up LSIDs themselves, and an LSID Client stack
> - which can be used by organizations to provide access to their LSIDs
> and/or proxy LSIDs provided by other organizations. It may make sense for
> an organization like GBIF to build a service using the Client Stack to
> support both their own LSIDs and those served by other organizations. The
> Client Stack has a caching mechanism which supports expiration information
> from the primary authority, so the primary authority can update where the
> LSID may be resolved and metadata of that authority.
Does the expiration apply to the domain, or to the individual object? In
other words, can a defined set of ObjectID's within one domain's LSID pool
be re-directed, without having to redirect all calls to that LSID domain?
Thanks again for your very useful insights!
Aloha,
Rich
1
0
Hello everyone,
Sorry to come into this late - and forgive me if I'm covering trodden
ground here, I just re-joined the list and may have missed a few posts.
I've been looking at various GUID systems for SEEK - primarily the Handle
System underlying DOI, and LSIDs. It looks like I'll be giving an
introduction to GUIDs and focusing on LSIDs at TDWG in New Zealand in two
weeks. Judging by the conversation here, I think I'll be keeping my
introduction brief to allow maximal time for discussion!
There are a few things about LSIDs that I want to point out. First....
as someone has mentioned, some of my early ramblings on GUIDs can be found
here:
http://cvs.ecoinformatics.org/cvs/cvsweb.cgi/seek/projects/taxon/docs/guid/
Some of these files got munged in CVS, but I've fixed them. These files
are over six months old, and between then and now I've become more of an
LSID fan. So, if you do bother reading that stuff, realize it's old and
outdated and incomplete in too many places. I'm going to write a revised
document encompassing and expanding all of that and I'll post here when
it's completed.
Now for some miscellaneous points about LSIDs
* What happens when an LSID Authority goes away?
As I think Dave V. and others have pointed out, when an LSID is resolved,
DNS is used to find the LSID authority. The LSID authority then provides
information about how the LSID can be served up (e.g. HTTP, SOAP, FTP),
and where to get the data behind the LSID and associated metadata. If I
start serving up LSIDs with the authority learningsite.com and later
decide that I'm sick of serving up LSIDs, somebody else can take over
serving up the data and the metadata. However, I (or they) still bear the
responsibility of running the authority which points to the data. If my
lsids have an authority like lsid.learningsite.com
(urn:lsid:lsid.learningsite.com:foo:bar) then someone else can take over
the authority by taking over lsid.learningsite.com and I can still have
www.learningsite.com, mail.learningsite.com, etc... for myself. So, with a
little planning, it's not so hard to deal with an authority going away as
long as the people running it are responsible.
* data and metadata
With LSIDs there's a big difference between the data and metadata of an
LSID - and I think this is going to be the biggest challenge in deciding
how to use them in our context. What's the data? What's the metadata?
With gene sequences, the datum is the sequence, the metadata are things
like contact information, who did the sequencing, taxonomic information
about the thing sequenced, etc. There's an LTER site using LSIDs for
their data sets. The LSID data is the data set itself, and the metadata
is what you'd expect - a description of the data set, who the
investigators were, that sort of thing. NCBI has pubmed LSIDs - they're
not serving up the articles yet, but there's associated metadata in there.
For these things the division between data and metadata is fairly clear.
However, what is the data for taxa? What is the metadata?
Here's another interesting thing about data and metadata in LSIDs. When
you issue an LSID you're promising the the DATA behind that LSID never
changes. Additionally, there's only one authority ultimately responsible
for pointing to the data, and that never changes (although as above,
someone else can take the authority over). However, for metadata, there
are no such promises. Metadata can change. Furthermore, organizations
other than the authority can provide metadata, as long as the authority
agrees to it and adds them to a list of authorized metadata providers. I
donÂ’t know if this is such a great idea, but itÂ’s in the specification.
So, what's the data? What's the metadata? This question applies to any
GUID system, really - the Handle System has the same issues, but less
clearly defined. As an aside - the Handle System is very robust and the
fee schedule is probably circumventable. However, I think LSIDs are
better suited to the direction biodiversity informatics is taking - using
XML-based standards and standard internet protocols to share data.
* client stack versus authority server
The LSID folks provide two batches of code - an authority server, for
people who want to to serve up LSIDs themselves, and an LSID Client stack
- which can be used by organizations to provide access to their LSIDs
and/or proxy LSIDs provided by other organizations. It may make sense for
an organization like GBIF to build a service using the Client Stack to
support both their own LSIDs and those served by other organizations. The
Client Stack has a caching mechanism which supports expiration information
from the primary authority, so the primary authority can update where the
LSID may be resolved and metadata of that authority.
In this model, GBIF, or someone else, could support both their own LSIDs,
and the LSIDs of others. Furthermore, it could choose which authorities
it was going to resolve, so people who wanted to be sure to get "just the
good stuff" according to GBIF could use the GBIF service. In addition, it
could perform the de-duplication service that several people have
mentioned - trying to maintain a one LSID per data item mapping.
* lsid namespaces and file formats
I don't think the namespace part of the LSID
(urn:lsid:authority:namespace:object:version) is intended to be
semantically loaded except for the relevant lsid authority. There is a
way in the metadata to state what format the data comes in. It's not the
traditional text/javascript mime-type tag - instead the format is another
LSID! For example the FASTA protein sequence file format is:
urn:lsid:i3c.org:formats:fasta. Clients that understand LSIDs, like the
Launchpad application, can be set to attach applications to LSID formats
so that clicking on an LSID with a given format opens up an appropriate
application.
Sorry to be so scattershot - it's hard to come into the middle of a huge
topic like this. I’m glad to see all this discussion – it’s going to make
working on my talk much easier (I thinkÂ…)
Dave
1
0
> -the 'GUID combination' is not enforced and therefore not always used
> -Some collections belong to 2 or more Institutions or to none
> -If part of the collection moves to another institute, the guid
> combination
> is changed for that part.
> -The InstitutionCode should be unique, and providers where asking
> what to do
> if the code they wanted to use was already chosen, and who decides which
> institute may use an institutioncode if two institutes want to
> use it. There
> is no body responsible for that and there are no rules: the first
> Institute
> can claim a code, or the biggest or the most well known??
> -In different science areas different InstitutionCodes within one
> Organisation where in use, which one to choose.
> -This 'GUID' can only be used for specimen, not for other life science
> objects.
Wholeheartedly agree on all counts!!! That's why I still see it as a "soft"
ID (even with enforcement of unique registered Institution Codes, and
enforced uniqueness of CollectionCode+CatalogNumber within a single
InstitutionCode). It's a stop-gap to solve some of the problems, until a
real GUID system is up, running, and broadly adopted.
> Now let's look at LSID syntax:
> urn:lsid:authority:namespace:object_identifier (:revision_number)
> About the first part; authority:
> It is naturally to want this to be unique. Therefore we can
> expect the same
> problems as mentioned above, plus unclearity about the difference
> issuing_authority vs. current_authority for the data.
As to uniqueness, I think that's (part of) the point of using a URL, instead
of just an institution name or abbreviation. URLs seem to be effectively
unique. As to the confusion about "issuing_authority" vs.
"current_authority", count me among the befuddled. My interpretation of
Dave Vieglais' posts were that the "Authority" URL was assumed to be the URL
where the GUID is resolved to the data it represents. But Bob Morris' posts
suggest otherwise ("The authority name is the /issuing/ authority. It's an
authority for the LSID, not for its resolution or the underlying data.").
Perhaps I misunderstood Dave's post? My primary concern about LSIDs is that
(I thought) the URL used for the "authority" portion of the LSID must be
live, online, active, and perpetual in order to resolve the data. If this
is not the case, (i.e., if, as Bob says, it is only intended to indicate the
*issuer*, not the current authority), then my concerns about LSIDs are
greatly reduced.
> The problems with authority are important for the involved
> authorities only,
> not for the rest of the life science community.
Agreed!! And further, the authority makes sense for "local" or "owned" data
(e.g., specimens, and attributes thereof), but not for "public" data (e.g.,
taxa, and attributes thereof).
> So discussions about it and
> establishing an authority that takes decisions in political
> conflicts are a waste of time.
> We can solve it by using a unique number only and maintaining a list that
> gives information about each number. It should be clear that this are only
> the initial issuing authority/authorities.
Agreed for sure on the last sentence! And if I interpret the penultimate
sentence correctly, then full agreement there as well. I had started
writing a response to Bob Morris' post last night, but it got too late so I
didn't finish it. It included the following:
***************************************************
Bob Morris wrote:
> I find nothing in the LSID current(?) proposed recommendation
> http://www.omg.org/cgi-bin/doc?lifesci/2003-12-02 that specifies that
> the "object identification" part of an LSID may not be a string of hex
> digits and dashes. (Though I continue to fail to see why people are so
> in love with these).
I'm not particularly enamored by MACs or hex strings over other forms of
unique identifiers, but I WOULD like to see a protocol for bioinformatics
LSID generation that decreased the possibility of duplicate <ObjectID>
portions of the LSIDs from different issuers to effectively zero. Maybe
there is no meaningful technical reason to do this -- but I can't help but
feel that if LSIDs are later determined not to be ideal for bioinformatics
purposes, it may prove to be DAMN useful that the <ObjectID> portion alone
serves effectively as its own gUid (emphasis on "U" intentional), even
stripped of the remaining LSID components.
***************************************************
Wouter: Is this what you meant by using a unique number?
I also agree with Wouter (in response to Kevin Richards' concern about
single-server bottleneck) that this should not be a concern. I seriously
doubt that the sum total of all Life Sciences data calls would ever even
approach the level of calls that Google receives. True, there will likely
not be the same sort of money backing a life sciences GUID system that is
behind Google's server farm -- but even still, given the fundamental
importance such a system would have to such a wide variety of people
receiving such a large chunk of grant money, I can certainly imagine
justifying money on the order of 6 or 7 figures ($/euro) for such a server
system, and I imagine that would be ample money by today's technological
benchmarks (or, more likely, the benchmark a few more cycles of Moore's Law
hence, which is probably when a system of this sort would start to receive
really high volume requests) to get a system capable of meeting the demand.
I think the real bottlenecks are sociological (i.e., getting these damned
fickle life science practitioners to agree on anything), not technological.
> The last part; revision id: whether you need it depends: do you give the
> physical objects a GUID or the data records?
Agreed! Another excerpt from my attempted reply to Bob Morris:
***************************************************
Bob Morris wrote:
> This brings up a point I may have missed in this discussion: LSIDs are
> designed to give an identifier to /data/ not to physical objects. This
> probably means that a fully compliant use of LSIDs for "specimens" will
> be assigned to specimen records, not to specimen objects. This is good,
> because if a physical object is moved it presumably gets a new specimen
> record, which record, perhaps, has some metadata that tells the curation
> history of the underlying object, including its previous LSIDs.
> Seems like the analog of taxonomic synonomy to me....
I disagree -- as I stated before, I strongly feel that the number should be
applied to the physical *object* (or virtual representation thereof -- such
as an original description), not the electronic data record. If Bishop
Museum sends a specimen to Smithsonian, the specimen now curated at
Smithsonian is not, to my mind, a synonym of its earlier life at Bishop
Museum. Nor is its data. Both the specimen, and its associated data,
should be considered as fixed into perpetuity. As more and more data are
transferred from the physical to an electronic form, those data should be
associated to the same GUID for the object -- not multiple versions of the
electronic representation of data related to that object. If the ID *must*
be tied to the data record, instead of the object, then I take back
everything I said earlier about versioning. In the data-centric scenario,
versioning becomes absolutely *vital*. Personally, I think trying to manage
GUIDs as record identifiers, rather than object identifiers, would introduce
unnecessary and excessive complexity. Biologists are interested in the
objects. The data records are just a convenient mechanism of information
conveyance -- not important entities unto themselves. This may not apply to
all aspects of Life Sciences, but I think it should apply to objects we're
discussing here (specimens, taxa, references, etc.).
***************************************************
> With the first choice you do
> not need a revision number because the physical object will not change (or
> do they with living collections?).
Living collections do change, and so to unvochered observations, and so do
records representing populations (rather than specific physical organisms).
And even preserved specimens change over time (tracking condition,
preservation status, etc.). But I think the GUID should be fixed to
physical object, and that whatever dynamic properties of that object are
worth recording over time should be associated back to the physical object.
Where things get more complicated is how to define the "object". To some
collections, the unit of "Object" may be multi-taxon/multi-specimen (e.g.,
fossil), or single-taxon/multi-specimen (e.g., lot), or
single-taxon/single-specimen (individual specimen), or
single-taxon/partial-specimen (a part of a specimen, like a skeleton vs. a
skin). Single "objects" of any of these sorts may be fractioned (e.g.,
Isotypes, or simply splitting up a multi-specimen lot to send out to
different institutions). So, one important question in such cases is
whether one of the "child" objects retains the GUID of the "parent" object,
and new GUIDs are assigned only to the remaining "child" objects (the way
Linnaean taxonomy works for taxonomic concepts, and the way most
institutions deal with catalog numbers for specimens). Or, do *all* child
objects receive new GUIDs, each referring back to an historical "parent"
object that no longer exists? The temptation is to support the latter, but
in this case, what of a specimen that partial deteriorates and a portion of
it is destroyed, rather than sent to a different institution? Logically,
the remaining specimen should be treated no differently than it would have
if the deteriorated portion was instead sent to a different institution,
rather than destroyed, and hence receive a new GUID. But that's starting to
sound an awful lot like condition monitoring. Perhaps this distinction
should be left to the discretion of the GUID issuer/Object owner on a
case-by-case basis? (Yikes! Inconsistency!) Or, perhaps this is where
versioning comes in (where the versions are actual object versions, not
electronic data versions)? This seems like a more complicated problem than
the ones we have been discussing so far.
> If a GUID for a 'physical object' is
> chosen, a thing like a species name or author name or country
> should not get a GUID.
I disagree. For species names, the GUID would apply to the name's original
description/creation event. Metadata for such never change -- they can only
be corrected. For author names, I would argue that the object to which the
GUID is applied should be thought of as the *name* of the author as a
virtual physical object; not the author as a physical object. Multiple
AuthorName objects could be linked to each other via an Alias scheme, and/or
tied to a common "Person" (which could either be a separate GUID namespace,
or be defined as a set of linked AuthorName objects). In the context of
biological objects, place descriptors of all sorts are really just
surrogates to defined two(three?)-dimentional physical spaces. The GUIDs for
such should primarily be established for the physical space, not the name or
other descriptors applied to that space. The GEOnet Names Server (GNS;
http://earth-info.nima.mil/gns/html/index.html) seems to me to be a useful
model to follow. They have two ID numbers for each record. One is the
"Unique Feature Identifier" (UFI): "A number which uniquely identifies the
feature [=place].", and the other is the "Unique Name Identifier" (UNI): "A
number which uniquely identifies a name.".
The point is, these representations to which I think GUIDs should be applied
are effectively permanent/persistent.
Aloha,
Rich
Richard L. Pyle, PhD
Natural Sciences Database Coordinator, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef(a)bishopmuseum.org
http://www.bishopmuseum.org/bishop/HBS/pylerichard.html
1
0
Perhaps it would be useful to look at the issues being discussed about a bio
identifier/locator/GUID in comparison to the same things that are needed for
Internet communications. How do you find a web page in a directory on a
server somewhere in the world? The solution used by our Internet forefathers
was to create layers and have standards and standard-handling methods at
each layer.
To connect to a web page across the Internet involves multiple standards,
standard bodies, handlers, forwarders, duplicaters, etc. The entire chain
of events works because there are systems and people all working their part
of the network everyday. It's not just a single authority running
everything and it's not just every institution for themselves. It's a
coordinated combination approach.
This is going to be an overly simplistic description of the Internet (and
probably inaccurate in some details) but I hope it conveys the analogy.
The most unique thing of all on the Ethernet is the MAC address which is
assigned to the NIC at the lowest layer. Just numbers and letters. Only the
computers and network hardware deal with this number.
Under the TCP/IP protocol for Ethernet communication, an IP address is
assigned to the NIC/MAC address. Just numbers with no meaning. Sometimes
humans actually use the IP address, but it is mostly used by the computers.
IP addresses have to be unique world-wide to make the Internet work. The
Internet Corporation for Assigned Names and Numbers (ICANN- www.icann.org)
provides that uniqueness by assigning all the IP numbers in unique blocks or
ranges of numbers to "Internet Registries". There are Regional, National
and Local Internet Registries that subdivide and "license" IP addresses to
ISPs, who in turn license IP addresses to organizations. Some organizations
are so big that they bypass any ISP. So, there is a heirarchy of how the
"unique identifiers" are managed. There is in fact a central authority, but
it delegates to decentralized authorities. Is there an analogy for BioGUIDs
to have a central body who divvies out the unique numbers (like IP
addresses) to decentralized bodies or large organizations?
Since IP addresses are hard to memorize (and so too would be a BioGUID),
"domain names" are used. Starting with a domain name, you can first find the
name and/or IP address of a device, called the Domain Name Server, that can
locate the IP address of other computers. This is a form of indirect
addressing. ICANN also manages the top-level namespace for the Internet.
They decide what the valid domain "extensions" are (like .com, .uk) so that
everybody, everywhere knows where to look them up. Then, the domain name
extensions are separated among the Regional, National, and Local Interent
Registries around the world. There is a scheme for where to find the IP
addresses for every domain extension (e.g. .com is on the ARIN registry,
.com.uk is on the ).
Then there is a layer of Domain Registrars who have been accredited by ICANN
to assign domain names for the domain extensions - e.g. tdwg.org.
The domain name registrars are told by the owner of the domain where to find
their particular Domain Name Servers which may be many to enable redundancy
- Primary, Secondary, Tertiary, etc. These redundant Domain Name Servers
synchronize with each other at particular times of day and may be located
all around the world. They are the main "switchboard" for a particular
organizations computer names and associated IP addresses.
Then the individual organization can create multiple computers for the
domain name - e.g. www.tdwg.org - and add them to the Domain Name Server
listing. There can be many computers for a domain, for instance:
info.tdwg.org, www2.tdwg.org, myname.tdwg.org. Each of these can be a
different computer with a different IP address. The redundant Domain Name
Servers all contain the list of all these names and what IP addresses they
are.
So, it all works through a series of layers, each connected to the other
with indirect references. At the bottom there are the unique and cryptic IP
and MAC numbers. In between them and the humans are the layering of names.
And the methods for changing names. You can start with an IP address and
find the domain it is in and the computer name assigned to it. Or, you can
start with a name like www.tdwg.org and find its IP address.
The players in the Internet networking fabric all now play by these layered
rules. They all know them and follow them in order to keep the Internet
running. This stuff happens out of sight to everyone but the networking
people and we all take it for granted and assume it is simple. But, it's
invisible not because it's simple, but rather because it's disciplined. And
a lot of hardware devices have been constructed to follow and enforce the
rules.
In our discussions of how a BioGUID would be implemented - how assigned, how
managed, how identified and located, how made resilient to failure - we need
to be mindful that there is probably not going to be a simple, this way or
that way solution. It probably needs to be organized into layers of
abstraction, but it will also need to be disciplined.
I think we will need something like BioICANN with BioDomainRegistries,
BioDomainExtensions, BioDomains and BioDNSs that provide the access paths to
the BioGUIDs.
Chuck Miller
CIO
Missouri Botanical Garden
-----Original Message-----
From: Wouter Addink [mailto:wouter@ETI.UVA.NL]
Sent: Monday, September 27, 2004 5:42 AM
To: TDWG-SDD(a)LISTSERV.NHM.KU.EDU
Subject: Re: Globally Unique Identifier & Donald Hobern's PPT
Hi,
I found the PowerPoint document from Donald really helpfull on some of the
discussed issues. Unfortunately referring to it is a little difficult
because the pages does not contain a unique identier :)
But first some real-world experience with the 'GUID' used today in GBIF
specimen network. The combination InstitutionCode, CollectionCode and
CatalogNumber was chosen for that. Problems experienced last year: -the
'GUID combination' is not enforced and therefore not always used -Some
collections belong to 2 or more Institutions or to none -If part of the
collection moves to another institute, the guid combination is changed for
that part. -The InstitutionCode should be unique, and providers where asking
what to do if the code they wanted to use was already chosen, and who
decides which institute may use an institutioncode if two institutes want to
use it. There is no body responsible for that and there are no rules: the
first Institute can claim a code, or the biggest or the most well known??
-In different science areas different InstitutionCodes within one
Organisation where in use, which one to choose. -This 'GUID' can only be
used for specimen, not for other life science objects.
Now let's look at LSID syntax:
urn:lsid:authority:namespace:object_identifier (:revision_number) About the
first part; authority: It is naturally to want this to be unique. Therefore
we can expect the same problems as mentioned above, plus unclearity about
the difference issuing_authority vs. current_authority for the data. The
problems with authority are important for the involved authorities only, not
for the rest of the life science community. So discussions about it and
establishing an authority that takes decisions in political conflicts are a
waste of time. We can solve it by using a unique number only and maintaining
a list that gives information about each number. It should be clear that
this are only the initial issuing authority/authorities.
About the second part; namespace:
Things like 'Specimen' or 'Experiment'. In contrast with the first part,
problems with this part are interesting for the whole lifescience community
because applications will want to use this to decide whether the data can be
used for a specific application. Standardisation of namespaces is necessary.
I think it should be devided in two parts (not currently present in LSID)
like MIME type image/jpeg etc: 'observation/abcd' or 'specimen/darwincore'
for example. If we look at the Donalds PPT we see that in model 1: LSID
assigned centrally, the namespace is chosen centrally and by model2: LSID
assigned by each provider, the provider is free to choose one. Even language
variations of naming a namespace can already give problems, so this is why I
strongly favor a central mechanism here for assigning LSIDs, unless the
provider is somehow forced to use a certain namespace class. The potential
bottleneck problem is not really an issue I think (see also DNS mechanisms).
If we choose central mechanism the issuing authority will always be GBIF (or
do we need different authorities for different parts of Life Science?) so no
problems with that also in this case.
The third part; object id: no problems there.
The last part; revision id: whether you need it depends: do you give the
physical objects a GUID or the data records? With the first choice you do
not need a revision number because the physical object will not change (or
do they with living collections?). At first I thought that a GUID should be
put on physical object: if you are looking for data, you are looking for
data about a certain physical object, the source of the data is not (very)
important. The same data elements in different sources about the same object
should be equal, else there are errors. Donald's PPT gives the example that
someone wants to refer to a LSID in a publication as a source. In that case
you want to refer to a data source with a certain version. Then you need to
give a GUID to the data and also you need revisions. Data is not persistent,
it changes all the time. Giving a persistent identifier to it is very
difficult and not many data systems have full revisions support. If a GUID
for a 'physical object' is chosen, a thing like a species name or author
name or country should not get a GUID. These are more a kind of attributes:
most data will use one or more species names as 'metadata'. There needs to
be central datasource for each of these 'metadata', like a NameBank for
species names (with its own ID). I am not sure whether LSID was designed for
a GUID to data or to physical objects. The use of namespace and object id
instead of databasename and recordid seems to indicate that it was designed
for physical objects, but why then the optional revision id? Instead of a
revision id you can also assign a new GUID with every change, but then how
to point to a new version from an old version of data (if you have the GUID
of the old version, how to get the GUID of the new one).
Requirements in Donald's PPT:
-if a GUID is on a physical object, the GUID must not refer uniquely to a
single data element, it must only be unique itself. It is also not a
requirement in LSID specification. There will be overlap between the
objects, so an object can belong to more then one IDs. For instance a
researcher can have its own ID and also belong to the ID of the Institute he
is working for. The data for overlapping elements like researcher name must
be equal. -I would restrict the identifiers to life science objects.
Issues to be resolved in Donald's PPT:
It would be beneficial to maintain the GUID in the datasource itself (at
least for the owner of the datasource), but not absolutely necessary. I see
GUID in data records as a 'tightly coupled' model (which requires some work
for existing databases). I can imagine also a 'loosely coupled' model where
provider software is modified to get the identifier from a central server
(or mirror).
Wouter Addink
1
0
Having joined late, forgive me if my comments below repeat. (If not
here, then perhap's in Dave Thau's or Donald Hobern's documents?)
Discussion on a wiki instead of email would help...
Kevin Richards wrote:
> I believe a centralised system that maintins and assigns the LSIDs would be
> too large a job for one organisation and would create a bottle-neck in the
> system (where every call/request or assignment of IDs will need to be
> passed through one web server located a GBIF).
>
> The reality of the data that will be transferred using the SDD format, is
> that it is quite decentralised and is represented in quite different ways
> at each data source. I think therefore it should be up the orgainisation
> containing the data to provide a LSID and resolve these LSIDs. For example
> we are considering an LSID and a MAC GUID for the unique ID of our
> taxonomic name data here at Landcare Research. Which would be something
> like URN:LSID:LandcareResearch:TaxonName:86BA062A-ADC6-4516-956F-
> 34CDA0F465EC. With a centralised system this would not be allowed - ie the
> LSID would probably be limited to integers, which would then need to be
> reslove at the individual organisation to find the matching data.
>
I find nothing in the LSID current(?) proposed recommendation
http://www.omg.org/cgi-bin/doc?lifesci/2003-12-02 that specifies that
the "object identification" part of an LSID may not be a string of hex
digits and dashes. (Though I continue to fail to see why people are so
in love with these).
However, what is the case is that the complete LSID must be globally
unique and /persistent/.
FWIW, the recommendation is that the authority identification should
normally be a URL, e.g. landcareresearch.co.nz, or else the
corresponding name be registered with the appropriate authority.
> Having a non-centralised system would put more work on each organisation
> involved, and create problems when data is moved, or organisations are
> closed, but this just means that procedures need to be put in place to
> handle such situations. It is possible that there may be intermediate
> services that provide the LSID resolution for a bunch of databases/data
> sources and serve up this data to those who request it.
>
> I also think there is a bit of a tie between the authority name of the LSID
> and the URL that is used for obtaining the data. This doesnt need to be
> so.
It isn't so. The authority name is the /issuing/ authority. It's an
authority for the LSID, not for its resolution or the underlying data.
The obligation on the issuer is only that the complete name must be
syntactically correct and globally unique forever. The obligation on the
party receiving the LSID is that it must never be assigned to another
datum [see para below beginning "This brings up"]. However, several
different LSIDs can be assigned to the same datum.
> The would be the job of the central Resolver would be to match the
> authority name to the url for obtaining the data.
Actually, in the proposed recommendation, the LSID resolution protocol
does not return data, but rather a list of data retrieval services which
know about the given LSID. The client calls one of these to fetch data
or metadata corresponding to the LSID.
This brings up a point I may have missed in this discussion: LSIDs are
designed to give an identifier to /data/ not to physical objects. This
probably means that a fully compliant use of LSIDs for "specimens" will
be assigned to specimen records, not to specimen objects. This is good,
because if a physical object is moved it presumably gets a new specimen
record, which record, perhaps, has some metadata that tells the curation
history of the underlying object, including its previous LSIDs.
Seems like the analog of taxonomic synonomy to me....
Probably TDWG and GBIF should be examining what metadata to provide on
"TDWG compliant LSIDs" in various contexts, especially specimen records.
>So the authority name in
> the LSID could just be any name.
Not if compliant with the LSID proposal. It must itself have some global
uniqueness guarantees, e.g. suitable registration. See above.
>
> Kevin Richards
>
>
>
> On Thu, 23 Sep 2004 13:29:26 -1000, Richard Pyle
> <deepreef(a)BISHOPMUSEUM.ORG> wrote:
>
>
>>I want to start by wholeheartedly endorsing Wouter's plea for
>>non-information-bearing (meaningless) GUIDs. This feature is CRITICAL to
>>the long-term success of any GUID system. It is absolutely imperative that
>>there NEVER be any motivation to change the content of a GUID (i.e., it
>>should be permanent). If the GUID itself contains any information
>>whatsoever, there may be motivation to change that information at a later
>>time.
>>
>>For this reason, I had initially preferred the DOI approach, but over time,
>>I am gradually warming up to the LSID approach. While components of an
>
> LSID
>
>>do, indeed, represent information, they represent the one piece of
>>information that I think may legitimately belong embedded within a GUID:
>>context. That is, the context, or domain, of the GUID itself. The context
>>in this case would be the "issuer" of the GUID -- not necessarily the
>>current "owner" of the GUID (see more discussion on this below). Though
>
> the
>
>>organization that issued a GUID may eventually disappear, the fact that the
>>organization was the one to issue the GUID in the first place will never
>>change, and thus represents a permanent and unchanging component of the
>>GUID. Without the context portion, the GUID itself is really nothing more
>>than a random string of characters. In summary, I'm warming up to the LSID
>>approach because it represents embedded context, without the risk of
>>temptation to change the content of a GUID after it has been issued.
>>
>>Regarding Donald's PPT file, I have a couple of comments and questions:
>>(Assumes Title slide is "Slide 1")
>>
>>Slide 2:
>>You note there is "No reliable mechanism" to relate the same record from
>>different providers to each other. But in the context of DarwinCore, the
>>combination of [InstitutionCode]+[CollectionCode]+[CatalogNumber] should
>>represent a virtual GUID (provided that the Global Provider Registry
>
> ensures
>
>>no duplication of [InstitutionCode]). I do realize that words like "should"
>>and "reliable" are critical here. Perhaps the DarwinCore implementation
>>should enforce the requirement of uniqueness of
>>[CollectionCode]+[CatalogNumber] within a single [InstitutionCode], and
>>further ensure globally unique [InstitutionCode] values via the Global
>>Provider Registry.
>>
>>Slide 3:
>>Wouldn't most of the problems indicated in the first four bulleted points
>
> be
>
>>largely solved by the Global Provider Registry? Using the [InstitutionCode]
>>would allow lookup in the registry for a (current/active) metadata URL, and
>>the metadata URL would provide information on where to access a particular
>>[CollectionCode]+[CatalogNumber] piece of data.
>>
>>The issue of specimens changing numbers and/or collections is problematic,
>>of course.
>>
>>The issue of versioning is a bit dicey, in my mind (e.g., at what
>
> resolution
>
>>of information change)? Some things, like changing taxonomic
>
> determinations
>
>>(i.e., "real" changes) need to be handled in a robust way. Other things,
>>like the correction of typos and different styles of representing the exact
>>same information (e.g., R.L. Pile==>R.L. Pyle; or R.L. Pyle==>Pyle, R.L.)
>>probably don't need to be versioned. Other sorts of changes (e.g., the
>>elaboration of previously existing information, such as the addition of
>>retroactively-generated georeference coordinates) fall somewhere in-
>
> between.
>
>>Slide 4:
>>We should all get behind SEEK in addressing these issues (Taxon concept
>>mapping). Ultimately, we minimally need a GUID pool for References
>>(inclusive of unpublished works), and a GUID pool for what I call
>>"Protonyms" (original creations of IC_N Code-compliant names). The union
>
> of
>
>>these two GUIDs (what I would call "Assertions") would itself represent a
>>GUID to a "potential concept" (Berendsohn). (Note: my preference would be
>
> to
>
>>define Protonyms as a subtype of Assertions, and therefore Protonym GUIDs
>>would be a subset drawn from the same pool as Assertion GUIDs -- but this
>
> is
>
>>a technical discussion for another time).
>>
>>Slide 5:
>>Nice summary!!
>>
>>Slide 6:
>>Good stuff here, but I'll respond with some of my personal opinions:
>>
>>- RevisionID: see points of concern already expressed above
>>
>>- Specimen Record LSIDs: I gather from subsequent slides that you recognize
>>two alternative approaches: having the "owner" of a specimen assign the
>
> LSID
>
>>within the context of their own <domainName>, or adopting GBIF as the
>>international standard issuer for ALL specimen GUID. In other words, GBIF
>>would represent the centralized issuer of GUIDs for all biological
>>specimens, and the biological specimen community would/should rally around
>>GBIF for thus purpose, and adopt GBIF specimen GUIDs as their own. I
>>personally have no problem with this (I do not live in fear of "Big
>
> Brother"
>
>>centralization when it serves the benefit of all, as I believe it would in
>>this case) -- but I know there are many who might have a problem with it,
>>and therefore it might not garner widespread adoption without large volumes
>>of "fuss".
>>
>>If, on the other hand, each organization issues its own GUIDs for its own
>>set of specimens, then the question is when, if ever, GBIF would assign a
>>specimen GUID? Perhaps as a surrogate for institutions that lack the
>>technological ability to assign their own LSIDs? But I wonder, how many
>>institutions that could server electronic data of their holdings to the
>>internet would lack the ability to assign their own LSIDs?
>>
>>As you've outlined in subsequent slides, I see two alternative paths: A)
>>Get the biological world to rally around GBIF as the centralized provider
>
> of
>
>>GUIDs for specimens for all collections; or B) Have each
>>collection/institution issue its own set of LSIDs for its own specimens,
>
> and
>
>>have GBIF adopt those LSIDs for its own internal purposes. I could get
>>behind either approach, but I see danger in the adoption of a mixture of
>>these two approaches. I'll defer elaboration, but a lot of it has to do
>
> with
>
>>potential confusion about whether the GUID applies fundamentally to the
>>physical specimen, or the electronic conglomeration of data associated with
>>the specimen. Also, I think we should avoid the risk of assigning two
>>separate GUIDs for the same "single data element" (sensu your Slide 5).
>>
>>- Name record LSIDs: I understand the example of an IPNI LSID for a plant
>>name, and presumably there would be analogous "Catalog of Fishes" LSIDs for
>>each fish name, etc. But I don't think that would be a wise approach.
>>Unlike specimen records, where there are fairly unambiguous "owner"
>>institutions (or at least "original owner" institutions that issued a
>
> GUID),
>
>>taxonomic aggregators (IPNI, ITIS, Species2000, GBIF, uBio, etc.) are most
>>certainly not owners of the taxonomic names that they include in their
>>databases. We would want to avoid the risk of duplicate GUIDs for the same
>>name, and thus the need for mapping, e.g., an IPNI GUID for a name to its
>>ITIS equivalent. Again, I can't help but think that the world will be a
>>better place if we can avoid assigning multiple GUIDs to the same "single
>>data element".
>>
>>One approach would be to rally around GBIF, and rely on them to issue GUIDs
>>for all taxon names. However, I also recognize that we do not exist in a
>>political/personality vacuum with regards to "ownership" of taxonomic
>
> names,
>
>>or the electronic representations thereof. Therefore, the closest thing
>>that exists to an "owner" of a taxonomic name is the Commission of
>>Nomenclature (and it's respective Code of Nomenclature) under which the
>
> name
>
>>was established. Thus, when it comes to assigning GUIDs for names (not
>>concepts), I would propose the following:
>>
>>urn:lsid:ICZN.org:TaxonName:XXXXXX (all zoological names)
>>urn:lsid:ICBN.org:TaxonName:XXXXXX (all botanical names)
>>urn:lsid:ICNB[or LBSN??].org:TaxonName:XXXXXX (all bacteriological names)
>>urn:lsid:ICTV[or ICVCN??].org:TaxonName:XXXXXX (all virus names)
>>
>>In an ideal world, we'd get to the point where there would be a need for
>>only one registrar of nomenclature, e.g.:
>>urn:lsid:BioCode.org:TaxonName:XXXXXXX
>>
>>Or, perhaps:
>>urn:lsid:gbif.net:TaxonName:XXXXXXX
>>
>>But I don't think we're quite there yet.
>>
>>In any case, the idea would be for the taxon name aggregators to adopt the
>>unambiguously unique GUID for each taxon name.
>>
>>Taxonomic concepts are a whole 'nother ball of wax....
>>
>>Slide 8:
>>I actually prefer this approach (GBIF as the central issuer of specimen
>>GUIDs), for a variety of reasons. One of the main reasons is that it would
>>assure uniqueness of an integer within a given <namespace> (e.g.,
>>Specimens), which would make things a bit easier for those of us who like
>
> to
>
>>use integers as primary keys in databases. In other words, it avoids the
>>possibility of urn:lsid:bishopmuseum.org:Specimen:1234567 colliding with
>>urn:lsid:usnm.gov:Specimen:1234567, when reducing the GUID to just its
>>integer component for local application purposes (where context can be
>>enforced by other means). However, I should point something out regarding
>>the "Advantage" part of this slide, which is that the "problem" of
>>transferring record locations doesn't exist, provided that the <domainName>
>>component of the LSID is taken as the issuer of the GUID, not as the
>
> current
>
>>owner of the specimen. In other words, if Bishop Museum assigned GUID
>>urn:lsid:bishopmuseum.org:Specimen:1234567 to a specimen, and then gave
>
> that
>
>>specimen to Smithsonian, then Smithsonian would retain the complete GUID
>>intact as: urn:lsid:bishopmuseum.org:Specimen:1234567.
>>
>>The danger comes when you try to use the <domainName> component as metadata
>>to represent the current location of the specimen and/or its electronically
>>represented data. This is where Wouter's original point
>
> about 'meaningless'
>
>>GUIDs comes into play. If the whole point of using LSIDs is to embed the
>>"current location" information within the ID itself so that applications
>
> can
>
>>retrieve additional data associated with the GUID directly, then I have
>
> some
>
>>concerns (mostly address already).
>>
>>Why there is a reference to urn:lsid:gbif.net:TaxonConcept:106734 at the
>
> top
>
>>of this slide???
>>
>>Slide 9:
>>Again, I'm not sure I understand on this slide why there is a reference to
>>urn:lsid:ipni.org:TaxonName:82090-3:1.1
>>Also, in this model, what function does the LSID serve that is not met by
>>the concatenated [InstitutionCode]+[CollectionCode]+[CatalogNumber] (in the
>>context of Global Provider Registry).
>>
>>Slide 10 (taxon concepts and literature):
>>This message is already getting too long... :-)
>>I already touched on this above under "Slide 4". I definitely agree that
>
> we
>
>>need a GUID system for References. This should include more than just
>>published references. It doesn't quite exist yet among the existing
>>Reference registrars (as far as I can tell) to accommodate the specific
>>needs of taxonomists (e.g. referring to a subsection of a reference as
>>representing an original taxonomic description), so I do see a need to
>>create a Reference GUID system specific to biology. I could rant for pages
>>on this, but I'll summarize simply with a plea to *DEFINE* a Concept GUID
>
> as
>
>>an intersection between an Name GUID and a Reference GUID (i.e., what I
>>would call an "Assertion"). Not all Name-Reference combinations will be
>>worthy of recognition as a distinct "Concept", but all are *potentially*
>>representative of a concept (Berendsohn), and thus all should be drawn from
>>the same pool of GUIDs as Concept GUIDs. In other words, "Concepts" should
>>be thought of as a subtype of Name-Reference instances. I would go further
>>to suggest (as I did above) that "Name" GUIDs should also be a subtype of
>>Name-Reference instances (non-exclusive of Concept subtype instances),
>
> using
>
>>the Name-Reference instance that represents the Code-recognized original
>>description of the name as the "handle" to the Name.
>>
>>By this approach, you need only two GUID object classes <objectClass>: one
>>for References, and one for Name-Reference intersections (Assertions). The
>>latter of these could serve as the source for both Concept GUIDs and Name
>>GUIDs.
>>
>>Last Slide:
>>
>>My own answers to your questions:
>>
>>1) Are LSIDs the most appropriate technology?
>>
>> I'm increasingly coming to that conclusion.
>>
>>2) Should identifiers be assigned and resolved centrally or via a fully
>>distributed model (or should providers have the option of using either
>>model)?
>>
>> I think the best option would be central. The next option would
>
> be full
>
>>distributed. Leaving it as an option would, in my opinion, be a BIG
>>mistake.
>>
>>3) Which objects should receive identifiers?
>>
>> Specimens, References, Name-Reference intersections (Assertions),
>
> and
>
>>perhaps Agents. [TaxonNames and Concepts can be subsets of Name-Reference
>>intersections].
>>
>>3a) Should we develop a set of object classes for biodiversity informatics
>>and assign identifiers to instances of all of these?
>>
>> I think so, yes. Of course, it depends a bit on who you mean
>
> by "we". I'm
>
>>thinking sensu lato.
>>
>>3b) Should identifiers be associated with real world objects (e.g.
>>specimens), or with digitised records representing them (e.g. perhaps
>>multiple records representing different digitisation attempts by different
>>researchers for the same specimen), or both?
>>
>> I would say definitely real-world objects (treating things like
>>Code-recognized original descriptions of taxon names, and citable
>
> references
>
>>as "real-world objects"). I do NOT think we should have separate GUIDs for
>>digital representations thereof. Alternative digital representations are
>>simply clutter that will eventually be weeded out of the system, once we
>
> all
>
>>get organized on this stuff, and harness the power of the internet to
>>implement a global editing/QA system.
>>
>>4) What should be done about existing records without identifiers?
>>
>> As far as I know, ALL records are currently without identifiers
>
> (unless
>
>>someone established a widely accepted GUID system and I missed the
>>announcement...)
>>
>>4a) Should they be left alone?
>>
>> Ultimately, no.
>>
>>4b) Should they all be updated with identifiers?
>>
>> Ultimately, yes.
>>
>>4c) Should the provider software be modified to generate "soft" identifiers
>>(ones which we cannot guarantee in all cases to be unique) based e.g. on
>
> the
>
>>combination of InstitutionCode, CollectionCode and CatalogNumber?
>>
>> As an interim solution, perhaps. See my comments under "Slide 2"
>
> above.
>
>>5) Are revision identifiers a useful feature?
>>
>> I would like to think not. If the information is truly dynamic
>
> over time
>
>>(e.g., re-determinations of taxonomic identity of specimens), then
>>individual instances should probably receive their own set of GUIDs (as
>>opposed to versions of the "parent" GUID). If the information is static
>>over time, and changes represent objective corrections, then I don't see a
>>real need to track that within the context of a GUID (record edit history
>>may or may not need to be tracked, but this seems to me to be a separate
>>issue from GUIDs).
>>
>>5b) How many providers will be able to provide and handle them?
>>
>> If versioning is incorporated, then it should be designed such
>
> that a
>
>>"default" version is provided automatically when versioning is not handled.
>>
>>
>>Sorry for the long post, but I feel that this issue is extremely important
>>at this point in bioinformatics history.
>>
>>Aloha,
>>Rich
>>
>>Richard L. Pyle, PhD
>>Natural Sciences Database Coordinator, Bishop Museum
>>1525 Bernice St., Honolulu, HI 96817
>>Ph: (808)848-4115, Fax: (808)847-8252
>>email: deepreef(a)bishopmuseum.org
>>http://www.bishopmuseum.org/bishop/HBS/pylerichard.html
>>
>>
>>>-----Original Message-----
>>>From: TDWG - Structure of Descriptive Data
>>>[mailto:TDWG-SDD@LISTSERV.NHM.KU.EDU]On Behalf Of Donald Hobern
>>>Sent: Thursday, September 23, 2004 6:22 AM
>>>To: TDWG-SDD(a)LISTSERV.NHM.KU.EDU
>>>Subject: Re: Globally Unique Identifier
>>>
>>>
>>>This is precisely one of the key questions we need to address with any
>>>identifier framework we adopt. I think we could easily use LSIDs in a
>>>way that should overcome your concerns, and I think that the built-in
>>>mechanisms for discovery and metadata access within the LSID model are
>>>really exciting.
>>>
>>>I have just put together a PowerPoint presentation to explain some of
>>>what I think we could achieve with globally unique identifiers and
>>>particularly with LSIDS. It can be downloaded from:
>>>
>>>http://circa.gbif.net/Public/irc/gbif/dadi/library?l=/architecture/globa
>>>llyuniqueidentifier/
>>>
>>>It may be clearest if you go through it as a slide show rather than in
>>>edit mode.
>>>
>>>Thanks,
>>>
>>>Donald
>>>
>>>---------------------------------------------------------------
>>>Donald Hobern (dhobern(a)gbif.org)
>>>Programme Officer for Data Access and Database Interoperability
>>>Global Biodiversity Information Facility Secretariat
>>>Universitetsparken 15, DK-2100 Copenhagen, Denmark
>>>Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480
>>>---------------------------------------------------------------
>>>
>>>
>>>-----Original Message-----
>>>From: TDWG - Structure of Descriptive Data
>>>[mailto:TDWG-SDD@LISTSERV.NHM.KU.EDU] On Behalf Of Wouter Addink
>>>Sent: 23. september 2004 17:38
>>>To: TDWG-SDD(a)LISTSERV.NHM.KU.EDU
>>>Subject: Re: Globally Unique Identifier
>>>
>>>It seems that DOI allows for any existing IDs to be used as part of the
>>>unique identifier. That seems to me as a fast to adopt short term
>>>solution
>>>but not a good idea for the long term. At first sight I very much liked
>>>the
>>>LSID specification, but the longer I think about it, the less I like
>>>some
>>>parts. What I think is missing in the LSID specification is that the
>>>unique
>>>identifier should be 'meaningless' apart from being an identifier to
>>>become
>>>time independent (and to avoid possible political problems). Any
>>>solution
>>>with a URN I can think of has some meaning, which makes solutions like a
>>>MAC-address generated GUID favorable in my opinion. And any meaning you
>>>need
>>>(like an authority of an object) can be specified in metadata instead of
>>>using it in the identifier. What is not very clear to me in the LSID
>>>specification is where the LSID generated by a LSIDAssigningService is
>>>actually stored.
>>>
>>>Wouter Addink
>>>
>>>----- Original Message -----
>>>From: "Gregor Hagedorn" <G.Hagedorn(a)BBA.DE>
>>>To: <TDWG-SDD(a)LISTSERV.NHM.KU.EDU>
>>>Sent: Wednesday, September 08, 2004 6:20 PM
>>>Subject: Re: Globally Unique Identifier
>>>
>>>
>>>
>>>>I am not quite sure, but to me it seems with "GUID" you refer to the
>>>>numeric, MAC-address generated GUID type. I have nothing against
>>>>these. However, any URN in my view is a GUID that has most of the
>>>>properties you mention:
>>>>
>>>>
>>>>>- it is guaranteed to be unique globally, and can be created
>>>
>>>anywhere,
>>>
>>>>>anytime by any server or client machine - it has no meaning as to
>>>>>where the data is physically located and will there not confuse any
>>>>>user about this
>>>>
>>>>>- most id
>>>>>mechanisms, especially URI/URN ids require a 'governing body' to
>>>>>handle namespaces/urls to ensure every URN is unique, whereas a GUID
>>>>>is always unique
>>>>
>>>>The governing body is restricted to the primary web address, and in
>>>>most cases such an address is already available. Being a member of a
>>>>governmental institution that explicitly forbids the use without
>>>>prior consent, and forbids the use of its domain name once you are no
>>>>longer working for them, I realize some potential for problem.
>>>>
>>>>
>>>>>I do think a URL of some kind would be useful for things such as
>>>>>global searches of multiple databases, as this will allow the search
>>>>>to go directly to the data source where the name, referene, etc comes
>>>>>from. But this should not be part of its ID. Maybe a name/id should
>>>>>have several foms, a GUID for an ID and a URL + a GUID for a fully
>>>>>specified name.
>>>>>
>>>>>What are the current thoughts on these ideas?
>>>>
>>>>A GUID is only part of the problem. The other half of the problem is
>>>>actually getting at the resource. URN schemes like DOI or LSID (I
>>>>prefer the latter) intend to define resolution mechanisms. That make
>>>>the URN not yet a URL - in my view the good comes with the good,
>>>>location and reorganization independence.
>>>>
>>>>I believe GBIF should install such an LSID resolver, which is why in
>>>>the UBIF proxy model, under Links, I propose to support a general URL
>>>>(including potentially URNS), a typed LSID and a typed DOI. This
>>>>could be simplified to have just a URN (LSID and DOI are URNs), but
>>>>that would then require string parsing to determine and recognize the
>>>>preferred resolvable GUID types. Comments on splitting/not splitting
>>>>this are welcome!
>>>>
>>>>There may be some need to define a non-resolvable URN/numeric GUID as
>>>>well. However, that would not be under the linking question. Is it
>>>>correct that linking requires resolvability, or am I thinking into a
>>>>wrong direction?
>>>>
>>>>Gregor
>>>>
>>>>
>>>>----------------------------------------------------------
>>>>Gregor Hagedorn (G.Hagedorn(a)bba.de)
>>>>Institute for Plant Virology, Microbiology, and Biosafety
>>>>Federal Research Center for Agriculture and Forestry (BBA)
>>>>Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220
>>>>14195 Berlin, Germany Fax: +49-30-8304-2203
>>>>
>>>>Often wrong but never in doubt!
1
0