tdwg-content
Threads by month
- ----- 2024 -----
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2018 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2017 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2016 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2015 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2014 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2013 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2012 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2011 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2010 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2009 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2008 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2007 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2006 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2005 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2004 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2003 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2002 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2001 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2000 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 1999 -----
- December
- November
- October
- September
- August
- 1557 discussions
> > Agreed! And that might be the better approach (for a lot of
> reasons). My
> > only concern would be to what extent desktop database
> applications can uses
> > MAC values as primary keys (compared to using something like
> long integers)
> > efficiently and effectively, when manipulating large datasets
> in real time.
> > I guess this may be a trivial point in the grand scheme of things -- but
> > ultimately taxonomists will want to be able to work with large
> datasets on
> > their personal computers.
> >
>
> eh??? MAC addresses /are/ just long integers.
Sorry....I meant "Long" integer in the sense of DB applications, which are
32-bit (4 byte) integers. I had read once that MAC-style GUIDs are not
optimized on desktop DB applications for use as primary keys. However,
having just done some Google-snooping, I can't confirm that (in fact, if
anything, I've found the opposite). So I'm starting to think that my
concerns about using about using MAC-style IDs as primary keys are probably
a Red Herring.
> Once reduced to a digital representation, /everything/ is just a long
> integer. Hence in the end, for machine use, the differences between
> different ID schemes come down to a very few criteria, all having to do
> with programming ease and the effective computability of each of the
> assertions about the acquisition of IDS and manipulations of them that
> are required.]. No two schemes are distinguishable solely on the basis
> that one of them appears to have a representation as integers and one of
> them appears not to.
Agreed -- but aren't DB applications optimized to utilize certain kinds of
ID schemes more effectively/efficiently than other kinds of ID schemes?
Many thanks for the enlightening (and interesting) insights on MAC IDs. It
always amazes me that, though I may understand computers better than 95% of
people on Earth, the gap between me and the top 1% is much larger than the
gap between me and, say, the fish in my aquarium.
I think the important parts of this discussion surround the functional
parameters of the GUIDs for biological objects:
1) Should issuance of IDs be controlled from a single source; or freely
created by anyone with a computer, anywhere, anytime; or something
in-between?
2) Is it important that all biological objects use the same scheme for ID
sourcing, or is it advantageous to chose a scheme optimal for each class of
object (e.g., privately owned and managed specimen data, vs. publicly owned
and managed taxonomic nomenclature data)?
3) Should contextual information about how to resolve the ID be embedded
within the ID itself, or should the responsibility of context be relegated
to presentation protocols?
4) What is the greater risk to information flow in our context: reliance on
a single-point source for all the data resolution, or reliance on many
sources functioning simultaneously in order to resolve all of the data? What
options for mitigating impediments to data resolution are available to each
of these approaches?
Aloha,
Rich
1
0
Richard Pyle wrote:
> ...
> Agreed! And that might be the better approach (for a lot of reasons). My
> only concern would be to what extent desktop database applications can uses
> MAC values as primary keys (compared to using something like long integers)
> efficiently and effectively, when manipulating large datasets in real time.
> I guess this may be a trivial point in the grand scheme of things -- but
> ultimately taxonomists will want to be able to work with large datasets on
> their personal computers.
>
eh??? MAC addresses /are/ just long integers.
Once reduced to a digital representation, /everything/ is just a long
integer. Hence in the end, for machine use, the differences between
different ID schemes come down to a very few criteria, all having to do
with programming ease and the effective computability of each of the
assertions about the acquisition of IDS and manipulations of them that
are required.]. No two schemes are distinguishable solely on the basis
that one of them appears to have a representation as integers and one of
them appears not to. What's more interesting is: for a given scheme, to
what representation other than as a long integer can you reduce the ID
and how does that help with the programming about, and administration
of, the identifiers, what is the scope of "Global" in the acronym GUID,
and what needs to be specified to convert the ID to another ID of
importance to the enterprise (e.g. converting a specimen guid to a
(museum,room,drawer,pin)tuple).
For example, MAC addresses were designed to be globally unique only
among devices on the same wire. They were historically assigned in
blocks to manufacturers of network interface cards, and the only
motivation to have no repetition was to insure that the NICs of two
manufacturers could be connected to the same wire. In fact, MAC
addresses are now routinely programmable and rarely designate some
worldwide distinction between two specific physical objects. Indeed,
anybody who buys a consumer router for home use is often---and sometimes
unknown to them---exposing to their ISP a MAC address for the router
hardware that is actually the MAC address of one of the NICs on their
internal network. Subject to the very criticism David V. has leveled
against them---lack of context /offered in the spec/---MAC addresses as
used on networks require for resolution into physical objects an Address
Resolution Protocol (ARP) that depends on the wire protocol. For
example, http://www.faqs.org/rfcs/rfc826.html is the ARP for ethernet,
whereas http://www.networksorcery.com/enp/rfc/rfc2625.txt is the ARP for
fibre channels. (Actually, ARPs are meant to select a MAC given some
other kind of identifier, usually a routing identifier like an IP
address. It's the Reverse ARP protocol that maps a MAC address to a
routing id, and protocol is the one that---through a layered series of
protocol translations---can eventually yield communication
with---although not the MAC address of---a temporarily unique physical
device).
"The Problem: The world is a jungle in general, and the networking
game contributes many animals. At nearly every layer of a network
architecture there are several potential protocols that could be
used. " David C. Plummer, rfc826, November 1982
"RFC 826: it sucks big time. it's a piece of crap!! "
http://www.faqs.org/qa/rfcc-681.html, ando, 3/24/2004
"Protocol analysis in C01-AIM. The process was manual and laborious" ,
The Pandemonium of protocols, poster on network security found on the
web.
http://www.ee.oulu.fi/research/ouspg/frontier/management/poster/frontier-po…
L'Chaim
Bob Morris
Little Bit O'History: Dave Plummer's address on RFC826 was Symbolics
Inc., one of two companies making Lisp Machines in the early eighties.
For reasons /really/ irrelevant to this discussion, Symbolics sold the
first and only brand of laser printer commercially available, a wet
toner device briefly manufactured by Canon. It came with nothing but a
bit mapped raster image processer and could barely be lifted by two
people. Within a year it was replaced by dry toner technology whose
general architecture is what laser printers look like today. At
Interleaf, Inc., I wrote some driver software for it. On the wall we
displayed my first output, meant to be some line art, and which we
titled "Blizzard in Antarctica". The second was titled "Coal Mine at
Midnight".
> ...
>>
> Cheers,
> Rich
1
0
> > It seems to me that a lot of complexity would disappear if we
> could all get
> > behind a single issuer of GUIDs, and mirror the capability to
> resolve those
> > GUIDs on dozens or hundreds of servers around the world, and
> only use the
> > GUIDs in a semantic context that is self-evident.
>
> Perhaps. But what about if an insitution wants to provide IDs for more
> than just specimen or name objects? Should we always rely on a single
> authority to provide a mechanism for doing that? I don't think that
> would go very far.
No -- we come together as a community to agree upon mechanisms for GUID
assignment and exchange among well-defined classes that we routinely wish to
share and aggregate information about (Specimen/Observations,
Names/Concepts/Assertions, References, Agents, and maybe a couple of
others). If Bishop Museum wanted to provide IDs for, say, Baseball cards, or
Star Wars memorabilia, or some other object type, then it would coordinate
with other holders of data relating to baseball cards [or perhaps trading
cars in general] or Star Wars memorabilia [or perhaps movie memorabilia in
general], and they would decide amongst themselves whether DOIs or LSIDs or
UUIDs, or MACs, or whatever make the most sense for their particular needs.
If Bishop Museum wanted to provide IDs for something like condition reports
of specimens over time, or changing dynamics of wild populations, or DNA
sequences, or Images, or any number of other objects that are relevant to
bioinformatics, then they would propose a new class of objects to TDWG,
along with an appropriate schema and recommend standards for implementation,
and other institutions with like-data would discuss the optimal approach to
dealing with those sorst of data. The "community" would debate the various
options in the context of integration with existing data exchange protocols,
etc., and a standard would emerge.
Don't get me wrong....the letter "G" is my favorite letter in DiGIR. And
the more I think about it, the more I understand why you want the IDs to
have embedded context that can be resolved automatically, without any custom
tweaking to accomodate new classes of IDs. But the costs (or at least my
perception of the costs) do tend to frighten me a bit. It makes a lot more
sense for specimen data. If Bishop Museum's server is down -- tough luck,
you can't have our data. But I would *HATE* to rely on any one particular
server to get public-domain information like taxon names whenever I needed
it. Yes, the single-point failure problem is real. But there are
technological solutions to minimize those sorts of concerns (on an
intermittent basis) down to something indistinguishable from zero. I don't
think I have ever gone to Google and found it down. And there can be
redundancy built into the system.
So now let me ask you this: Are the advantages of serving the GUID needs of
both specimens and taxon names via an identical, generalized GUID scheme
greater than the advantages of custom-tailoring the GUID scheme for each
different sort of data (owned, vs. public-domain)? I think I already know
the short answer is "Yes", but I'm interested in the reasons (e.g., common
set of software tools to deal with both, etc.)
> The DN portion is meant to be resolvable by the DNS system. So yes,
> there is a dependency on the continued existence of the DN, but is can
> be set up to be resolved by any LSID service endpoint.
I think I need to learn more about LSIDs before forming too strong of an
opinion either for, or against them.
> If we use the MAC approach + a context such as an LSID or DOI form, then
> there is absolutely no need for a central issuing agency.
Agreed! And that might be the better approach (for a lot of reasons). My
only concern would be to what extent desktop database applications can uses
MAC values as primary keys (compared to using something like long integers)
efficiently and effectively, when manipulating large datasets in real time.
I guess this may be a trivial point in the grand scheme of things -- but
ultimately taxonomists will want to be able to work with large datasets on
their personal computers.
> Again, just use a MAC based GUID inside an LSID context. If you have
> any MS dev tools on your machine type "guidgen" and the command prompt.
> Voila! Globally unique identifiers. No matter how many times you
> push the "New GUID" button.
Yes, and this may be the best approach. MACs are scary for people to read,
but as I said before, people really shouldn't be reading them.
But in order to avoid having a bottleneck for data resolution at the point
of MAC issuance, you would need to get the data mirrored efficiently across
many servers. This could be done like DNS (as I undertand it), where the
changes propagate through a haphazard web of servers. But I think I'd be
more comfortable having some sort of centralized hub or coordinator (like
GBIF) to ensure data mirroring is done efficiently and completely. On this
point, I could very-well be persuaded to change my view.
> > Again, I'll have to think about this some more. I certainly don't think
> > that the "system" should be incapable of dealing with new
> classes -- sort of
> > like how anyone can develop their own Federation Schema and use DiGIR to
> > establish specific information networks. But I'd hate to see a
> breakdown in
> > the global transmission of biodiversity information simply
> because different
> > subgroups establish their own special-needs,
> non-mutually-compatible classes
> > for dealing with essentially the same kinds of information
> (especially if
> > they do not also conform to a generalized international standard).
>
> Bah. That's the whole point of this - to facilitate data exchange. If
> a small subgroup wants to start exchanging data in an abbreviated
> format, so what?
No problem if they also continue to provide the relevant bits via the
"conventional" means. But BIG problem if everyone in Europe gravitates
towards one conventional standard, and everyone in the U.S. gravitates
towards a different flavor, and everyone in Asia gravitates towards yet a
different flavor. I'm thinking Cell Phone standards, NTSC vs. PAL,
competing HDTV standards, DVD+R vs. DVD-R, etc., etc.
Already we have some problems in dealing with the fact that ICZN and ICBN do
not have perfectly compatible Codes of nomenclature. Imagine if different
geographic regions adopted their own versions of nomenclatural Codes; or if
the Fish people got together and decided they wanted slight different rules
to apply to their names. Such freedom would not likely contribute favorably
to scientific progress. Taxonomists conform to respective Codes of
nomenclature not because they are perfect in how they establish names, but
because the community has converged on a single standard.
> As long as the identifiers being used are able to
> resolve the type of object being passed around,
...which they presumably wouldn't be if the providing server were
offline....no?
> and the objects conform
> to their definitions, it shouldn't be a problem. By initially
> establishing a robust framework for Scientific Names and perhaps
> specimen data / collections, then there will be little need for others
> to recreate new ways to represent that data.
...so why comprimise the optimality of the system in order to accomodate
those who might prefer to define objects in a slight different way from
everyone else?
> The benefits of a robust
> reliable representation and provision of cheap, effective software tools
> will hopefully overcome the steep learning curve needed to even
> understand what's in some of these schemas.
Yes, but this is something we both agree on!
> But if I want to say to you, hey look at this specimen xxx while we're
> chatting from around the world using an instant messenger while
> collaborating on some project, would't it be nice to just be able to
> type in lsid:mymuseum.org:specimen:1234 and have your client retrieve
> that exact data and associated metadata directly?
But such a chat would certainly provide an opportunity to suggest context.
Wouldn't it be easier if I said "Hey, go to GBIF and look up SpecimenID
1234"? This scenario seems to apply equally to both of our world
perspectives. The only advantage I would see for LSIDs in this case is if I
forgot to mention to my colleague that I wanted her to look up a specimen,
rather than a taxon name, and just told her to "look up 1234". Not likely
in a human-human conversation.
> A trivial example but
> one that can form the foundation of some cool stuff for data exchange
> and interaction. I thought that was the whole point of these GUID
> things. But maybe I'm mistaken?
Yes, I certainly agree that this is the whole point of GUIDs! What I don't
understand is why LSIDs (with domain-active requirements) fulfill this more
effectively (all costs and benefits considered) than other GUID systems.
> Except in the somewhat bizarre case when you need the old version of the
> object.
Yeah, but as you say this is a bizarre case. In those bizarre cases, you
can send an email to the data manager and ask for a report on the edit
history of the record. If they got it, they got it. But I don't see this as
being such a routine need that it needs to be accomdated-for by the GUID
system (unless, as I suggested earlier, that it was otherwise completely
transparent and could be ignored without consequence).
> Yeah, good question. Maybe this should be on the GBIF DADI list or TDWG
> general? Or even the LSID list?
If it moves, let me know where it goes. Right now, it's time for bed....
Cheers,
Rich
1
0
> > I believe you are in the majority on this. But when I think it
> all through,
> > I still feel that consolidation of GUID issuance will be more
> advantageous
> > in the long term.
> >
>
> Nope. You'll have to try harder to convince me :-)
Wasn't trying to convince you....just signaling that you had yet to convince
me! :-)
I can see the discussions will be lively in Christchurch... :-)
> > If I read you correctly, I gather you are saying that the issuance of
> > numbers would be distributed and isolated, but the issuers
> would fall under
> > a centralized authority. I'm not sure I understand why this system is
> > necessarily advantageous over a centralized issuer.
> >
>
> Because there's no single point of failure, it is more scalable, and in
> the (unlikely) event the centralized authority no longer exists, it
> would be a fairly trivial matter to delegate root authority to another
> tusted party.
O.K., I can see that. But in the world I described, the only affect of a
failure would be an inability to retrieve new IDs -- all existing ID's would
still be resolvable at any of the dozens or hundreds of mirror sites.
Moreover, I would assume that the server that issued the numbers would be
designed to be as reliable as possible (think NYSE). In the event that the
centralized authority no longer exists, it would be, as you say, a trivial
matter to delegate number issuance to another "trusted" mirror site.
Also, what of my point that a dataset containing ID's with heterogenous
sources would allow *ALL* sources to be single points of failure? Is it not
true that an LSID would require that the issuing domain be online in order
to retrieve the data associated with the ID? So if my dataset included IDs
from 15 issuing domains, all 15 would need to be active at the time I run my
query in order to get a complete return of data? This seems like an even
less robust system.
> >>It is quite likely that there will be multiple LSID generators and
> >>issuers. There is no real reason why this should be prevented, except
> >>to ensure that appropriate measures are taken to avoid duplication of
> >>GUIDs for the same object (taxonomic concept in this case).
> >
> > Actually, I was talking about Taxonomic Names, specifically --
> but if Names
> > are considered as represented by a subset of Concepts (as I
> hope they will
> > be), then it's the same GUID pool.
> >
>
> Not sure what you mean here- If Joe enters a citation someplace and Rich
> uses it's LSID within a Taxonomic Object he entered, why does it have to
> be in the same pool? As long as the LSID resolved to the appropriate
> object, all would be good.
No....Reference IDs (I assume by "citation", you mean what I mean when I say
"Reference"?) would be a different pool from Names.
What I meant was, in my world "Name" Id's would be drawn from the same pool
of Name+Reference IDs that Concept IDs would be. In other words, there
would be one pool of IDs for Name+Reference instances. A subset of these
would be Concept-bearing Name+Reference instances. And a subset of the
Concept-bearing instances (sub-subset of Name+Reference instances) would be
Name-bearing instances. Those Name-bearing instances would be the "Name"
component of other Name+Reference instances. It's a recursive relationship,
via well-defined subtypes. But this is an entirely separate topic of
discussion, having more to do with the question of "what is the essence of a
taxonomic concept", and "what is the essence of a taxonomic name" -- that is
tangential to the more general discussion at hand.
> Yeah, but it really concerns me having a single point of failure for
> such a critical system.
As I described above, it doesn't have to be a single point of failure, and
the only "failure" would be a delay in receiving new ID's. And there could
be a defined "chain of command" such that if the primary server (ID issuer)
goes down, the calls are automatically re-routed to the next mirror in the
chain of command, which then automatically assumes authority for issuing
numbers....and so on down the chain.
I *think* I understand the fundamental differences in our perspective. I
see the bioinformatics world operating more smoothly within a very specific,
well-understood and universally agreed-upon context. You seem to prefer a
completely generic, self-describing system, that is not necessarily
restricted in scope to biology (correct?) I understand your perspective,
and I certainly see its appeal -- I just think that the case-specific
implementation is even more appealing (to me, anyway).
> How an LSID is resolved is described in detail in the document:
>
> http://www.omg.org/cgi-bin/doc?lifesci/2003-12-02
>
> Section 8.3 describes the use of DNS for resolution.
>
> Basically, the LSID client:
>
> 1. Parses the LSID urn:lsid:DN:NS:ID[:Rev]
> 2. Using DNS, locate the SVR record for DN, which points to the service
> 3. Using DNS again, resolve the location of the service
> 4. Once you have the service endpoint, basically ask it for the object
> with NS:ID:Rev
>
> That's a gross simplification, and it appears that the LSID definition
> now treats DNS resolution as one resolution mechanism, rather than the
> only one.
O.K., this clears up a lot in my mind. But what intrigues me now is: What
are the other (potential?) resolution mechanisms?
Obviously, my grasp of LSIDs was fundamentally flawed. As much as the
"single-point failure" problem concerns you, the "must have all DNs within a
dataset containing multiple LSIDs active & online" concerns me. In a sense,
it means there are many single-point failures that can impede data flow.
> > Embedding issuer context in a GUID makes sense to me. Restricting
> > resolution of GUID to the embedded issuer *only*, seems like a very
> > dangerous system to me.
> >
>
> Yeah, but once again - if the single issuer no longer exists, then
> everything is gone. That would be a real drag.
The specimens, presumably, wouldn't be gone -- they would simply move to a
new location. And what happens when one specific specimen is transferred
from Museum "A" to Museum "B"? Does it keep the original LSID (meaining
that the original Musuem must continue to maintain and update data for a
specimen it no longer owns)? Or is it issued a new LSID by the receiving
institution, in which case the specimen now has more than one ID? What
happens to a legacy dataset that still has a pointer to the old LSID? Yes,
I can see work-arounds to all of these (mechanisms for auto-forwarding,
maintenance of indesx of duplicate LSIDs, etc.) But when you add up all of
that sort of baggage, the
redundant-mirrored-central-ID-issuer-with-defined-chain-of-command-cascade
system seems easier to manage, more flexible, and more reliable.
> > Well...that's partly why I emphasized that I think GUIDs should be for
> > computer-computer data exchange only. But even if printed for a pair of
> > human eyes to read, surely there would be *some* stated context. E.g.,
> > "ITIS TSN 1234567"; "BPBM 123456"; "GBIF Specimen ID 9876543";
> "ICZN NameID
> > 92AB5B37-70E9-4f05-9E97-CBABD08513ED"; etc....
> >
>
> So formalize that a little and you might have something more
> consistently machine parsable like: ITIS.ORG:TSN:1234567;
> BPBM.EDU:something:123456;GBIF.ORG:Specimen:9876543, ...
>
> Add in the system identifier for resolution (urn:lsid:...) and you have
> LSIDs. The result is a far more consistent, legible and widely useful
> mechanism for referencing objects. Allowing an author to arbitrarily
> provide the context for identifiers gets us little further along.
Yes, but the difference would be that in my world, any one of many mirrored
sites could resolve GBIF.ORG:Specimen:9876543; whereas the LSID protocol you
described above requires the issuer to resolve it.
> > How hard would it be in such cases to include within the
> Methods section of
> > the document, something to the effect of "All taxon IDs listed
> in this paper
> > refer to GBIF Specimen ID's, which can be resolved at gbif.net". If the
> > problem is one involving a pair of human eyes reading a number, then the
> > problem can be solved in the context of a pair of human eyes reading the
> > context.
> >
>
> Sure, but do that consistently, by all authors? And do it in a way that
> is without ambiguity? Machine parsable (for electronic publication)?
> Easily resuable in other documents?
If it's human-human data exchange, it doesn't need to be consistent by all
authors. ICZN requires new species-group names to be represented by a
holotype specimen. There are no rules about how an author indicates the
Holotype specimen -- only that it is done so more or less unambiguously.
Perhaps ICZN rules should be strengthened -- but the point is, human-human
communication of this sort works fairly well even without rigid rules for
consistency, and even tolerates a fair amount of ambiguity.
Machine parsable is another issue. I see documents, as you describe them,
as a medium of human->human (or computer->human) information exchange. If
the data exist electronically already, why pass them from machine to machine
via "dumbed-down" human-readable documents that need to be subsequently
re-interpreted by a machine? Better to start teaching Kindergarteners to
read XML as though it were prose! :-)
It's getting late, but I'll try to send off a "Part II" before I call it a
night.
Aloha,
Rich
1
0
> Yes, certainly, a GUID within the context of an XML document is pretty
> well defined by the schema, dtd or just it's loose association with
> other elements in the document.
>
> But what about if one appears in a journal article, a citation in a
> policy document, etc?
Well...that's partly why I emphasized that I think GUIDs should be for
computer-computer data exchange only. But even if printed for a pair of
human eyes to read, surely there would be *some* stated context. E.g.,
"ITIS TSN 1234567"; "BPBM 123456"; "GBIF Specimen ID 9876543"; "ICZN NameID
92AB5B37-70E9-4f05-9E97-CBABD08513ED"; etc....
> It would be nice to be able to provide a unique
> identifier as perhaps a footnote for a scientific name mentioned in a
> document.
How hard would it be in such cases to include within the Methods section of
the document, something to the effect of "All taxon IDs listed in this paper
refer to GBIF Specimen ID's, which can be resolved at gbif.net". If the
problem is one involving a pair of human eyes reading a number, then the
problem can be solved in the context of a pair of human eyes reading the
context.
> Or perhaps a system might be developed that provided an LSID
> for a DiGIR query document- so the dataset could be completely recreated
> just be hitting on the LSID (yes, one is under construction). One could
> imagine simply passing the LSID to another infrastructure that say,
> estimated potential distribution, or highlighted relevant news reports
> from an AP feed mentioning the species for which the query was created.
> Using a simple, meaningless GUID buys us none of this potential, and
> forces us to always use a wrapper to provide a contextual basis on how
> to interpret the identifier.
I guess my question is, why *must* the wrapper be integral to the ID itself?
Why can't the contextual basis be established around the ID, at the time the
ID is presented/transferred, as needed? If the cost of embedding the
context within the GUID is that all links to, say, Bishop Museum ichthyology
GUIDs for specimens become useless if the collection is transfered to
another institution and the embedded DomainName terminated, then I say put
the burden of context establishment on the ID exchange system
("presentation layer"), rather than embedded within the ID itself.
Aloha,
Rich
1
0
> I have to disagree - kind of. A non-information-bearing GUID such as
> one generated by a MAC, eg
>
> {92AB5B37-70E9-4f05-9E97-CBABD08513ED}
>
> is completely useless unless it only appears within the context of a
> system that provides more information about what it actually is.
Yes, that would be an assumption. But not an unreasonable one. I'm trying
to imagine a scenario where I am presented with a series of MAC id's where I
don't inherently understand the context. I suppose if I came in to work and
found such a number scribbled on a piece of paper, with no other
information, I'd be in a fix to figure out what the number refers to. But
obviously that's not a realistic scenario. I suspect that such IDs would be
used by computers (not humans), and would only be exchanged among computers
in some sort of semantic context; e.g., within the context of a DwC2 XML
file, nestled between appropriate tags:
<GlobalUniqueIdentifier>92AB5B37-70E9-4f05-9E97-CBABD08513ED</GlobalUniqueId
entifier>
...these themselves nestled within further context tags.
> That's
> the point of the LSID or DOI, they provide GUIDs that identify what
> system can be used to resolve them. If GUIDs for names or specimens or
> whatever are to be used in other systems, then it is essential that the
> GUID can be associated with a resolving system.
I tend to agree -- which is why I preferred DOIs (and increasingly, LSIDs)
to MAC ID's (which show up all over the place in all sorts of contexts).
Even still, though, I think we'll find that all electronic exchanges
involving GUIDs of which we speak, will do so within an evident context.
> Both the DOI and LSID approaches are structured and provide context.
> The DOI system uses the NISO Z39.84-2000 standard for categorization,
> the LSID uses the domain name system. Both provide a context essential
> for reuse of an identifier outside it's original context.
Yes, but I initially preferred DOIs to LSIDs because there tends to be less
"context baggage" associated with them. My sense of DOIs is that each
institution would not create its own DOI category; but rather there would be
a single agreed-upon DOI category that is independent of any particular
institution (with all the potential for political baggage an
institution-specified context might afford).
> This was one of the first recommendations to GBIF - to provide a
> registry of institution codes for exactly this purpose. Having a tool
> that verified the uniqueness of records within a collection as exposed
> by it's provider (either biocase or digir) would help this uniqueness
> problem. Now that the UDDI registry is available, we could in theory
> use the institution identifiers in there.
More power to you (and GBIF, and the future of DiGIR)! But in my view, it
should still be seen only as a temporary solution, until we can get our acts
together with more specific (and less information-contingent) ID systems.
> I strongly disagree that there should be a single GUID issuer or
> resolver.
I believe you are in the majority on this. But when I think it all through,
I still feel that consolidation of GUID issuance will be more advantageous
in the long term.
> What we really need is an organization that operates kind of
> like a certificate authority- GBIF could act as the root from which
> other trusted GUID issuers may be created. In this way we can avoid the
> arbitrary creation of GUIDs yet still provide considerable flexibility
> and de-centralization in the community.
If I read you correctly, I gather you are saying that the issuance of
numbers would be distributed and isolated, but the issuers would fall under
a centralized authority. I'm not sure I understand why this system is
necessarily advantageous over a centralized issuer.
> It would be a relatively simple task to include a LSID resolver service
> along with a DiGIR provider. I have prototyped such a system a while
> back, but other issues prevented deployment. With such an
> implementation, it would be trivial to assign unique identifiers to
> specimens - but first the problems institutions seem to have even
> providing unique identifiers within a collection must be resolved.
AGREED!
> > As you've outlined in subsequent slides, I see two alternative
> paths: A)
> > Get the biological world to rally around GBIF as the
> centralized provider of
> > GUIDs for specimens for all collections; or B) Have each
> > collection/institution issue its own set of LSIDs for its own
> specimens, and
> > have GBIF adopt those LSIDs for its own internal purposes. I could get
> > behind either approach, but I see danger in the adoption of a mixture of
> > these two approaches. I'll defer elaboration, but a lot of it
> has to do with
> > potential confusion about whether the GUID applies fundamentally to the
> > physical specimen, or the electronic conglomeration of data
> associated with
> > the specimen. Also, I think we should avoid the risk of assigning two
> > separate GUIDs for the same "single data element" (sensu your Slide 5).
> >
>
> A mixture would still work, provided there was appropriate coordination
> between the efforts.
With the level of coordination required, you might as well go for the "brass
ring" (in my opinion). But maybe what I see as the "brass ring" is seen as
a dud to others.
> > Thus, when it comes to assigning GUIDs for names (not
> > concepts), I would propose the following:
> >
> > urn:lsid:ICZN.org:TaxonName:XXXXXX (all zoological names)
> > urn:lsid:ICBN.org:TaxonName:XXXXXX (all botanical names)
> > urn:lsid:ICNB[or LBSN??].org:TaxonName:XXXXXX (all
> bacteriological names)
> > urn:lsid:ICTV[or ICVCN??].org:TaxonName:XXXXXX (all virus names)
> >
> > In an ideal world, we'd get to the point where there would be a need for
> > only one registrar of nomenclature, e.g.:
> > urn:lsid:BioCode.org:TaxonName:XXXXXXX
> >
> > Or, perhaps:
> > urn:lsid:gbif.net:TaxonName:XXXXXXX
>
> It is quite likely that there will be multiple LSID generators and
> issuers. There is no real reason why this should be prevented, except
> to ensure that appropriate measures are taken to avoid duplication of
> GUIDs for the same object (taxonomic concept in this case).
Actually, I was talking about Taxonomic Names, specifically -- but if Names
are considered as represented by a subset of Concepts (as I hope they will
be), then it's the same GUID pool.
> So a
> critical piece of infrastructure for a name service that was intending
> to assign GUIDs would be a mechanism for determining if the object they
> are about to assign the GUID to is not already present in the system,
> held at some other location. There needs to be something like a global
> "findThisObject(taxon_object)" that absolutely guarantees that the
> instance doesn't exist some other place. And if duplicates were to
> occur, then there must also be a mechanism for indicating equivalence
> between GUIDs, or perhaps a way of deleting the duplicate (how to decide
> which is the duplicate?).
I agree with all of this, but it seems that the infrastructure you describe
would yield a higher total cost than the single GUID provider approach
would.
> Forcing the use of a single DN such as BioCode.org for all names would
> seem to be a mistake, since that implies a single resolver service for
> all names- with obvious implications in case of failure. Perhaps there
> can be multiple resolver services with a single DN? That would probably
> work fine then.
Hmmm...I'm not sure I follow. If I interpret your word "resolver"
correctly, then I see no reason why BioCode.org LSIDs could only be resolved
by one server. Is that what the DomainName component of a LSID is
specifically for? That is, "go to this domain to resolve the meaning of
this LSID"? I thought the DomainName component was simply to give
uniqueness to an LSID in the form of representing the issuer (analogous to
the function of InstitutionCode in DwC). I see no reason why there couldn't
be dozens, or hundreds of mirrored caches of the complete dataset all over
the world, maintained automatically in synchrony with the "master" set
(which would presumably, but not necessarily, reside at BioCode.org). Any
one of the mirrors could resolve any BioCode.org LSID. With such a system,
resolving an LSID would require that *any one* of potentially dozens of
mirrored servers to be functional.
If I understand you correctly, and an LSID is resolved only by the server at
the Domain embedded within the LSID, then a dataset containing a
heterogeneous assortment of LSIDs would need *all* of potentially dozens of
distributed servers to be functional.
> The LSID service must be able to resolve the object. When the object
> moves some other place, then there will need to be a mechanism for the
> LSID service to forward the resolution to the appropriate service. The
> really big problem is when an institution no longer exists - so the
> hypothetical example of Bishop museum consuming all the Smithsonian fish
> collections - the Smithsonian LSID resolver would perhaps no longer
> exist, and so those LSIDs become meaningless.
In that case, I would vehemently oppose the use of LSIDs -- especially ones
issued from multiple sources, which rely on the issuer existing into
perpetuity. It seems MUCH more feasible to me that the GUIDs only be used
within a prescribed context, than it would to require that all LSID issuers
exist into perpetuity, and be functional at all times that someone needs to
resolve the information associated with any particular ID value.
Embedding issuer context in a GUID makes sense to me. Restricting
resolution of GUID to the embedded issuer *only*, seems like a very
dangerous system to me.
> Perhaps there's a
> delegation mechanism that can be used? So when a DN can't be resolved,
> the system backs down to a default DN, such as gbif.org that would then
> indicate that smithsonian.org is now bishop.org?
But it's not that simple, is it? If there is an LSID:
urn:lsid:bishopmuseum.org:Specimen:1234567
and another LSID, to a completely different specimen:
urn:lsid:smithsonian.gov:Specimen:1234567
...then simply re-directing all bishopmuseum.org requests to Smiithsonian
wouldn't work....would it? Or would Smithsonian recognize the domain and
deal with it accordingly?
It seems to me that a lot of complexity would disappear if we could all get
behind a single issuer of GUIDs, and mirror the capability to resolve those
GUIDs on dozens or hundreds of servers around the world, and only use the
GUIDs in a semantic context that is self-evident.
Re-reading something I wrote:
> > I would go further
> > to suggest (as I did above) that "Name" GUIDs should also be a subtype
of
> > Name-Reference instances (non-exclusive of Concept subtype instances),
using
> > the Name-Reference instance that represents the Code-recognized original
> > description of the name as the "handle" to the Name.
Actually, it's probably safe to say that all "name-bearing" Name+Reference
instances (i.e., original descriptions) are also, virtually by definition,
also "concept-bearing" Name+Reference instances. So, not only would
name-bearing and concept-bearing Name+Reference instances be non-exclusive
of each other, it would probably be safe to think of name-bearing instances
as a subset (Subtype) of Concept-bearing instances, which themselves are a
subset (Subtype) of all Name+Reference instances.
> > My own answers to your questions:
> >
> > 1) Are LSIDs the most appropriate technology?
> >
> > I'm increasingly coming to that conclusion.
>
> I agree. The LSID system is easy to implement, stable, scalable and
> does everything we need. The DOI system is good as well, but the fee
> scheme bothers me (though I understand there are ways around that).
My understanding is that it would be easy to develop a DOI-like system that
is not part of the fee-based DOI system, and I still find it appealing
because it could as simple as an integer ID and very basic context tag.
As for LSIDs -- If I understand correctly that the purpose of the
<DomainName> portion of the LSID is to point to the one (and only?) server
that can resolve the ID, then all of a sudden I don't like them at all. If
it's true that the embedded Domain portion of an LSID *requires* that the
domain exist for as long as the GUID exists in order for the GUID to be
useful, then I definitely have reservations. If, on the other hand, the
Domain portion can be seen as representing the issuer (somewhat analogous to
the function of "InstitutionCode" in DwC), and could be resolved by any
server set up to deal with the <namespace> part of the LSID, then I'm much
less concerned.
> > I think the best option would be central. The next
> option would be full
> > distributed. Leaving it as an option would, in my opinion, be a BIG
> > mistake.
>
> I disagree- the assignment of identifiers should be by the curators of
> the data. However, I do strongly consider that there should be some
> sort of trust scheme in place, where identifiers are issued only by
> entities trusted by the rest of the system. A scheme similar to that
> used by certificate authorities and delegates should be adequate.
Maybe I'm misunderstanding the use of the word "issuers", but in my mind,
the issuer's job is only to provide a guaranteed-unique set of ID's. It
would not, necessarily, be the location where the ID is applied to its
associated data.
In Donald's PowerPoint file, he made reference to "mechanisms for data
providers to request and use blocks of LSIDs from central service". Here's
how I imagine a system would work:
GBIF (or some other central entity) establishes a service that can generate
unique <objectID> numbers within its own LSID context. The same service
also maintains a complete set of data associated with each <objectID>.
Major (and minor) institutions (essentially your set of "Trusted" entities)
would established mirrored copies of the complete set of all data (or,
perhaps, only a filtered subset of the complete data), but would not be able
to issue new GUIDs directly. However, the mirrored sites could serve as
real-time "pass-through" to the central sight so as to be functionally able
to provide new GUIDs in real time, by retrieving them directly (in real
time) from the central server. Also, the mirrored sites would all maintain
synchrony of their copies of the data with the central "master" copy, on a
realistic time frame (e.g., every 24 hours, or on-demand if a data provider
chose to initiate a synchronization command).
If a curator of a local institution's data needed to assign a new batch of
numbers for a new set of specimens, the curator would issue a request to the
central server (or via one of mirrored sites as a pass-through request) for
a block of N numbers. The central server would never re-issue those same
numbers again to anyone else. But those numbers remain "empty" until the
curator assigns them to data, and uploads that data either to the central
server or to one of the mirrors. In other words, even though the numbers
are "issued" by a central server, they are applied to real data only by
local curators.
A big issue, of course, is control over editing of data associated with a
given GUID. In the case of specimens, the central server and mirrored sites
could (perhaps at the discretion of the data curator who initially requested
the number) restrict subsequent editing of those data to a defined set of
password-protected user accounts. In the case of more public data, such as
taxon names and publications, the control of data editing would be less
restrictive (e.g., either full accessible by the public, or accessible to
anyone who goes to the trouble to register themselves as a taxonomist with
the central server or with any of the mirrored sites).
Maybe this approach would not be practical for specimen data -- but I think
it would be the optimal approach for taxon data. Perhaps those two
fundamentally different kinds of data (owned, vs. public domain) need
fundamentally different approaches to GUID issuance and assignment?
> > 3) Which objects should receive identifiers?
> >
> > Specimens, References, Name-Reference intersections
> (Assertions), and
> > perhaps Agents. [TaxonNames and Concepts can be subsets of
> Name-Reference
> > intersections].
>
> Any object. It doesn't matter what it is, just that it can be resolved,
> and when you find it, you can figure out what it is. Sensible use of
> the NameSpace portion of the LSID will help a lot with this. A trusted
> organization should issue the NameSpace portion to avoid NS conflicts.
I'd have to think this through some more. Leaving it too open might lead to
a plethora of (potentially overlapping, but not quite equivalent)
NameSpaces, which seems like it could turn into a real mess, really quickly.
Centralized ID systems such as social security numbers in the U.S.,
telephone numbers, etc. definitely have some advantage over totally open
systems. I suppose that the pool of NS's would be self-cleaning simply by
use or non-use....but I still wonder how much better this approach would be
over the status quo.
> > 3a) Should we develop a set of object classes for biodiversity
> informatics
> > and assign identifiers to instances of all of these?
> >
> > I think so, yes. Of course, it depends a bit on who you
> mean by "we". I'm
> > thinking sensu lato.
>
> Sure, and these could be a core from which others can be built. But we
> should absolutely not restrict the capability of the "system" to accept
> new classes - even classes that represent the same infomration in a
> different way that may be appropriate to a group of users.
Again, I'll have to think about this some more. I certainly don't think
that the "system" should be incapable of dealing with new classes -- sort of
like how anyone can develop their own Federation Schema and use DiGIR to
establish specific information networks. But I'd hate to see a breakdown in
the global transmission of biodiversity information simply because different
subgroups establish their own special-needs, non-mutually-compatible classes
for dealing with essentially the same kinds of information (especially if
they do not also conform to a generalized international standard).
> > 4) What should be done about existing records without identifiers?
> >
> > As far as I know, ALL records are currently without
> identifiers (unless
> > someone established a widely accepted GUID system and I missed the
> > announcement...)
>
> All records currently have some sort of identifier, the problem is their
> uniqueness is not rigorously enforced or even evaluated, so their
> usefulness is probably limited.
O.K., in that case I misunderstood the meaning of "identifiers". All
historical identifiers (e.g., catalog numbers for specimens) should be
maintained, preserved, and cross-referenced to GUIDs just like any other
metadata about the physical object. I think of catalog numbers not so much
as unique identifiers, but as "labels" -- not altogether unlike taxonomic
names. In the databases I manage, I do not use catalog numbers as
identifiers -- the computer generates the UID, which is never seen, read,
written, or typed by a human. That's how I'd like to see the sorts of GUIDs
we're discussing be implemented -- i.e., for the benefit of
computer-computer data exchange; not human-human data exchange or
human-computer/computer-human data exchange.
> > 4c) Should the provider software be modified to generate "soft"
> identifiers
> > (ones which we cannot guarantee in all cases to be unique)
> based e.g. on the
> > combination of InstitutionCode, CollectionCode and CatalogNumber?
> >
> > As an interim solution, perhaps. See my comments under
> "Slide 2" above.
>
> Yes, but not soft. The providers should assign their own identifiers,
> but there must be a mechanism to ensure that identifiers are being
> properly assigned.
Agreed -- but I still think of these as "soft" identifiers, because
CatalogNumber values can change over time, in certain circumstances. GUIDs
should *never* need to be changed (even if the institution that issued them
vanishes without a trace from the face of the Earth).
> Revision information is very helpful in dealing with errors such as
> keystroke errors or other such details that do not change the object.
I agree the revision information *can* be helpful in dealing with errors;
but I don't see that function as being integral to the assignment of GUID
values.
> Not many. It seems most collections don't record any history in their
> record edits, so without a major alteration in the way the data are
> stored, it will be a significant undertaking to provide useful revision
> information.
For what it's worth, the databases I have developed for my institution are
designed to log every change made to every field (except
performance-enhancing, purely derivative fields), including what the
previous value was, who made the change, and when the change was made. When
records are deleted, a "snapshot" of the value of every non-null field is
logged, including the time the record was deleted, and by whom. The reason
I say all of this is to underscore that my stance on not including
versioning IDs as part of a GUID system is NOT from lack of appreciation for
the value of preserving edit histories (something I clearly value very, very
much -- given that the total diskspace occupied by my edit logs exceeds the
total diskspace occupied by the "real" data!)
In closing, I apologize to those who find my overly-long posts on this topic
to be an annoyance. I also am starting to wonder: is this the appropriate
email forum to have this discussion?
Aloha,
Rich
1
0
Mahalo for your informative discussion, Rich.
A few questions. You're pretty active on this so maybe you can help me out.
What about duplicate specimens? Although a specimen may be MO 1234, K 5678
and P AABB, they may in fact all be SMITH 10001 and duplicates of the exact
same specimen, not different specimens. Is that one GUID or 3? When
attempting to use world-wide specimen records via GBIF for biodiversity
counts and species analyses, these duplicates artificially inflate the
counts significantly in some cases.
What about triplicate names? IPNI is often given as the example for a set
of name records. But, IPNI can have three records for the same exact name
and reference--one from IK, APNI and Grey Cards. IPNI has no plans to ever
deduplicate these records due to the nature of the creation of the IPNI
collaboration. So, do the three duplicate records get three GUIDs?
Where are the GUIDs actually to be perpetually located after they are
assigned? Are all the originating organizations supposed to modify their
databases to add the GUID attribute and then build a mechanism to send out
their records and then receive the GUID back from somewhere and finally
update their records with it so the record+GUID can then in turn be
published from their database onto the web?
Couldn't agree more on the need for a single index/GUIDs to all references,
but beyond that is needed the single database containing all the GUIDS plus
the standard abbreviations and descriptions for them. Nobody has this
database. There are subsets like BPH and TL2. But no single, definitive
list of all references, online, in one place with GUIDs. This science needs
that in the worst way.
If a concept is Name+Reference, then don't IPNI and Tropicos contain
millions of concept records?
Thanks,
Chuck Miller
Chief Information Officer
Missouri Botanical Garden
-----Original Message-----
From: Richard Pyle [mailto:deepreef@BISHOPMUSEUM.ORG]
Sent: Thursday, September 23, 2004 6:29 PM
To: TDWG-SDD(a)LISTSERV.NHM.KU.EDU
Subject: Re: Globally Unique Identifier
I want to start by wholeheartedly endorsing Wouter's plea for
non-information-bearing (meaningless) GUIDs. This feature is CRITICAL to
the long-term success of any GUID system. It is absolutely imperative that
there NEVER be any motivation to change the content of a GUID (i.e., it
should be permanent). If the GUID itself contains any information
whatsoever, there may be motivation to change that information at a later
time.
For this reason, I had initially preferred the DOI approach, but over time,
I am gradually warming up to the LSID approach. While components of an LSID
do, indeed, represent information, they represent the one piece of
information that I think may legitimately belong embedded within a GUID:
context. That is, the context, or domain, of the GUID itself. The context
in this case would be the "issuer" of the GUID -- not necessarily the
current "owner" of the GUID (see more discussion on this below). Though the
organization that issued a GUID may eventually disappear, the fact that the
organization was the one to issue the GUID in the first place will never
change, and thus represents a permanent and unchanging component of the
GUID. Without the context portion, the GUID itself is really nothing more
than a random string of characters. In summary, I'm warming up to the LSID
approach because it represents embedded context, without the risk of
temptation to change the content of a GUID after it has been issued.
Regarding Donald's PPT file, I have a couple of comments and questions:
(Assumes Title slide is "Slide 1")
Slide 2:
You note there is "No reliable mechanism" to relate the same record from
different providers to each other. But in the context of DarwinCore, the
combination of [InstitutionCode]+[CollectionCode]+[CatalogNumber] should
represent a virtual GUID (provided that the Global Provider Registry ensures
no duplication of [InstitutionCode]). I do realize that words like "should"
and "reliable" are critical here. Perhaps the DarwinCore implementation
should enforce the requirement of uniqueness of
[CollectionCode]+[CatalogNumber] within a single [InstitutionCode], and
further ensure globally unique [InstitutionCode] values via the Global
Provider Registry.
Slide 3:
Wouldn't most of the problems indicated in the first four bulleted points be
largely solved by the Global Provider Registry? Using the [InstitutionCode]
would allow lookup in the registry for a (current/active) metadata URL, and
the metadata URL would provide information on where to access a particular
[CollectionCode]+[CatalogNumber] piece of data.
The issue of specimens changing numbers and/or collections is problematic,
of course.
The issue of versioning is a bit dicey, in my mind (e.g., at what resolution
of information change)? Some things, like changing taxonomic determinations
(i.e., "real" changes) need to be handled in a robust way. Other things,
like the correction of typos and different styles of representing the exact
same information (e.g., R.L. Pile==>R.L. Pyle; or R.L. Pyle==>Pyle, R.L.)
probably don't need to be versioned. Other sorts of changes (e.g., the
elaboration of previously existing information, such as the addition of
retroactively-generated georeference coordinates) fall somewhere in-between.
Slide 4:
We should all get behind SEEK in addressing these issues (Taxon concept
mapping). Ultimately, we minimally need a GUID pool for References
(inclusive of unpublished works), and a GUID pool for what I call
"Protonyms" (original creations of IC_N Code-compliant names). The union of
these two GUIDs (what I would call "Assertions") would itself represent a
GUID to a "potential concept" (Berendsohn). (Note: my preference would be to
define Protonyms as a subtype of Assertions, and therefore Protonym GUIDs
would be a subset drawn from the same pool as Assertion GUIDs -- but this is
a technical discussion for another time).
Slide 5:
Nice summary!!
Slide 6:
Good stuff here, but I'll respond with some of my personal opinions:
- RevisionID: see points of concern already expressed above
- Specimen Record LSIDs: I gather from subsequent slides that you recognize
two alternative approaches: having the "owner" of a specimen assign the LSID
within the context of their own <domainName>, or adopting GBIF as the
international standard issuer for ALL specimen GUID. In other words, GBIF
would represent the centralized issuer of GUIDs for all biological
specimens, and the biological specimen community would/should rally around
GBIF for thus purpose, and adopt GBIF specimen GUIDs as their own. I
personally have no problem with this (I do not live in fear of "Big Brother"
centralization when it serves the benefit of all, as I believe it would in
this case) -- but I know there are many who might have a problem with it,
and therefore it might not garner widespread adoption without large volumes
of "fuss".
If, on the other hand, each organization issues its own GUIDs for its own
set of specimens, then the question is when, if ever, GBIF would assign a
specimen GUID? Perhaps as a surrogate for institutions that lack the
technological ability to assign their own LSIDs? But I wonder, how many
institutions that could server electronic data of their holdings to the
internet would lack the ability to assign their own LSIDs?
As you've outlined in subsequent slides, I see two alternative paths: A)
Get the biological world to rally around GBIF as the centralized provider of
GUIDs for specimens for all collections; or B) Have each
collection/institution issue its own set of LSIDs for its own specimens, and
have GBIF adopt those LSIDs for its own internal purposes. I could get
behind either approach, but I see danger in the adoption of a mixture of
these two approaches. I'll defer elaboration, but a lot of it has to do with
potential confusion about whether the GUID applies fundamentally to the
physical specimen, or the electronic conglomeration of data associated with
the specimen. Also, I think we should avoid the risk of assigning two
separate GUIDs for the same "single data element" (sensu your Slide 5).
- Name record LSIDs: I understand the example of an IPNI LSID for a plant
name, and presumably there would be analogous "Catalog of Fishes" LSIDs for
each fish name, etc. But I don't think that would be a wise approach.
Unlike specimen records, where there are fairly unambiguous "owner"
institutions (or at least "original owner" institutions that issued a GUID),
taxonomic aggregators (IPNI, ITIS, Species2000, GBIF, uBio, etc.) are most
certainly not owners of the taxonomic names that they include in their
databases. We would want to avoid the risk of duplicate GUIDs for the same
name, and thus the need for mapping, e.g., an IPNI GUID for a name to its
ITIS equivalent. Again, I can't help but think that the world will be a
better place if we can avoid assigning multiple GUIDs to the same "single
data element".
One approach would be to rally around GBIF, and rely on them to issue GUIDs
for all taxon names. However, I also recognize that we do not exist in a
political/personality vacuum with regards to "ownership" of taxonomic names,
or the electronic representations thereof. Therefore, the closest thing
that exists to an "owner" of a taxonomic name is the Commission of
Nomenclature (and it's respective Code of Nomenclature) under which the name
was established. Thus, when it comes to assigning GUIDs for names (not
concepts), I would propose the following:
urn:lsid:ICZN.org:TaxonName:XXXXXX (all zoological names)
urn:lsid:ICBN.org:TaxonName:XXXXXX (all botanical names) urn:lsid:ICNB[or
LBSN??].org:TaxonName:XXXXXX (all bacteriological names) urn:lsid:ICTV[or
ICVCN??].org:TaxonName:XXXXXX (all virus names)
In an ideal world, we'd get to the point where there would be a need for
only one registrar of nomenclature, e.g.:
urn:lsid:BioCode.org:TaxonName:XXXXXXX
Or, perhaps:
urn:lsid:gbif.net:TaxonName:XXXXXXX
But I don't think we're quite there yet.
In any case, the idea would be for the taxon name aggregators to adopt the
unambiguously unique GUID for each taxon name.
Taxonomic concepts are a whole 'nother ball of wax....
Slide 8:
I actually prefer this approach (GBIF as the central issuer of specimen
GUIDs), for a variety of reasons. One of the main reasons is that it would
assure uniqueness of an integer within a given <namespace> (e.g.,
Specimens), which would make things a bit easier for those of us who like to
use integers as primary keys in databases. In other words, it avoids the
possibility of urn:lsid:bishopmuseum.org:Specimen:1234567 colliding with
urn:lsid:usnm.gov:Specimen:1234567, when reducing the GUID to just its
integer component for local application purposes (where context can be
enforced by other means). However, I should point something out regarding
the "Advantage" part of this slide, which is that the "problem" of
transferring record locations doesn't exist, provided that the <domainName>
component of the LSID is taken as the issuer of the GUID, not as the current
owner of the specimen. In other words, if Bishop Museum assigned GUID
urn:lsid:bishopmuseum.org:Specimen:1234567 to a specimen, and then gave that
specimen to Smithsonian, then Smithsonian would retain the complete GUID
intact as: urn:lsid:bishopmuseum.org:Specimen:1234567.
The danger comes when you try to use the <domainName> component as metadata
to represent the current location of the specimen and/or its electronically
represented data. This is where Wouter's original point about 'meaningless'
GUIDs comes into play. If the whole point of using LSIDs is to embed the
"current location" information within the ID itself so that applications can
retrieve additional data associated with the GUID directly, then I have some
concerns (mostly address already).
Why there is a reference to urn:lsid:gbif.net:TaxonConcept:106734 at the top
of this slide???
Slide 9:
Again, I'm not sure I understand on this slide why there is a reference to
urn:lsid:ipni.org:TaxonName:82090-3:1.1
Also, in this model, what function does the LSID serve that is not met by
the concatenated [InstitutionCode]+[CollectionCode]+[CatalogNumber] (in the
context of Global Provider Registry).
Slide 10 (taxon concepts and literature):
This message is already getting too long... :-)
I already touched on this above under "Slide 4". I definitely agree that we
need a GUID system for References. This should include more than just
published references. It doesn't quite exist yet among the existing
Reference registrars (as far as I can tell) to accommodate the specific
needs of taxonomists (e.g. referring to a subsection of a reference as
representing an original taxonomic description), so I do see a need to
create a Reference GUID system specific to biology. I could rant for pages
on this, but I'll summarize simply with a plea to *DEFINE* a Concept GUID as
an intersection between an Name GUID and a Reference GUID (i.e., what I
would call an "Assertion"). Not all Name-Reference combinations will be
worthy of recognition as a distinct "Concept", but all are *potentially*
representative of a concept (Berendsohn), and thus all should be drawn from
the same pool of GUIDs as Concept GUIDs. In other words, "Concepts" should
be thought of as a subtype of Name-Reference instances. I would go further
to suggest (as I did above) that "Name" GUIDs should also be a subtype of
Name-Reference instances (non-exclusive of Concept subtype instances), using
the Name-Reference instance that represents the Code-recognized original
description of the name as the "handle" to the Name.
By this approach, you need only two GUID object classes <objectClass>: one
for References, and one for Name-Reference intersections (Assertions). The
latter of these could serve as the source for both Concept GUIDs and Name
GUIDs.
Last Slide:
My own answers to your questions:
1) Are LSIDs the most appropriate technology?
I'm increasingly coming to that conclusion.
2) Should identifiers be assigned and resolved centrally or via a fully
distributed model (or should providers have the option of using either
model)?
I think the best option would be central. The next option would be
full distributed. Leaving it as an option would, in my opinion, be a BIG
mistake.
3) Which objects should receive identifiers?
Specimens, References, Name-Reference intersections (Assertions),
and perhaps Agents. [TaxonNames and Concepts can be subsets of
Name-Reference intersections].
3a) Should we develop a set of object classes for biodiversity informatics
and assign identifiers to instances of all of these?
I think so, yes. Of course, it depends a bit on who you mean by
"we". I'm thinking sensu lato.
3b) Should identifiers be associated with real world objects (e.g.
specimens), or with digitised records representing them (e.g. perhaps
multiple records representing different digitisation attempts by different
researchers for the same specimen), or both?
I would say definitely real-world objects (treating things like
Code-recognized original descriptions of taxon names, and citable references
as "real-world objects"). I do NOT think we should have separate GUIDs for
digital representations thereof. Alternative digital representations are
simply clutter that will eventually be weeded out of the system, once we all
get organized on this stuff, and harness the power of the internet to
implement a global editing/QA system.
4) What should be done about existing records without identifiers?
As far as I know, ALL records are currently without identifiers
(unless someone established a widely accepted GUID system and I missed the
announcement...)
4a) Should they be left alone?
Ultimately, no.
4b) Should they all be updated with identifiers?
Ultimately, yes.
4c) Should the provider software be modified to generate "soft" identifiers
(ones which we cannot guarantee in all cases to be unique) based e.g. on the
combination of InstitutionCode, CollectionCode and CatalogNumber?
As an interim solution, perhaps. See my comments under "Slide 2"
above.
5) Are revision identifiers a useful feature?
I would like to think not. If the information is truly dynamic over
time (e.g., re-determinations of taxonomic identity of specimens), then
individual instances should probably receive their own set of GUIDs (as
opposed to versions of the "parent" GUID). If the information is static
over time, and changes represent objective corrections, then I don't see a
real need to track that within the context of a GUID (record edit history
may or may not need to be tracked, but this seems to me to be a separate
issue from GUIDs).
5b) How many providers will be able to provide and handle them?
If versioning is incorporated, then it should be designed such that
a "default" version is provided automatically when versioning is not
handled.
Sorry for the long post, but I feel that this issue is extremely important
at this point in bioinformatics history.
Aloha,
Rich
Richard L. Pyle, PhD
Natural Sciences Database Coordinator, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef(a)bishopmuseum.org
http://www.bishopmuseum.org/bishop/HBS/pylerichard.html
> -----Original Message-----
> From: TDWG - Structure of Descriptive Data
> [mailto:TDWG-SDD@LISTSERV.NHM.KU.EDU]On Behalf Of Donald Hobern
> Sent: Thursday, September 23, 2004 6:22 AM
> To: TDWG-SDD(a)LISTSERV.NHM.KU.EDU
> Subject: Re: Globally Unique Identifier
>
>
> This is precisely one of the key questions we need to address with any
> identifier framework we adopt. I think we could easily use LSIDs in a
> way that should overcome your concerns, and I think that the built-in
> mechanisms for discovery and metadata access within the LSID model are
> really exciting.
>
> I have just put together a PowerPoint presentation to explain some of
> what I think we could achieve with globally unique identifiers and
> particularly with LSIDS. It can be downloaded from:
>
> http://circa.gbif.net/Public/irc/gbif/dadi/library?l=/architecture/glo
> ba
> llyuniqueidentifier/
>
> It may be clearest if you go through it as a slide show rather than in
> edit mode.
>
> Thanks,
>
> Donald
>
> ---------------------------------------------------------------
> Donald Hobern (dhobern(a)gbif.org)
> Programme Officer for Data Access and Database Interoperability Global
> Biodiversity Information Facility Secretariat Universitetsparken 15,
> DK-2100 Copenhagen, Denmark
> Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480
> ---------------------------------------------------------------
>
>
> -----Original Message-----
> From: TDWG - Structure of Descriptive Data
> [mailto:TDWG-SDD@LISTSERV.NHM.KU.EDU] On Behalf Of Wouter Addink
> Sent: 23. september 2004 17:38
> To: TDWG-SDD(a)LISTSERV.NHM.KU.EDU
> Subject: Re: Globally Unique Identifier
>
> It seems that DOI allows for any existing IDs to be used as part of
> the unique identifier. That seems to me as a fast to adopt short term
> solution but not a good idea for the long term. At first sight I very
> much liked the
> LSID specification, but the longer I think about it, the less I like
> some
> parts. What I think is missing in the LSID specification is that the
> unique
> identifier should be 'meaningless' apart from being an identifier to
> become
> time independent (and to avoid possible political problems). Any
> solution
> with a URN I can think of has some meaning, which makes solutions like a
> MAC-address generated GUID favorable in my opinion. And any meaning you
> need
> (like an authority of an object) can be specified in metadata instead of
> using it in the identifier. What is not very clear to me in the LSID
> specification is where the LSID generated by a LSIDAssigningService is
> actually stored.
>
> Wouter Addink
>
> ----- Original Message -----
> From: "Gregor Hagedorn" <G.Hagedorn(a)BBA.DE>
> To: <TDWG-SDD(a)LISTSERV.NHM.KU.EDU>
> Sent: Wednesday, September 08, 2004 6:20 PM
> Subject: Re: Globally Unique Identifier
>
>
> >I am not quite sure, but to me it seems with "GUID" you refer to the
> >numeric, MAC-address generated GUID type. I have nothing against
> >these. However, any URN in my view is a GUID that has most of the
> >properties you mention:
> >
> >> - it is guaranteed to be unique globally, and can be created
> anywhere,
> >> anytime by any server or client machine - it has no meaning as to
> >> where the data is physically located and will there not confuse any
> >> user about this
> >
> >> - most id
> >> mechanisms, especially URI/URN ids require a 'governing body' to
> >> handle namespaces/urls to ensure every URN is unique, whereas a
> >> GUID is always unique
> >
> > The governing body is restricted to the primary web address, and in
> > most cases such an address is already available. Being a member of a
> > governmental institution that explicitly forbids the use without
> > prior consent, and forbids the use of its domain name once you are
> > no longer working for them, I realize some potential for problem.
> >
> >> I do think a URL of some kind would be useful for things such as
> >> global searches of multiple databases, as this will allow the
> >> search to go directly to the data source where the name, referene,
> >> etc comes from. But this should not be part of its ID. Maybe a
> >> name/id should have several foms, a GUID for an ID and a URL + a
> >> GUID for a fully specified name.
> >>
> >> What are the current thoughts on these ideas?
> >
> > A GUID is only part of the problem. The other half of the problem is
> > actually getting at the resource. URN schemes like DOI or LSID (I
> > prefer the latter) intend to define resolution mechanisms. That make
> > the URN not yet a URL - in my view the good comes with the good,
> > location and reorganization independence.
> >
> > I believe GBIF should install such an LSID resolver, which is why in
> > the UBIF proxy model, under Links, I propose to support a general
> > URL (including potentially URNS), a typed LSID and a typed DOI. This
> > could be simplified to have just a URN (LSID and DOI are URNs), but
> > that would then require string parsing to determine and recognize
> > the preferred resolvable GUID types. Comments on splitting/not
> > splitting this are welcome!
> >
> > There may be some need to define a non-resolvable URN/numeric GUID
> > as well. However, that would not be under the linking question. Is
> > it correct that linking requires resolvability, or am I thinking
> > into a wrong direction?
> >
> > Gregor
> >>
> >
> >
> > ----------------------------------------------------------
> > Gregor Hagedorn (G.Hagedorn(a)bba.de)
> > Institute for Plant Virology, Microbiology, and Biosafety Federal
> > Research Center for Agriculture and Forestry (BBA)
> > Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220
> > 14195 Berlin, Germany Fax: +49-30-8304-2203
> >
> > Often wrong but never in doubt!
1
0
This is precisely one of the key questions we need to address with any
identifier framework we adopt. I think we could easily use LSIDs in a
way that should overcome your concerns, and I think that the built-in
mechanisms for discovery and metadata access within the LSID model are
really exciting.
I have just put together a PowerPoint presentation to explain some of
what I think we could achieve with globally unique identifiers and
particularly with LSIDS. It can be downloaded from:
http://circa.gbif.net/Public/irc/gbif/dadi/library?l=/architecture/globa
llyuniqueidentifier/
It may be clearest if you go through it as a slide show rather than in
edit mode.
Thanks,
Donald
---------------------------------------------------------------
Donald Hobern (dhobern(a)gbif.org)
Programme Officer for Data Access and Database Interoperability
Global Biodiversity Information Facility Secretariat
Universitetsparken 15, DK-2100 Copenhagen, Denmark
Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480
---------------------------------------------------------------
-----Original Message-----
From: TDWG - Structure of Descriptive Data
[mailto:TDWG-SDD@LISTSERV.NHM.KU.EDU] On Behalf Of Wouter Addink
Sent: 23. september 2004 17:38
To: TDWG-SDD(a)LISTSERV.NHM.KU.EDU
Subject: Re: Globally Unique Identifier
It seems that DOI allows for any existing IDs to be used as part of the
unique identifier. That seems to me as a fast to adopt short term
solution
but not a good idea for the long term. At first sight I very much liked
the
LSID specification, but the longer I think about it, the less I like
some
parts. What I think is missing in the LSID specification is that the
unique
identifier should be 'meaningless' apart from being an identifier to
become
time independent (and to avoid possible political problems). Any
solution
with a URN I can think of has some meaning, which makes solutions like a
MAC-address generated GUID favorable in my opinion. And any meaning you
need
(like an authority of an object) can be specified in metadata instead of
using it in the identifier. What is not very clear to me in the LSID
specification is where the LSID generated by a LSIDAssigningService is
actually stored.
Wouter Addink
----- Original Message -----
From: "Gregor Hagedorn" <G.Hagedorn(a)BBA.DE>
To: <TDWG-SDD(a)LISTSERV.NHM.KU.EDU>
Sent: Wednesday, September 08, 2004 6:20 PM
Subject: Re: Globally Unique Identifier
>I am not quite sure, but to me it seems with "GUID" you refer to the
> numeric, MAC-address generated GUID type. I have nothing against
> these. However, any URN in my view is a GUID that has most of the
> properties you mention:
>
>> - it is guaranteed to be unique globally, and can be created
anywhere,
>> anytime by any server or client machine - it has no meaning as to
>> where the data is physically located and will there not confuse any
>> user about this
>
>> - most id
>> mechanisms, especially URI/URN ids require a 'governing body' to
>> handle namespaces/urls to ensure every URN is unique, whereas a GUID
>> is always unique
>
> The governing body is restricted to the primary web address, and in
> most cases such an address is already available. Being a member of a
> governmental institution that explicitly forbids the use without
> prior consent, and forbids the use of its domain name once you are no
> longer working for them, I realize some potential for problem.
>
>> I do think a URL of some kind would be useful for things such as
>> global searches of multiple databases, as this will allow the search
>> to go directly to the data source where the name, referene, etc comes
>> from. But this should not be part of its ID. Maybe a name/id should
>> have several foms, a GUID for an ID and a URL + a GUID for a fully
>> specified name.
>>
>> What are the current thoughts on these ideas?
>
> A GUID is only part of the problem. The other half of the problem is
> actually getting at the resource. URN schemes like DOI or LSID (I
> prefer the latter) intend to define resolution mechanisms. That make
> the URN not yet a URL - in my view the good comes with the good,
> location and reorganization independence.
>
> I believe GBIF should install such an LSID resolver, which is why in
> the UBIF proxy model, under Links, I propose to support a general URL
> (including potentially URNS), a typed LSID and a typed DOI. This
> could be simplified to have just a URN (LSID and DOI are URNs), but
> that would then require string parsing to determine and recognize the
> preferred resolvable GUID types. Comments on splitting/not splitting
> this are welcome!
>
> There may be some need to define a non-resolvable URN/numeric GUID as
> well. However, that would not be under the linking question. Is it
> correct that linking requires resolvability, or am I thinking into a
> wrong direction?
>
> Gregor
>>
>
>
> ----------------------------------------------------------
> Gregor Hagedorn (G.Hagedorn(a)bba.de)
> Institute for Plant Virology, Microbiology, and Biosafety
> Federal Research Center for Agriculture and Forestry (BBA)
> Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220
> 14195 Berlin, Germany Fax: +49-30-8304-2203
>
> Often wrong but never in doubt!
1
0
It seems that DOI allows for any existing IDs to be used as part of the
unique identifier. That seems to me as a fast to adopt short term solution
but not a good idea for the long term. At first sight I very much liked the
LSID specification, but the longer I think about it, the less I like some
parts. What I think is missing in the LSID specification is that the unique
identifier should be 'meaningless' apart from being an identifier to become
time independent (and to avoid possible political problems). Any solution
with a URN I can think of has some meaning, which makes solutions like a
MAC-address generated GUID favorable in my opinion. And any meaning you need
(like an authority of an object) can be specified in metadata instead of
using it in the identifier. What is not very clear to me in the LSID
specification is where the LSID generated by a LSIDAssigningService is
actually stored.
Wouter Addink
----- Original Message -----
From: "Gregor Hagedorn" <G.Hagedorn(a)BBA.DE>
To: <TDWG-SDD(a)LISTSERV.NHM.KU.EDU>
Sent: Wednesday, September 08, 2004 6:20 PM
Subject: Re: Globally Unique Identifier
>I am not quite sure, but to me it seems with "GUID" you refer to the
> numeric, MAC-address generated GUID type. I have nothing against
> these. However, any URN in my view is a GUID that has most of the
> properties you mention:
>
>> - it is guaranteed to be unique globally, and can be created anywhere,
>> anytime by any server or client machine - it has no meaning as to
>> where the data is physically located and will there not confuse any
>> user about this
>
>> - most id
>> mechanisms, especially URI/URN ids require a 'governing body' to
>> handle namespaces/urls to ensure every URN is unique, whereas a GUID
>> is always unique
>
> The governing body is restricted to the primary web address, and in
> most cases such an address is already available. Being a member of a
> governmental institution that explicitly forbids the use without
> prior consent, and forbids the use of its domain name once you are no
> longer working for them, I realize some potential for problem.
>
>> I do think a URL of some kind would be useful for things such as
>> global searches of multiple databases, as this will allow the search
>> to go directly to the data source where the name, referene, etc comes
>> from. But this should not be part of its ID. Maybe a name/id should
>> have several foms, a GUID for an ID and a URL + a GUID for a fully
>> specified name.
>>
>> What are the current thoughts on these ideas?
>
> A GUID is only part of the problem. The other half of the problem is
> actually getting at the resource. URN schemes like DOI or LSID (I
> prefer the latter) intend to define resolution mechanisms. That make
> the URN not yet a URL - in my view the good comes with the good,
> location and reorganization independence.
>
> I believe GBIF should install such an LSID resolver, which is why in
> the UBIF proxy model, under Links, I propose to support a general URL
> (including potentially URNS), a typed LSID and a typed DOI. This
> could be simplified to have just a URN (LSID and DOI are URNs), but
> that would then require string parsing to determine and recognize the
> preferred resolvable GUID types. Comments on splitting/not splitting
> this are welcome!
>
> There may be some need to define a non-resolvable URN/numeric GUID as
> well. However, that would not be under the linking question. Is it
> correct that linking requires resolvability, or am I thinking into a
> wrong direction?
>
> Gregor
>>
>
>
> ----------------------------------------------------------
> Gregor Hagedorn (G.Hagedorn(a)bba.de)
> Institute for Plant Virology, Microbiology, and Biosafety
> Federal Research Center for Agriculture and Forestry (BBA)
> Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220
> 14195 Berlin, Germany Fax: +49-30-8304-2203
>
> Often wrong but never in doubt!
1
0
> Mahalo for your informative discussion, Rich.
> A few questions. You're pretty active on this so maybe you can help me
out.
> What about duplicate specimens? Although a specimen may be MO 1234, K
5678 and P AABB,
> they may in fact all be SMITH 10001 and duplicates of the exact same
specimen, not
> different specimens. Is that one GUID or 3?
In my view, we would assign only ONE GUID, which represents the actual,
physical specimen. That this one specimen has multiple catalog number
assigned to it is simply additional information associated with that one
specimen (in the same way that many specimens may have more than one
taxonomic name applied to it, by different investigators at different
times). This is part of the problem with using the "soft" GUID surrogate of
[InstitutionCode]+[CollectionCode]+[CatalogNumber]. A simple solution would
be to select one of these catalog numbers (e.g., SMITH 10001) as the
"current" catalog number, and enter that in the appropriate DarwinCore (DwC)
fields (either [CollectionCode]+[CatalogNumber] or
[InstitutionCode]+[CatalogNumber], in this case). The MaNIS implementation
of DwC included a "OtherCatalogNumbers" element, which would store the other
numbers.
I imagine two main problems:
1) Data for the single specimen may be represented more than once in an
Aggregator, if different providers represent the "soft" GUID for the
specimen with two different catalog numbers. For human-viewed search
results, it would probably be evident soon by looking at the other data that
the two records are the same. For statistical search results, the specimen
would be counted more than once, which could cause errors in the numeric
results of statistical queries.
2) If the record is only represented by one of its catalog numbers, then how
is someone supposed to locate it by one of the other catalog numbers? One
way is to include support for a "OtherCatalogNumbers" element, in such a way
that it can be searched in addition to the "soft" GUID of
[InstitutionCode]+[CollectionCode]+[CatalogNumber]. But that's a bit
convoluted.
So, the real solution, in my mind, is to implement a "hard" GUID
("GlobalUniqueIdentifier":
http://darwincore.calacademy.org/Documentation/DarwinCore2DraftHTML). That
way, the specimen could be represented in four different Provider records,
but easily combine as one by an Aggregator via the shared GUID.
> When attempting to use world-wide specimen records via GBIF for
biodiversity counts
> and species analyses, these duplicates artificially inflate the counts
significantly
> in some cases.
Yes -- that's what I meant by "statistical search results". Presumably,
DiGIR Providers should only provide data on specimens that they current
hold. For instance, if BPBM 12345 was donated to Smithsonian, and now has
the new catalog number USNM 987654, then Bishop Museum should not include
the record in its DiGIR provider under its original catalog number (BPBM
12345). Bishop could either represent it with the current catalog number
(USNM 987654), in which case an Aggregator could easily identify it as the
same specimen, or Bishop should exclude it from its DiGIR provider
altogether.
Of course, none of this is perfect -- there are likely to be all kinds of
errors of this sort when institutions wholesale dump their electronic
catalogs online in the form of DiGIR providers. But the same is true of
"hard" GUIDs. What's to stop Bishop Museum from assigning one GUID to its
record of BPBM 12345, and Smithsonian assigning another GUID to its record
of USNM 987654? The correct answer is, "nothing, really" -- except to
whatever extent the people in charge of assigning these GUIDs to specimens
in their charge are careful to avoid making such duplications. But nobody
is perfect -- which is why *any* GUID system is going to require some sort
of integrated "inadvertent duplication index", to keep a permanent index of
"objective" duplications (not to be confused with "subjective" record
equivalencies, such as this taxonomic concept is equivalent to that
taxonomic concept).
> What about triplicate names? IPNI is often given as the example for a set
of name records.
> But, IPNI can have three records for the same exact name and
reference--one from IK, APNI
> and Grey Cards. IPNI has no plans to ever deduplicate these records due
to the nature of
> the creation of the IPNI collaboration. So, do the three duplicate
records get three GUIDs?
Not intentionally -- no (at least not in my view). But I can very easily
see how they would inadvertently be assigned different GUIDs -- hence the
need to be able to seamlessly deal with objective duplicates when they are
discovered.
> Where are the GUIDs actually to be perpetually located after they are
assigned?
That's the crux of the question posed in Donald's PowerPoint file. My
inclination is to pick a more centralized organization that seems likely to
survive in the long run (GBIF seems to me to be a leading candidate;
although for taxonomic names, I would still favor the respective
nomenclatural Commissions).
> Are all the originating organizations supposed to modify their databases
to add
> the GUID attribute and then build a mechanism to send out their records
and then
> receive the GUID back from somewhere and finally update their records with
it so
> the record+GUID can then in turn be published from their database onto the
web?
I would like to think so, yes. Certainly all organizations that set up a
DiGIR provider. If you follow the link (above) to the DwC2 draft, you'll
see that the first element is "GlobalUniqueIdentifier", which is required in
the current draft. A stop-gap solution is to concatenate a "soft" GUID in
the form of:
URN:catalog:[InstitutionCode]:[CollectionCode]:[CatalogNumber]
...but personally, I see this only as a temporary solution. I'd rather see
the bioinformatics community bite the bullet and commit to a "hard" GUID
system.
> Couldn't agree more on the need for a single index/GUIDs to all
references,
> but beyond that is needed the single database containing all the GUIDS
plus
> the standard abbreviations and descriptions for them. Nobody has this
database.
> There are subsets like BPH and TL2. But no single, definitive list of all
> references, online, in one place with GUIDs. This science needs that in
the worst way.
I agree on all counts. Which is why I think someone (GBIF?) needs to build
it. It won't suddenly materialize out of nothing -- it will have to be
built over time. If you want to assign a GUID to a Taxon Name, you must
first enter the citation details for its original description Reference in
the Reference GUID issuer.
> If a concept is Name+Reference, then don't IPNI and Tropicos contain
millions
> of concept records?
It depends on what you mean by "Concept". Note in my last email that I
explicitly identified "Concepts" as a *subset* of Name+Reference instances.
Who decides which Name+Reference instances are "concept-bearers" and which
are not? Tough question -- but one that is being thought about by the SEEK
folks. Similarly, who decides which Name+Reference instances are
"Name-bearers"? That's easier to answer: the respective Code of
Nomenclature.
Will there be millions of concept records? Well, given that there are
millions of names, I imagine there will probably be tens of millions of
Concepts to which those names have been, or will be, applied. There will,
of course, be BILLIONS of Name+Reference instances. I say this with
confidence because, in my view, every identification label of every specimen
in the world could potentially be considered as a "Reference", and there are
presumably billions of specimens out there. But I'm not terribly concerned
about such large numbers. As of this moment, there are 4,285,199,774 web
pages indexed by Google, yet it can find what I'm looking for with AMAZING
speed and efficiency -- and that's without any semantic context. What we're
talking about here is highly structured data in a tightly controlled
semantic context. Computers are exceedingly good and managing vast
quantities of data very quickly -- and they're getting better and faster all
the time. By the time we (the Bioinformatics community) get around to
digitizing billions of specimens and Name+Reference instances, the hard
drive on my laptop will be measured in Terabytes.
Aloha,
Rich
Richard L. Pyle, PhD
Natural Sciences Database Coordinator, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef(a)bishopmuseum.org
http://www.bishopmuseum.org/bishop/HBS/pylerichard.html
1
0