Comments regarding several emails inline:
Richard Pyle wrote:
By contrast, the core object in GNUB is a taxon name usage instance -- which
is a purely abstract notion of the usage of a taxon name within some
documentation source (like a publication). In this case, the text-string
name is merely a property of the GUID-identified object, and would be an
extremely BAD choice to use as a unique identifier.
It is possible that I'm not understanding what you are saying here, but
if you are saying that the only name-related property of your GNUB
taxon instances will be one which has a name string literal as its
object, then I think that is a big mistake. That will require any
client using your taxon instance metadata to re-process the literal
name string to cross reference it with lexical variants, parse it into
its pieces, etc. That should only need to be done once and then
referenced via a GUID for the name (i.e. in the sense of
tn:TaxonName).
This is why GNUB needs
to generate a unique identifier to represent this core data object. The
form that identifier takes (UUID, LSID, integer, DOI, whatever) from the
perspective of the end user should be completely irrelevant, because the
user should rarely (if ever) see it, and should certainly *never* be in a
position to type it on a keyboard (we can discuss the appearance of ZooBank
LSIDs on printed pages separately).
OK, again maybe I'm not understanding what you are saying here, but if
you are saying that you don't intend to expose your unique GNUB
identifiers to the public, then as far as I'm concerned you are setting
up GNUB to be irrelevant from the start. You mention a number of cool
taxonomist-geek type things that you hope to accomplish with GNUB. But
from my perspective as a non-taxonomist-geek, the main purpose I have
for GNUB is as a place to anchor dwc:Identification instances so that I
can indicate whether my identified resource is a representative of the
same taxon that is being referred to by somebody else (or at least to
make it possible for somebody to figure that out via computery
cleverness, Semantic Web or otherwise). How am I going to do that if
you don't provide me with a good (i.e. meeting the 8 criteria of my
last email) GUID to use as the object of my dwc:Identification
properties? For over a year, I've heard you lament that the whole
problem is that people make identifications and don't indicate the
sensu/sec. reference for the names they use. You are now creating a
system that would allow people to unambiguously make it clear what
taxon they mean but you aren't giving them any way to say what it is?
Again, I may just be misunderstanding what you wrote here.
Kevin Richards wrote:
Oh, now that I have read Rich's email here, it seems we are in agreement, of sorts. I think there is obviously a need for both of these "identifier" approaches - ie a record based ID that no one should really ever need to interact with directly, and a human friendly "ID" that allows people to discuss the same semantic "thing".
Yes. This "record based ID" can be anything you want. I don't really
don't and shouldn't have to care about that. The "human friendly ID
that allows people to discuss the same semantic thing" is precisely
what the TDWG GUID Applicability Statement (a ratified TDWG standard,
thanks to Kevin) is talking about. As I read that standard, I don't
see any requirement that a GUID be "human friendly", but I would
consider "human friendliness" to be a desirable "best practice"
(influenced somewhat by http://www.w3.org/Provider/Style/URI and
http://www.w3.org/TR/cooluris/) - if we have a choice of creating
externally exposed GUIDs that are either human-friendly or not
human-friendly, and if either works equally well, why not choose ones
that are human-friendly?
It is interesting all this discussion of identifiers when in the end it doesn’t matter too much what the identifier is, just that you have an identifier at all. The important thing is the semantics, the "are we talking about the same thing" question - so this is where I believe RDF/semantic web comes in - I might see if I can come up with some RDF/sem web example for TDWG that could demonstrate this, hmmm...
Already done in the context of tc:Taxon and tn:TaxonName and posted on
this list in January:
http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002204.html .
http://biodiversity.org.au/apni.taxon/118883
an identifier that is both friendly to humans and computers. Through
content negotiation a computer gets
http://biodiversity.org.au/apni.taxon/118883.rdf
and the human gets
http://biodiversity.org.au/apni.taxon/118883.html
The resource itself has rdf:type tc:TaxonConcept (defined in the
ontology to be equivalent to tc:Taxon), well-known because it is part
of the TDWG ontology. In these examples, the approach for referring to
name strings through tc:hasName, the subsequent reference to a name
record (http://biodiversity.org.au/apni.name/36530), and the structure
of that name record in RDF
(http://biodiversity.org.au/apni.name/36530.rdf) follow the approach of
the TSC standard (as incarnated in the TDWG ontology) very precisely.
I can't see anything in these examples that doesn't follow TDWG
standards and what I know of as "best practices". Thank you, Paul...
Also we have many examples of appropriate HTTP URI GUID use from Pete,
although not involving tc:Taxon and tn:TaxonName specifically.
Richard Pyle wrote:
I like your list of what we want "GUIDs" (see below) to do, and I think it's
an excellent starting point for a bar we should all strive for. I'm
particularly grateful to learn that the existing ZooBank service fails so
many of them. I've forwarded your post to Rob Whitton, who will be working
on Gen-2 of ZooBank in the coming weeks, and asked him if we can use your 8
tests as a metric to adhere to.
Better yet, read the TDWG GUID Applicability Statement
http://www.tdwg.org/standards/150/ and http://www.w3.org/TR/cooluris/
. My 8 points are just a paraphrase out of my head. Striving is not
good enough. Follow the standard.
"But really, from the perspective of the end-user, does it matter
if it's an identifier or a service? Ultimately, they ask the questions,
and the answers appear on their computer screens."
I would answer this question by saying "yes, it does matter!" -
it is important that a well-designed GUID do more than just throw
something up onto a human user's web browser.
I absolutely agree with you, but that's not the distinction I was making in
my quoted text. I was only talking about whether we call something an
"identifier" (not GUID, which has more specific implications), or a
"service", in the context of human-machine conversations. I think your
enumeration of things we want GUIDs to do is a very good framework for
discussion. I would only caution that "GUID" means different things to
different people (some people use it synonymously with UUID, for example),
and also that GUID does not imply "actionable".
Again I would say read http://www.tdwg.org/standards/150/ . When I say
"GUID" I am not throwing around a colloquial term. I intend for it to
have the exact technical meaning that it is given in the TDWG
standard. At this point in time (i.e. after we finally have a ratified
standard on GUIDs), nobody in our community has any business designing
and exposing GUIDs without having read this document and completely
understanding its requirements and recommendations. I should not have
to be "explaining" any of this to anybody on the list. It is explained
clearly and concisely in the standard. I really am somewhat
flabbergasted about how participants in TDWG, which I think is supposed
to be a biodiversity standards organization, generally don't seem to
read and follow the ratified standards. I think the process could be
helped somewhat if the TDWG website were cleaned up a bit to make the
obsolete stuff less easy to find and the important, current stuff
easier to find. Also, I don't understand why all important documents
aren't linked to the permanent URI page (e.g.
http://www.tdwg.org/standards/150/) in pdf format. That would allow
users to view the page directly in a web browser rather than having to
open a zip file and then open a Word document.
There has been a bit of a
debate over the importance of embedding "actionability" into identifiers
inherently (the Tim Berners-Lee perspective)
Wrong. "GUIDs should be resolvable" (direct quote of recommendation 7
from the GUID applicability statement).
, vs thinking about
"identification" separately from how we perform some action on it. For
example, UUIDs and Social Security numbers are extremely useful identifiers,
even though they are not inherently actionable. It's amazingly easy to
perform action on a non-actionable identifier by simply appending it to a
actionable prefix. For example, going back to the list of "identifiers":
A. A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
B. urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
C.
http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4
1523
D. http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
E.
http://lsid.tdwg.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5B
F41523
F. http://www.google.com/search?q=Danaus+plexippus+(Linnaeus+1758)
G.
http://lsid.tdwg.org/summary/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB
4-EA8E5BF41523
H.
http://darwin.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:zoobank.org:a
ct:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523&submit=Go
There are two different ways of looking at this:
1) There are 8 different identifiers
2) There is one identifier (A)
A is an identifier but A does not meet the requirement of the GUID
Applicability statement. Quote recommendation 2: "HTTP GET resolution must
be provided for non-self resolving GUIDs". Pick one of your proxied
HTTP URIs, call it your GUID and stop there. (Note: the emphasis on
"must" is present in the standards document, not added by me.)
, and 6 ways to perform action on it (B-E,
G-H).
If you treat them all as distinct identifiers, then let me add a few more to
the list:
I.
http://zoobank.org/?lsid=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA
8E5BF41523
J.
http://zoobank.org/?id=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E
5BF41523
K. http://zoobank.org/?id=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
L. http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
Don't add more of them to the list. Recommendation 3: "Providers must
assign at most one GUID to any particular object." Recommendation 4:
"Only one globally unique identifier should be assigned to each
object".
Note that all four of the above, plus B-D in the original list, are all
resolved through zoobank.org. Why are there so many different ways to
perform action on the "same" identifier? Because I wanted the ZooBank
resolution service to be flexible. And, because in my mind, there is only
one identifier (A); and lots of different ways to retrieve the metadata of
the object it represents.
I would assert that what you "want" and what you have in your mind is
at odds with the TDWG standard for GUIDs.
Now consider this from the TB-L perspective. Eleven different identifiers
for the same object (excluding F). Does that mean we need to generate
owl:sameAs statements for all pair-wise relationships? That's a lot of
owl:sameAs statements! Even if I'm the bad guy in foolishly allowing so many
different ways to resolve ZooBank identifiers, and needlessly fabricated so
many "different" identifiers for the same thing unnecessarily. Fair enough.
But I still think we're a lot better off by disentangling identifiers from
the services we use to perform action on them.
This may be your opinion, but it is at odds with the ratified standard
which says (recommendation 2) that "HTTP GET resolution must be
provided for non-self-resolving GUIDs".
One of the arguments on the TB-L side is that a non-actionable identifier by
itself is useless if you cannot inherently perform action on it. For
example, if you were walking through the park and stumbled upon a slip of
paper with "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" written on it, you
probably wouldn't be able to do much with it. But in reality, that's not
what happens. We never expose identifiers as a simple context-free
identifiers in their non-resolvable form. These identifiers are *always*
exposed in some context. The problem is that if you treat the "resolution
metadata" (as I call it -- e.g., "urn:lsid:zoobank.org:act:" or
"http://zoobank.org/") as *part* of the identifier (as you have to do if you
make things like "urn:lsid:ubio.org:namebank:11815"), then it becomes
difficult for an application to distinguish between
"http://zoobank.org/?id=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", and
"http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523"; which, to a
human, obviously refers to the same thing. In other words, absent all those
owl:sameAs statements, an application could break if it harvests content
from different sources that use different resolution metadata for the "same"
(sensu Pyle) identifier.
The problem here is caused by you when you create and expose so many
different HTTP URI forms of your UUID. Stop doing that (recommendation
4).
Maybe what we need to think about is a registry of "persistent resolution
services", which our community relies on. That way, we can apply the
owl:sameAs statements to the resolution services, rather than to every
single individual identifier.
There is no need for this. Make a single HTTP URI version of your UUID
and stick with it. Preferably one without the query string and use Mod
rewrite (or whatever it's called) to transform the simple, clear, and
permanent version of the URI into whatever flavor of temporary URL you
are liking at the moment. Every application today understands HTTP
GET. No need for a registry.
An important question that I think has been underlying much of this
discussion
is whether GUIDs are actually needed for names.
I think the answer is clearly "yes". The problem is defining what is meant
by the word "name".
Go with the TCS standard and the TDWG ontology as it exists currently.
and parts thereof, then it does make sense to apply GUIDs to that
kind of entity. I am thinking about a tn:TaxonName as defined in the
TDWG ontology (see
http://code.google.com/p/tdwg-ontology/source/browse/trunk/ontology/voc/Taxo
nName.rdf),
which comes out of the TCS schema (see
http://code.google.com/p/darwin-sw/wiki/ClassTaxon for info and links
regarding TCS).
A tn:TaxonName is "An object that represents a single scientific
biological name..." i.e. an "object"
NOT defined as a string.
While it's nice to see the explicit representation of a "name" as an object,
rather than a string; unfortunately that doesn't address the elephant in the
room; that is, that different people have different notions of what "a
single scientific biological name" is. I'm not talking subtly different
shades of fundamentally the same thing; I'm talking about fundamentally
different things with different implied sets of properties. This is one of
the issues I continued to hammer on during the development of TCS, and the
one that gave me the biggest qualms about TCS 1.0. My hope was that it
would be resolved in TCS 2.0.
There ain't no TCS 2.0 . There is only TCS 1.2 . I'm sorry about it,
but that's the ratified standard.
I wanted to reduce both names and concepts to
the same core entity: usage instances. That's exactly what we're doing with
GNUB.
There have been any number of things that I would "like" to be the way
I want. However, the point of standards is that they get hammered out
in a form that satisfies the community in a general way. Individual
people often are left without everything that they wanted. From within
our own personal projects, we can do anything we darn well please. But
when it comes to communicating with others, we should discipline
ourselves to follow the standards. I understand that for existing
systems, there is considerable time and money required to retrofit old
systems to a new standard. But GNUB is not an "old system". It is
being build from scratch and I would assert that where it comes to
interfacing it with the outside world, it should follow standards such
as they exist at the moment. At the moment, people are allowed to
think about and describe names without reducing them solely to usage
instances as you would like. I spend about an hour yesterday composing
a rant about how counterproductive it is for taxonomy and computer
geeks to create tools and systems that won't ever actually be used by
the people who need them. I decided that it wasn't helpful to actually
post it, but now I'm thinking that maybe I should have...
That's only true to the extent that tn:TaxonName may be too broadly
(imprecisely) defined (just like dwc:Taxon).
dwc:Taxon doesn't really have much of any useful definition, so I'm
with you there. tn:TaxonName is actually rather precisely defined, at
least if you look at the RDF
(http://code.google.com/p/tdwg-ontology/source/browse/trunk/ontology/voc/TaxonName.rdf)
and relate it to the TCS documents on which it is based
(http://www.tdwg.org/standards/117/ , again it would be extremely
useful to have a pdf version of the User Guide directly linked to that
page so that people could look at it in their browsers rather than
having to download a zip archive. Note also Kennedy et al. 2005
http://www.springerlink.com/content/7bv5pa3falxwrrvx/ which I found
helpful for understanding the rationale for TCS). In my opinion, TCS
(and by extension, the TDWG ontology) puts a rather restrictive collar
and leash on taxon names. I quote from the user guide page 9:
"<TaxonName> elements do not represent taxa. They serve only as
abstract nomenclatural data structures that encapsulate the core rules
of the different nomenclatural codes. Their purpose is to prevent
nomenclatural statements becoming confused with statements about the
circumscription of, and relationships between, different taxon
concepts. No taxonomic opinion can be expressed using
<TaxonName> elements in TCS. As a rule of thumb if you are
dealing with anything beyond a type specimen and references to it, you
are talking about a TaxonConcept of some form." This does not seem
like a broad and imprecise definition to me. One is allowed to
describe the pieces of the name and that's about it.
When I look carefully at how the TDWG ontology deals with taxon names
and taxon concepts, it seems very simple and "usable" to me. If one
defines a Taxon to be composed of a name component and a sensu/sec.
component as several people (including you, I think) on this list have
done and as TSC has done (I think), then representing it in RDF becomes
tractable. One anchors the name part to a tn:TaxonName instance
(properly collared and chained and wearing a GUID as a dog tag). How
one anchors the sensu/sec. part is still a subject for discussion. I
have been thinking about the following approach. It is based on a Venn
diagram that I have in my head which I created from your descriptions
of TNUs on this list. The Venn diagram has a big rectangle labeled
"nominal taxon". Inside that is a smaller rectangle named "taxon name
usage (TNU)". Inside that is an even smaller rectangle named "taxon
concept". In this view, Taxon concepts are
well-described/circumscribed by a publication. TNUs (which include
taxon concepts) are associated with a particular person's idea of what
the taxon is, but which may or may not be described in a publication.
Nominal taxa are all instances of a scientific name use including those
where we have no idea who applied the name or what set of organisms
they intended to be included in the taxon. In terms of RDF metadata:
1. Go ahead and let the rdf:type of the thing be tc:Taxon
2. Make the object of tc:hasName be a GUID (i.e. as described by the
TDWG GUID Applicability Statement, not some other kind of
GUID)-identified resource, preferably from a well-known source like
uBio.
3. If the sensu/sec. is described in a publication (in my mind a true
taxon concept), then the object of tc:accordingTo is an HTTP proxied
DOI, HTTP URI of a BHL-scanned publication, or if both of those fail,
something non-resolvable but globally-unique like an ISBN or URL of a
stable web page.
4. If the sensu/sec. is not described in a publication, but is
associated with a particular person (in my mind a TNU that isn't a true
taxon concept), then the object of tc:accordingTo could be the URI of a
foaf:Person or foaf:Group.
5. If the sensu/sec. is completely unknown, then the taxon is a nominal
taxon that is not a TNU. I don't know whether it is better for the
taxon to simply lack a tc:accordingTo property or to have a
tc:accordingTo property that somehow says "we don't know anything about
the sensu/sec.".
I realize that you probably aren't going to like this because it isn't
as sophisticated and nuanced as you would like for your GNUB TNUs to
be. However, there would be nothing that would prohibit you from
creating and adding a myriad of clever properties to the tc:Taxon
instance RDF to make it do all of the things you want. The practice I
have described would break down the act of defining a taxon into
well-known, standardized pieces and it is a practice that could be
fairly easily be followed by people without sophisticated IT
resources. It would allow for the transfer and comparison of taxa
information and make the possibility of reconciling at some central
location (like GNUB) the taxa that are described in a distributed
network of users. Doing something like this is, I believe, the entire
reason for the existence of TCS, the TDWG ontology, old TDWG TAG
roadmaps, etc. Please apply some self-discipline to follow the
ratified standards or risk blowing us all back to 2005 where we would
have to discuss all of the settled things again. If that is going to
happen, I will give up on TDWG because I'll be retired before it is
done over again.
In some ways what I'm talking about here is really (as I understand it)
the principle that underlies REST. Within your big GNUB kingdom and my
little Bioimages kingdom, we are free to do whatever clever things we
want, structure databases as we wish, do clever programming stuff or
whatever. But when you and I talk, we follow commonly established
rules, namely we talk using the HTTP protocol and identify the things
that we want to talk about using HTTP URIs. Since we are talking
specifically about biodiversity informatics, we should choose to follow
more restrictive rules about the identifiers themselves (following the
TDWG GUID applicability statement) and the nature of the RDF (following
the GUID applicability statement, well-known vocabularies such as the
TDWG ontology, FOAF, DCMI, Darwin Core, geo, etc.). If we fail to do
that, then every interaction that I have with another entity requires
me to establish in advance the rules of that interaction. The Web
works well because people follow a defined set of rules about URLs and
HTML. I would assert that we now (at last) have a similar model
available to us in the biodiversity informatics community if
organizations would just have the self-discipline to use it.
Roderic Page wrote:
Reading this thread makes me despair. It's as if we are determined not to make progress, forever debating identifiers and what they identify, with seemingly little hope of resolution, and no clear vision of what the goals are. We wallow in acronym soup, and enjoy the technical challenges, but don't actually get anywhere
I have to say that I'm not as pessimistic as Rod is. Maybe that's just
because I haven't been involved in the process as long as he has and
haven't had sufficient time to develop appropriate cynicism. But I
think there has been real progress, even in the couple years I've been
tracking TDWG. We DO have a GUID Applicability Statement Standard
now. We DO have a Darwin Core standard that defines terms which could
be used to describe properties of biodiversity resources. We DO have
doi's that are HTTP proxied and which return real metadata. We DO have
people in our community who know how to write RDF and set up content
negotiation for GUIDs as described in standards and best practices. I
would also say that we do have a relatively clear vision of what the
goals are. When I look at the old TAG roadmaps from 2006-2008
http://www.tdwg.org/uploads/media/TAG_Roadmap_01.doc (2006)
http://www.tdwg.org/fileadmin/subgroups/tag/TAG_Roadmap_2007_final.pdf
(2007)
http://www.tdwg.org/fileadmin/subgroups/tag/TAG_Roadmap_2008.pdf (2008)
the goals laid out there are the same ones I hear people talking about
now. The difference is that we now have the tools and standards to do
what was desired in 2006-8. We also have a funded project (BiSciCol)
that is making progress toward developing a system that will track when
changes occur in metadata for resources that are described by GUIDs.
So I'm actually pretty optimistic about the whole venture assuming that
we can get people and organizations to actually read and try to follow
the standards that we have already agreed upon.
Steve
--
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences
postal mail address:
VU Station B 351634
Nashville, TN 37235-1634, U.S.A.
delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235
office: 2128 Stevenson Center
phone: (615) 343-4582, fax: (615) 343-6707
http://bioimages.vanderbilt.edu