Taxon debate synthesis?

Thu Nov 17 11:13:10 CET 2005

Hi Jessie & Ricardo,

As an implementer (not a domain scientist), I've been thinking about
some of these same issues while designing and prototyping DiGIR 2.  I'll
place my comments in-line:

Kennedy, Jessie wrote:

>Hi Ricardo
>
>
>
>>    I think we should really have one GUID system for anything.
>>    For one, a single GUID system will favor integration within and
>>across domains. For example, if you get a GUID somewhere, you don't
>>
>>
>need
>
>
>>to figure out which GUID system you use to resolve it, you just
>>
>>
>resolve
>
>
>>it within the one GUID system and get metadata describing the object.
>>Can be a taxon concept, which can have GUIDs linking it to specimens
>>
>>
>and
>
>
>>observation data, which in turn can be resolved using the same
>>architecture.
>>
>>
>
>I agree this is easier for us to implement - I think I mentioned that
>but I don't think that's always the right reason for doing something.
>
>
>
I agree with both of you.  One reason for implementing a single GUID
system is ease of resolution, but it's not the only reason.  As I see
it, the alternative, which provides multiple GUID systems, each
obligated by a social contract to provide GUIDs that are restricted to a
certain type of data object, does not really provide any guarantee that
you know what type of object you will get when you resolve a GUID.
Inside of a GUID system, it is quite difficult to enforce that a certain
form of GUID will resolve to a certain type of data object.  Even with a
centralized service problem cases can occur.  People can publish GUIDs
for the wrong type of object, either by mistake or through malice.
However, even if this is prevented though technological means, there is
another difficult case.

This case occurs when one part of the community needs to extend an
existing schema or ontology.  Even if they design a clean extension of
an existing class whose instances can polymorphically appear to be
instances of the existing base class, this new class of data objects
will not have exactly the same type as the existing class.  If the
authority that is responsible for serving instances of the base class
refuses to allow the extension, the portion of the community that
requires it will fork and start their own  authority despite the fact
that the two classes of data object are related.  If the central
authority allows the extension, its implicit contract with clients that
it serves only known types of objects will be broken.

So, it may always be the case, even with several authorities that
attempt to serve only objects of specific types, that a piece of client
software will have to gracefully handle data objects of unknown types.
In my opinion, the best way to handle the problem is to place the burden
of figuring out what type a data object has or what it means
semantically on the client software, not on the GUID infrastructure.
The client software will have to understand the data objects anyway, and
the first step in this process is usually figuring out what type a data
object has.

>>    Second, I tend to think that such a system will work better if the
>>assigning authorities are descentralized, i.e. agents are free to
>>
>>
>assign
>
>
>>GUIDs to anything.
>>
>>
>
>The only issue here - and I don't think there is an easy answer - is the
>trade off between easy issuing of GUIDs and difficulty of knowing what
>you get back from them, what you can do with them and which GUIDs are of
>real value....I guess what you're saying is let the users decide that
>even if they don't really know what the issues are. I'd like to for
>example decide that if IPNI issue TaxonNames GUIDs then I can decide
>wholesale to accept that they've done a good job and know they are
>names, know they are not concepts, know what I will get and therefore
>what I can do with them - I don't really want to have to figure that out
>somehow at run-time.
>
>
>
I think that you bring up several different issues here. In my mind the
issue of what kind of data object you get back when you resolve a GUID
(and therefore what you can do with it) is orthogonal to the issues data
quality and trust.  The first issue can be solved with technology, the
second issue can be addressed with technology (but never truly solved)
and the third depends on human intervention.

The most easily addressable issue is knowing how to understand what you
get back when you resolve a GUID.  My only comment here is that it's
likely that the end user will not be dealing directly with the resolved
data objects.  Instead I imagine that they'll be working with a piece of
client software that understands a set of ontologies (and can perhaps
dynamically fetch new ones).  This client software will have to
understand the data objects semantically and give the user some options
on how to resolve GUIDs that are referenced in a resolved data object,
how to merge many objects into a single graph, and how to display or
output these data objects in a form that is useful.

This has some serious implications for what type of data representation
system we use.  Building a client like this that can deal with data
objects described with XML Schema would be difficult (perhaps even
impossible).  I agree with Roderic Page that RDF makes the problem
easier to solve.  While this is supposed to be a discussion of GUIDs,
you can't cleanly separate the GUID mechanism from the data
representation system (because design decisions taken in one system
constrain the design of the other).

The second issue, whether or not a resolved data object is of real value
can be addressed in several ways.  First it needs to be syntactically
valid and semantically meaningful to be of real use.  Syntactic validity
is easy to test but depends on how we decide to represent data objects.
The constraint that data objects be semantically meaningful can be
partially addressed if we use real ontologies that can be reasoned over,
but depends in part upon how crazy we get with our ontologies.  It is
possible to "semantically validate" data objects against an ontology and
this service should be part of client software.

The third issue, trust, is interesting.  My feeling is that there will
be two different ways that an end user can get data objects and that
trust will work differently for the two methods.  The first method will
be by using some kind of portal or search engine.  If a domain scientist
is interested in finding some taxonomic concepts so she can use them in
her work, she's not going to start by using a GUID resolution system.
Instead, she'll work with a portal or piece of client software, that has
it's own user interface or protocol and perhaps some algorithm for
selecting the most likely concepts that match some search constraints.
She will query this service and examine the results to select one or
more taxonomic concepts.  From that point on, she can refer to them in
her work using GUIDs.  In this case trust is evaluated by her during her
selection of a service to query and her selection of the specific taxon
concepts that seem to make sense.

The second method of working with GUIDs occurs when a domain scientist
resolves a GUID to get back an object (through the use of client
software).  In this case she can decide to trust the result based not
just on well formedness and whether or not the data object seems to make
sense, but also based on who created the data object.  Information on
the ownership of data objects could be represented in the metadata about
those objects or it could be represented in the GUID itself (as in the
authority part of an LSID).  But if there is only one central issuing
authority and metadata about creator or owner is not stored with each
object, then it's hard to resolve the trust issue.

>>    What controls the caos in this case are the metadata services. You
>>can issue GUIDs right and left, but then what will add value to them
>>
>>
>is
>
>
>>how you cross link them and maybe more importantly, how the community
>>uses the objects you issued GUIDs for (again this relates to cross
>>linking objects).
>>
>>
>
>I can understand this to some extent but (possibly because of my
>ignorance) can't quite see how this will work. I'm a little confused
>about what say IPNI or TROPICOS would do here - I'm imagining they would
>issues GUIDs for what they have in their databases....how would this
>work and how and who would cross link their GUIDs. What would they issue
>GUIDs for? How would we refer to their concepts or names? What would we
>get returned I'm not sure from this proposal.
>TCS relies on lots of cross linking but all of the cross linking for one
>concept is encapsulated within the one concept (and thereby GUID).  So
>for example we differentiate between relationships which were mentioned
>in the definition of that concept between the concept and other
>concepts, and relationships by others regarding that concept which are
>opinions of others and not part of the definition of the concept as
>defined. How would we do that?
>
>
>>    Third, if you want, you still can have a centralized metadata
>>service that works like an authority saying which taxon GUIDs are
>>important.
>>    I just don't think you can enforce data correctness at the issuing
>>authority level.
>>
>>
>
>Yes if I was clear about what I was getting if I resolved a taxon GUID.
>I'm still not entirely clear what we're getting back.
>I agree you can't enforce correctness at the issuing authority level but
>I think you can set up agreements as to what your protocol for
>publishing is and what you will do with your objects that you have
>assigned GUIDs to etc.
>
>Can anyone tell me come to think about it what we are planning on
>providing here? If anyone can issue GUIDs (I guess using their own GUID
>server) for any taxon thing - or really anything that anybody wants to
>consider a taxon thing. What "extra" value are we talking about
>providing?
>
>
>
>
>>    Regarding your comment that computers can't handle metadata, I
>>
>>
>think
>
>
>>we need to explore that a little bit more with examples. Could some of
>>us in the group present more concrete examples on how something like
>>that would be done? I think that if we define one or more ontologies
>>(i.e. a classification scheme for things getting GUIDs) computers can
>>than say: I know what the class this object belongs to, it's a
>>NameUsage, or a TissueFromSpecimen, and act upon it. If the system
>>doesn't know what kind of object that is, it can just ignore it or try
>>to render it using a xslt stylesheet that can be used to displays that
>>content to the user (if that's available).
>>
>>
>>
>I guess what I was trying to get at here is - does the metadata tell me
>what kind of object I'm getting or does it contain the data (or at least
>some of the data) that I've been thinking is the values for the concept.
>I'm confused about that after reading some of the posts.
>
>I don't want to stop us progressing but I want to be sure that where
>we're going is going to add some value to what is out there already.
>
>
>

In my mind, GUIDs are simply pointers to data objects.  The agreements
should be about how data objects are represented in the community. In
other words, what ontologies are used and how these ontologies model the
semantic properties of data objects and the relationships between
different types of data objects.  How it works exactly depends upon what
GUID scheme we select and how we decide to represent knowledge in the
community.

Here's an example of one way of doing things.  I apologize in advance
for going into depth, but I think it's required in order to illustrate
what I'm talking about.

If we select LSIDs as the GUID mechanism and RDF-Schema or OWL
ontologies as the knowledge representation layer, then LSIDs will
resolve to chunks of RDF (metadata, since the data portion of LSIDs
should be reserved for holding complex binary objects like images or
sound files that can't be "understood").  Each chunk of RDF represents
one or more data objects that should be instances of some ontological
class.  Because every data object, no matter what it is, is represented
in the same way (RDF), the client software only has to parse one format
(and does not have to know about countless XML Schema and how to parse
and understand instances documents from them).  There are already many
APIs available for working with RDF (we use Jena
http://jena.sourceforge.net in DiGIR2).

Inside a chunk of RDF, a single data object is identified by a resource
URI (in our case this would be the LSID URN).  The representation of a
data object takes the form of a series of assertions about that object.
These assertions are statements in the form of subject, predicate, and
object.  So, if the data object represents a taxonomic name, that name's
LSID is the subject of any assertion about it.  The predicate would be
some URI that uniquely identifies a property of the class that the data
object belongs to, for instance it's name or it's relationship to an
another data object.  Finally the object of the statement is some
value.  What makes RDF powerful and flexible is that objects (values)
can be other resources.  So, it's possible to represent a data object
like this:

Assertion 1:
    Subject: <urn:lsid:authority:namespace:123>
    Predicate: <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
    Object: <http://tdwg.org/ontologies/tcs/3.0#taxonomicName>

Which says that the data object identified by the LSID
urn:lsid:authority:namespace:123 is an instance of the class
taxonomicName which is defined in the OWL ontology whose namespace is
http://tdwg.org/ontologies/tcs/3.0

Assertion 2:
    Subject: <urn:lsid:authority:namespace:123>
    Predicate: <http://tdwg.org/ontologies/tcs/3.0#nameString>
    Object: "Physeter catodon"

Which says that the object identified by the LSID
urn:lsid:authority:namespace:123 has a property called nameString (which
is defined in the OWL ontology with the namespace
http://tdwg.org/ontologies/tcs/3.0) and it's value is set to the string
Physeter catodon"  (Forgive me if this is not correct TCS, I'm trying to
keep it simple).

Assertion 3:
    Subject: <urn:lsid:authority:namespace:123>
    Predicate: <http://tdwg.org/ontologies/tcs/3.0#isBasionymOf>
    Object: <urn:lsid:authority:namespace:256>

Which says that the object identified by the LSID
urn:lsid:authority:namespace:123 has a property called isBasionymOf
(which is defined in the OWL ontology with the namespace
http://tdwg.org/ontologies/tcs/3.0) and it's value is set to the data
object identified by the LSID urn:lsid:authority:namespace:256.

So, as soon as you resolve urn:lsid:authority:namespace:123, you know
that it is a taxonomic name, that is has some properties (like
nameString) and that it is related so some other data object in a well
defined way.  The beautiful thing is that even if the piece of software
that has to resolve and understand an LSID doesn't know what type of
data object it is (this would be the case if assertion 1 was missing),
it can still know some things about it and how it relates to other
things (because it would recognize the predicates).  It is also possible
for the client software to fetch the original ontology and make further
inferences from it.

I've tried to give a concrete example above of how it might work with
LSIDs and RDF.  It's a rather simple example and does not cover all
cases (such as whether urn:lsid:authority:namespace:256 should be
automatically resolved as part of the resolution of
urn:lsid:authority:namespace:123), but it's a start.

RDF with LSIDs is not the only option, but I think it provides a great
deal of power and flexibility.  While it places a large burden on the
client software to be able to "understand" data objects, it also uses a
well defined, standard set of technologies to do so.  It makes it easier
to develop ontologies from the bottom up (which is quite hard to do with
XML Schema because they have to import each other and because of other
technological difficulties posed by substitution groups and other Schema
mechanisms) and allows the client the chance to do something with data
objects, even if it only partially understands them.

It's useful to discuss GUID mechanisms independently of how data objects
are represented, but at some point (in the client software) these issues
are coupled and must be dealt with.  Otherwise it's difficult to
envision how the choice of a GUID scheme will constrain the choice of a
data object representation system, and vice versa.

-Steve

>Thanks,
>
>Jessie
>
>This message is intended for the addressee(s) only and should not be read, copied or disclosed to anyone else outwith the University without the permission of the sender.
>It is your responsibility to ensure that this message and any attachments are scanned for viruses or other defects. Napier University does not accept liability for any loss
>or damage which may result from this email or any attachment, or for errors or omissions arising after it was sent. Email is not a secure medium. Email entering the
>University's system is subject to routine monitoring and filtering by the University.
>
>