Hi Jessie & Ricardo,
As an implementer (not a domain scientist), I've been thinking about some of these same issues while designing and prototyping DiGIR 2. I'll place my comments in-line:
Kennedy, Jessie wrote:
Hi Ricardo
I think we should really have one GUID system for anything. For one, a single GUID system will favor integration within and across domains. For example, if you get a GUID somewhere, you don't
need
to figure out which GUID system you use to resolve it, you just
resolve
it within the one GUID system and get metadata describing the object. Can be a taxon concept, which can have GUIDs linking it to specimens
and
observation data, which in turn can be resolved using the same architecture.
I agree this is easier for us to implement - I think I mentioned that but I don't think that's always the right reason for doing something.
I agree with both of you. One reason for implementing a single GUID system is ease of resolution, but it's not the only reason. As I see it, the alternative, which provides multiple GUID systems, each obligated by a social contract to provide GUIDs that are restricted to a certain type of data object, does not really provide any guarantee that you know what type of object you will get when you resolve a GUID. Inside of a GUID system, it is quite difficult to enforce that a certain form of GUID will resolve to a certain type of data object. Even with a centralized service problem cases can occur. People can publish GUIDs for the wrong type of object, either by mistake or through malice. However, even if this is prevented though technological means, there is another difficult case.
This case occurs when one part of the community needs to extend an existing schema or ontology. Even if they design a clean extension of an existing class whose instances can polymorphically appear to be instances of the existing base class, this new class of data objects will not have exactly the same type as the existing class. If the authority that is responsible for serving instances of the base class refuses to allow the extension, the portion of the community that requires it will fork and start their own authority despite the fact that the two classes of data object are related. If the central authority allows the extension, its implicit contract with clients that it serves only known types of objects will be broken.
So, it may always be the case, even with several authorities that attempt to serve only objects of specific types, that a piece of client software will have to gracefully handle data objects of unknown types. In my opinion, the best way to handle the problem is to place the burden of figuring out what type a data object has or what it means semantically on the client software, not on the GUID infrastructure. The client software will have to understand the data objects anyway, and the first step in this process is usually figuring out what type a data object has.
Second, I tend to think that such a system will work better if the assigning authorities are descentralized, i.e. agents are free to
assign
GUIDs to anything.
The only issue here - and I don't think there is an easy answer - is the trade off between easy issuing of GUIDs and difficulty of knowing what you get back from them, what you can do with them and which GUIDs are of real value....I guess what you're saying is let the users decide that even if they don't really know what the issues are. I'd like to for example decide that if IPNI issue TaxonNames GUIDs then I can decide wholesale to accept that they've done a good job and know they are names, know they are not concepts, know what I will get and therefore what I can do with them - I don't really want to have to figure that out somehow at run-time.
I think that you bring up several different issues here. In my mind the issue of what kind of data object you get back when you resolve a GUID (and therefore what you can do with it) is orthogonal to the issues data quality and trust. The first issue can be solved with technology, the second issue can be addressed with technology (but never truly solved) and the third depends on human intervention.
The most easily addressable issue is knowing how to understand what you get back when you resolve a GUID. My only comment here is that it's likely that the end user will not be dealing directly with the resolved data objects. Instead I imagine that they'll be working with a piece of client software that understands a set of ontologies (and can perhaps dynamically fetch new ones). This client software will have to understand the data objects semantically and give the user some options on how to resolve GUIDs that are referenced in a resolved data object, how to merge many objects into a single graph, and how to display or output these data objects in a form that is useful.
This has some serious implications for what type of data representation system we use. Building a client like this that can deal with data objects described with XML Schema would be difficult (perhaps even impossible). I agree with Roderic Page that RDF makes the problem easier to solve. While this is supposed to be a discussion of GUIDs, you can't cleanly separate the GUID mechanism from the data representation system (because design decisions taken in one system constrain the design of the other).
The second issue, whether or not a resolved data object is of real value can be addressed in several ways. First it needs to be syntactically valid and semantically meaningful to be of real use. Syntactic validity is easy to test but depends on how we decide to represent data objects. The constraint that data objects be semantically meaningful can be partially addressed if we use real ontologies that can be reasoned over, but depends in part upon how crazy we get with our ontologies. It is possible to "semantically validate" data objects against an ontology and this service should be part of client software.
The third issue, trust, is interesting. My feeling is that there will be two different ways that an end user can get data objects and that trust will work differently for the two methods. The first method will be by using some kind of portal or search engine. If a domain scientist is interested in finding some taxonomic concepts so she can use them in her work, she's not going to start by using a GUID resolution system. Instead, she'll work with a portal or piece of client software, that has it's own user interface or protocol and perhaps some algorithm for selecting the most likely concepts that match some search constraints. She will query this service and examine the results to select one or more taxonomic concepts. From that point on, she can refer to them in her work using GUIDs. In this case trust is evaluated by her during her selection of a service to query and her selection of the specific taxon concepts that seem to make sense.
The second method of working with GUIDs occurs when a domain scientist resolves a GUID to get back an object (through the use of client software). In this case she can decide to trust the result based not just on well formedness and whether or not the data object seems to make sense, but also based on who created the data object. Information on the ownership of data objects could be represented in the metadata about those objects or it could be represented in the GUID itself (as in the authority part of an LSID). But if there is only one central issuing authority and metadata about creator or owner is not stored with each object, then it's hard to resolve the trust issue.
What controls the caos in this case are the metadata services. You can issue GUIDs right and left, but then what will add value to them
is
how you cross link them and maybe more importantly, how the community uses the objects you issued GUIDs for (again this relates to cross linking objects).
I can understand this to some extent but (possibly because of my ignorance) can't quite see how this will work. I'm a little confused about what say IPNI or TROPICOS would do here - I'm imagining they would issues GUIDs for what they have in their databases....how would this work and how and who would cross link their GUIDs. What would they issue GUIDs for? How would we refer to their concepts or names? What would we get returned I'm not sure from this proposal. TCS relies on lots of cross linking but all of the cross linking for one concept is encapsulated within the one concept (and thereby GUID). So for example we differentiate between relationships which were mentioned in the definition of that concept between the concept and other concepts, and relationships by others regarding that concept which are opinions of others and not part of the definition of the concept as defined. How would we do that?
Third, if you want, you still can have a centralized metadata service that works like an authority saying which taxon GUIDs are important. I just don't think you can enforce data correctness at the issuing authority level.
Yes if I was clear about what I was getting if I resolved a taxon GUID. I'm still not entirely clear what we're getting back. I agree you can't enforce correctness at the issuing authority level but I think you can set up agreements as to what your protocol for publishing is and what you will do with your objects that you have assigned GUIDs to etc.
Can anyone tell me come to think about it what we are planning on providing here? If anyone can issue GUIDs (I guess using their own GUID server) for any taxon thing - or really anything that anybody wants to consider a taxon thing. What "extra" value are we talking about providing?
Regarding your comment that computers can't handle metadata, I
think
we need to explore that a little bit more with examples. Could some of us in the group present more concrete examples on how something like that would be done? I think that if we define one or more ontologies (i.e. a classification scheme for things getting GUIDs) computers can than say: I know what the class this object belongs to, it's a NameUsage, or a TissueFromSpecimen, and act upon it. If the system doesn't know what kind of object that is, it can just ignore it or try to render it using a xslt stylesheet that can be used to displays that content to the user (if that's available).
I guess what I was trying to get at here is - does the metadata tell me what kind of object I'm getting or does it contain the data (or at least some of the data) that I've been thinking is the values for the concept. I'm confused about that after reading some of the posts.
I don't want to stop us progressing but I want to be sure that where we're going is going to add some value to what is out there already.
In my mind, GUIDs are simply pointers to data objects. The agreements should be about how data objects are represented in the community. In other words, what ontologies are used and how these ontologies model the semantic properties of data objects and the relationships between different types of data objects. How it works exactly depends upon what GUID scheme we select and how we decide to represent knowledge in the community.
Here's an example of one way of doing things. I apologize in advance for going into depth, but I think it's required in order to illustrate what I'm talking about.
If we select LSIDs as the GUID mechanism and RDF-Schema or OWL ontologies as the knowledge representation layer, then LSIDs will resolve to chunks of RDF (metadata, since the data portion of LSIDs should be reserved for holding complex binary objects like images or sound files that can't be "understood"). Each chunk of RDF represents one or more data objects that should be instances of some ontological class. Because every data object, no matter what it is, is represented in the same way (RDF), the client software only has to parse one format (and does not have to know about countless XML Schema and how to parse and understand instances documents from them). There are already many APIs available for working with RDF (we use Jena http://jena.sourceforge.net in DiGIR2).
Inside a chunk of RDF, a single data object is identified by a resource URI (in our case this would be the LSID URN). The representation of a data object takes the form of a series of assertions about that object. These assertions are statements in the form of subject, predicate, and object. So, if the data object represents a taxonomic name, that name's LSID is the subject of any assertion about it. The predicate would be some URI that uniquely identifies a property of the class that the data object belongs to, for instance it's name or it's relationship to an another data object. Finally the object of the statement is some value. What makes RDF powerful and flexible is that objects (values) can be other resources. So, it's possible to represent a data object like this:
Assertion 1: Subject: urn:lsid:authority:namespace:123 Predicate: http://www.w3.org/1999/02/22-rdf-syntax-ns#type Object: http://tdwg.org/ontologies/tcs/3.0#taxonomicName
Which says that the data object identified by the LSID urn:lsid:authority:namespace:123 is an instance of the class taxonomicName which is defined in the OWL ontology whose namespace is http://tdwg.org/ontologies/tcs/3.0
Assertion 2: Subject: urn:lsid:authority:namespace:123 Predicate: http://tdwg.org/ontologies/tcs/3.0#nameString Object: "Physeter catodon"
Which says that the object identified by the LSID urn:lsid:authority:namespace:123 has a property called nameString (which is defined in the OWL ontology with the namespace http://tdwg.org/ontologies/tcs/3.0) and it's value is set to the string Physeter catodon" (Forgive me if this is not correct TCS, I'm trying to keep it simple).
Assertion 3: Subject: urn:lsid:authority:namespace:123 Predicate: http://tdwg.org/ontologies/tcs/3.0#isBasionymOf Object: urn:lsid:authority:namespace:256
Which says that the object identified by the LSID urn:lsid:authority:namespace:123 has a property called isBasionymOf (which is defined in the OWL ontology with the namespace http://tdwg.org/ontologies/tcs/3.0) and it's value is set to the data object identified by the LSID urn:lsid:authority:namespace:256.
So, as soon as you resolve urn:lsid:authority:namespace:123, you know that it is a taxonomic name, that is has some properties (like nameString) and that it is related so some other data object in a well defined way. The beautiful thing is that even if the piece of software that has to resolve and understand an LSID doesn't know what type of data object it is (this would be the case if assertion 1 was missing), it can still know some things about it and how it relates to other things (because it would recognize the predicates). It is also possible for the client software to fetch the original ontology and make further inferences from it.
I've tried to give a concrete example above of how it might work with LSIDs and RDF. It's a rather simple example and does not cover all cases (such as whether urn:lsid:authority:namespace:256 should be automatically resolved as part of the resolution of urn:lsid:authority:namespace:123), but it's a start.
RDF with LSIDs is not the only option, but I think it provides a great deal of power and flexibility. While it places a large burden on the client software to be able to "understand" data objects, it also uses a well defined, standard set of technologies to do so. It makes it easier to develop ontologies from the bottom up (which is quite hard to do with XML Schema because they have to import each other and because of other technological difficulties posed by substitution groups and other Schema mechanisms) and allows the client the chance to do something with data objects, even if it only partially understands them.
It's useful to discuss GUID mechanisms independently of how data objects are represented, but at some point (in the client software) these issues are coupled and must be dealt with. Otherwise it's difficult to envision how the choice of a GUID scheme will constrain the choice of a data object representation system, and vice versa.
-Steve
Thanks,
Jessie
This message is intended for the addressee(s) only and should not be read, copied or disclosed to anyone else outwith the University without the permission of the sender. It is your responsibility to ensure that this message and any attachments are scanned for viruses or other defects. Napier University does not accept liability for any loss or damage which may result from this email or any attachment, or for errors or omissions arising after it was sent. Email is not a secure medium. Email entering the University's system is subject to routine monitoring and filtering by the University.