[Tdwg-tag] Primary Objects as XML Structures or OWL Classes

Mon Feb 20 19:57:39 CET 2006

I agree with Bob that our data model specifications should be decoupled 
from possible representation schemes.  In my opinion, these 
specifications should take the form of UML static structures with 
accompanying explanatory documents.  The use of BNF grammars is a good 
idea, but I worry that they might become difficult to manage as they 
grow and that non-CS people in the community would find it difficult to 
understand them.

I also think that the technical architecture group should not be 
concerned with the data models themselves.  Instead we have to worry 
about how to map existing data sets into the shared models, how to link 
instances of different models together, how to locate one or more data 
objects that meet certain criteria, how to merge collections of data 
objects from one or more models, how to visualize trees or graphs of 
data objects, how to serialize and deserialize data objects into 
different representations, etc.  In short, we have to design a network 
of services that allow us work with data objects and collections of data 
objects in a fairly generic fashion and leave the actual creation of the 
models up to the subject matter experts (though we might supply a bit of 
KR advice).

These services and processes will also require documentation and might 
be specified with the same combination of UML (sequence or activity 
diagrams) and explanatory documentation.  I think all of us agree that 
these ought to be designed in a language-independent manner and be built 
upon a small stack of existing technologies like HTTP and XML for 
message transport.

At some point though we have to agree on a representation format.  If 
we're talking about building a set of distributed services that will 
allow us to locate, acquire, and work with biodiversity data, then I 
think we need to propose an architecture that has a few fixed points, 
one of which should be representation format.  I for one don't want to 
have to design tools that can injest both XML Schema instances and RDF 
described by ontologies.

The representation format we select ought to be flexible enough to 
accommodate the data models described by the subject matter experts.  It 
should also minimize the burden on the software engineers and developers 
that have to design and maintain the processes, tools, and services that 
satisfy the above use cases (mapping, serializing/deserializing, 
searching, merging, visualizing, etc).  Ideally the representation 
format should allow us to choose from a collection of existing tools and 
frameworks to use while implementing (because no one has the time or 
money to create all this from scratch).  This means that we need to 
evaluate each candidate representation format with the above use cases 
in mind.  Every representation scheme (RDF, XML, Java classes, etc.) has 
its strengths and weaknesses and this process of talking about each use 
case with respect to implementing it over each of the representation 
formats will allow us to better understand the trade offs of selecting 
one format over another.

Here's an example of the types of discussions I'd like to see from TAG: 

Portions of the TCS data model describe specimens, publications, and 
other things that are not names or concepts.  In a perfect world, TCS 
would not define it's own data model for specimens but would instead use 
an existing model designed by the curators of collections (perhaps with 
Darwin Core as a starting point).  The same is true for publications.  
Instances of TCS should then use GUIDs to point to instances of Specimen 
and Publication. 

Now, imagine a hypothetical system that provides for visualization of 
TCS, a Taxon Concept Browser that allows researchers to search for and 
view TCS instances in order to select a set of concepts to use in their 
own work.  At some point, this system will have to have an in-memory 
graph of data objects of different types including TCS, Specimen, and 
Publication.  This graph could be constructed in a variety of ways, but 
the most likely method will start by parsing a TCS document that 
contains several taxon concept instances.  Each instance will be 
examined for references to other objects named by LSID.  Each of these 
LSIDs will be resolved (with care so as to not create cycles), resulting 
in a chunk of serialized data which will be turned into an in-memory 
instance and inserted into the graph.  This is at heart the merge case I 
was talking about above.

So, to satisfy this case with XML Schema, TCS's specimen element might 
be implemented with a simpleType that has a restriction base of string 
and a pattern match constraint designed to allow validation of LSID 
URNs.  This effectively decouples Specimen from TCS and allows instances 
of them to refer to each other without having to import each other's 
schema.  The Taxon Concept Browser's instance graph composer would know 
about the schemas for TCS and Specimen.  It would take the resolved 
chunk of XML from specimen LSIDs and attempt to deserialize (unmarshall) 
it before merging it into the graph that will be visualized.

At first glance this appears to work fairly well, however there are a 
few issues with this design.  First, it precludes the direct embedding 
of Specimen instances in TCS instances.  There are many reasons why one 
might want to do this.  One reason is to avoid unnecessary LSID 
resolution calls (which add latency) in the case where specimen and 
taxon concept objects are coming from the same server.  Another reason 
for embedding specimen instances in taxon concept instances is to make 
things easier on a user who might want to download an entire taxon 
concept graph to their local machine for processing by a desktop 
application.  Without embedding they may be forced to download several 
different files.  We could fix this by changing the definition of the 
specimen element in TCS so it can be either a subtree or an LSID, but 
then the TCS schema would have to import specimen and vice versa.  This 
is of course impossible so we'd have to do one of three things: make the 
TCS specimen element xsd:any, derive both TCS and Specimen from the same 
base XML Schema which minimally defines an LSID element, or design a 
complicated scheme for embedding both instances and schema (akin to how 
WFS works) in a single instance document.  Each of these have their own 
drawbacks: xsd:any makes it difficult or impossible to use most 
XML-to-Object binding tools, schema inheritance is difficult, can be 
accomplished only with a social agreement between everyone in the 
community, and allows for only a weak form of validation, and the 
embedded-schema approach is burdensome to developers.

An additional problem with the XML solution is that it is relatively 
brittle when it comes to change in data models over time.  Darwin Core 
has a high adoption rate because it is very simple.  However, the 
simplicity that drove it's adoption also encouraged different parts of 
the community to customize it to fit their needs; I know of at least 
three variants of Darwin Core in common use and there could be many more 
that I haven't encountered.  Some of these variants were declared in 
their own namespaces, but others were not.  This has made it quite 
difficult to write code that can injest all variants of Darwin Core to 
extract even the minimal set of common elements such as ScientificName.  
Finally it is not possible to validate most variants of DarwinCore (for 
a variety of reasons).  This makes them a poor candidate for 
XML-to-Object binding tools.  In our example above, if a new variant of 
the Specimen Schema were introduced, then the XML-to-Object binding code 
that backs the deserialization of Specimen instances into the graph 
would most likely not be able to handle the new version.  So, in order 
to use the Taxon Concept Browser in a heterogeneous network that has 
more than one version of the Specimen schema, even if the goal is simply 
to display the minimal set of elements common to each version, we would 
have to release a new version of the tool every time we deploy a new Schema.

That's not to say that RDF will solve all the problems.  While it might 
make the design of flexible, modular data models and the software that 
use them a bit easier, no one has ever proved that it will scale.  
Additionally there is the temptation with RDF to catch what I call 
Ontology Fever.  In terminal cases, this disease results in an obsession 
with using OWL Full to model the entire universe reductively at the 
level of the laws of physics.  Any distributed data network afflicted by 
this disease is destined to die.  That's why I prefer RDF-Schema to OWL 
(though I think OWL may end up eventually playing some role if we move 
towards RDF).  For the same reason I think the primary use case is not 
inference over OWL-described RDF, but search over flexible RDF-Schema 
described data models.  I personally think that RDF might make some use 
cases, especially the merge case, easier to handle.  So I'd like to see 
further discussion of the use cases above for both XML Schema and RDF.

In summary, the design of our shared data models is more a social 
process than a technical one and I agree with Bob that it should be 
carried out using a representation-agnostic modeling  language.  The 
technically difficult bit is designing the network of services that will 
allow one to use the data models.  We have an intuitive idea of what the 
use cases are for such a system, but I'd like to see more discussion on 
that topic.  Roger has started this off nicely by considering the 
differences between resolution and search but I'd like to continue the 
discussion into the other use cases like merging, visualization, etc.  
TAG seems like the best place to do so.

-Steve

Roger Hyam wrote:

>
> Hi Bob,
>
>> I'm rushing off to the GISIN meeting at AGADIR and might not have 
>> much time to respond more before midweek, or maybe even until I get 
>> back next week, but:
>>
>> 0.  I _wish_ this discussion were taking place in a wiki, with RSS or 
>> email notification,  so it is easier to follow if you cannot keep up 
>> with the email
>
> The way I was planning on running the TAG discussions was to have 
> 'discussions' on the mailing list and summarize them to the wiki. The 
> motivation behind this is to work towards the wiki being a readable 
> document for the uninitiated. It should not be necessary for some one 
> coming new to a field to have to read all the discussions that have 
> taken place to reach a conclusion. These discussions should be 
> available but it is the job of an editor/facilitator to create a 
> readable narrative from possibly wandering dialog.
>
> The wiki is here: http://www.tdwg.hyam.net/twiki/bin/view/TAG
>
> The URL will change at some point in the next few months but I will 
> make sure all URLs forward to the appropriate place on the new server. 
> There is no RSS feed on it at present I'll see about setting one up 
> either now or when we move it to the main server.
>
> The mailing list archive is here: 
> http://lists.tdwg.org/pipermail/tdwg-tag_lists.tdwg.org/ so any thread 
> can be followed and resurrected at any time.
>
> I take on board what you are saying though and will try and create 
> links between the wiki and list archive.
>
>> 1.  I don't think specifications of high level things like "objects" 
>> should be done in a serlization constraint languge such as RDF or XML 
>> Schema. Instead, it should have something more general as the 
>> normative definition and have _representation_ in one or more of such 
>> constraint languages. This is the mechanism of W3C usually. Many 
>> (Most?) W3C standards have a normative BNF definition, and one or 
>> more representations to allow implementers to actually do business.  
>> OMG favors UML for this, etc.There is nothing inherently normative 
>> about, say RDF or XML Schema, for, say TaxonConcepts. If you take the 
>> serialization language as the normative language, then in the future 
>> you just end up having to support several serialization languages 
>> when you find you want to extend your specification with something 
>> for which the chosen one is insufficiently expressive. This, in fact, 
>> is what is going on now with the cries for RDF over XML Schema. Put 
>> another way, if you choose language L as the normative language, you 
>> are not building a specification, but rather a set of constraints on 
>> applications written in L. Such things do not have as long a life as 
>> actual specifications do and mature standards bodies do not seem to 
>> use serialization languages as the root specification language, as 
>> far as I can tell.  My conclusion is that specifications should not 
>> be in anything like RDF or XML Schema, but in something else---BNF is 
>> probably adequate for most TDWG standards---with working subgroups 
>> responsible for publishing a serialization definition implementing 
>> the standard in languages useful for one or another purpose, e.g. 
>> LSID resolution.
>>
> Yes I think you are right. We should be specifying our objects in a 
> high level 'language' like UML (not so sure about BNF but I am not so 
> familiar with it) . There has been talk about OWL Lite as a subset of 
> UML. This was actually the next topic I was going to suggest and I'll 
> kick of  a thread on it soon if no one else does.
>
> Can I take it from your reply that you think:
>
>    1. There should be commonality between all TDWG 'objects' and that
>       that commonality should be their specification in UML/BNF/Other
>       technology? (Yes to my question 1).
>    2. Their should be alternative ways to serialize these objects.
>       Some of the serialization may support different aspects of the
>       objects (Yes to my question 2).
>    3. XML Schema or RDF/S are not appropriate ways to define such objects
>
> Have I read this correctly?
>
> Roger
>
>
>> Bob
>>
>>
>>
>>
>> On 2/17/06, *Roger Hyam* <roger at tdwg.org <mailto:roger at tdwg.org>> wrote:
>>
>>
>>     Hi All,
>>
>>     In a previous post I suggested definitions for Resolving,
>>     Searching and Querying from the point of view of the TAG. There
>>     has been a muted response which I take as meaning there aren't
>>     any strong objections to these definitions. We can come back to
>>     them later if need be. You can read the post here if you missed it:
>>
>>     http://lists.tdwg.org/pipermail/tdwg-tag_lists.tdwg.org/2006-February/000009.html
>>     <http://lists.tdwg.org/pipermail/tdwg-tag_lists.tdwg.org/2006-February/000009.html>
>>
>>     I'd like to look at the implications of the first two definitions:
>>
>>   1. *Resolving.* This means to convert a pointer into a data object. 
>>      Examples would be to resolve an LSID and get back either data or
>>      metadata or resolve a url and get back a web page in html.
>>
>>   2. *Searching.* This means to select a set of objects (or their
>>      proxies) on the basis of the values of their properties. The
>>      objects are  predefined (implicitly part of the call) and we are
>>      simply looking for them. An example would be finding pages on Google.
>>
>>    
>>
>>     Both these definitions imply the existence of data 'Objects' or
>>     'Structures' that are understood by the clients when they are
>>     received. The kinds of objects that jump to mind are Specimens,
>>     TaxonNames, TaxonConcepts, NaturalCollections, Collectors,
>>     Publications, People,  Expeditions etc etc. A piece of client
>>     software should be able to know what to do with an object when it
>>     gets - how to display it to the user or map it to a db etc.
>>
>>     My two leading questions are:
>>
>>        1. *Should there be commonality to all the objects?* If yes -
>>           what should it be? XML Schema location or OWL Class or
>>           something else? If no - then how should clients handle new
>>           objects dynamically - or shouldn't they be doing that kind
>>           of thing.
>>        2. *Should we have multiple ways of representing the SAME
>>           objects?* e.g. Should there be only one way to encode a
>>           Specimen or should it be possible to have several encodings
>>           running in parallel. If there is only one way how do we
>>           handle upgrades (where we have to run two types of encoding
>>           together during the roll out of the new one) AND how do we
>>           reach consensus on the 'perfect' way of encoding each and
>>           every object in our domain?
>>
>>     The answers I have for my leading questions are:
>>
>>        1. Yes - We should have some commonality between objects or it
>>           will be really difficult to write client code - but what
>>           that commonality is has to be decided.
>>        2. Yes - The architecture has to handle multiple versions/ways
>>           of encoding any particular object type because any one
>>           version is not likely to be ideal for everyone forever.
>>
>>     Are the two conclusions I come to here reasonable? Is this too
>>     high level and not making any sense?
>>
>>     I'd be grateful for your thoughts on this,
>>
>>     Roger
>>
>>
>>-- 
>>
>>-------------------------------------
>> Roger Hyam
>> Technical Architect
>> Taxonomic Databases Working Group
>>-------------------------------------
>> 
>>http://www.tdwg.org
>> roger at tdwg.org <mailto:roger at tdwg.org>
>> +44 1578 722782
>>-------------------------------------
>>    
>>
>>
>>     _______________________________________________
>>     Tdwg-tag mailing list
>>     Tdwg-tag at lists.tdwg.org <mailto:Tdwg-tag at lists.tdwg.org>
>>     http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
>>
>>
>>
>
>
>-- 
>
>-------------------------------------
> Roger Hyam
> Technical Architect
> Taxonomic Databases Working Group
>-------------------------------------
> http://www.tdwg.org
> roger at tdwg.org
> +44 1578 722782
>-------------------------------------
>  
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Tdwg-tag mailing list
>Tdwg-tag at lists.tdwg.org
>http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
>  
>