Re: [Tdwg-tag] Primary Objects as XML Structures or OWL Classes

20 Feb 2006

      I agree with Bob that our data model specifications should be decoupled 
from possible representation schemes.  In my opinion, these 
specifications should take the form of UML static structures with 
accompanying explanatory documents.  The use of BNF grammars is a good 
idea, but I worry that they might become difficult to manage as they 
grow and that non-CS people in the community would find it difficult to 
understand them.

I also think that the technical architecture group should not be 
concerned with the data models themselves.  Instead we have to worry 
about how to map existing data sets into the shared models, how to link 
instances of different models together, how to locate one or more data 
objects that meet certain criteria, how to merge collections of data 
objects from one or more models, how to visualize trees or graphs of 
data objects, how to serialize and deserialize data objects into 
different representations, etc.  In short, we have to design a network 
of services that allow us work with data objects and collections of data 
objects in a fairly generic fashion and leave the actual creation of the 
models up to the subject matter experts (though we might supply a bit of 
KR advice).

These services and processes will also require documentation and might 
be specified with the same combination of UML (sequence or activity 
diagrams) and explanatory documentation.  I think all of us agree that 
these ought to be designed in a language-independent manner and be built 
upon a small stack of existing technologies like HTTP and XML for 
message transport.

At some point though we have to agree on a representation format.  If 
we're talking about building a set of distributed services that will 
allow us to locate, acquire, and work with biodiversity data, then I 
think we need to propose an architecture that has a few fixed points, 
one of which should be representation format.  I for one don't want to 
have to design tools that can injest both XML Schema instances and RDF 
described by ontologies.

The representation format we select ought to be flexible enough to 
accommodate the data models described by the subject matter experts.  It 
should also minimize the burden on the software engineers and developers 
that have to design and maintain the processes, tools, and services that 
satisfy the above use cases (mapping, serializing/deserializing, 
searching, merging, visualizing, etc).  Ideally the representation 
format should allow us to choose from a collection of existing tools and 
frameworks to use while implementing (because no one has the time or 
money to create all this from scratch).  This means that we need to 
evaluate each candidate representation format with the above use cases 
in mind.  Every representation scheme (RDF, XML, Java classes, etc.) has 
its strengths and weaknesses and this process of talking about each use 
case with respect to implementing it over each of the representation 
formats will allow us to better understand the trade offs of selecting 
one format over another.

Here's an example of the types of discussions I'd like to see from TAG: 

Portions of the TCS data model describe specimens, publications, and 
other things that are not names or concepts.  In a perfect world, TCS 
would not define it's own data model for specimens but would instead use 
an existing model designed by the curators of collections (perhaps with 
Darwin Core as a starting point).  The same is true for publications.  
Instances of TCS should then use GUIDs to point to instances of Specimen 
and Publication. 

Now, imagine a hypothetical system that provides for visualization of 
TCS, a Taxon Concept Browser that allows researchers to search for and 
view TCS instances in order to select a set of concepts to use in their 
own work.  At some point, this system will have to have an in-memory 
graph of data objects of different types including TCS, Specimen, and 
Publication.  This graph could be constructed in a variety of ways, but 
the most likely method will start by parsing a TCS document that 
contains several taxon concept instances.  Each instance will be 
examined for references to other objects named by LSID.  Each of these 
LSIDs will be resolved (with care so as to not create cycles), resulting 
in a chunk of serialized data which will be turned into an in-memory 
instance and inserted into the graph.  This is at heart the merge case I 
was talking about above.

So, to satisfy this case with XML Schema, TCS's specimen element might 
be implemented with a simpleType that has a restriction base of string 
and a pattern match constraint designed to allow validation of LSID 
URNs.  This effectively decouples Specimen from TCS and allows instances 
of them to refer to each other without having to import each other's 
schema.  The Taxon Concept Browser's instance graph composer would know 
about the schemas for TCS and Specimen.  It would take the resolved 
chunk of XML from specimen LSIDs and attempt to deserialize (unmarshall) 
it before merging it into the graph that will be visualized.

At first glance this appears to work fairly well, however there are a 
few issues with this design.  First, it precludes the direct embedding 
of Specimen instances in TCS instances.  There are many reasons why one 
might want to do this.  One reason is to avoid unnecessary LSID 
resolution calls (which add latency) in the case where specimen and 
taxon concept objects are coming from the same server.  Another reason 
for embedding specimen instances in taxon concept instances is to make 
things easier on a user who might want to download an entire taxon 
concept graph to their local machine for processing by a desktop 
application.  Without embedding they may be forced to download several 
different files.  We could fix this by changing the definition of the 
specimen element in TCS so it can be either a subtree or an LSID, but 
then the TCS schema would have to import specimen and vice versa.  This 
is of course impossible so we'd have to do one of three things: make the 
TCS specimen element xsd:any, derive both TCS and Specimen from the same 
base XML Schema which minimally defines an LSID element, or design a 
complicated scheme for embedding both instances and schema (akin to how 
WFS works) in a single instance document.  Each of these have their own 
drawbacks: xsd:any makes it difficult or impossible to use most 
XML-to-Object binding tools, schema inheritance is difficult, can be 
accomplished only with a social agreement between everyone in the 
community, and allows for only a weak form of validation, and the 
embedded-schema approach is burdensome to developers.

An additional problem with the XML solution is that it is relatively 
brittle when it comes to change in data models over time.  Darwin Core 
has a high adoption rate because it is very simple.  However, the 
simplicity that drove it's adoption also encouraged different parts of 
the community to customize it to fit their needs; I know of at least 
three variants of Darwin Core in common use and there could be many more 
that I haven't encountered.  Some of these variants were declared in 
their own namespaces, but others were not.  This has made it quite 
difficult to write code that can injest all variants of Darwin Core to 
extract even the minimal set of common elements such as ScientificName.  
Finally it is not possible to validate most variants of DarwinCore (for 
a variety of reasons).  This makes them a poor candidate for 
XML-to-Object binding tools.  In our example above, if a new variant of 
the Specimen Schema were introduced, then the XML-to-Object binding code 
that backs the deserialization of Specimen instances into the graph 
would most likely not be able to handle the new version.  So, in order 
to use the Taxon Concept Browser in a heterogeneous network that has 
more than one version of the Specimen schema, even if the goal is simply 
to display the minimal set of elements common to each version, we would 
have to release a new version of the tool every time we deploy a new Schema.

That's not to say that RDF will solve all the problems.  While it might 
make the design of flexible, modular data models and the software that 
use them a bit easier, no one has ever proved that it will scale.  
Additionally there is the temptation with RDF to catch what I call 
Ontology Fever.  In terminal cases, this disease results in an obsession 
with using OWL Full to model the entire universe reductively at the 
level of the laws of physics.  Any distributed data network afflicted by 
this disease is destined to die.  That's why I prefer RDF-Schema to OWL 
(though I think OWL may end up eventually playing some role if we move 
towards RDF).  For the same reason I think the primary use case is not 
inference over OWL-described RDF, but search over flexible RDF-Schema 
described data models.  I personally think that RDF might make some use 
cases, especially the merge case, easier to handle.  So I'd like to see 
further discussion of the use cases above for both XML Schema and RDF.

In summary, the design of our shared data models is more a social 
process than a technical one and I agree with Bob that it should be 
carried out using a representation-agnostic modeling  language.  The 
technically difficult bit is designing the network of services that will 
allow one to use the data models.  We have an intuitive idea of what the 
use cases are for such a system, but I'd like to see more discussion on 
that topic.  Roger has started this off nicely by considering the 
differences between resolution and search but I'd like to continue the 
discussion into the other use cases like merging, visualization, etc.  
TAG seems like the best place to do so.

-Steve

Roger Hyam wrote:
...
Hi Bob,
...
I'm rushing off to the GISIN meeting at AGADIR and might not have 
much time to respond more before midweek, or maybe even until I get 
back next week, but:
0.  I _wish_ this discussion were taking place in a wiki, with RSS or 
email notification,  so it is easier to follow if you cannot keep up 
with the email
The way I was planning on running the TAG discussions was to have 
'discussions' on the mailing list and summarize them to the wiki. The 
motivation behind this is to work towards the wiki being a readable 
document for the uninitiated. It should not be necessary for some one 
coming new to a field to have to read all the discussions that have 
taken place to reach a conclusion. These discussions should be 
available but it is the job of an editor/facilitator to create a 
readable narrative from possibly wandering dialog.
The wiki is here: http://www.tdwg.hyam.net/twiki/bin/view/TAG
The URL will change at some point in the next few months but I will 
make sure all URLs forward to the appropriate place on the new server. 
There is no RSS feed on it at present I'll see about setting one up 
either now or when we move it to the main server.
The mailing list archive is here: 
http://lists.tdwg.org/pipermail/tdwg-tag_lists.tdwg.org/ so any thread 
can be followed and resurrected at any time.
I take on board what you are saying though and will try and create 
links between the wiki and list archive.
...
1.  I don't think specifications of high level things like "objects" 
should be done in a serlization constraint languge such as RDF or XML 
Schema. Instead, it should have something more general as the 
normative definition and have _representation_ in one or more of such 
constraint languages. This is the mechanism of W3C usually. Many 
(Most?) W3C standards have a normative BNF definition, and one or 
more representations to allow implementers to actually do business.  
OMG favors UML for this, etc.There is nothing inherently normative 
about, say RDF or XML Schema, for, say TaxonConcepts. If you take the 
serialization language as the normative language, then in the future 
you just end up having to support several serialization languages 
when you find you want to extend your specification with something 
for which the chosen one is insufficiently expressive. This, in fact, 
is what is going on now with the cries for RDF over XML Schema. Put 
another way, if you choose language L as the normative language, you 
are not building a specification, but rather a set of constraints on 
applications written in L. Such things do not have as long a life as 
actual specifications do and mature standards bodies do not seem to 
use serialization languages as the root specification language, as 
far as I can tell.  My conclusion is that specifications should not 
be in anything like RDF or XML Schema, but in something else---BNF is 
probably adequate for most TDWG standards---with working subgroups 
responsible for publishing a serialization definition implementing 
the standard in languages useful for one or another purpose, e.g. 
LSID resolution.
Yes I think you are right. We should be specifying our objects in a 
high level 'language' like UML (not so sure about BNF but I am not so 
familiar with it) . There has been talk about OWL Lite as a subset of 
UML. This was actually the next topic I was going to suggest and I'll 
kick of  a thread on it soon if no one else does.
Can I take it from your reply that you think:
1. There should be commonality between all TDWG 'objects' and that
      that commonality should be their specification in UML/BNF/Other
      technology? (Yes to my question 1).
   2. Their should be alternative ways to serialize these objects.
      Some of the serialization may support different aspects of the
      objects (Yes to my question 2).
   3. XML Schema or RDF/S are not appropriate ways to define such objects
Have I read this correctly?
Roger
...
Bob
On 2/17/06, *Roger Hyam* <roger@tdwg.org <mailto:roger@tdwg.org>> wrote:
Hi All,
In a previous post I suggested definitions for Resolving,
    Searching and Querying from the point of view of the TAG. There
    has been a muted response which I take as meaning there aren't
    any strong objections to these definitions. We can come back to
    them later if need be. You can read the post here if you missed it:
http://lists.tdwg.org/pipermail/tdwg-tag_lists.tdwg.org/2006-February/000009...
    <http://lists.tdwg.org/pipermail/tdwg-tag_lists.tdwg.org/2006-February/000009.html>
I'd like to look at the implications of the first two definitions:
1. *Resolving.* This means to convert a pointer into a data object. 
     Examples would be to resolve an LSID and get back either data or
     metadata or resolve a url and get back a web page in html.
2. *Searching.* This means to select a set of objects (or their
     proxies) on the basis of the values of their properties. The
     objects are  predefined (implicitly part of the call) and we are
     simply looking for them. An example would be finding pages on Google.
Both these definitions imply the existence of data 'Objects' or
    'Structures' that are understood by the clients when they are
    received. The kinds of objects that jump to mind are Specimens,
    TaxonNames, TaxonConcepts, NaturalCollections, Collectors,
    Publications, People,  Expeditions etc etc. A piece of client
    software should be able to know what to do with an object when it
    gets - how to display it to the user or map it to a db etc.
My two leading questions are:
1. *Should there be commonality to all the objects?* If yes -
          what should it be? XML Schema location or OWL Class or
          something else? If no - then how should clients handle new
          objects dynamically - or shouldn't they be doing that kind
          of thing.
       2. *Should we have multiple ways of representing the SAME
          objects?* e.g. Should there be only one way to encode a
          Specimen or should it be possible to have several encodings
          running in parallel. If there is only one way how do we
          handle upgrades (where we have to run two types of encoding
          together during the roll out of the new one) AND how do we
          reach consensus on the 'perfect' way of encoding each and
          every object in our domain?
The answers I have for my leading questions are:
1. Yes - We should have some commonality between objects or it
          will be really difficult to write client code - but what
          that commonality is has to be decided.
       2. Yes - The architecture has to handle multiple versions/ways
          of encoding any particular object type because any one
          version is not likely to be ideal for everyone forever.
Are the two conclusions I come to here reasonable? Is this too
    high level and not making any sense?
I'd be grateful for your thoughts on this,
Roger
--
-------------------------------------
Roger Hyam
Technical Architect
Taxonomic Databases Working Group
-------------------------------------
http://www.tdwg.org
roger@tdwg.org <mailto:roger@tdwg.org>
+44 1578 722782
-------------------------------------
_______________________________________________
    Tdwg-tag mailing list
    Tdwg-tag@lists.tdwg.org <mailto:Tdwg-tag@lists.tdwg.org>
    http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
--
-------------------------------------
Roger Hyam
Technical Architect
Taxonomic Databases Working Group
-------------------------------------
http://www.tdwg.org
roger@tdwg.org
+44 1578 722782
-------------------------------------
------------------------------------------------------------------------
_______________________________________________
Tdwg-tag mailing list
Tdwg-tag@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org

Re: [Tdwg-tag] Primary Objects as XML Structures or OWL Classes

Steven Perry