[Tdwg-tag] Primary Objects as XML Structures or OWL Classes
smperry at ku.edu
Mon Feb 20 19:57:39 CET 2006
I agree with Bob that our data model specifications should be decoupled
from possible representation schemes. In my opinion, these
specifications should take the form of UML static structures with
accompanying explanatory documents. The use of BNF grammars is a good
idea, but I worry that they might become difficult to manage as they
grow and that non-CS people in the community would find it difficult to
I also think that the technical architecture group should not be
concerned with the data models themselves. Instead we have to worry
about how to map existing data sets into the shared models, how to link
instances of different models together, how to locate one or more data
objects that meet certain criteria, how to merge collections of data
objects from one or more models, how to visualize trees or graphs of
data objects, how to serialize and deserialize data objects into
different representations, etc. In short, we have to design a network
of services that allow us work with data objects and collections of data
objects in a fairly generic fashion and leave the actual creation of the
models up to the subject matter experts (though we might supply a bit of
These services and processes will also require documentation and might
be specified with the same combination of UML (sequence or activity
diagrams) and explanatory documentation. I think all of us agree that
these ought to be designed in a language-independent manner and be built
upon a small stack of existing technologies like HTTP and XML for
At some point though we have to agree on a representation format. If
we're talking about building a set of distributed services that will
allow us to locate, acquire, and work with biodiversity data, then I
think we need to propose an architecture that has a few fixed points,
one of which should be representation format. I for one don't want to
have to design tools that can injest both XML Schema instances and RDF
described by ontologies.
The representation format we select ought to be flexible enough to
accommodate the data models described by the subject matter experts. It
should also minimize the burden on the software engineers and developers
that have to design and maintain the processes, tools, and services that
satisfy the above use cases (mapping, serializing/deserializing,
searching, merging, visualizing, etc). Ideally the representation
format should allow us to choose from a collection of existing tools and
frameworks to use while implementing (because no one has the time or
money to create all this from scratch). This means that we need to
evaluate each candidate representation format with the above use cases
in mind. Every representation scheme (RDF, XML, Java classes, etc.) has
its strengths and weaknesses and this process of talking about each use
case with respect to implementing it over each of the representation
formats will allow us to better understand the trade offs of selecting
one format over another.
Here's an example of the types of discussions I'd like to see from TAG:
Portions of the TCS data model describe specimens, publications, and
other things that are not names or concepts. In a perfect world, TCS
would not define it's own data model for specimens but would instead use
an existing model designed by the curators of collections (perhaps with
Darwin Core as a starting point). The same is true for publications.
Instances of TCS should then use GUIDs to point to instances of Specimen
Now, imagine a hypothetical system that provides for visualization of
TCS, a Taxon Concept Browser that allows researchers to search for and
view TCS instances in order to select a set of concepts to use in their
own work. At some point, this system will have to have an in-memory
graph of data objects of different types including TCS, Specimen, and
Publication. This graph could be constructed in a variety of ways, but
the most likely method will start by parsing a TCS document that
contains several taxon concept instances. Each instance will be
examined for references to other objects named by LSID. Each of these
LSIDs will be resolved (with care so as to not create cycles), resulting
in a chunk of serialized data which will be turned into an in-memory
instance and inserted into the graph. This is at heart the merge case I
was talking about above.
So, to satisfy this case with XML Schema, TCS's specimen element might
be implemented with a simpleType that has a restriction base of string
and a pattern match constraint designed to allow validation of LSID
URNs. This effectively decouples Specimen from TCS and allows instances
of them to refer to each other without having to import each other's
schema. The Taxon Concept Browser's instance graph composer would know
about the schemas for TCS and Specimen. It would take the resolved
chunk of XML from specimen LSIDs and attempt to deserialize (unmarshall)
it before merging it into the graph that will be visualized.
At first glance this appears to work fairly well, however there are a
few issues with this design. First, it precludes the direct embedding
of Specimen instances in TCS instances. There are many reasons why one
might want to do this. One reason is to avoid unnecessary LSID
resolution calls (which add latency) in the case where specimen and
taxon concept objects are coming from the same server. Another reason
for embedding specimen instances in taxon concept instances is to make
things easier on a user who might want to download an entire taxon
concept graph to their local machine for processing by a desktop
application. Without embedding they may be forced to download several
different files. We could fix this by changing the definition of the
specimen element in TCS so it can be either a subtree or an LSID, but
then the TCS schema would have to import specimen and vice versa. This
is of course impossible so we'd have to do one of three things: make the
TCS specimen element xsd:any, derive both TCS and Specimen from the same
base XML Schema which minimally defines an LSID element, or design a
complicated scheme for embedding both instances and schema (akin to how
WFS works) in a single instance document. Each of these have their own
drawbacks: xsd:any makes it difficult or impossible to use most
XML-to-Object binding tools, schema inheritance is difficult, can be
accomplished only with a social agreement between everyone in the
community, and allows for only a weak form of validation, and the
embedded-schema approach is burdensome to developers.
An additional problem with the XML solution is that it is relatively
brittle when it comes to change in data models over time. Darwin Core
has a high adoption rate because it is very simple. However, the
simplicity that drove it's adoption also encouraged different parts of
the community to customize it to fit their needs; I know of at least
three variants of Darwin Core in common use and there could be many more
that I haven't encountered. Some of these variants were declared in
their own namespaces, but others were not. This has made it quite
difficult to write code that can injest all variants of Darwin Core to
extract even the minimal set of common elements such as ScientificName.
Finally it is not possible to validate most variants of DarwinCore (for
a variety of reasons). This makes them a poor candidate for
XML-to-Object binding tools. In our example above, if a new variant of
the Specimen Schema were introduced, then the XML-to-Object binding code
that backs the deserialization of Specimen instances into the graph
would most likely not be able to handle the new version. So, in order
to use the Taxon Concept Browser in a heterogeneous network that has
more than one version of the Specimen schema, even if the goal is simply
to display the minimal set of elements common to each version, we would
have to release a new version of the tool every time we deploy a new Schema.
That's not to say that RDF will solve all the problems. While it might
make the design of flexible, modular data models and the software that
use them a bit easier, no one has ever proved that it will scale.
Additionally there is the temptation with RDF to catch what I call
Ontology Fever. In terminal cases, this disease results in an obsession
with using OWL Full to model the entire universe reductively at the
level of the laws of physics. Any distributed data network afflicted by
this disease is destined to die. That's why I prefer RDF-Schema to OWL
(though I think OWL may end up eventually playing some role if we move
towards RDF). For the same reason I think the primary use case is not
inference over OWL-described RDF, but search over flexible RDF-Schema
described data models. I personally think that RDF might make some use
cases, especially the merge case, easier to handle. So I'd like to see
further discussion of the use cases above for both XML Schema and RDF.
In summary, the design of our shared data models is more a social
process than a technical one and I agree with Bob that it should be
carried out using a representation-agnostic modeling language. The
technically difficult bit is designing the network of services that will
allow one to use the data models. We have an intuitive idea of what the
use cases are for such a system, but I'd like to see more discussion on
that topic. Roger has started this off nicely by considering the
differences between resolution and search but I'd like to continue the
discussion into the other use cases like merging, visualization, etc.
TAG seems like the best place to do so.
Roger Hyam wrote:
> Hi Bob,
>> I'm rushing off to the GISIN meeting at AGADIR and might not have
>> much time to respond more before midweek, or maybe even until I get
>> back next week, but:
>> 0. I _wish_ this discussion were taking place in a wiki, with RSS or
>> email notification, so it is easier to follow if you cannot keep up
>> with the email
> The way I was planning on running the TAG discussions was to have
> 'discussions' on the mailing list and summarize them to the wiki. The
> motivation behind this is to work towards the wiki being a readable
> document for the uninitiated. It should not be necessary for some one
> coming new to a field to have to read all the discussions that have
> taken place to reach a conclusion. These discussions should be
> available but it is the job of an editor/facilitator to create a
> readable narrative from possibly wandering dialog.
> The wiki is here: http://www.tdwg.hyam.net/twiki/bin/view/TAG
> The URL will change at some point in the next few months but I will
> make sure all URLs forward to the appropriate place on the new server.
> There is no RSS feed on it at present I'll see about setting one up
> either now or when we move it to the main server.
> The mailing list archive is here:
> http://lists.tdwg.org/pipermail/tdwg-tag_lists.tdwg.org/ so any thread
> can be followed and resurrected at any time.
> I take on board what you are saying though and will try and create
> links between the wiki and list archive.
>> 1. I don't think specifications of high level things like "objects"
>> should be done in a serlization constraint languge such as RDF or XML
>> Schema. Instead, it should have something more general as the
>> normative definition and have _representation_ in one or more of such
>> constraint languages. This is the mechanism of W3C usually. Many
>> (Most?) W3C standards have a normative BNF definition, and one or
>> more representations to allow implementers to actually do business.
>> OMG favors UML for this, etc.There is nothing inherently normative
>> about, say RDF or XML Schema, for, say TaxonConcepts. If you take the
>> serialization language as the normative language, then in the future
>> you just end up having to support several serialization languages
>> when you find you want to extend your specification with something
>> for which the chosen one is insufficiently expressive. This, in fact,
>> is what is going on now with the cries for RDF over XML Schema. Put
>> another way, if you choose language L as the normative language, you
>> are not building a specification, but rather a set of constraints on
>> applications written in L. Such things do not have as long a life as
>> actual specifications do and mature standards bodies do not seem to
>> use serialization languages as the root specification language, as
>> far as I can tell. My conclusion is that specifications should not
>> be in anything like RDF or XML Schema, but in something else---BNF is
>> probably adequate for most TDWG standards---with working subgroups
>> responsible for publishing a serialization definition implementing
>> the standard in languages useful for one or another purpose, e.g.
>> LSID resolution.
> Yes I think you are right. We should be specifying our objects in a
> high level 'language' like UML (not so sure about BNF but I am not so
> familiar with it) . There has been talk about OWL Lite as a subset of
> UML. This was actually the next topic I was going to suggest and I'll
> kick of a thread on it soon if no one else does.
> Can I take it from your reply that you think:
> 1. There should be commonality between all TDWG 'objects' and that
> that commonality should be their specification in UML/BNF/Other
> technology? (Yes to my question 1).
> 2. Their should be alternative ways to serialize these objects.
> Some of the serialization may support different aspects of the
> objects (Yes to my question 2).
> 3. XML Schema or RDF/S are not appropriate ways to define such objects
> Have I read this correctly?
>> On 2/17/06, *Roger Hyam* <roger at tdwg.org <mailto:roger at tdwg.org>> wrote:
>> Hi All,
>> In a previous post I suggested definitions for Resolving,
>> Searching and Querying from the point of view of the TAG. There
>> has been a muted response which I take as meaning there aren't
>> any strong objections to these definitions. We can come back to
>> them later if need be. You can read the post here if you missed it:
>> I'd like to look at the implications of the first two definitions:
>> 1. *Resolving.* This means to convert a pointer into a data object.
>> Examples would be to resolve an LSID and get back either data or
>> metadata or resolve a url and get back a web page in html.
>> 2. *Searching.* This means to select a set of objects (or their
>> proxies) on the basis of the values of their properties. The
>> objects are predefined (implicitly part of the call) and we are
>> simply looking for them. An example would be finding pages on Google.
>> Both these definitions imply the existence of data 'Objects' or
>> 'Structures' that are understood by the clients when they are
>> received. The kinds of objects that jump to mind are Specimens,
>> TaxonNames, TaxonConcepts, NaturalCollections, Collectors,
>> Publications, People, Expeditions etc etc. A piece of client
>> software should be able to know what to do with an object when it
>> gets - how to display it to the user or map it to a db etc.
>> My two leading questions are:
>> 1. *Should there be commonality to all the objects?* If yes -
>> what should it be? XML Schema location or OWL Class or
>> something else? If no - then how should clients handle new
>> objects dynamically - or shouldn't they be doing that kind
>> of thing.
>> 2. *Should we have multiple ways of representing the SAME
>> objects?* e.g. Should there be only one way to encode a
>> Specimen or should it be possible to have several encodings
>> running in parallel. If there is only one way how do we
>> handle upgrades (where we have to run two types of encoding
>> together during the roll out of the new one) AND how do we
>> reach consensus on the 'perfect' way of encoding each and
>> every object in our domain?
>> The answers I have for my leading questions are:
>> 1. Yes - We should have some commonality between objects or it
>> will be really difficult to write client code - but what
>> that commonality is has to be decided.
>> 2. Yes - The architecture has to handle multiple versions/ways
>> of encoding any particular object type because any one
>> version is not likely to be ideal for everyone forever.
>> Are the two conclusions I come to here reasonable? Is this too
>> high level and not making any sense?
>> I'd be grateful for your thoughts on this,
>> Roger Hyam
>> Technical Architect
>> Taxonomic Databases Working Group
>> roger at tdwg.org <mailto:roger at tdwg.org>
>> +44 1578 722782
>> Tdwg-tag mailing list
>> Tdwg-tag at lists.tdwg.org <mailto:Tdwg-tag at lists.tdwg.org>
> Roger Hyam
> Technical Architect
> Taxonomic Databases Working Group
> roger at tdwg.org
> +44 1578 722782
>Tdwg-tag mailing list
>Tdwg-tag at lists.tdwg.org
More information about the tdwg-tag