Re: [Tdwg-tag] roger@tdwg.org

28 Mar 2006

      Hi Gregor,

I've placed some comments regarding your exchange with Roger in-line:

Gregor Hagedorn wrote:
...
Hi Roger
...
TAG list url is here with the archive:
http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
thanks, I registered.
...
...
I need help to understand rdf. Whereas xml schema has a conceptual mapping to 
database or oo-programming design, rdf seem to have none, I lack anything I can 
relate it too. I still have not seen any software to help me understand what 
you produced.
XML Schema was not designed to have a conceptual mapping to databases or 
object-oriented frameworks.  There are a set of tools and a series of 
conventions for loading XML schema instances into objects and for 
mapping schemas into relational table structures, but most of these 
systems only work if you use a subset of XML Schema language features.  
For example, XML Schema features like substitution groups and xsd:any 
cause many of these tools to have problems.
...
...
RDF is no more complex than xml schema. The RDFS way of doing things is 
far more object orientated than schema. It forces you to have classes 
and properties whereas arbitrary XML document structures can be 
ambiguous as to whether they are defining objects or properties of 
objects so - I don't see your reasoning.
I never encountered atomizing every statement into subject-predicate-object in 
OO design...
That's very true and whether you think of RDF as a graph of 3-tuples or 
whether you envision it as a set of "objects" that are instances of 
classes depends on the type of problem you're trying to solve.  Triples 
are the lowest level but thinking in terms of the abstraction of objects 
and classes can be helpful for some tasks.

As an aside, some XML databases reduce XML Schema instances to a 
low-level structure called a flattened tree that can be analogous to 
triples.  It is possible to decompose any XML instance into an ordered 
list of XPath = value pairs where the XPaths are concrete and used to 
refer to any element or attribute in the document.  This is one of two 
approaches for building an XML database from scratch.
...
"to whether they are defining objects or properties of objects so": xml schema 
is about classes, not objects (instances). Can you give an example what you 
find confusing in xml schema, I don't see it.
XML Schema is a grammar for accepting or rejecting documents.  It does 
not define classes of objects and the relationships among those 
classes.  With care and common agreement from the stakeholders of an XML 
Schema one can create a schema such that it describes classes of 
objects, but this is by agreement not by design.  The only class of 
object described by the ABCD schema is an ABCD document, not a specimen 
or a publication or a name or what have you.
...
Of course you do have the strange animal of mixed content in xml schema, but 
ignoring this (none of the TDWG standard used it) you have classes and each 
class has a type. The type can be simple or complex, just like in OO languages.
I agree that you can define simple and complex types in an XML Schema, 
however these are syntactic types.  An XML Schema type (simple or 
complex) is simply a rule for accepting or rejecting an XML subtree.  It 
does not define what a thing is and how it relates to other things, it 
merely describes the form a thing must have in order to be acceptable to 
a validating XML parser.

Because XML Schema was designed to be a grammar for the validation of 
XML trees and not a semantic typing system, using it to build a global 
collection of interrelated data objects introduces a variety of issues:

First XML Schema is very limited in the relationships you can define 
between types.  One of the most-used relationships in OOA/OOD is 
inheritance and XML Schema does not provide proper inheritance.  In XML 
Schema, substitution groups can be used as an inheritance-like language 
function, but they only work within a single schema.  To do more than 
that, one must start importing other schemas which can cause some 
surprising problems. 

Second, there is no global identity property in XML.  One can use id's, 
but they are local to a single instance document.  The use of GUIDs will 
enable us to build a large  collection of interrelated data objects of 
different types.  To accomplish this in XML we would have to agree on 
how to represent GUIDs in all of the TDWG schema.  Again, this is 
something we can accomplish, but it will be accomplished by agreement 
instead as opposed to being enforced by the technology stack.

Third, XML Schema introduces the problem of schema interoperability.  If 
I have a TCS XML Schema that allows pointers to instances of a 
publication XML Schema and I want instances of my TCS schema to be able 
to represent publications either as GUIDs or as actual data, then I must 
design my TCS schema to import my publication schema.  This is fine for 
taxon concepts and publications, but what about taxon concepts and 
specimens?  The Specimen XML Schema would have to import the TCS schema 
(because a specimen can be identified as an instance of a particular 
taxon concept) and TCS would have to import the specimen schema.  This 
is circular import and it is not allowed.  Furthermore, there is no 
sophisticated XML instance pre-processor system (as in C compilers) that 
supports conditional imports.  In order to do this with XML we would 
have to change our requirements such that we only ever allow references 
to data objects defined under a foreign schema by GUID and never allow 
copies of those foreign data objects to be embedded in our XML 
instance.  In plain English this means our TCS instance can't embed a 
publication data object, it can only refer to it by GUID.  Once again, 
this builds greater dependency upon the GUID framework which exists by 
agreement only due to the second problem listed above.

This is only three out of a great many issues with using XML Schema to 
build a large collection of interrelated data objects.  RDF (along with 
RDF-Schema and/or OWL) solve many of these problems.  To be fair RDF 
also has its drawbacks, not limited to complexity of client-APIs and 
inefficiency of triple stores.  I'd be happy to discuss problems on both 
sides of this ontological divide at more length if anyone else is 
interested.
...
I did already tried the primer but it did not help me, it seemed to talk
of use cases rather in Artificial intelligence that are hard for me to follow.
...
The RDF primer is a good place to start reading:
http://www.w3.org/TR/rdf-primer/
It is less than 100 printed pages so can probably be read in an evening 
and understood in several evenings!
There is a tutorial here:
http://www.w3schools.com/rdf/default.asp
and loads of books and things
The key to understanding it I found was that it is about describing 
resources not validating documents. When using XML Schema we are trying 
to create a set of rules to validate a document that describes the 
resource. We are effectively designing forms. With RDF we are describing 
the attributes of the resource that we want to use to describe it. Thus 
the two things are not mutually exclusive - which I hoped to demonstrate 
with my code.
That may be a good pointer to the problems I have. Because I do not think we 
are describing resources. In my mind we are sharing scientific data. I want the 
data, not the resources.
Resources only act as identifiers for things, for data objects of a 
particular type.  What is important is the description of those things 
(the data).  In the RDF universe I've been imagining, resources are 
GUIDs for things like names, specimens, observations, publications, 
people, institutions, sequences, etc.

One thing we haven't talked about is the fundamental unit of data 
exchange in an RDF universe.  It's not a document (as in the XML 
universe) nor is it a statement (a triple), instead it is a set of 
triples that form a concise description of a resource.  See 
http://swdev.nokia.com/uriqa/CBD.html (a W3C proposal).

-Steve

Re: [Tdwg-tag] roger@tdwg.org

Steven Perry