Re: [Tdwg-tag] w3c xml-schema discussion

29 Mar 2006

      Dear Steve,

many thanks for your comments. You definitely pointed out many places where my 
language was inaccurate
...
XML Schema is a grammar for accepting or rejecting documents.  It does 
not define classes of objects and the relationships among those 
classes.  With care and common agreement from the stakeholders of an XML 
Schema one can create a schema such that it describes classes of 
objects, but this is by agreement not by design.  The only class of 
object described by the ABCD schema is an ABCD document, not a specimen 
or a publication or a name or what have you.
I agree, but then you can (and TCS and UBIF/SDD do) use schema in a way, that 
by design avoids global elements, substitution groups etc. Instead, we use 
types, which always are intended map to a class.
...
I agree that you can define simple and complex types in an XML Schema, 
however these are syntactic types.  An XML Schema type (simple or 
complex) is simply a rule for accepting or rejecting an XML subtree.  It 
does not define what a thing is and how it relates to other things, it 
merely describes the form a thing must have in order to be acceptable to 
a validating XML parser.
Yes, but if you write these rules by means of class inheritance, extension and 
polymorphism, and you add a note that this is not meant to be random, you 
surely are enabled to interpret this as design. You can just as well claim that 
an Java/whatever OO architecture is not about defining what a thing is and how 
it relates to other things. Strictly you are correct, but I believe that no 
strict separation is in place here.
...
First XML Schema is very limited in the relationships you can define 
between types.  One of the most-used relationships in OOA/OOD is 
inheritance and XML Schema does not provide proper inheritance.
I believe it does. w3c-schema has type derivation, extension, and even type 
polymorphism (all somewhat limited by parsing determinism optimisations in 
schema). You can have extension both on simple and complex types. For 
polymorphism, it even has the special xsi:type attribute (strictly a separate 
schema, but documented in the w3c schema documentation.
...
substitution groups can be used as an inheritance-like language 
function, but they only work within a single schema.  To do more than 
that, one must start importing other schemas which can cause some 
surprising problems.
I believe what you say may be true for substitution groups, but not for 
extension.
...
Second, there is no global identity property in XML.  One can use id's, 
but they are local to a single instance document.  The use of GUIDs will 
enable us to build a large  collection of interrelated data objects of 
different types.  To accomplish this in XML we would have to agree on 
how to represent GUIDs in all of the TDWG schema.  Again, this is 
something we can accomplish, but it will be accomplished by agreement 
instead as opposed to being enforced by the technology stack.
I think this is erroneous. Whenever you define a data element as the simple 
type xs:uri you inform any parser that you mean this to be a guid. Whether 
parsers use that information is another question, it certainly is not validated 
in current validators.

Also there is not restriction that id attributes must be local. In fact they 
can be freely typed, including to xs:uri.

===
By the way, conversely in SDD we have identified a major problem in forcing 
people to use URIs for every internal reference. The problem is learning curve 
(school children trying to develop their own LUCID key to backyard plants 
should NOT be bothered with defining their GUID-scheme first - and then as a 
biologist I may say that biologists often would like to be treated the same...) 
and legal (e.g. my current base address is my employers one, but as soon as I 
leave or retire, I am legally forced to no longer use bba.de in any 
circumstances).

I am not sure how to overcome this, perhaps someone should indeed register a 
urn:local schema.
===
...
Third, XML Schema introduces the problem of schema interoperability.  If 
I have a TCS XML Schema that allows pointers to instances of a 
publication XML Schema and I want instances of my TCS schema to be able 
to represent publications either as GUIDs or as actual data, then I must 
design my TCS schema to import my publication schema.  This is fine for 
taxon concepts and publications, but what about taxon concepts and 
specimens?  The Specimen XML Schema would have to import the TCS schema 
(because a specimen can be identified as an instance of a particular 
taxon concept) and TCS would have to import the specimen schema.  This 
is circular import and it is not allowed.
I fully agree with this being a serious problem.

In principle it is possible to overcome this with the use of type polymorphism. 
UBIF would define an abstract base type (and yes, if we need more base types we 
would need to extent to UBIF schema, creating a new version of it).

However, in testing in 2002/2003 it turned out that major xml tools did not 
handle multiple namespace schemata correctly, so we never got down the road 
very far. So I cannot say how realistic the solution is with current software.

I agree this is an open problem with w3c schema.
...
This is only three out of a great many issues with using XML Schema to 
build a large collection of interrelated data objects.  RDF (along with 
RDF-Schema and/or OWL) solve many of these problems.  To be fair RDF 
also has its drawbacks, not limited to complexity of client-APIs and 
inefficiency of triple stores.  I'd be happy to discuss problems on both 
sides of this ontological divide at more length if anyone else is 
interested.
I am and I believe we should be. Please do so, to help us get a clearer 
picture. Current usage ("mainstream") seems to point to xml-schema, but I think 
ontological approaches are exiting. I just feel we loose quite a bit as well, 
simply because RDF may be so general, that it does not allow to write software 
for more constrained (and therefore easier to analyse) cases. Although RDBMS 
can be used as triple store, that is not what they are designed for, so my 
current impression is we do loose the time proven utility of ER models 
implemented in RDBMS. I may still be wrong, I just start to learn about RDF/S.
...
Resources only act as identifiers for things, for data objects of a 
particular type.  What is important is the description of those things 
(the data).  In the RDF universe I've been imagining, resources are 
GUIDs for things like names, specimens, observations, publications, 
people, institutions, sequences, etc.
What are the resources, what are metadata and data when expressing knowledge 
about a taxon or specimen in saying:

Ipomoea violacea in the USA: "Flowers frequently dark to light blue, sometimes 
bordering on violet (G. Hagedorn, 29.3.2006)" and then "Flowers dark or light 
blue to purplish (Much. Better, 30.3.2006)"

We have object parts, characters, states, frequency modifiers, IPR metadata, 
versions, etc.

SDD expresses this through xml-schema. I find it very hard to think how to 
express this in RDF tuples. Maybe attempting this may help to understand what 
we loose in RDF.

Gregor----------------------------------------------------------
Gregor Hagedorn (G.Hagedorn@bba.de)
Institute for Plant Virology, Microbiology, and Biosafety
Federal Research Center for Agriculture and Forestry (BBA)
Königin-Luise-Str. 19           Tel: +49-30-8304-2220
14195 Berlin, Germany           Fax: +49-30-8304-2203