[Tdwg-tag] A very simple question stated again.

Tue Mar 28 19:54:40 CEST 2006

Sorry to propose a very complex answer to a very simple question, but 
here it is:

Roger Hyam wrote:

>
> Hi All,
>
> Gregor posted my rather long winded description of confusion about 
> semantics in XML Schema to the list and it may have confused you. It 
> can be summed up with a simple question to which a simple answer is 
> all that suffices.
>
> *"Are the semantics encoded in the XML Schema or in the structure of 
> the XML instance documents that validate against that schema? Is it 
> possible to 'understand' an instance document without reference to the 
> schema?"*
>
> Possible answers are:
>
>    1. *Yes:* you can understand an XML instance document in the
>       absence of a schema it validates against i.e. just from the
>       structure of the elements and the namespaces used.
>    2. *No*: you require the XML Schema to understand the document.
>

I cannot state strongly enough that the answer is NO (number 2) -- you 
must have an XML schema in order to "understand" an instance.  XML is 
"self-documenting" only to humans.  I would venture that almost no one 
uses XML directly in their work.  Instead people who use data often 
collect it and load it into a client application that can then do 
something useful with it (for example, a geographic or scientific 
information system like Arc or Matlab or the GBIF portal).  In this 
scenario, humans aren't consuming the XML, software is.  When a user 
goes to the GBIF portal and requests search results in tab-delimited 
format, they're downloading the results of a piece of software that has 
consumed the XML and produced a text file.

In any case where software must consume XML, it needs to know the 
structure of the XML.  One common way to do this is to use XML-binding 
tools like Castor that create bindings between a programming language 
and an XML document structure.  Given a properly constructed XML Schema, 
these tools create a set of object-oriented classes that can parse and 
"understand" XML instances under that schema.  These classes are used by 
a software application to consume instances of the given schema.  The 
other common method for working with XML in software is to create a 
custom deserializer that "understands" instances of a schema given 
hard-coded domain specific information by a programmer.  While this 
second option does not directly depend upon the XML Schema, the 
programmer who creates the rules embedded in the deserializer uses the 
XML schema to encode the rules about what is acceptable.

Both of these approaches depend upon XML Schema for reasons other than 
validation.  Validation however is the only function that XML Schema was 
explicitly designed to address.  At heart XML Schema is a grammar for 
accepting or rejecting documents.  It is not a description of a data 
model.  This begs the question of what it means to "understand" an XML 
instance or an XML Schema.

Roger asks *"Are the semantics encoded in the XML Schema or in the 
structure of the XML instance documents that validate against that schema?"*

I argue that the semantics are embedded in the contents of XML 
instances, and that the XML Schema does not address the semantics at all 
but that they are necessary (though not sufficient) for doing so.  The 
XML Schema merely constrains the syntax of an acceptable document.  
Syntax is not semantics.  Natural (human) languages are a great deal 
more powerful and expressive than XML, but my point can be illustrated 
with a simple syntactically valid English sentence that makes no 
semantic sense whatsoever: "The moon jumped over the cow".  For another 
example, see Donald's argument about using Darwin Core to exchange stamp 
collection data.

In order for a piece of software to consume XML it must first know the 
syntactic structure of the XML before it can do something useful.  
Simple systems that convert one representation into another without any 
translation (like the GBIF portal when it creates tab-delimited 
representations) don't really require a semantic understanding of the 
data.  However, any sort of analysis tool or data-cleaning tool must be 
smarter and these smart tools can provide a great deal of value to end 
users.

XML requires human intervention in order to be understood.   Because it 
only constrains the syntax of documents, not the semantics of 
inter-related data objects, programmers must embed domain specific 
knowledge in software in order to do any non-trivial processing of XML, 
especially of interrelated XML instances defined under multiple schema.

> This is not a trivial question. The answers may require different 
> approaches to an overall architecture.
>
> Versioning of schemas, for example, becomes irrelevant if the answer 
> is Yes - as the meaning is implicit in the structure you can throw the 
> schema away and not loose anything. XML is 'self describing' so you 
> would think this must be true. The schema is just a useful device to 
> help you construct XML in the correct format.
>
> If the answer in No then we need clear statements about how all 
> instances must always bear links to a permanently retrievable schema - 
> or they become meaningless. We need very tight version control of 
> schemas and a method of linking between the versions so we can track 
> how the meaning has changed. We also need clear statements on what 
> happens when you can validate a document with multiple schemas? Does 
> this imply multiple meanings? Schemas must be archived with any data etc.
>
> If you respond to this message please state a preference for either 1 
> or 2. There is no middle road on this one!
>
At heart the real problem is schema interoperability.  We need 
interoperability both within schemas and across schemas.  When two 
pieces of software exchange data in XML they both need to know the 
structure of the data (it's schema) and be assured that they're using 
the same version.  This can be addressed by rigorous schema versioning. 

The difficult problem manifests when we start talking about 
interoperability across schema (for example across a specimen schema and 
a taxon concept schema).  We can avoid the circular XML Schema import 
problem (which is made much more difficult if we have to strictly 
version schema) by making references across schema instances with 
GUIDs.  For example, a Specimen schema instance can refer to a TCS 
instance for it's identified taxon concept using an LSID.  However, to a 
piece of software that consumes XML based on XML Schema, this LSID is 
simply a string.  A specimen instance that refers to a taxon concept 
might validate just as easily if that LSID were:
1.) an invalid LSID
2.) an LSID pointing to a publication instance (instead of to a taxon 
concept)
3.) a valid LSID pointing to a valid taxon concept

The problem is that the software that consumes instances of different 
XML schema that are made interoperable by GUIDs must be an order of 
magnitude more intelligent than what we're building now in order to 
"understand" what they're working with.  In order to semantically 
validate the specimen from the above example, the software should first 
validate and parse the specimen instance, then resolve the LSID which is 
encoded in a taxon concept element, fetch the XML metadata for the taxon 
concept and then validate that taxon concept instance.  If the end user 
wants to display the name of the taxon concept along with the rest of 
the data about a specimen, the taxon concept XML would also have to be 
parsed. 

So, to consume instances of different schemas that are interrelated with 
GUIDs, the software has to know about each schema involved (specifically 
each version of each schema).  What this means in practice is that if a 
new schema were introduced, or a new version of an existing schema came 
into production, every piece of software that consumes it or related 
schemas must be updated.  In practice this is a software maintenance 
nightmare.

What I'm trying to point out is that using GUID's does not break 
dependencies between XML Schemas (or versions of the same schema), it 
merely pushes the problem to a higher level in the process of consuming 
XML.  Can anyone propose a real solution to this problem using XML Schema?

-Steve

> All the best,
>
> Roger
>
>-- 
>
>-------------------------------------
> Roger Hyam
> Technical Architect
> Taxonomic Databases Working Group
>-------------------------------------
> http://www.tdwg.org
> roger at tdwg.org
> +44 1578 722782
>-------------------------------------
>  
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Tdwg-tag mailing list
>Tdwg-tag at lists.tdwg.org
>http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
>  
>