Sorry to propose a very complex answer to a very simple question, but
here it is:
Roger Hyam wrote:
Hi All,
Gregor posted my rather long winded description of confusion about
semantics in XML Schema to the list and it may have confused you. It
can be summed up with a simple question to which a simple answer is
all that suffices.
*"Are the semantics encoded in the XML Schema or in the structure of
the XML instance documents that validate against that schema? Is it
possible to 'understand' an instance document without reference to the
schema?"*
Possible answers are:
1. *Yes:* you can understand an XML instance document in the
absence of a schema it validates against i.e. just from the
structure of the elements and the namespaces used.
2. *No*: you require the XML Schema to understand the document.
I cannot state strongly enough that the answer is NO (number 2) -- you
must have an XML schema in order to "understand" an instance. XML is
"self-documenting" only to humans. I would venture that almost no one
uses XML directly in their work. Instead people who use data often
collect it and load it into a client application that can then do
something useful with it (for example, a geographic or scientific
information system like Arc or Matlab or the GBIF portal). In this
scenario, humans aren't consuming the XML, software is. When a user
goes to the GBIF portal and requests search results in tab-delimited
format, they're downloading the results of a piece of software that has
consumed the XML and produced a text file.
In any case where software must consume XML, it needs to know the
structure of the XML. One common way to do this is to use XML-binding
tools like Castor that create bindings between a programming language
and an XML document structure. Given a properly constructed XML Schema,
these tools create a set of object-oriented classes that can parse and
"understand" XML instances under that schema. These classes are used by
a software application to consume instances of the given schema. The
other common method for working with XML in software is to create a
custom deserializer that "understands" instances of a schema given
hard-coded domain specific information by a programmer. While this
second option does not directly depend upon the XML Schema, the
programmer who creates the rules embedded in the deserializer uses the
XML schema to encode the rules about what is acceptable.
Both of these approaches depend upon XML Schema for reasons other than
validation. Validation however is the only function that XML Schema was
explicitly designed to address. At heart XML Schema is a grammar for
accepting or rejecting documents. It is not a description of a data
model. This begs the question of what it means to "understand" an XML
instance or an XML Schema.
Roger asks *"Are the semantics encoded in the XML Schema or in the
structure of the XML instance documents that validate against that schema?"*
I argue that the semantics are embedded in the contents of XML
instances, and that the XML Schema does not address the semantics at all
but that they are necessary (though not sufficient) for doing so. The
XML Schema merely constrains the syntax of an acceptable document.
Syntax is not semantics. Natural (human) languages are a great deal
more powerful and expressive than XML, but my point can be illustrated
with a simple syntactically valid English sentence that makes no
semantic sense whatsoever: "The moon jumped over the cow". For another
example, see Donald's argument about using Darwin Core to exchange stamp
collection data.
In order for a piece of software to consume XML it must first know the
syntactic structure of the XML before it can do something useful.
Simple systems that convert one representation into another without any
translation (like the GBIF portal when it creates tab-delimited
representations) don't really require a semantic understanding of the
data. However, any sort of analysis tool or data-cleaning tool must be
smarter and these smart tools can provide a great deal of value to end
users.
XML requires human intervention in order to be understood. Because it
only constrains the syntax of documents, not the semantics of
inter-related data objects, programmers must embed domain specific
knowledge in software in order to do any non-trivial processing of XML,
especially of interrelated XML instances defined under multiple schema.
This is not a trivial question. The answers may require different
approaches to an overall architecture.
Versioning of schemas, for example, becomes irrelevant if the answer
is Yes - as the meaning is implicit in the structure you can throw the
schema away and not loose anything. XML is 'self describing' so you
would think this must be true. The schema is just a useful device to
help you construct XML in the correct format.
If the answer in No then we need clear statements about how all
instances must always bear links to a permanently retrievable schema -
or they become meaningless. We need very tight version control of
schemas and a method of linking between the versions so we can track
how the meaning has changed. We also need clear statements on what
happens when you can validate a document with multiple schemas? Does
this imply multiple meanings? Schemas must be archived with any data etc.
If you respond to this message please state a preference for either 1
or 2. There is no middle road on this one!
At heart the real problem is schema interoperability. We need
interoperability both within schemas and across schemas. When two
pieces of software exchange data in XML they both need to know the
structure of the data (it's schema) and be assured that they're using
the same version. This can be addressed by rigorous schema versioning.
The difficult problem manifests when we start talking about
interoperability across schema (for example across a specimen schema and
a taxon concept schema). We can avoid the circular XML Schema import
problem (which is made much more difficult if we have to strictly
version schema) by making references across schema instances with
GUIDs. For example, a Specimen schema instance can refer to a TCS
instance for it's identified taxon concept using an LSID. However, to a
piece of software that consumes XML based on XML Schema, this LSID is
simply a string. A specimen instance that refers to a taxon concept
might validate just as easily if that LSID were:
1.) an invalid LSID
2.) an LSID pointing to a publication instance (instead of to a taxon
concept)
3.) a valid LSID pointing to a valid taxon concept
The problem is that the software that consumes instances of different
XML schema that are made interoperable by GUIDs must be an order of
magnitude more intelligent than what we're building now in order to
"understand" what they're working with. In order to semantically
validate the specimen from the above example, the software should first
validate and parse the specimen instance, then resolve the LSID which is
encoded in a taxon concept element, fetch the XML metadata for the taxon
concept and then validate that taxon concept instance. If the end user
wants to display the name of the taxon concept along with the rest of
the data about a specimen, the taxon concept XML would also have to be
parsed.
So, to consume instances of different schemas that are interrelated with
GUIDs, the software has to know about each schema involved (specifically
each version of each schema). What this means in practice is that if a
new schema were introduced, or a new version of an existing schema came
into production, every piece of software that consumes it or related
schemas must be updated. In practice this is a software maintenance
nightmare.
What I'm trying to point out is that using GUID's does not break
dependencies between XML Schemas (or versions of the same schema), it
merely pushes the problem to a higher level in the process of consuming
XML. Can anyone propose a real solution to this problem using XML Schema?
-Steve
All the best,
Roger
--
-------------------------------------
Roger Hyam
Technical Architect
Taxonomic Databases Working Group
-------------------------------------
http://www.tdwg.org
roger@tdwg.org
+44 1578 722782
-------------------------------------
------------------------------------------------------------------------
_______________________________________________
Tdwg-tag mailing list
Tdwg-tag@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
_______________________________________________
Tdwg-tag mailing list
Tdwg-tag@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org