Hi Steve,
Great post. You gave a definitive NO but then further down your answer said YES!
If I can quote one bit:
"The XML Schema merely constrains the syntax of an acceptable document. Syntax is not semantics."
I think this sums it up. At the moment we basically don't do semantics we only do syntax.
Knowing the syntax of two languages (XML Schema applications) does not get us anywhere in trying to use the two languages together. We have to have some mechanism to link the meaning of the words between the languages. You finish up with more or less the same question - that I was coming to from a different direction. How do we do this in XML Schema?
So maybe there is no Yes/No answer to the question....
Thanks,
Roger
Steven Perry wrote:
Sorry to propose a very complex answer to a very simple question, but here it is:
Roger Hyam wrote:
Hi All,
Gregor posted my rather long winded description of confusion about semantics in XML Schema to the list and it may have confused you. It can be summed up with a simple question to which a simple answer is all that suffices.
*"Are the semantics encoded in the XML Schema or in the structure of the XML instance documents that validate against that schema? Is it possible to 'understand' an instance document without reference to the schema?"*
Possible answers are:
- *Yes:* you can understand an XML instance document in the absence of a schema it validates against i.e. just from the structure of the elements and the namespaces used.
- *No*: you require the XML Schema to understand the document.
I cannot state strongly enough that the answer is NO (number 2) -- you must have an XML schema in order to "understand" an instance. XML is "self-documenting" only to humans. I would venture that almost no one uses XML directly in their work. Instead people who use data often collect it and load it into a client application that can then do something useful with it (for example, a geographic or scientific information system like Arc or Matlab or the GBIF portal). In this scenario, humans aren't consuming the XML, software is. When a user goes to the GBIF portal and requests search results in tab-delimited format, they're downloading the results of a piece of software that has consumed the XML and produced a text file.
In any case where software must consume XML, it needs to know the structure of the XML. One common way to do this is to use XML-binding tools like Castor that create bindings between a programming language and an XML document structure. Given a properly constructed XML Schema, these tools create a set of object-oriented classes that can parse and "understand" XML instances under that schema. These classes are used by a software application to consume instances of the given schema. The other common method for working with XML in software is to create a custom deserializer that "understands" instances of a schema given hard-coded domain specific information by a programmer. While this second option does not directly depend upon the XML Schema, the programmer who creates the rules embedded in the deserializer uses the XML schema to encode the rules about what is acceptable.
Both of these approaches depend upon XML Schema for reasons other than validation. Validation however is the only function that XML Schema was explicitly designed to address. At heart XML Schema is a grammar for accepting or rejecting documents. It is not a description of a data model. This begs the question of what it means to "understand" an XML instance or an XML Schema.
Roger asks *"Are the semantics encoded in the XML Schema or in the structure of the XML instance documents that validate against that schema?"*
I argue that the semantics are embedded in the contents of XML instances, and that the XML Schema does not address the semantics at all but that they are necessary (though not sufficient) for doing so. The XML Schema merely constrains the syntax of an acceptable document. Syntax is not semantics. Natural (human) languages are a great deal more powerful and expressive than XML, but my point can be illustrated with a simple syntactically valid English sentence that makes no semantic sense whatsoever: "The moon jumped over the cow". For another example, see Donald's argument about using Darwin Core to exchange stamp collection data.
In order for a piece of software to consume XML it must first know the syntactic structure of the XML before it can do something useful. Simple systems that convert one representation into another without any translation (like the GBIF portal when it creates tab-delimited representations) don't really require a semantic understanding of the data. However, any sort of analysis tool or data-cleaning tool must be smarter and these smart tools can provide a great deal of value to end users.
XML requires human intervention in order to be understood. Because it only constrains the syntax of documents, not the semantics of inter-related data objects, programmers must embed domain specific knowledge in software in order to do any non-trivial processing of XML, especially of interrelated XML instances defined under multiple schema.
This is not a trivial question. The answers may require different approaches to an overall architecture.
Versioning of schemas, for example, becomes irrelevant if the answer is Yes - as the meaning is implicit in the structure you can throw the schema away and not loose anything. XML is 'self describing' so you would think this must be true. The schema is just a useful device to help you construct XML in the correct format.
If the answer in No then we need clear statements about how all instances must always bear links to a permanently retrievable schema - or they become meaningless. We need very tight version control of schemas and a method of linking between the versions so we can track how the meaning has changed. We also need clear statements on what happens when you can validate a document with multiple schemas? Does this imply multiple meanings? Schemas must be archived with any data etc.
If you respond to this message please state a preference for either 1 or 2. There is no middle road on this one!
At heart the real problem is schema interoperability. We need interoperability both within schemas and across schemas. When two pieces of software exchange data in XML they both need to know the structure of the data (it's schema) and be assured that they're using the same version. This can be addressed by rigorous schema versioning.
The difficult problem manifests when we start talking about interoperability across schema (for example across a specimen schema and a taxon concept schema). We can avoid the circular XML Schema import problem (which is made much more difficult if we have to strictly version schema) by making references across schema instances with GUIDs. For example, a Specimen schema instance can refer to a TCS instance for it's identified taxon concept using an LSID. However, to a piece of software that consumes XML based on XML Schema, this LSID is simply a string. A specimen instance that refers to a taxon concept might validate just as easily if that LSID were: 1.) an invalid LSID 2.) an LSID pointing to a publication instance (instead of to a taxon concept) 3.) a valid LSID pointing to a valid taxon concept
The problem is that the software that consumes instances of different XML schema that are made interoperable by GUIDs must be an order of magnitude more intelligent than what we're building now in order to "understand" what they're working with. In order to semantically validate the specimen from the above example, the software should first validate and parse the specimen instance, then resolve the LSID which is encoded in a taxon concept element, fetch the XML metadata for the taxon concept and then validate that taxon concept instance. If the end user wants to display the name of the taxon concept along with the rest of the data about a specimen, the taxon concept XML would also have to be parsed.
So, to consume instances of different schemas that are interrelated with GUIDs, the software has to know about each schema involved (specifically each version of each schema). What this means in practice is that if a new schema were introduced, or a new version of an existing schema came into production, every piece of software that consumes it or related schemas must be updated. In practice this is a software maintenance nightmare.
What I'm trying to point out is that using GUID's does not break dependencies between XML Schemas (or versions of the same schema), it merely pushes the problem to a higher level in the process of consuming XML. Can anyone propose a real solution to this problem using XML Schema?
-Steve
All the best,
Roger
--
Roger Hyam Technical Architect Taxonomic Databases Working Group
http://www.tdwg.org roger@tdwg.org
+44 1578 722782
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org