A very simple question stated again.
Hi All,
Gregor posted my rather long winded description of confusion about semantics in XML Schema to the list and it may have confused you. It can be summed up with a simple question to which a simple answer is all that suffices.
*"Are the semantics encoded in the XML Schema or in the structure of the XML instance documents that validate against that schema? Is it possible to 'understand' an instance document without reference to the schema?"*
Possible answers are:
1. *Yes:* you can understand an XML instance document in the absence of a schema it validates against i.e. just from the structure of the elements and the namespaces used. 2. *No*: you require the XML Schema to understand the document.
This is not a trivial question. The answers may require different approaches to an overall architecture.
Versioning of schemas, for example, becomes irrelevant if the answer is Yes - as the meaning is implicit in the structure you can throw the schema away and not loose anything. XML is 'self describing' so you would think this must be true. The schema is just a useful device to help you construct XML in the correct format.
If the answer in No then we need clear statements about how all instances must always bear links to a permanently retrievable schema - or they become meaningless. We need very tight version control of schemas and a method of linking between the versions so we can track how the meaning has changed. We also need clear statements on what happens when you can validate a document with multiple schemas? Does this imply multiple meanings? Schemas must be archived with any data etc.
If you respond to this message please state a preference for either 1 or 2. There is no middle road on this one!
All the best,
Roger
"Are the semantics encoded in the XML Schema or in the structure of the XML instance documents that validate against that schema? Is it possible to 'understand' an instance document without reference to the schema?"
Possible answers are:
1. Yes: you can understand an XML instance document in the absence of a schema it validates against i.e. just from the structure of the elements and the namespaces used. 2. No: you require the XML Schema to understand the document.
Roger, If you postulate that the instance document is valid against the schema, and the that the element and attribute names are meaningful to the reader (a human, or software written by a human who understands their meaning), then the only additional semantics the schema could provide would be in the annotations/documentation, if any exist in the schema.
I'm not entirely sure what you include in [data] "structure", but if you only mean concepts such as tuples, trees, sets, lists, bags, etc., then I would disagree that semantics are encoded substantially in data structure (of the XML instance doc or any other record). It is true that without proper structure, semantics cannot be encoded, but I think semantics are encoded predominantly in class/element-attribute names and any referenced documentation (i.e., natural language). If you replace meaningful names with surrogate keys (e.g., integers) and thereby obscure any meaning conveyed by the names, then the instance document would lose a lot of its meaning.
I'm not exactly sure how this relates to the earlier discussion about XML schema, RDF, and more powerful modeling methodologies like UML. but I hope it helps.
Cheers,
-Stan
Hi Stan,
Can I take that as a Yes then? Or is it a No?
Concrete example:
Take this instance document.
<?xml version="1.0" encoding="UTF-8"?> <ExampleDataSet xmlns="http://example.org/specimens#"> <Specimen> <Collector> <Name>John Doe</Name> </Collector> </Specimen> </ExampleDataSet>
Is John Doe a person or a research vessel?
If the document validates against this schema:
<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" targetNamespace="http://example.org/specimens#" xmlns:specimens="http://example.org/specimens#%22%3E <xs:element name="ExampleDataSet"> xs:complexType xs:sequence <xs:element ref="specimens:Specimen"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="Specimen"> xs:complexType xs:sequence <xs:element name="Collector" type="specimens:personType"/> </xs:sequence> </xs:complexType> </xs:element> <xs:complexType name="personType"> xs:sequence <xs:element name="Name" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:schema>
then it looks like John Doe is a person.
Without looking at the schema John Doe could as easily have been a team of people or an expedition or a sampling machine etc. We can change the meaning of the instance document depending on the schema it validates against - if we say meaning is in the schema - and there are many schemas that this document could validate against.
This is the reason for my question. If we interpret meaning from the type hierarchy in XML Schema then we are stuck with single inheritance (good in Java but maybe not so hot in data modeling). It also means that we hard link structure to meaning. It is very difficult for some one to come along and extend this schema saying "I have an element and my element represents a collector that isn't a person" because the notion of collector is hard coded to the structure for representing a person. They can't abstractly say "This machine is a type of collector".
Another way to imply meaning is through namespaces. The element http://example.org/specimens#Collector could resolve to a document that tells us what we mean by collector. Then we wouldn't have to worry so much about the 'structure' of the document but about the individual meanings of elements. We could still use XML Schema to encode useful structures but the meanings of elements would come from the namespace. (And I didn't even mention that this is how RDF does it - oh bother - now I have...).
My central question is how we map between existing and future schemas. If we can't say where the meaning is encoded in our current schemas then we can't even start the process.
All the best,
Roger
Blum, Stan wrote:
"Are the semantics encoded in the XML Schema or in the structure of the XML instance documents that validate against that schema? Is it possible to 'understand' an instance document without reference to the schema?"
Possible answers are:
- Yes: you can understand an XML instance document in the
absence of a schema it validates against i.e. just from the structure of the elements and the namespaces used. 2. No: you require the XML Schema to understand the document.
Roger, If you postulate that the instance document is valid against the schema, and the that the element and attribute names are meaningful to the reader (a human, or software written by a human who understands their meaning), then the only additional semantics the schema could provide would be in the annotations/documentation, if any exist in the schema.
I'm not entirely sure what you include in [data] "structure", but if you only mean concepts such as tuples, trees, sets, lists, bags, etc., then I would disagree that semantics are encoded substantially in data structure (of the XML instance doc or any other record). It is true that without proper structure, semantics cannot be encoded, but I think semantics are encoded predominantly in class/element-attribute names and any referenced documentation (i.e., natural language). If you replace meaningful names with surrogate keys (e.g., integers) and thereby obscure any meaning conveyed by the names, then the instance document would lose a lot of its meaning.
I'm not exactly sure how this relates to the earlier discussion about XML schema, RDF, and more powerful modeling methodologies like UML. but I hope it helps.
Cheers,
-Stan
Ignoring the often observed combersomeness of versioning in XML-Schema, your example seems a bit of a red-herring and to me serves more as supporting a claim---with which I agree---that it is \easier/ in RDF than XML-Schema to clarify relations and easier in XML-Schema to fail to do so, For example, were it the \intent/ to insure that Collector isA Human, it is certainly possible to do so in RDF. This then would require an extender to introduce a new supercclass, say PseudoCollector. along with all the baggage(?) of enforcing the properties it shares with the subclass. I don't know enough about DL to assert this with any confidence, but allowing arbitrary superclassing also sounds to me like it might cause pain to reasoners.
I take the gist of the previously posted reference to the final pages of http://www.omg.org/docs/ad/05-08-01.pdf to be that even in a modeling tool \more expressive/ than OWL, there is likely to remain the embedding of semantics in naming conventions, a cranky, but somewhat successful mechanism that is part of what Stan seems to be exploring.
To me, issues of pain to reasoning engines is not small. To my mind, machine reasoning is the biggest motivation for considering RDF-based representations, and it is well-understood in the research community that it is quite easy for this utility to vanish into the thin air of exponential time or other complexities. I am reminded of the Feb 15 posting by Steve Perry which contained the scary (to me) sentence. "We mostly use text editors for developing ontologies because we've occasionally found Protege to be unstable with large complicated OWL models.". (*)
Bob
(*)In fairness to Steve, who is way more qualified than I in these matters, on Feb 20 he posted a rather detailed analysis which makes me question my belief about the main utility of RDF where he writes:
"For the same reason I think the primary use case is not inference over OWL-described RDF, but search over flexible RDF-Schema described data models. I personally think that RDF might make some use cases, especially the merge case, easier to handle. So I'd like to see further discussion of the use cases above for both XML Schema and RDF."
Alas, my finding his arguments convincing does nothing to assuage my terrors. If biologist-friendly OWL tools are lacking for non-toy ontology development, it is likely that biologist-friendly tools for RDF/RDFS are not-even on the production-quality horizon.
Roger Hyam wrote:
Hi Stan,
Can I take that as a Yes then? Or is it a No?
Concrete example:
Take this instance document.
<?xml version="1.0" encoding="UTF-8"?>
<ExampleDataSet xmlns="http://example.org/specimens#"> <Specimen> <Collector> <Name>John Doe</Name> </Collector> </Specimen> </ExampleDataSet>
Is John Doe a person or a research vessel?
If the document validates against this schema:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" targetNamespace="http://example.org/specimens#" xmlns:specimens="http://example.org/specimens#%22%3E <xs:element name="ExampleDataSet"> xs:complexType xs:sequence <xs:element ref="specimens:Specimen"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="Specimen"> xs:complexType xs:sequence <xs:element name="Collector" type="specimens:personType"/> </xs:sequence> </xs:complexType> </xs:element> <xs:complexType name="personType"> xs:sequence <xs:element name="Name" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:schema>
then it looks like John Doe is a person.
Without looking at the schema John Doe could as easily have been a team of people or an expedition or a sampling machine etc. We can change the meaning of the instance document depending on the schema it validates against - if we say meaning is in the schema - and there are many schemas that this document could validate against.
This is the reason for my question. If we interpret meaning from the type hierarchy in XML Schema then we are stuck with single inheritance (good in Java but maybe not so hot in data modeling). It also means that we hard link structure to meaning. It is very difficult for some one to come along and extend this schema saying "I have an element and my element represents a collector that isn't a person" because the notion of collector is hard coded to the structure for representing a person. They can't abstractly say "This machine is a type of collector".
Another way to imply meaning is through namespaces. The element http://example.org/specimens#Collector could resolve to a document that tells us what we mean by collector. Then we wouldn't have to worry so much about the 'structure' of the document but about the individual meanings of elements. We could still use XML Schema to encode useful structures but the meanings of elements would come from the namespace. (And I didn't even mention that this is how RDF does it - oh bother - now I have...).
My central question is how we map between existing and future schemas. If we can't say where the meaning is encoded in our current schemas then we can't even start the process.
All the best,
Roger
Blum, Stan wrote:
"Are the semantics encoded in the XML Schema or in the structure of the XML instance documents that validate against that schema? Is it possible to 'understand' an instance document without reference to the schema?"
Possible answers are:
- Yes: you can understand an XML instance document in the
absence of a schema it validates against i.e. just from the structure of the elements and the namespaces used. 2. No: you require the XML Schema to understand the document.
Roger, If you postulate that the instance document is valid against the schema, and the that the element and attribute names are meaningful to the reader (a human, or software written by a human who understands their meaning), then the only additional semantics the schema could provide would be in the annotations/documentation, if any exist in the schema.
I'm not entirely sure what you include in [data] "structure", but if you only mean concepts such as tuples, trees, sets, lists, bags, etc., then I would disagree that semantics are encoded substantially in data structure (of the XML instance doc or any other record). It is true that without proper structure, semantics cannot be encoded, but I think semantics are encoded predominantly in class/element-attribute names and any referenced documentation (i.e., natural language). If you replace meaningful names with surrogate keys (e.g., integers) and thereby obscure any meaning conveyed by the names, then the instance document would lose a lot of its meaning.
I'm not exactly sure how this relates to the earlier discussion about XML schema, RDF, and more powerful modeling methodologies like UML. but I hope it helps.
Cheers,
-Stan
p.s.
I vote "Sometimes. And maybe often enough it is not a problem."
[MetaVote: I needed to extend your voting ontology to cover what I believe at the moment. :-) ]
Hi Bob,
What tools are you using with SDD and the other UBIF based schemas?
Thanks,
Roger
Bob Morris wrote:
Ignoring the often observed combersomeness of versioning in XML-Schema, your example seems a bit of a red-herring and to me serves more as supporting a claim---with which I agree---that it is \easier/ in RDF than XML-Schema to clarify relations and easier in XML-Schema to fail to do so, For example, were it the \intent/ to insure that Collector isA Human, it is certainly possible to do so in RDF. This then would require an extender to introduce a new supercclass, say PseudoCollector. along with all the baggage(?) of enforcing the properties it shares with the subclass. I don't know enough about DL to assert this with any confidence, but allowing arbitrary superclassing also sounds to me like it might cause pain to reasoners.
I take the gist of the previously posted reference to the final pages of http://www.omg.org/docs/ad/05-08-01.pdf to be that even in a modeling tool \more expressive/ than OWL, there is likely to remain the embedding of semantics in naming conventions, a cranky, but somewhat successful mechanism that is part of what Stan seems to be exploring.
To me, issues of pain to reasoning engines is not small. To my mind, machine reasoning is the biggest motivation for considering RDF-based representations, and it is well-understood in the research community that it is quite easy for this utility to vanish into the thin air of exponential time or other complexities. I am reminded of the Feb 15 posting by Steve Perry which contained the scary (to me) sentence. "We mostly use text editors for developing ontologies because we've occasionally found Protege to be unstable with large complicated OWL models.". (*)
Bob
(*)In fairness to Steve, who is way more qualified than I in these matters, on Feb 20 he posted a rather detailed analysis which makes me question my belief about the main utility of RDF where he writes:
"For the same reason I think the primary use case is not inference over OWL-described RDF, but search over flexible RDF-Schema described data models. I personally think that RDF might make some use cases, especially the merge case, easier to handle. So I'd like to see further discussion of the use cases above for both XML Schema and RDF."
Alas, my finding his arguments convincing does nothing to assuage my terrors. If biologist-friendly OWL tools are lacking for non-toy ontology development, it is likely that biologist-friendly tools for RDF/RDFS are not-even on the production-quality horizon.
Roger Hyam wrote:
Hi Stan,
Can I take that as a Yes then? Or is it a No?
Concrete example:
Take this instance document.
<?xml version="1.0" encoding="UTF-8"?>
<ExampleDataSet xmlns="http://example.org/specimens#"> <Specimen> <Collector> <Name>John Doe</Name> </Collector> </Specimen> </ExampleDataSet>
Is John Doe a person or a research vessel?
If the document validates against this schema:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" targetNamespace="http://example.org/specimens#" xmlns:specimens="http://example.org/specimens#%22%3E <xs:element name="ExampleDataSet"> xs:complexType xs:sequence <xs:element ref="specimens:Specimen"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="Specimen"> xs:complexType xs:sequence <xs:element name="Collector" type="specimens:personType"/> </xs:sequence> </xs:complexType> </xs:element> <xs:complexType name="personType"> xs:sequence <xs:element name="Name" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:schema>
then it looks like John Doe is a person.
Without looking at the schema John Doe could as easily have been a team of people or an expedition or a sampling machine etc. We can change the meaning of the instance document depending on the schema it validates against - if we say meaning is in the schema - and there are many schemas that this document could validate against.
This is the reason for my question. If we interpret meaning from the type hierarchy in XML Schema then we are stuck with single inheritance (good in Java but maybe not so hot in data modeling). It also means that we hard link structure to meaning. It is very difficult for some one to come along and extend this schema saying "I have an element and my element represents a collector that isn't a person" because the notion of collector is hard coded to the structure for representing a person. They can't abstractly say "This machine is a type of collector".
Another way to imply meaning is through namespaces. The element http://example.org/specimens#Collector could resolve to a document that tells us what we mean by collector. Then we wouldn't have to worry so much about the 'structure' of the document but about the individual meanings of elements. We could still use XML Schema to encode useful structures but the meanings of elements would come from the namespace. (And I didn't even mention that this is how RDF does it - oh bother - now I have...).
My central question is how we map between existing and future schemas. If we can't say where the meaning is encoded in our current schemas then we can't even start the process.
All the best,
Roger
Blum, Stan wrote:
"Are the semantics encoded in the XML Schema or in the structure of the XML instance documents that validate against that schema? Is it possible to 'understand' an instance document without reference to the schema?"
Possible answers are:
1. Yes: you can understand an XML instance document in the
absence of a schema it validates against i.e. just from the structure of the elements and the namespaces used.
2. No: you require the XML Schema to understand the document.
Roger, If you postulate that the instance document is valid against the schema, and the that the element and attribute names are meaningful to the reader (a human, or software written by a human who understands their meaning), then the only additional semantics the schema could provide would be in the annotations/documentation, if any exist in the schema. I'm not entirely sure what you include in [data] "structure", but if you only mean concepts such as tuples, trees, sets, lists, bags, etc., then I would disagree that semantics are encoded substantially in data structure (of the XML instance doc or any other record). It is true that without proper structure, semantics cannot be encoded, but I think semantics are encoded predominantly in class/element-attribute names and any referenced documentation (i.e., natural language). If you replace meaningful names with surrogate keys (e.g., integers) and thereby obscure any meaning conveyed by the names, then the instance document would lose a lot of its meaning.
I'm not exactly sure how this relates to the earlier discussion about XML schema, RDF, and more powerful modeling methodologies like UML. but I hope it helps.
Cheers,
-Stan
For producing tools we have been fans of Castor, but it has some flaws about key/keyref that make us have to do some hand coding that shouldn't be. We have a new P2P-based collaborative object editing project---especially aimed at immage annotation--- in which we are recoding a hand-coded prototype C# SDD instance document editor into Java, and our prelimary opinion is that Apache XML-Beans is good for the problem of making schema-driven (==schema independent) tools. Our generator that produces programs XXX2SDD and SDD2XXX (for data encoded by any reasonable descriptive data XML schema XXX) is based on Castor and needs hand configuration of some metadata not easily expressible in XML Schema (especially regular expressions). In a system we built for NatureServe to provide for schema-driven distributed role-based access control of sensitive data (e.g. geolocations of endangered species) we had to hand-craft a schema-path hash enumeration to generate XSLT as the control filter, but that works mainly because XML-Schema has an expression in XML-Schema, so a parser plus the relationship of XPath to XML-Schema is enough. [This effort is less about SDD than it is about what will/would become a representation of observations]. Similarly, a tool we built for adding human readable heuristics to the otherwise meaningless integer key/keyref pairs is driven mostly by parsing the subject schema---which is supposed to insulate it from changes to the SDD schema, and we have yet to look at whether these two apparently similar tasks in two different projects are in fact a common task.
We find these kind of frameworks pretty productive, but I don't claim that \they/ are the tools for biologists, only that they can \produce/ the biology-friendly tools.I worry that such frameworks are not mature for RDF (else why has it proved so difficult to teach biologists how to use ontology tools, and why, for example, is the premier tool, Protege, unable to handle the complexity of ecological ontologies?). I'm going to guess that the purely parser-driven tools must be OK for RDFS. For example, I suppose there must be something around for generating SPARQL queries based on <something>. I have a lot to learn before I can offer opinions based on arguments other than expressed in an old U.S.(?) cultural idiom: "Where's the beef?". Or like SETI: just because we haven't found it doesn't mean it isn't there, it only means that we can't tell if we are any closer to finding it than before we started looking. That's an enviable position for research projects---it keeps our money flowing---but maybe not for production architecture proposals.
For what its worth, we also have acceded to the position in our particular biologist clientele (mostly field naturalists) that "biologist friendly" means "looks like Excel". We implemented a VBL application for management of property lists on Excel cells but haven't tried to exploit frameworks, and it is stuck to a particular (simple) schema [or maybe even none at all---I forget, since generating VBL is not something I want my lab to aspire to...]. And, as it turns out, Excel ain't so bad at managing triples on a small scale. In the heat of a whirlwind meeting recently about invasive species information, Kevin Thiele forgot that he had started learning about triples and reinvented them---in Excel!. See http://wiki.cs.umb.edu/twiki/pub/IASPS/TerminologySummary/GISINSchemaworkgro... and my biologist-friendly (I hope) commentary on it at http://wiki.cs.umb.edu/twiki/bin/view/IASPS/SampleDefinitions
We are also getting our feet wet in aspect oriented programming. The Spring framework is proving to have a very big following, but at the moment I have no clue where it, or its relatives, might fit in the discussions at hand.
Bob
Roger Hyam wrote:
Hi Bob,
What tools are you using with SDD and the other UBIF based schemas?
Thanks,
Roger
Bob Morris wrote:
Ignoring the often observed combersomeness of versioning in XML-Schema, your example seems a bit of a red-herring and to me serves more as supporting a claim---with which I agree---that it is \easier/ in RDF than XML-Schema to clarify relations and easier in XML-Schema to fail to do so, For example, were it the \intent/ to insure that Collector isA Human, it is certainly possible to do so in RDF. This then would require an extender to introduce a new supercclass, say PseudoCollector. along with all the baggage(?) of enforcing the properties it shares with the subclass. I don't know enough about DL to assert this with any confidence, but allowing arbitrary superclassing also sounds to me like it might cause pain to reasoners.
I take the gist of the previously posted reference to the final pages of http://www.omg.org/docs/ad/05-08-01.pdf to be that even in a modeling tool \more expressive/ than OWL, there is likely to remain the embedding of semantics in naming conventions, a cranky, but somewhat successful mechanism that is part of what Stan seems to be exploring.
To me, issues of pain to reasoning engines is not small. To my mind, machine reasoning is the biggest motivation for considering RDF-based representations, and it is well-understood in the research community that it is quite easy for this utility to vanish into the thin air of exponential time or other complexities. I am reminded of the Feb 15 posting by Steve Perry which contained the scary (to me) sentence. "We mostly use text editors for developing ontologies because we've occasionally found Protege to be unstable with large complicated OWL models.". (*)
Bob
(*)In fairness to Steve, who is way more qualified than I in these matters, on Feb 20 he posted a rather detailed analysis which makes me question my belief about the main utility of RDF where he writes:
"For the same reason I think the primary use case is not inference over OWL-described RDF, but search over flexible RDF-Schema described data models. I personally think that RDF might make some use cases, especially the merge case, easier to handle. So I'd like to see further discussion of the use cases above for both XML Schema and RDF."
Alas, my finding his arguments convincing does nothing to assuage my terrors. If biologist-friendly OWL tools are lacking for non-toy ontology development, it is likely that biologist-friendly tools for RDF/RDFS are not-even on the production-quality horizon.
Roger Hyam wrote:
Hi Stan,
Can I take that as a Yes then? Or is it a No?
Concrete example:
Take this instance document.
<?xml version="1.0" encoding="UTF-8"?>
<ExampleDataSet xmlns="http://example.org/specimens#"> <Specimen> <Collector> <Name>John Doe</Name> </Collector> </Specimen> </ExampleDataSet>
Is John Doe a person or a research vessel?
If the document validates against this schema:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" targetNamespace="http://example.org/specimens#" xmlns:specimens="http://example.org/specimens#%22%3E <xs:element name="ExampleDataSet"> xs:complexType xs:sequence <xs:element ref="specimens:Specimen"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="Specimen"> xs:complexType xs:sequence <xs:element name="Collector" type="specimens:personType"/> </xs:sequence> </xs:complexType> </xs:element> <xs:complexType name="personType"> xs:sequence <xs:element name="Name" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:schema>
then it looks like John Doe is a person.
Without looking at the schema John Doe could as easily have been a team of people or an expedition or a sampling machine etc. We can change the meaning of the instance document depending on the schema it validates against - if we say meaning is in the schema - and there are many schemas that this document could validate against.
This is the reason for my question. If we interpret meaning from the type hierarchy in XML Schema then we are stuck with single inheritance (good in Java but maybe not so hot in data modeling). It also means that we hard link structure to meaning. It is very difficult for some one to come along and extend this schema saying "I have an element and my element represents a collector that isn't a person" because the notion of collector is hard coded to the structure for representing a person. They can't abstractly say "This machine is a type of collector".
Another way to imply meaning is through namespaces. The element http://example.org/specimens#Collector could resolve to a document that tells us what we mean by collector. Then we wouldn't have to worry so much about the 'structure' of the document but about the individual meanings of elements. We could still use XML Schema to encode useful structures but the meanings of elements would come from the namespace. (And I didn't even mention that this is how RDF does it - oh bother - now I have...).
My central question is how we map between existing and future schemas. If we can't say where the meaning is encoded in our current schemas then we can't even start the process.
All the best,
Roger
Blum, Stan wrote:
"Are the semantics encoded in the XML Schema or in the structure of the XML instance documents that validate against that schema? Is it possible to 'understand' an instance document without reference to the schema?"
Possible answers are:
1. Yes: you can understand an XML instance document in the
absence of a schema it validates against i.e. just from the structure of the elements and the namespaces used. 2. No: you require the XML Schema to understand the document.
Roger, If you postulate that the instance document is valid against the schema, and the that the element and attribute names are meaningful to the reader (a human, or software written by a human who understands their meaning), then the only additional semantics the schema could provide would be in the annotations/documentation, if any exist in the schema. I'm not entirely sure what you include in [data] "structure", but if you only mean concepts such as tuples, trees, sets, lists, bags, etc., then I would disagree that semantics are encoded substantially in data structure (of the XML instance doc or any other record). It is true that without proper structure, semantics cannot be encoded, but I think semantics are encoded predominantly in class/element-attribute names and any referenced documentation (i.e., natural language). If you replace meaningful names with surrogate keys (e.g., integers) and thereby obscure any meaning conveyed by the names, then the instance document would lose a lot of its meaning.
I'm not exactly sure how this relates to the earlier discussion about XML schema, RDF, and more powerful modeling methodologies like UML. but I hope it helps.
Cheers,
-Stan
Roger wrote:
... It also means that we hard link structure to meaning. It is very difficult for some one to come along and extend this schema saying "I have an element and my element represents a collector that isn't a person" because the notion of collector is hard coded to the structure for representing a person. They can't abstractly say "This machine is a type of collector".
Partly you can. You can create a revised schema in which the complex type PersonType is replaced by AgentType and PersonType as well as SoftwareAgent, InstitutionalAgent etc. are derived from AgentType,
The original document would remain valid both syntactically (what xml-schema does) and semantically (additional interpretation of schema typing).
What you loose is that in earlier documents data in AgentType were not undecided about its subtypes, but fixed to the specific subtype PersonType. However, since the content of PersonType is simple string, it is very likely that many documents under the first version in fact did *not* have a person there, despite the name of the type, so relatively little is lost.
You can do evolution with schema. You can not validate semantics with schema. However, turning your example to rdf, I think it is likely that there is no way to validate that a string or URI refers indeed to a person. You can validate internal semantic consistency, but not actual usage.
My central question is how we map between existing and future schemas. If we can't say where the meaning is encoded in our current schemas then we can't even start the process.
Meaning is not encoded, but documented. You do not have to make use of element nesting (= OO composition) but is often help mapping xml-data to software design and it usually help intuition.
The technical answer to questions about mapping is xslt. But devising these mappings is a lot of work, where semantic processors clearly would help.
I understand what we can gain by using semantic tools. I still have not clearly understood what we gain by RDF. What is wrong with simply tagging the schema elements with id attributes (or even rdf attributes - schema-schema is defined to support attributes from any other namespace) and then have an external ontology based on this?
Gregor---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn@bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Königin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203
At risk of sounding like a broken record (there's an anachronism for you),
I understand what we can gain by using semantic tools. I still have not clearly understood what we gain by RDF. What is wrong with simply tagging the schema elements with id attributes (or even rdf attributes - schema-schema is defined to support attributes from any other namespace) and then have an external ontology based on this?
Gregor---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn@bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Königin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203
... this is essentially the approach taken by GML, and if I understand correctly, by ISO 19109. However, I do not know of tools that validate the internal semantic consistency of GML expressions or ISO 19109-compliant models.
Flip
This 'semantic tagging' is something that may be useful in some situations. The two examples I have been thinking about are:
1. Marking up existing schemas as you suggest. This might make it easier to transform between systems but would need investigating. We could also map the namespaces of the elements in instance documents and do it that way. 2. Marking up regular text in taxonomic literature could be done by extending XHTML. We could simply extend <span> and <div> tags to have a tdwg="" attribute that contains the URI or URN of the core class of thing between the tags. Regular browsers and text processors would ignore the additional attribute but could be extended to handle it. Simple applications could be written in JavaScript to extend browser functionality to give term expansion, associated searching etc. Data could be extracted with XSLT etc... Some clever person might even extend DreamWeaver or another editor to support authoring of the tagging but regular taxonomists who are using word processors at the moment. (This is a fantasy example and not a concrete suggestion of doing it this way!)
I think the point is that most of these technologies overlap and we are probably not talking using 100% of anything in a solution model.
All the best,
Roger
Phillip C. Dibner wrote:
At risk of sounding like a broken record (there's an anachronism for you),
I understand what we can gain by using semantic tools. I still have not clearly understood what we gain by RDF. What is wrong with simply tagging the schema elements with id attributes (or even rdf attributes - schema-schema is defined to support attributes from any other namespace) and then have an external ontology based on this?
Gregor---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn@bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Königin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203
... this is essentially the approach taken by GML, and if I understand correctly, by ISO 19109. However, I do not know of tools that validate the internal semantic consistency of GML expressions or ISO 19109-compliant models.
Flip _______________________________________________ Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
*"Are the semantics encoded in the XML Schema or in the structure of the XML instance documents that validate against that schema? Is it possible to 'understand' an instance document without reference to the schema?"*
I would say:
A little semantics is embedded in the instance document, with intuitive design quite a bit more can be guessed.
Much more semantics is embedded in the schema: Type information, complete knowledge of optional properties, cardinality constraints, extension points. More semantics is commonly added in the schema annotations.
Certain kind of semantics, like the semantics of repeated use of the same type (simple like string or complex, self-created types=classes) in different properties of a class are not expressed formally in schema. Nor are they formally expressed in UML static class modeling or ER modeling.
Others semantics, like relations between classes (xml-schema:"complex types") are expressable through identity constraints, but unfortunately most people skip the work of doing it.
XML is 'self describing' so you would think this must be true.
Perhaps XML is "self-guessable" :-)
If the answer in No then we need clear statements about how all instances must always bear links to a permanently retrievable schema - or they become meaningless.
In xml schema this is done through namespace and namespace location. Note that namespace location is a hint, not required to be followed. Consumer may use their own version of a schema on their own responsibility.
We need very tight version control of schemas and a method of linking between the versions so we can track how the meaning has changed.
Updating a deployed schema in the same namespace implies that you state that at least previous document are valid under the new schema. Depending on the management of schema users, the reverse may or may not be required as well.
Guidelines exist as to how forward and backward compatibility can be achieved in xml-schema (provide extension points with xs:any and the self or non-self namespace options, provide them within container element), enabling schema to become extensible. If these guidelines are followed, a new schema version may be produced having the same namespace, but a different version attribute. The version attribute documents "evolutionary changes".
We also need clear statements on what happens when you can validate a document with multiple schemas? Does this imply multiple meanings? Schemas must be archived with any data etc.
Schema is primarily about syntax, not semantics (although some semantics are implied by syntax). Clearly, additionaly semantics is desirable, and makes this discussion worth it.
However, I can see great benefits in syntax. According to Roger's post RDFS cares very little about this ("we could use external OWL-based . Knowing that syntax is correct enables you to guarantee that the imported/consumed data fullfill an number of validation constraints build into your local data structure. Having these constraints enforced then allows you to write code relying on assumptions. If you import unconstraints data, you would have to write super-error-tolerant code that has exceptions for all possible inconsistencies in the data.
Is this a problem with using RDF/S?
If you respond to this message please state a preference for either 1 or 2. There is no middle road on this one!
I disagree, I believe formal semantical definitions are a question of degree, not "yes" or "no". Even OWL claims only to do the "doable" things and not aiming to solve all of AI problems.
Gregor---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn@bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Königin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203
Sorry to propose a very complex answer to a very simple question, but here it is:
Roger Hyam wrote:
Hi All,
Gregor posted my rather long winded description of confusion about semantics in XML Schema to the list and it may have confused you. It can be summed up with a simple question to which a simple answer is all that suffices.
*"Are the semantics encoded in the XML Schema or in the structure of the XML instance documents that validate against that schema? Is it possible to 'understand' an instance document without reference to the schema?"*
Possible answers are:
- *Yes:* you can understand an XML instance document in the absence of a schema it validates against i.e. just from the structure of the elements and the namespaces used.
- *No*: you require the XML Schema to understand the document.
I cannot state strongly enough that the answer is NO (number 2) -- you must have an XML schema in order to "understand" an instance. XML is "self-documenting" only to humans. I would venture that almost no one uses XML directly in their work. Instead people who use data often collect it and load it into a client application that can then do something useful with it (for example, a geographic or scientific information system like Arc or Matlab or the GBIF portal). In this scenario, humans aren't consuming the XML, software is. When a user goes to the GBIF portal and requests search results in tab-delimited format, they're downloading the results of a piece of software that has consumed the XML and produced a text file.
In any case where software must consume XML, it needs to know the structure of the XML. One common way to do this is to use XML-binding tools like Castor that create bindings between a programming language and an XML document structure. Given a properly constructed XML Schema, these tools create a set of object-oriented classes that can parse and "understand" XML instances under that schema. These classes are used by a software application to consume instances of the given schema. The other common method for working with XML in software is to create a custom deserializer that "understands" instances of a schema given hard-coded domain specific information by a programmer. While this second option does not directly depend upon the XML Schema, the programmer who creates the rules embedded in the deserializer uses the XML schema to encode the rules about what is acceptable.
Both of these approaches depend upon XML Schema for reasons other than validation. Validation however is the only function that XML Schema was explicitly designed to address. At heart XML Schema is a grammar for accepting or rejecting documents. It is not a description of a data model. This begs the question of what it means to "understand" an XML instance or an XML Schema.
Roger asks *"Are the semantics encoded in the XML Schema or in the structure of the XML instance documents that validate against that schema?"*
I argue that the semantics are embedded in the contents of XML instances, and that the XML Schema does not address the semantics at all but that they are necessary (though not sufficient) for doing so. The XML Schema merely constrains the syntax of an acceptable document. Syntax is not semantics. Natural (human) languages are a great deal more powerful and expressive than XML, but my point can be illustrated with a simple syntactically valid English sentence that makes no semantic sense whatsoever: "The moon jumped over the cow". For another example, see Donald's argument about using Darwin Core to exchange stamp collection data.
In order for a piece of software to consume XML it must first know the syntactic structure of the XML before it can do something useful. Simple systems that convert one representation into another without any translation (like the GBIF portal when it creates tab-delimited representations) don't really require a semantic understanding of the data. However, any sort of analysis tool or data-cleaning tool must be smarter and these smart tools can provide a great deal of value to end users.
XML requires human intervention in order to be understood. Because it only constrains the syntax of documents, not the semantics of inter-related data objects, programmers must embed domain specific knowledge in software in order to do any non-trivial processing of XML, especially of interrelated XML instances defined under multiple schema.
This is not a trivial question. The answers may require different approaches to an overall architecture.
Versioning of schemas, for example, becomes irrelevant if the answer is Yes - as the meaning is implicit in the structure you can throw the schema away and not loose anything. XML is 'self describing' so you would think this must be true. The schema is just a useful device to help you construct XML in the correct format.
If the answer in No then we need clear statements about how all instances must always bear links to a permanently retrievable schema - or they become meaningless. We need very tight version control of schemas and a method of linking between the versions so we can track how the meaning has changed. We also need clear statements on what happens when you can validate a document with multiple schemas? Does this imply multiple meanings? Schemas must be archived with any data etc.
If you respond to this message please state a preference for either 1 or 2. There is no middle road on this one!
At heart the real problem is schema interoperability. We need interoperability both within schemas and across schemas. When two pieces of software exchange data in XML they both need to know the structure of the data (it's schema) and be assured that they're using the same version. This can be addressed by rigorous schema versioning.
The difficult problem manifests when we start talking about interoperability across schema (for example across a specimen schema and a taxon concept schema). We can avoid the circular XML Schema import problem (which is made much more difficult if we have to strictly version schema) by making references across schema instances with GUIDs. For example, a Specimen schema instance can refer to a TCS instance for it's identified taxon concept using an LSID. However, to a piece of software that consumes XML based on XML Schema, this LSID is simply a string. A specimen instance that refers to a taxon concept might validate just as easily if that LSID were: 1.) an invalid LSID 2.) an LSID pointing to a publication instance (instead of to a taxon concept) 3.) a valid LSID pointing to a valid taxon concept
The problem is that the software that consumes instances of different XML schema that are made interoperable by GUIDs must be an order of magnitude more intelligent than what we're building now in order to "understand" what they're working with. In order to semantically validate the specimen from the above example, the software should first validate and parse the specimen instance, then resolve the LSID which is encoded in a taxon concept element, fetch the XML metadata for the taxon concept and then validate that taxon concept instance. If the end user wants to display the name of the taxon concept along with the rest of the data about a specimen, the taxon concept XML would also have to be parsed.
So, to consume instances of different schemas that are interrelated with GUIDs, the software has to know about each schema involved (specifically each version of each schema). What this means in practice is that if a new schema were introduced, or a new version of an existing schema came into production, every piece of software that consumes it or related schemas must be updated. In practice this is a software maintenance nightmare.
What I'm trying to point out is that using GUID's does not break dependencies between XML Schemas (or versions of the same schema), it merely pushes the problem to a higher level in the process of consuming XML. Can anyone propose a real solution to this problem using XML Schema?
-Steve
All the best,
Roger
--
Roger Hyam Technical Architect Taxonomic Databases Working Group
http://www.tdwg.org roger@tdwg.org
+44 1578 722782
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
Hi Steve,
Great post. You gave a definitive NO but then further down your answer said YES!
If I can quote one bit:
"The XML Schema merely constrains the syntax of an acceptable document. Syntax is not semantics."
I think this sums it up. At the moment we basically don't do semantics we only do syntax.
Knowing the syntax of two languages (XML Schema applications) does not get us anywhere in trying to use the two languages together. We have to have some mechanism to link the meaning of the words between the languages. You finish up with more or less the same question - that I was coming to from a different direction. How do we do this in XML Schema?
So maybe there is no Yes/No answer to the question....
Thanks,
Roger
Steven Perry wrote:
Sorry to propose a very complex answer to a very simple question, but here it is:
Roger Hyam wrote:
Hi All,
Gregor posted my rather long winded description of confusion about semantics in XML Schema to the list and it may have confused you. It can be summed up with a simple question to which a simple answer is all that suffices.
*"Are the semantics encoded in the XML Schema or in the structure of the XML instance documents that validate against that schema? Is it possible to 'understand' an instance document without reference to the schema?"*
Possible answers are:
- *Yes:* you can understand an XML instance document in the absence of a schema it validates against i.e. just from the structure of the elements and the namespaces used.
- *No*: you require the XML Schema to understand the document.
I cannot state strongly enough that the answer is NO (number 2) -- you must have an XML schema in order to "understand" an instance. XML is "self-documenting" only to humans. I would venture that almost no one uses XML directly in their work. Instead people who use data often collect it and load it into a client application that can then do something useful with it (for example, a geographic or scientific information system like Arc or Matlab or the GBIF portal). In this scenario, humans aren't consuming the XML, software is. When a user goes to the GBIF portal and requests search results in tab-delimited format, they're downloading the results of a piece of software that has consumed the XML and produced a text file.
In any case where software must consume XML, it needs to know the structure of the XML. One common way to do this is to use XML-binding tools like Castor that create bindings between a programming language and an XML document structure. Given a properly constructed XML Schema, these tools create a set of object-oriented classes that can parse and "understand" XML instances under that schema. These classes are used by a software application to consume instances of the given schema. The other common method for working with XML in software is to create a custom deserializer that "understands" instances of a schema given hard-coded domain specific information by a programmer. While this second option does not directly depend upon the XML Schema, the programmer who creates the rules embedded in the deserializer uses the XML schema to encode the rules about what is acceptable.
Both of these approaches depend upon XML Schema for reasons other than validation. Validation however is the only function that XML Schema was explicitly designed to address. At heart XML Schema is a grammar for accepting or rejecting documents. It is not a description of a data model. This begs the question of what it means to "understand" an XML instance or an XML Schema.
Roger asks *"Are the semantics encoded in the XML Schema or in the structure of the XML instance documents that validate against that schema?"*
I argue that the semantics are embedded in the contents of XML instances, and that the XML Schema does not address the semantics at all but that they are necessary (though not sufficient) for doing so. The XML Schema merely constrains the syntax of an acceptable document. Syntax is not semantics. Natural (human) languages are a great deal more powerful and expressive than XML, but my point can be illustrated with a simple syntactically valid English sentence that makes no semantic sense whatsoever: "The moon jumped over the cow". For another example, see Donald's argument about using Darwin Core to exchange stamp collection data.
In order for a piece of software to consume XML it must first know the syntactic structure of the XML before it can do something useful. Simple systems that convert one representation into another without any translation (like the GBIF portal when it creates tab-delimited representations) don't really require a semantic understanding of the data. However, any sort of analysis tool or data-cleaning tool must be smarter and these smart tools can provide a great deal of value to end users.
XML requires human intervention in order to be understood. Because it only constrains the syntax of documents, not the semantics of inter-related data objects, programmers must embed domain specific knowledge in software in order to do any non-trivial processing of XML, especially of interrelated XML instances defined under multiple schema.
This is not a trivial question. The answers may require different approaches to an overall architecture.
Versioning of schemas, for example, becomes irrelevant if the answer is Yes - as the meaning is implicit in the structure you can throw the schema away and not loose anything. XML is 'self describing' so you would think this must be true. The schema is just a useful device to help you construct XML in the correct format.
If the answer in No then we need clear statements about how all instances must always bear links to a permanently retrievable schema - or they become meaningless. We need very tight version control of schemas and a method of linking between the versions so we can track how the meaning has changed. We also need clear statements on what happens when you can validate a document with multiple schemas? Does this imply multiple meanings? Schemas must be archived with any data etc.
If you respond to this message please state a preference for either 1 or 2. There is no middle road on this one!
At heart the real problem is schema interoperability. We need interoperability both within schemas and across schemas. When two pieces of software exchange data in XML they both need to know the structure of the data (it's schema) and be assured that they're using the same version. This can be addressed by rigorous schema versioning.
The difficult problem manifests when we start talking about interoperability across schema (for example across a specimen schema and a taxon concept schema). We can avoid the circular XML Schema import problem (which is made much more difficult if we have to strictly version schema) by making references across schema instances with GUIDs. For example, a Specimen schema instance can refer to a TCS instance for it's identified taxon concept using an LSID. However, to a piece of software that consumes XML based on XML Schema, this LSID is simply a string. A specimen instance that refers to a taxon concept might validate just as easily if that LSID were: 1.) an invalid LSID 2.) an LSID pointing to a publication instance (instead of to a taxon concept) 3.) a valid LSID pointing to a valid taxon concept
The problem is that the software that consumes instances of different XML schema that are made interoperable by GUIDs must be an order of magnitude more intelligent than what we're building now in order to "understand" what they're working with. In order to semantically validate the specimen from the above example, the software should first validate and parse the specimen instance, then resolve the LSID which is encoded in a taxon concept element, fetch the XML metadata for the taxon concept and then validate that taxon concept instance. If the end user wants to display the name of the taxon concept along with the rest of the data about a specimen, the taxon concept XML would also have to be parsed.
So, to consume instances of different schemas that are interrelated with GUIDs, the software has to know about each schema involved (specifically each version of each schema). What this means in practice is that if a new schema were introduced, or a new version of an existing schema came into production, every piece of software that consumes it or related schemas must be updated. In practice this is a software maintenance nightmare.
What I'm trying to point out is that using GUID's does not break dependencies between XML Schemas (or versions of the same schema), it merely pushes the problem to a higher level in the process of consuming XML. Can anyone propose a real solution to this problem using XML Schema?
-Steve
All the best,
Roger
--
Roger Hyam Technical Architect Taxonomic Databases Working Group
http://www.tdwg.org roger@tdwg.org
+44 1578 722782
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
participants (6)
-
Blum, Stan
-
Bob Morris
-
Gregor Hagedorn
-
Phillip C. Dibner
-
Roger Hyam
-
Steven Perry