TCS in RDF for use in LSIDs and possible generic mechanism.
Hi Everyone,
I am cross posting this to the TCS list and the TAG list because it is relevant to both but responses should fall neatly into things to do with nomenclature (for the TCS list) and things to do with technology - for the TAG list. The bit about avowed serializations of RDF below are TAG relevant.
The move towards using LSIDs and the implied use of RDF for metadata has lead to the question: "Can we do TCS is RDF?". I have put together a package of files to encode the TaxonName part of TCS as an RDF vocabulary. It is not 100% complete but could form the basis of a solution.
You can download it here: http://biodiv.hyam.net/schemas/TCS_RDF/tcs_rdf_examples.zip
For the impatient you can see a summary of the vocabulary here: http://biodiv.hyam.net/schemas/TCS_RDF/TaxonNames.html
and an example xml document here: http://biodiv.hyam.net/schemas/TCS_RDF/instance.xml
It has actually been quite easy (though time consuming) to represent the semantics in the TCS XML Schema as RDF. Generally elements within the TaxonName element have become properties of the TaxonName class with some minor name changes. Several other classes were needed to represent NomenclaturalNotes and Typification events. The only difficult part was with Typification. A nomenclatural type is both a property of a name and, if it is a lectotype, a separate object that merely references a type and a name. The result is a compromise in an object that can be embedded as a property. I use instances for controlled vocabularies that may be controversial or may not.
What is lost in only using RDFS is control over validation. It is not possible to specify that certain combinations of properties are permissible and certain not. There are two approaches to adding more 'validation':
OWL Ontologies
An OWL ontology could be built that makes assertions about the items in the RDF ontology. It would be possible to use necessary and sufficient properties to assert that instances of TaxonName are valid members of an OWL class for BotanicalSubspeciesName for example. In fact far more control could be introduced in this way than is present in the current XML Schema. What is important to note is that any such OWL ontology could be separate from the common vocabulary suggested here. Different users could develop their own ontologies for their own purposes. This is a good thing as it is probably impossible to come up with a single, agreed ontology that handles the full complexity of the domain.
I would argue strongly that we should not build a single central ontology that summarizes all we know about nomenclature - we couldn't do it within my lifetime :)
Avowed Serializations
Because RDF can be serialized as XML it is possible for an XML document to both validate against an XML Schema AND be valid RDF. This may be a useful generic solution so I'll explain it here in an attempt to make it accessible to those not familiar with the technology.
The same RDF data can be serialized in XML in many ways and different code libraries will do it differently though all code libraries can read the serializations produced by others. It is possible to pick one of the ways of serializing a particular set of RDF data and design a XML Schema to validate the resulting structure. I am stuck for a way to describe this so I am going to use the term 'avowed serialization' (Avowed means 'openly declared') as opposed to 'arbitrary serialization'. This is the approach taken by the prismstandard.org http://www.prismstandard.orggroup for their standard and it gives a number of benefits as a bridging technology:
1. Publishing applications that are not RDF aware (even simple scripts) can produce regular XML Schema validated XML documents that just happen to also be RDF compliant. 2. Consuming applications can assume that all data is just RDF and not worry about the particular XML Schema used. These are the applications that are likely to have to merge different kinds of data from different suppliers so they benefit most from treating it like RDF. 3. Because it is regular structured XML it can be transformed using XSLT into other document formats such as 'legacy' non-RDF compliant structures - if required.
There is one direction that data would not flow without some effort. The same data published in an arbitrary serialization rather than the avowed one could be transformed, probably via several XSLT steps, into the avowed serialization and therefore made available to legacy applications using 3 above. This may not be worth the bother or may be useful. Some of the code involved would be generic to all transformations so may not be too great. It would certainly be possible for restricted data sets.
To demonstrate this instance.xml is included in the package along with avowed.xsd and two supporting files. instance.xml will validate against avowed.xsd and parse correctly in the w3c RDF parser.
I have not provided XSLT to convert instance.xml to the TCS standard format though I believe it could be done quite easily if required. Converting arbitrary documents from the current TCS to the structure represented in avowed.xsd would be more tricky but feasible and certainly possible for restricted uses of the schema that are typical from individual data suppliers.
Contents
This is what the files in this package are:
README.txt = this file TaxonNames.rdfs = An RDF vocabulary that represents TCS TaxonNames object. TaxonNames.html = Documentation from TaxonNames.rdfs - much more readable. instance.xml = an example of an XML document that is RDF compliant use of the vocabulary and XML Schema compliant. avowed.xsd = XML Schema that instance.xml validates against. dc.xsd = XML Schema that is used by avowed.xsd. taxonnames.xsd = XML Schema that is used by avowed.xsd. rdf2html.css = the style formatting for TaxonNames.html rdfs2html.xsl = XSLT style sheet to generate docs from TaxonNames.rdfs tcs_1.01.xsd = the TCS XML Schema for reference.
Needs for other Vocabularies
What is obvious looking at the vocabulary for TaxonNames here is that we need vocabularies for people, teams of people, literature and specimens as soon as possible.
Need for conventions
In order for all exchanged objects to be discoverable in a reasonable way we need to have conventions on the use of rdfs:label for Classes and Properties and dc:title for instances.
The namespaces used in these examples are fantasy as we have not finalized them yet.
Minor changes in TCS
There are a few points where I have intentionally not followed TCS 1.01 (there are probably others where it is accidental).
* basionym is a direct pointer to a TaxonName rather than a NomenclaturalNote. I couldn't see why it was a nomenclatural note in the 1.01 version as it is a simple pointer to a name. * changed name of genus element to genusEpithet property. The contents of the element are not to be used alone and are not a genus name in themselves (uninomial should be used in this case) so genusEpithet is more appropriate - even if it is not common English usage. * Addition of referenceTo property. The vocabulary may be used to mark up an occurrence of a name that is not a publishing of a new name. In these cases the thing being marked up is actually a pointer to another object, either a TaxonName issued by a nomenclator or a TaxonConcept. In these cases we need to have a reference field. Here is an example (assuming namespace) <TaxonName referenceTo="urn:lsid:example.com:myconcepts:1234"><genusEpithet>Bellis</genusEpithet><specificEpithet>perennis</specificEpithet></TaxonName> This could possibly appear in a XHTML document for example.
Comments Please
All this amounts to a complex suggestion of how things could be done. i.e. we develop central vocabularies that go no further than RDFS but permit exchange and validation of data using avowed serializations and OWL ontologies.
What do you think?
Roger
Hi Roger,
Just a short question...
Where can I get the temp.xsd schema with namespace http://www.w3.org/ 1999/02/22-rdf-syntax-ns#
I need it to validate the document instance against avowed.xsd. I wanted to try it with the PyWrapper schema parser.
Javi.
Just thoughts/comments on the use of XML Schema for validating RDF documents.
I'm afraid that by using XML Schema to validate RDF documents, we would be creating unnecessary constraints on the system. Some services may want to serve data in formats other than RDF/XML, for example N-Triple or Turtle for various reasons. Neither of these would be able to be validated by an XML Schema. For example, I've been working on indexing large quantities of data represented as RDF using standard IR techniques. N-Triple has distinct benefits over other representations because its grammar is trivial. Another benefit of N-Triple is that one can use simple concatenation to build a model without being required to use an in memory model through an RDF library such as Jena. For example, I can build a large single document containing N-Triples about millions of resources. The index maintains file position and size for each resource indexed. The benefit of using N-Triple is that upon querying, I can simple use fast random access to the file based on the position and size stored in the index to read in chunks of N-Triple based on the size and immediately start streaming the results across the wire.
With the additional constraint of using only RDF/XML as the output format, the above indexer example would either need to custom serialize N-Tripe -> RDF/XML or use a library to read it into an in-memory model to serialize it as RDF/XML.
Another concern is that we will be reducing any serialization potential we have from standard libraries. Jena, Redland, SemWeb, or any other library that can produce and consume RDF is not likely to produce RDF/XML in the same format. Producers of RDF now will not only be required to use RDF/XML as opposed to other formats such as N-Triple, but will be required to write custom serialization code to translate the in-memory model for the library of their choice into the structured RDF response that fits the XML Schema. It seems to me, we are really removing one of the technical benefits of using RDF. Services and consumers really should not need to be concerned about the specific structure of the bits of RDF across the wire so long as its valid RDF.
In my humble opinion, any constraints and validation should be either at the level of the ontology (OWL-Lite, OWL-DL, RDFS/OWL) or through a reasoner that can be packaged and distributed for use within any application that desires to utilize our products.
Cheers, Rob
Roger Hyam wrote:
Hi Everyone,
I am cross posting this to the TCS list and the TAG list because it is relevant to both but responses should fall neatly into things to do with nomenclature (for the TCS list) and things to do with technology - for the TAG list. The bit about avowed serializations of RDF below are TAG relevant.
The move towards using LSIDs and the implied use of RDF for metadata has lead to the question: "Can we do TCS is RDF?". I have put together a package of files to encode the TaxonName part of TCS as an RDF vocabulary. It is not 100% complete but could form the basis of a solution.
You can download it here: http://biodiv.hyam.net/schemas/TCS_RDF/tcs_rdf_examples.zip
For the impatient you can see a summary of the vocabulary here: http://biodiv.hyam.net/schemas/TCS_RDF/TaxonNames.html
and an example xml document here: http://biodiv.hyam.net/schemas/TCS_RDF/instance.xml
It has actually been quite easy (though time consuming) to represent the semantics in the TCS XML Schema as RDF. Generally elements within the TaxonName element have become properties of the TaxonName class with some minor name changes. Several other classes were needed to represent NomenclaturalNotes and Typification events. The only difficult part was with Typification. A nomenclatural type is both a property of a name and, if it is a lectotype, a separate object that merely references a type and a name. The result is a compromise in an object that can be embedded as a property. I use instances for controlled vocabularies that may be controversial or may not.
What is lost in only using RDFS is control over validation. It is not possible to specify that certain combinations of properties are permissible and certain not. There are two approaches to adding more 'validation':
OWL Ontologies
An OWL ontology could be built that makes assertions about the items in the RDF ontology. It would be possible to use necessary and sufficient properties to assert that instances of TaxonName are valid members of an OWL class for BotanicalSubspeciesName for example. In fact far more control could be introduced in this way than is present in the current XML Schema. What is important to note is that any such OWL ontology could be separate from the common vocabulary suggested here. Different users could develop their own ontologies for their own purposes. This is a good thing as it is probably impossible to come up with a single, agreed ontology that handles the full complexity of the domain.
I would argue strongly that we should not build a single central ontology that summarizes all we know about nomenclature - we couldn't do it within my lifetime :)
Avowed Serializations
Because RDF can be serialized as XML it is possible for an XML document to both validate against an XML Schema AND be valid RDF. This may be a useful generic solution so I'll explain it here in an attempt to make it accessible to those not familiar with the technology.
The same RDF data can be serialized in XML in many ways and different code libraries will do it differently though all code libraries can read the serializations produced by others. It is possible to pick one of the ways of serializing a particular set of RDF data and design a XML Schema to validate the resulting structure. I am stuck for a way to describe this so I am going to use the term 'avowed serialization' (Avowed means 'openly declared') as opposed to 'arbitrary serialization'. This is the approach taken by the prismstandard.org http://www.prismstandard.orggroup for their standard and it gives a number of benefits as a bridging technology:
- Publishing applications that are not RDF aware (even simple scripts) can produce regular XML Schema validated XML documents that just happen to also be RDF compliant.
- Consuming applications can assume that all data is just RDF and not worry about the particular XML Schema used. These are the applications that are likely to have to merge different kinds of data from different suppliers so they benefit most from treating it like RDF.
- Because it is regular structured XML it can be transformed using XSLT into other document formats such as 'legacy' non-RDF compliant structures - if required.
There is one direction that data would not flow without some effort. The same data published in an arbitrary serialization rather than the avowed one could be transformed, probably via several XSLT steps, into the avowed serialization and therefore made available to legacy applications using 3 above. This may not be worth the bother or may be useful. Some of the code involved would be generic to all transformations so may not be too great. It would certainly be possible for restricted data sets.
To demonstrate this instance.xml is included in the package along with avowed.xsd and two supporting files. instance.xml will validate against avowed.xsd and parse correctly in the w3c RDF parser.
I have not provided XSLT to convert instance.xml to the TCS standard format though I believe it could be done quite easily if required. Converting arbitrary documents from the current TCS to the structure represented in avowed.xsd would be more tricky but feasible and certainly possible for restricted uses of the schema that are typical from individual data suppliers.
Contents
This is what the files in this package are:
README.txt = this file TaxonNames.rdfs = An RDF vocabulary that represents TCS TaxonNames object. TaxonNames.html = Documentation from TaxonNames.rdfs - much more readable. instance.xml = an example of an XML document that is RDF compliant use of the vocabulary and XML Schema compliant. avowed.xsd = XML Schema that instance.xml validates against. dc.xsd = XML Schema that is used by avowed.xsd. taxonnames.xsd = XML Schema that is used by avowed.xsd. rdf2html.css = the style formatting for TaxonNames.html rdfs2html.xsl = XSLT style sheet to generate docs from TaxonNames.rdfs tcs_1.01.xsd = the TCS XML Schema for reference.
Needs for other Vocabularies
What is obvious looking at the vocabulary for TaxonNames here is that we need vocabularies for people, teams of people, literature and specimens as soon as possible.
Need for conventions
In order for all exchanged objects to be discoverable in a reasonable way we need to have conventions on the use of rdfs:label for Classes and Properties and dc:title for instances.
The namespaces used in these examples are fantasy as we have not finalized them yet.
Minor changes in TCS
There are a few points where I have intentionally not followed TCS 1.01 (there are probably others where it is accidental).
* basionym is a direct pointer to a TaxonName rather than a NomenclaturalNote. I couldn't see why it was a nomenclatural note in the 1.01 version as it is a simple pointer to a name. * changed name of genus element to genusEpithet property. The contents of the element are not to be used alone and are not a genus name in themselves (uninomial should be used in this case) so genusEpithet is more appropriate - even if it is not common English usage. * Addition of referenceTo property. The vocabulary may be used to mark up an occurrence of a name that is not a publishing of a new name. In these cases the thing being marked up is actually a pointer to another object, either a TaxonName issued by a nomenclator or a TaxonConcept. In these cases we need to have a reference field. Here is an example (assuming namespace) <TaxonName referenceTo="urn:lsid:example.com:myconcepts:1234"><genusEpithet>Bellis</genusEpithet><specificEpithet>perennis</specificEpithet></TaxonName> This could possibly appear in a XHTML document for example. Comments Please
All this amounts to a complex suggestion of how things could be done. i.e. we develop central vocabularies that go no further than RDFS but permit exchange and validation of data using avowed serializations and OWL ontologies.
What do you think?
Roger
--
Roger Hyam Technical Architect Taxonomic Databases Working Group
http://www.tdwg.org roger@tdwg.org
+44 1578 722782
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
Rob,
It may help to see this from the other side. You are quite right that we would not want an RDF model to be constrained by the fact that some people wish to share data using XML Schema validation.
However, if the documents validated this way are a valid subset of the valid documents according to the RDF model, it may give these people a chance to make their data available for use in an RDF-enabled world.
Consumers of RDF data can freely handle alternative representations such as N-Triple as well as the XML-encoded form. Because such heterogeneity is always possible, it may not be a big deal if some of the providers are unaware that their documents are expressed in XML RDF, so long as they are valid as such.
Does that make sense?
Donald
--------------------------------------------------------------- Donald Hobern (dhobern@gbif.org) Programme Officer for Data Access and Database Interoperability Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480 ---------------------------------------------------------------
-----Original Message----- From: Tdwg-tag-bounces@lists.tdwg.org [mailto:Tdwg-tag-bounces@lists.tdwg.org] On Behalf Of Robert Gales Sent: 22 March 2006 16:09 To: roger@tdwg.org Cc: Tdwg-tag@lists.tdwg.org Subject: Re: [Tdwg-tag] TCS in RDF for use in LSIDs and possiblegeneric mechanism.
Just thoughts/comments on the use of XML Schema for validating RDF documents.
I'm afraid that by using XML Schema to validate RDF documents, we would be creating unnecessary constraints on the system. Some services may want to serve data in formats other than RDF/XML, for example N-Triple or Turtle for various reasons. Neither of these would be able to be validated by an XML Schema. For example, I've been working on indexing large quantities of data represented as RDF using standard IR techniques. N-Triple has distinct benefits over other representations because its grammar is trivial. Another benefit of N-Triple is that one can use simple concatenation to build a model without being required to use an in memory model through an RDF library such as Jena. For example, I can build a large single document containing N-Triples about millions of resources. The index maintains file position and size for each resource indexed. The benefit of using N-Triple is that upon querying, I can simple use fast random access to the file based on the position and size stored in the index to read in chunks of N-Triple based on the size and immediately start streaming the results across the wire.
With the additional constraint of using only RDF/XML as the output format, the above indexer example would either need to custom serialize N-Tripe -> RDF/XML or use a library to read it into an in-memory model to serialize it as RDF/XML.
Another concern is that we will be reducing any serialization potential we have from standard libraries. Jena, Redland, SemWeb, or any other library that can produce and consume RDF is not likely to produce RDF/XML in the same format. Producers of RDF now will not only be required to use RDF/XML as opposed to other formats such as N-Triple, but will be required to write custom serialization code to translate the in-memory model for the library of their choice into the structured RDF response that fits the XML Schema. It seems to me, we are really removing one of the technical benefits of using RDF. Services and consumers really should not need to be concerned about the specific structure of the bits of RDF across the wire so long as its valid RDF.
In my humble opinion, any constraints and validation should be either at the level of the ontology (OWL-Lite, OWL-DL, RDFS/OWL) or through a reasoner that can be packaged and distributed for use within any application that desires to utilize our products.
Cheers, Rob
Roger Hyam wrote:
Hi Everyone,
I am cross posting this to the TCS list and the TAG list because it is relevant to both but responses should fall neatly into things to do with nomenclature (for the TCS list) and things to do with technology - for the TAG list. The bit about avowed serializations of RDF below are TAG relevant.
The move towards using LSIDs and the implied use of RDF for metadata has lead to the question: "Can we do TCS is RDF?". I have put together a package of files to encode the TaxonName part of TCS as an RDF vocabulary. It is not 100% complete but could form the basis of a
solution.
You can download it here: http://biodiv.hyam.net/schemas/TCS_RDF/tcs_rdf_examples.zip
For the impatient you can see a summary of the vocabulary here: http://biodiv.hyam.net/schemas/TCS_RDF/TaxonNames.html
and an example xml document here: http://biodiv.hyam.net/schemas/TCS_RDF/instance.xml
It has actually been quite easy (though time consuming) to represent the semantics in the TCS XML Schema as RDF. Generally elements within the TaxonName element have become properties of the TaxonName class with some minor name changes. Several other classes were needed to represent NomenclaturalNotes and Typification events. The only difficult part was with Typification. A nomenclatural type is both a property of a name and, if it is a lectotype, a separate object that merely references a type and a name. The result is a compromise in an object that can be embedded as a property. I use instances for controlled vocabularies that may be controversial or may not.
What is lost in only using RDFS is control over validation. It is not possible to specify that certain combinations of properties are permissible and certain not. There are two approaches to adding more 'validation':
OWL Ontologies
An OWL ontology could be built that makes assertions about the items in the RDF ontology. It would be possible to use necessary and sufficient properties to assert that instances of TaxonName are valid members of an OWL class for BotanicalSubspeciesName for example. In fact far more control could be introduced in this way than is present in the current XML Schema. What is important to note is that any such OWL ontology could be separate from the common vocabulary suggested here. Different users could develop their own ontologies for their own purposes. This is a good thing as it is probably impossible to come up with a single, agreed ontology that handles the full complexity of the domain.
I would argue strongly that we should not build a single central ontology that summarizes all we know about nomenclature - we couldn't do it within my lifetime :)
Avowed Serializations
Because RDF can be serialized as XML it is possible for an XML document to both validate against an XML Schema AND be valid RDF. This may be a useful generic solution so I'll explain it here in an attempt to make it accessible to those not familiar with the technology.
The same RDF data can be serialized in XML in many ways and different code libraries will do it differently though all code libraries can read the serializations produced by others. It is possible to pick one of the ways of serializing a particular set of RDF data and design a XML Schema to validate the resulting structure. I am stuck for a way to describe this so I am going to use the term 'avowed serialization' (Avowed means 'openly declared') as opposed to 'arbitrary serialization'. This is the approach taken by the prismstandard.org http://www.prismstandard.orggroup for their standard and it gives a number of benefits as a bridging technology:
- Publishing applications that are not RDF aware (even simple scripts) can produce regular XML Schema validated XML documents that just happen to also be RDF compliant.
- Consuming applications can assume that all data is just RDF and not worry about the particular XML Schema used. These are the applications that are likely to have to merge different kinds of data from different suppliers so they benefit most from treating it like RDF.
- Because it is regular structured XML it can be transformed using XSLT into other document formats such as 'legacy' non-RDF compliant structures - if required.
There is one direction that data would not flow without some effort. The same data published in an arbitrary serialization rather than the avowed one could be transformed, probably via several XSLT steps, into the avowed serialization and therefore made available to legacy applications using 3 above. This may not be worth the bother or may be useful. Some of the code involved would be generic to all transformations so may not be too great. It would certainly be possible for restricted data sets.
To demonstrate this instance.xml is included in the package along with avowed.xsd and two supporting files. instance.xml will validate against avowed.xsd and parse correctly in the w3c RDF parser.
I have not provided XSLT to convert instance.xml to the TCS standard format though I believe it could be done quite easily if required. Converting arbitrary documents from the current TCS to the structure represented in avowed.xsd would be more tricky but feasible and certainly possible for restricted uses of the schema that are typical from individual data suppliers.
Contents
This is what the files in this package are:
README.txt = this file TaxonNames.rdfs = An RDF vocabulary that represents TCS TaxonNames object. TaxonNames.html = Documentation from TaxonNames.rdfs - much more readable. instance.xml = an example of an XML document that is RDF compliant use of the vocabulary and XML Schema compliant. avowed.xsd = XML Schema that instance.xml validates against. dc.xsd = XML Schema that is used by avowed.xsd. taxonnames.xsd = XML Schema that is used by avowed.xsd. rdf2html.css = the style formatting for TaxonNames.html rdfs2html.xsl = XSLT style sheet to generate docs from TaxonNames.rdfs tcs_1.01.xsd = the TCS XML Schema for reference.
Needs for other Vocabularies
What is obvious looking at the vocabulary for TaxonNames here is that we need vocabularies for people, teams of people, literature and specimens as soon as possible.
Need for conventions
In order for all exchanged objects to be discoverable in a reasonable way we need to have conventions on the use of rdfs:label for Classes and Properties and dc:title for instances.
The namespaces used in these examples are fantasy as we have not finalized them yet.
Minor changes in TCS
There are a few points where I have intentionally not followed TCS 1.01 (there are probably others where it is accidental).
* basionym is a direct pointer to a TaxonName rather than a NomenclaturalNote. I couldn't see why it was a nomenclatural note in the 1.01 version as it is a simple pointer to a name. * changed name of genus element to genusEpithet property. The contents of the element are not to be used alone and are not a genus name in themselves (uninomial should be used in this case) so genusEpithet is more appropriate - even if it is not common English usage. * Addition of referenceTo property. The vocabulary may be used to mark up an occurrence of a name that is not a publishing of a new name. In these cases the thing being marked up is actually a pointer to another object, either a TaxonName issued by a nomenclator or a TaxonConcept. In these cases we need to have a reference field. Here is an example (assuming namespace) <TaxonName
referenceTo="urn:lsid:example.com:myconcepts:1234"><genusEpithet>Bellis</gen usEpithet><specificEpithet>perennis</specificEpithet></TaxonName>
This could possibly appear in a XHTML document for example. Comments Please
All this amounts to a complex suggestion of how things could be done. i.e. we develop central vocabularies that go no further than RDFS but permit exchange and validation of data using avowed serializations and OWL ontologies.
What do you think?
Roger
--
Roger Hyam Technical Architect Taxonomic Databases Working Group
http://www.tdwg.org roger@tdwg.org
+44 1578 722782
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
I'm pretty confused by this exchange, probably because I am so new to issues of RDF serialization.
I would have thought that the people who want XML Schema validation, or any other non-RDF validation, do so because they have a need to embed in LSID data or metadata something which only has a non-RDF serialization, and that their real need is to be able to validate the embedded data after its extraction and deserialization into pure XML. Image pixels come to mind, accompanied by an assertion that the representation has some specified encoding and when decoded represents a media file meeting some ISO standard. (As an aside, if there is a standard way of embedding binary data in RDF it would help our image LSID effort to have a pointer to it). SDD (which we should perhaps call "SDD-XML) is another, mainly because it is almost certainly as expressive as full RDF itself, and there are no RDF editing tools on the horizon which even SEEK asserts biologists will happily embrace, from which I infer that some such current XML-Schema TDWG projects will have a rather long life. From experience in a project with NatureServe, I am guessing that Observations have similar issues.
Thinking about image data is probably the least controversial, albeit the simplest, case.
So if all the above is meaningful and correct, it seems to me that the issue may be one of settling on TDWG standards, consistent with current RDF best practices, about how to signal that some embedded stuff is actually in some "non-RDF" serialization, with enough standardized RDF to guide extraction, and that this is probably idiosyncratic to the kinds of embedded data. Isn't this basically the same issue as custom serializers/deserializers for SOAP? And it hasn't already been addressed by the RDF community???
Or maybe I am so clueless that this all sounds like the rantings of a character from a Borges story. (Which is often what I feel like when contemplating RDF. The Library of Babel comes to mind, as does the story whose title I forget which is about two giants locked in mortal combat, except that each is dreaming the combat in some kind of shared dream and the one who wins in the dream gets to wake up.) :-) ...
Donald Hobern wrote:
Rob,
It may help to see this from the other side. You are quite right that we would not want an RDF model to be constrained by the fact that some people wish to share data using XML Schema validation.
However, if the documents validated this way are a valid subset of the valid documents according to the RDF model, it may give these people a chance to make their data available for use in an RDF-enabled world.
Consumers of RDF data can freely handle alternative representations such as N-Triple as well as the XML-encoded form. Because such heterogeneity is always possible, it may not be a big deal if some of the providers are unaware that their documents are expressed in XML RDF, so long as they are valid as such.
Does that make sense?
Donald
Donald Hobern (dhobern@gbif.org) Programme Officer for Data Access and Database Interoperability Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480
-----Original Message----- From: Tdwg-tag-bounces@lists.tdwg.org [mailto:Tdwg-tag-bounces@lists.tdwg.org] On Behalf Of Robert Gales Sent: 22 March 2006 16:09 To: roger@tdwg.org Cc: Tdwg-tag@lists.tdwg.org Subject: Re: [Tdwg-tag] TCS in RDF for use in LSIDs and possiblegeneric mechanism.
Just thoughts/comments on the use of XML Schema for validating RDF documents.
I'm afraid that by using XML Schema to validate RDF documents, we would be creating unnecessary constraints on the system. Some services may want to serve data in formats other than RDF/XML, for example N-Triple or Turtle for various reasons. Neither of these would be able to be validated by an XML Schema. For example, I've been working on indexing large quantities of data represented as RDF using standard IR techniques. N-Triple has distinct benefits over other representations because its grammar is trivial. Another benefit of N-Triple is that one can use simple concatenation to build a model without being required to use an in memory model through an RDF library such as Jena. For example, I can build a large single document containing N-Triples about millions of resources. The index maintains file position and size for each resource indexed. The benefit of using N-Triple is that upon querying, I can simple use fast random access to the file based on the position and size stored in the index to read in chunks of N-Triple based on the size and immediately start streaming the results across the wire.
With the additional constraint of using only RDF/XML as the output format, the above indexer example would either need to custom serialize N-Tripe -> RDF/XML or use a library to read it into an in-memory model to serialize it as RDF/XML.
Another concern is that we will be reducing any serialization potential we have from standard libraries. Jena, Redland, SemWeb, or any other library that can produce and consume RDF is not likely to produce RDF/XML in the same format. Producers of RDF now will not only be required to use RDF/XML as opposed to other formats such as N-Triple, but will be required to write custom serialization code to translate the in-memory model for the library of their choice into the structured RDF response that fits the XML Schema. It seems to me, we are really removing one of the technical benefits of using RDF. Services and consumers really should not need to be concerned about the specific structure of the bits of RDF across the wire so long as its valid RDF.
In my humble opinion, any constraints and validation should be either at the level of the ontology (OWL-Lite, OWL-DL, RDFS/OWL) or through a reasoner that can be packaged and distributed for use within any application that desires to utilize our products.
Cheers, Rob
Roger Hyam wrote:
Hi Everyone,
I am cross posting this to the TCS list and the TAG list because it is relevant to both but responses should fall neatly into things to do with nomenclature (for the TCS list) and things to do with technology - for the TAG list. The bit about avowed serializations of RDF below are TAG relevant.
The move towards using LSIDs and the implied use of RDF for metadata has lead to the question: "Can we do TCS is RDF?". I have put together a package of files to encode the TaxonName part of TCS as an RDF vocabulary. It is not 100% complete but could form the basis of a
solution.
You can download it here: http://biodiv.hyam.net/schemas/TCS_RDF/tcs_rdf_examples.zip
For the impatient you can see a summary of the vocabulary here: http://biodiv.hyam.net/schemas/TCS_RDF/TaxonNames.html
and an example xml document here: http://biodiv.hyam.net/schemas/TCS_RDF/instance.xml
It has actually been quite easy (though time consuming) to represent the semantics in the TCS XML Schema as RDF. Generally elements within the TaxonName element have become properties of the TaxonName class with some minor name changes. Several other classes were needed to represent NomenclaturalNotes and Typification events. The only difficult part was with Typification. A nomenclatural type is both a property of a name and, if it is a lectotype, a separate object that merely references a type and a name. The result is a compromise in an object that can be embedded as a property. I use instances for controlled vocabularies that may be controversial or may not.
What is lost in only using RDFS is control over validation. It is not possible to specify that certain combinations of properties are permissible and certain not. There are two approaches to adding more 'validation':
OWL Ontologies
An OWL ontology could be built that makes assertions about the items in the RDF ontology. It would be possible to use necessary and sufficient properties to assert that instances of TaxonName are valid members of an OWL class for BotanicalSubspeciesName for example. In fact far more control could be introduced in this way than is present in the current XML Schema. What is important to note is that any such OWL ontology could be separate from the common vocabulary suggested here. Different users could develop their own ontologies for their own purposes. This is a good thing as it is probably impossible to come up with a single, agreed ontology that handles the full complexity of the domain.
I would argue strongly that we should not build a single central ontology that summarizes all we know about nomenclature - we couldn't do it within my lifetime :)
Avowed Serializations
Because RDF can be serialized as XML it is possible for an XML document to both validate against an XML Schema AND be valid RDF. This may be a useful generic solution so I'll explain it here in an attempt to make it accessible to those not familiar with the technology.
The same RDF data can be serialized in XML in many ways and different code libraries will do it differently though all code libraries can read the serializations produced by others. It is possible to pick one of the ways of serializing a particular set of RDF data and design a XML Schema to validate the resulting structure. I am stuck for a way to describe this so I am going to use the term 'avowed serialization' (Avowed means 'openly declared') as opposed to 'arbitrary serialization'. This is the approach taken by the prismstandard.org http://www.prismstandard.orggroup for their standard and it gives a number of benefits as a bridging technology:
- Publishing applications that are not RDF aware (even simple scripts) can produce regular XML Schema validated XML documents that just happen to also be RDF compliant.
- Consuming applications can assume that all data is just RDF and not worry about the particular XML Schema used. These are the applications that are likely to have to merge different kinds of data from different suppliers so they benefit most from treating it like RDF.
- Because it is regular structured XML it can be transformed using XSLT into other document formats such as 'legacy' non-RDF compliant structures - if required.
There is one direction that data would not flow without some effort. The same data published in an arbitrary serialization rather than the avowed one could be transformed, probably via several XSLT steps, into the avowed serialization and therefore made available to legacy applications using 3 above. This may not be worth the bother or may be useful. Some of the code involved would be generic to all transformations so may not be too great. It would certainly be possible for restricted data sets.
To demonstrate this instance.xml is included in the package along with avowed.xsd and two supporting files. instance.xml will validate against avowed.xsd and parse correctly in the w3c RDF parser.
I have not provided XSLT to convert instance.xml to the TCS standard format though I believe it could be done quite easily if required. Converting arbitrary documents from the current TCS to the structure represented in avowed.xsd would be more tricky but feasible and certainly possible for restricted uses of the schema that are typical from individual data suppliers.
Contents
This is what the files in this package are:
README.txt = this file TaxonNames.rdfs = An RDF vocabulary that represents TCS TaxonNames object. TaxonNames.html = Documentation from TaxonNames.rdfs - much more readable. instance.xml = an example of an XML document that is RDF compliant use of the vocabulary and XML Schema compliant. avowed.xsd = XML Schema that instance.xml validates against. dc.xsd = XML Schema that is used by avowed.xsd. taxonnames.xsd = XML Schema that is used by avowed.xsd. rdf2html.css = the style formatting for TaxonNames.html rdfs2html.xsl = XSLT style sheet to generate docs from TaxonNames.rdfs tcs_1.01.xsd = the TCS XML Schema for reference.
Needs for other Vocabularies
What is obvious looking at the vocabulary for TaxonNames here is that we need vocabularies for people, teams of people, literature and specimens as soon as possible.
Need for conventions
In order for all exchanged objects to be discoverable in a reasonable way we need to have conventions on the use of rdfs:label for Classes and Properties and dc:title for instances.
The namespaces used in these examples are fantasy as we have not finalized them yet.
Minor changes in TCS
There are a few points where I have intentionally not followed TCS 1.01 (there are probably others where it is accidental).
- basionym is a direct pointer to a TaxonName rather than a NomenclaturalNote. I couldn't see why it was a nomenclatural note in the 1.01 version as it is a simple pointer to a name.
- changed name of genus element to genusEpithet property. The contents of the element are not to be used alone and are not a genus name in themselves (uninomial should be used in this case) so genusEpithet is more appropriate - even if it is not common English usage.
- Addition of referenceTo property. The vocabulary may be used to mark up an occurrence of a name that is not a publishing of a new name. In these cases the thing being marked up is actually a pointer to another object, either a TaxonName issued by a nomenclator or a TaxonConcept. In these cases we need to have a reference field. Here is an example (assuming namespace) <TaxonName
referenceTo="urn:lsid:example.com:myconcepts:1234"><genusEpithet>Bellis</gen usEpithet><specificEpithet>perennis</specificEpithet></TaxonName>
This could possibly appear in a XHTML document for example. Comments Please
All this amounts to a complex suggestion of how things could be done. i.e. we develop central vocabularies that go no further than RDFS but permit exchange and validation of data using avowed serializations and OWL ontologies.
What do you think?
Roger
--
Roger Hyam Technical Architect Taxonomic Databases Working Group
http://www.tdwg.org roger@tdwg.org
+44 1578 722782
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
Bob Morris wrote:
[...] Or maybe I am so clueless that this all sounds like the rantings of a character from a Borges story. [...]
But meant:
"Or maybe I am so clueless that my writing above all sounds like the rantings of a character from a Borges story. [...]"
I did not mean to suggest that the previous authors were ranting. But do read Borges' "The Library of Babel" and see if it is possible to believe me (though it's true! It's true!).
Bob
Hi Bob,
Comments below:
Bob Morris wrote:
I'm pretty confused by this exchange, probably because I am so new to issues of RDF serialization.
I would have thought that the people who want XML Schema validation, or any other non-RDF validation, do so because they have a need to embed in LSID data or metadata something which only has a non-RDF serialization, and that their real need is to be able to validate the embedded data after its extraction and deserialization into pure XML. Image pixels come to mind, accompanied by an assertion that the representation has some specified encoding and when decoded represents a media file meeting some ISO standard. (As an aside, if there is a standard way of embedding binary data in RDF it would help our image LSID effort to have a pointer to it). SDD (which we should perhaps call "SDD-XML) is another, mainly because it is almost certainly as expressive as full RDF itself, and there are no RDF editing tools on the horizon which even SEEK asserts biologists will happily embrace, from which I infer that some such current XML-Schema TDWG projects will have a rather long life. From experience in a project with NatureServe, I am guessing that Observations have similar issues.
Can you point me in the direction of a tool that will generate interfaces for data editing from an arbitrary XML Schema - and works out the box? I presume that, as you bemoan the lack of them for RDF, they exist for XML Schema based documents.
I am not talking XML editors (Spy and friends) here I am talking pukka data editors for regular biologists to use. I gave up trying to find these tools a while back but they may now have reached maturity. There was all this promise that one would be able to define a document structure in XML Schema and distribute this to clients and they would just see groovy forms to fill in and manage data in a database somewhere. No more slog in designing user interfaces just one generic tool. I got excited about XForms but soon got over that. Is there a tool I can download and use with SDD, TCS and ABCD?
If I could get my hands on such a tool I could try it with one of my 'avowed' serialization schemas and then demonstrate that I am editing RDF with it - but that really would be confusing. Where are the generic XML tools you imply?
Thinking about image data is probably the least controversial, albeit the simplest, case.
So if all the above is meaningful and correct, it seems to me that the issue may be one of settling on TDWG standards, consistent with current RDF best practices, about how to signal that some embedded stuff is actually in some "non-RDF" serialization, with enough standardized RDF to guide extraction, and that this is probably idiosyncratic to the kinds of embedded data. Isn't this basically the same issue as custom serializers/deserializers for SOAP? And it hasn't already been addressed by the RDF community???
Or maybe I am so clueless that this all sounds like the rantings of a character from a Borges story. (Which is often what I feel like when contemplating RDF. The Library of Babel comes to mind, as does the story whose title I forget which is about two giants locked in mortal combat, except that each is dreaming the combat in some kind of shared dream and the one who wins in the dream gets to wake up.) :-) ...
Donald Hobern wrote:
Rob,
It may help to see this from the other side. You are quite right that we would not want an RDF model to be constrained by the fact that some people wish to share data using XML Schema validation. However, if the documents validated this way are a valid subset of the valid documents according to the RDF model, it may give these people a chance to make their data available for use in an RDF-enabled world.
Consumers of RDF data can freely handle alternative representations such as N-Triple as well as the XML-encoded form. Because such heterogeneity is always possible, it may not be a big deal if some of the providers are unaware that their documents are expressed in XML RDF, so long as they are valid as such.
Does that make sense? Donald
Donald Hobern (dhobern@gbif.org) Programme Officer for Data Access and Database Interoperability Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480
-----Original Message----- From: Tdwg-tag-bounces@lists.tdwg.org [mailto:Tdwg-tag-bounces@lists.tdwg.org] On Behalf Of Robert Gales Sent: 22 March 2006 16:09 To: roger@tdwg.org Cc: Tdwg-tag@lists.tdwg.org Subject: Re: [Tdwg-tag] TCS in RDF for use in LSIDs and possiblegeneric mechanism.
Just thoughts/comments on the use of XML Schema for validating RDF documents.
I'm afraid that by using XML Schema to validate RDF documents, we would be creating unnecessary constraints on the system. Some services may want to serve data in formats other than RDF/XML, for example N-Triple or Turtle for various reasons. Neither of these would be able to be validated by an XML Schema. For example, I've been working on indexing large quantities of data represented as RDF using standard IR techniques. N-Triple has distinct benefits over other representations because its grammar is trivial. Another benefit of N-Triple is that one can use simple concatenation to build a model without being required to use an in memory model through an RDF library such as Jena. For example, I can build a large single document containing N-Triples about millions of resources. The index maintains file position and size for each resource indexed. The benefit of using N-Triple is that upon querying, I can simple use fast random access to the file based on the position and size stored in the index to read in chunks of N-Triple based on the size and immediately start streaming the results across the wire.
With the additional constraint of using only RDF/XML as the output format, the above indexer example would either need to custom serialize N-Tripe -> RDF/XML or use a library to read it into an in-memory model to serialize it as RDF/XML.
Another concern is that we will be reducing any serialization potential we have from standard libraries. Jena, Redland, SemWeb, or any other library that can produce and consume RDF is not likely to produce RDF/XML in the same format. Producers of RDF now will not only be required to use RDF/XML as opposed to other formats such as N-Triple, but will be required to write custom serialization code to translate the in-memory model for the library of their choice into the structured RDF response that fits the XML Schema. It seems to me, we are really removing one of the technical benefits of using RDF. Services and consumers really should not need to be concerned about the specific structure of the bits of RDF across the wire so long as its valid RDF.
In my humble opinion, any constraints and validation should be either at the level of the ontology (OWL-Lite, OWL-DL, RDFS/OWL) or through a reasoner that can be packaged and distributed for use within any application that desires to utilize our products.
Cheers, Rob
Roger Hyam wrote:
Hi Everyone,
I am cross posting this to the TCS list and the TAG list because it is relevant to both but responses should fall neatly into things to do with nomenclature (for the TCS list) and things to do with technology - for the TAG list. The bit about avowed serializations of RDF below are TAG relevant.
The move towards using LSIDs and the implied use of RDF for metadata has lead to the question: "Can we do TCS is RDF?". I have put together a package of files to encode the TaxonName part of TCS as an RDF vocabulary. It is not 100% complete but could form the basis of a
solution.
You can download it here: http://biodiv.hyam.net/schemas/TCS_RDF/tcs_rdf_examples.zip
For the impatient you can see a summary of the vocabulary here: http://biodiv.hyam.net/schemas/TCS_RDF/TaxonNames.html
and an example xml document here: http://biodiv.hyam.net/schemas/TCS_RDF/instance.xml
It has actually been quite easy (though time consuming) to represent the semantics in the TCS XML Schema as RDF. Generally elements within the TaxonName element have become properties of the TaxonName class with some minor name changes. Several other classes were needed to represent NomenclaturalNotes and Typification events. The only difficult part was with Typification. A nomenclatural type is both a property of a name and, if it is a lectotype, a separate object that merely references a type and a name. The result is a compromise in an object that can be embedded as a property. I use instances for controlled vocabularies that may be controversial or may not.
What is lost in only using RDFS is control over validation. It is not possible to specify that certain combinations of properties are permissible and certain not. There are two approaches to adding more 'validation':
OWL Ontologies
An OWL ontology could be built that makes assertions about the items in the RDF ontology. It would be possible to use necessary and sufficient properties to assert that instances of TaxonName are valid members of an OWL class for BotanicalSubspeciesName for example. In fact far more control could be introduced in this way than is present in the current XML Schema. What is important to note is that any such OWL ontology could be separate from the common vocabulary suggested here. Different users could develop their own ontologies for their own purposes. This is a good thing as it is probably impossible to come up with a single, agreed ontology that handles the full complexity of the domain.
I would argue strongly that we should not build a single central ontology that summarizes all we know about nomenclature - we couldn't do it within my lifetime :)
Avowed Serializations
Because RDF can be serialized as XML it is possible for an XML document to both validate against an XML Schema AND be valid RDF. This may be a useful generic solution so I'll explain it here in an attempt to make it accessible to those not familiar with the technology.
The same RDF data can be serialized in XML in many ways and different code libraries will do it differently though all code libraries can read the serializations produced by others. It is possible to pick one of the ways of serializing a particular set of RDF data and design a XML Schema to validate the resulting structure. I am stuck for a way to describe this so I am going to use the term 'avowed serialization' (Avowed means 'openly declared') as opposed to 'arbitrary serialization'. This is the approach taken by the prismstandard.org http://www.prismstandard.orggroup for their standard and it gives a number of benefits as a bridging technology:
- Publishing applications that are not RDF aware (even simple scripts) can produce regular XML Schema validated XML documents that just happen to also be RDF compliant.
- Consuming applications can assume that all data is just RDF and not worry about the particular XML Schema used. These are the applications that are likely to have to merge different kinds of data from different suppliers so they benefit most from treating it like RDF.
- Because it is regular structured XML it can be transformed using XSLT into other document formats such as 'legacy' non-RDF compliant structures - if required.
There is one direction that data would not flow without some effort. The same data published in an arbitrary serialization rather than the avowed one could be transformed, probably via several XSLT steps, into the avowed serialization and therefore made available to legacy applications using 3 above. This may not be worth the bother or may be useful. Some of the code involved would be generic to all transformations so may not be too great. It would certainly be possible for restricted data sets.
To demonstrate this instance.xml is included in the package along with avowed.xsd and two supporting files. instance.xml will validate against avowed.xsd and parse correctly in the w3c RDF parser.
I have not provided XSLT to convert instance.xml to the TCS standard format though I believe it could be done quite easily if required. Converting arbitrary documents from the current TCS to the structure represented in avowed.xsd would be more tricky but feasible and certainly possible for restricted uses of the schema that are typical from individual data suppliers.
Contents
This is what the files in this package are:
README.txt = this file TaxonNames.rdfs = An RDF vocabulary that represents TCS TaxonNames object. TaxonNames.html = Documentation from TaxonNames.rdfs - much more readable. instance.xml = an example of an XML document that is RDF compliant use of the vocabulary and XML Schema compliant. avowed.xsd = XML Schema that instance.xml validates against. dc.xsd = XML Schema that is used by avowed.xsd. taxonnames.xsd = XML Schema that is used by avowed.xsd. rdf2html.css = the style formatting for TaxonNames.html rdfs2html.xsl = XSLT style sheet to generate docs from TaxonNames.rdfs tcs_1.01.xsd = the TCS XML Schema for reference.
Needs for other Vocabularies
What is obvious looking at the vocabulary for TaxonNames here is that we need vocabularies for people, teams of people, literature and specimens as soon as possible.
Need for conventions
In order for all exchanged objects to be discoverable in a reasonable way we need to have conventions on the use of rdfs:label for Classes and Properties and dc:title for instances.
The namespaces used in these examples are fantasy as we have not finalized them yet.
Minor changes in TCS
There are a few points where I have intentionally not followed TCS 1.01 (there are probably others where it is accidental).
- basionym is a direct pointer to a TaxonName rather than a NomenclaturalNote. I couldn't see why it was a nomenclatural note in the 1.01 version as it is a simple pointer to a name.
- changed name of genus element to genusEpithet property. The contents of the element are not to be used alone and are not a genus name in themselves (uninomial should be used in this case) so genusEpithet is more appropriate - even if it is not common English usage.
- Addition of referenceTo property. The vocabulary may be used to mark up an occurrence of a name that is not a publishing of a new name. In these cases the thing being marked up is actually a pointer to another object, either a TaxonName issued by a nomenclator or a TaxonConcept. In these cases we need to have a reference field. Here is an example (assuming namespace) <TaxonName
referenceTo="urn:lsid:example.com:myconcepts:1234"><genusEpithet>Bellis</gen
usEpithet><specificEpithet>perennis</specificEpithet></TaxonName>
This could possibly appear in a XHTML document for example. Comments Please
All this amounts to a complex suggestion of how things could be done. i.e. we develop central vocabularies that go no further than RDFS but permit exchange and validation of data using avowed serializations and OWL ontologies.
What do you think?
Roger
--
Roger Hyam Technical Architect Taxonomic Databases Working Group
http://www.tdwg.org roger@tdwg.org
+44 1578 722782
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
Roger,
Microsoft InfoPath 2003, which comes with Microsoft Office 2003 Professional, is supposed to do what you want.
It doesn't create a nice looking user interface automagically from an XML Schema, though. You need to load the schema and then design the form, by dragging the XML elements into it and adding captions, labels, etc. But once you do that, the user interface for filling in the form is kinda nice.
It comes with a number of silly templates for business use (invoices, time cards, etc). I loaded TCS schema v1.0 and designed a simple form using it just for fun. See the attached picture.
I hope this helps.
Regards,
Ricardo
Roger Hyam wrote:
Can you point me in the direction of a tool that will generate interfaces for data editing from an arbitrary XML Schema - and works out the box? I presume that, as you bemoan the lack of them for RDF, they exist for XML Schema based documents.
I am not talking XML editors (Spy and friends) here I am talking pukka data editors for regular biologists to use. I gave up trying to find these tools a while back but they may now have reached maturity. There was all this promise that one would be able to define a document structure in XML Schema and distribute this to clients and they would just see groovy forms to fill in and manage data in a database somewhere. No more slog in designing user interfaces just one generic tool. I got excited about XForms but soon got over that. Is there a tool I can download and use with SDD, TCS and ABCD?
If I could get my hands on such a tool I could try it with one of my 'avowed' serialization schemas and then demonstrate that I am editing RDF with it - but that really would be confusing. Where are the generic XML tools you imply?
Thanks Ricardo.
Now I am not a great fan of MS Office but I am actually going to say that this looks pretty cool :)
I just put a quick form together based on the avowed.xsd, filled out a copy and validated it against the w3c RDF validator.
So yes you can build forms on the basis of XML Schema and yes the results can be valid RDF. So here we have a 'cool tool' for biologists to create XML by filling in forms and the resulting XML can also be valid RDF!
The only fly in the ointment is the need for every desktop to have a copy of InfoPath on it and no cross platform support etc. Good for inside larger organisations though.
Roger
Ricardo Scachetti Pereira wrote:
Roger,
Microsoft InfoPath 2003, which comes with Microsoft Office 2003 Professional, is supposed to do what you want.
It doesn't create a nice looking user interface automagically from an XML Schema, though. You need to load the schema and then design the form, by dragging the XML elements into it and adding captions, labels, etc. But once you do that, the user interface for filling in the form is kinda nice.
It comes with a number of silly templates for business use (invoices, time cards, etc). I loaded TCS schema v1.0 and designed a simple form using it just for fun. See the attached picture.
I hope this helps.
Regards,
Ricardo
Roger Hyam wrote:
Can you point me in the direction of a tool that will generate interfaces for data editing from an arbitrary XML Schema - and works out the box? I presume that, as you bemoan the lack of them for RDF, they exist for XML Schema based documents.
I am not talking XML editors (Spy and friends) here I am talking pukka data editors for regular biologists to use. I gave up trying to find these tools a while back but they may now have reached maturity. There was all this promise that one would be able to define a document structure in XML Schema and distribute this to clients and they would just see groovy forms to fill in and manage data in a database somewhere. No more slog in designing user interfaces just one generic tool. I got excited about XForms but soon got over that. Is there a tool I can download and use with SDD, TCS and ABCD?
If I could get my hands on such a tool I could try it with one of my 'avowed' serialization schemas and then demonstrate that I am editing RDF with it - but that really would be confusing. Where are the generic XML tools you imply?
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
Hi roger and Ricardo, Altova (XMLSpy) has similar products for creating forms on top of XML Schema's, I think it's called Stylevision. However I don't see either of these tools as fitting for a 'global' roll out where each biologist is required to own a license for either of these tools. For widespread use, a custom web interface seems more appropriate to me.
// Fini
Roger Hyam wrote:
Thanks Ricardo.
Now I am not a great fan of MS Office but I am actually going to say that this looks pretty cool :)
I just put a quick form together based on the avowed.xsd, filled out a copy and validated it against the w3c RDF validator.
So yes you can build forms on the basis of XML Schema and yes the results can be valid RDF. So here we have a 'cool tool' for biologists to create XML by filling in forms and the resulting XML can also be valid RDF!
The only fly in the ointment is the need for every desktop to have a copy of InfoPath on it and no cross platform support etc. Good for inside larger organisations though.
Roger
Ricardo Scachetti Pereira wrote:
Roger,
Microsoft InfoPath 2003, which comes with Microsoft Office 2003 Professional, is supposed to do what you want.
It doesn't create a nice looking user interface automagically from an XML Schema, though. You need to load the schema and then design the form, by dragging the XML elements into it and adding captions, labels, etc. But once you do that, the user interface for filling in the form is kinda nice.
It comes with a number of silly templates for business use (invoices, time cards, etc). I loaded TCS schema v1.0 and designed a simple form using it just for fun. See the attached picture.
I hope this helps.
Regards,
Ricardo
Roger Hyam wrote:
Can you point me in the direction of a tool that will generate interfaces for data editing from an arbitrary XML Schema - and works out the box? I presume that, as you bemoan the lack of them for RDF, they exist for XML Schema based documents.
I am not talking XML editors (Spy and friends) here I am talking pukka data editors for regular biologists to use. I gave up trying to find these tools a while back but they may now have reached maturity. There was all this promise that one would be able to define a document structure in XML Schema and distribute this to clients and they would just see groovy forms to fill in and manage data in a database somewhere. No more slog in designing user interfaces just one generic tool. I got excited about XForms but soon got over that. Is there a tool I can download and use with SDD, TCS and ABCD?
If I could get my hands on such a tool I could try it with one of my 'avowed' serialization schemas and then demonstrate that I am editing RDF with it - but that really would be confusing. Where are the generic XML tools you imply?
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
--
Roger Hyam Technical Architect Taxonomic Databases Working Group
http://www.tdwg.org roger@tdwg.org
+44 1578 722782
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
http://www.franz.com/resources/educational_resources/white_papers/AllegroCac... is a rather interesting piece about RDF scalability. They claim to load 300,000 triples/sec from a triple store based on Allegro Common Lisp.
Allegro CL is also at the heart of Ora Lassila's Wilbur toolkit. OINK, a way cool new Wilbur application is described at http://www.lassila.org/blog/archive/2006/03/oink.html. [Does Wilbur run on the free version of Allegro?]. Wilbur loads 2600 triples/sec. Lassila is generally regarded with Hendler and Berner's-Lee as one of the founders of the Semantic Web.
[Bill Campbell, a colleague of mine and author of UMB-Scheme distributed with RedHat Linux, once remarked "XML is just Lisp with pointy brackets". The above might support: "RDF is just CLOS with pointy brackets". Which, by the way, is positive.]
Does anyone know what triple retrieval claims Oracle is making for its triple store support?
There is a good current survey of RDF programming support at http://www.wiwiss.fu-berlin.de/suhl/bizer/toolkits/
--Bob
Bob,
This is interesting stuff. I don't know what claims Oracle is making for it's triple store, but there are many other database-backed triple stores out there and I've examined several of them in depth.
With most of the current crop of triple stores (including Jena's which we use in DiGIR2), triples are stored in a single de-normalized table that has columns for subject, predicate, and object. This table is heavily indexed to allow quick lookups, for example, to find statements with a given subject and predicate. The difficult thing in measuring triple store performance is not raw throughput (how fast triples can be loaded or listed), but is instead their performance on queries. With many of the triple stores I've examined, raw throughput is limited only by how fast the underlying database is at performing SQL inserts and selects. Granted there is some overhead in the RDF framework that sits atop the database, but performance for insertions and basic retrievals is dominated by the underlying database.
With sophisticated queries the story is quite different. For a long time every triple store had it's own query language. Now that the world is starting to standardize on SPARQL I hope to see a standard set of SPARQL-based metrics that will allow query performance comparisons to be made across triple store implementations. SPARQL is very powerful and allows a large variety of useful queries. However much of SPARQL cannot be pushed down into SQL queries. This makes any triple store designed to work over a relational database at risk of having to load all triples into memory for examination by the RDF framework in order to answer sophisticated SPARQL queries. The simplest example of such a query is one that uses the filter(regex()) pattern because most relational databases cannot perform XPath's matches regex function.
I hope to have more information about Oracle's performance claims soon and I'll share them with the list when I get them.
-Steve
Bob Morris wrote:
http://www.franz.com/resources/educational_resources/white_papers/AllegroCac... is a rather interesting piece about RDF scalability. They claim to load 300,000 triples/sec from a triple store based on Allegro Common Lisp.
Allegro CL is also at the heart of Ora Lassila's Wilbur toolkit. OINK, a way cool new Wilbur application is described at http://www.lassila.org/blog/archive/2006/03/oink.html. [Does Wilbur run on the free version of Allegro?]. Wilbur loads 2600 triples/sec. Lassila is generally regarded with Hendler and Berner's-Lee as one of the founders of the Semantic Web.
[Bill Campbell, a colleague of mine and author of UMB-Scheme distributed with RedHat Linux, once remarked "XML is just Lisp with pointy brackets". The above might support: "RDF is just CLOS with pointy brackets". Which, by the way, is positive.]
Does anyone know what triple retrieval claims Oracle is making for its triple store support?
There is a good current survey of RDF programming support at http://www.wiwiss.fu-berlin.de/suhl/bizer/toolkits/
--Bob
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
Steve,
I find the triple store debate interesting because the goal posts seem to shift. No one would claim to design a conventional relational structure (or perhaps generate one from an XML Schema?) that was guaranteed to perform equally quickly for any arbitrary query. All real world relational schemas are optimized for a particular purpose. When people talk about triple stores they expect to be able to ask them *anything *and get a responsive answer - when that is never expected of relational db.
As an example: Donald would be mad to build the GBIF data portal as a basic triple store because 90% of queries are currently going to be the same i.e. by taxon name and geographical area. Even if there is a triple store out the back some place one is going to want to optimize for the most common queries. If you want to ask something weird you are going to have to wait!
Enabling a client to ask an arbitrary query of a data store with no knowledge of the underlying structure (only the semantics) and guaranteeing response times seems, to me, to be a general problem of any system - whether we are in a triple based world or a XML Schema based one. It also seems to be one we don't need to answer.
I imagine that the people who are looking at optimizing triple stores are looking at using the queries to build 'clever' indexes that amount to separate tables for triple patterns that occur regularly a little like MS SQL Server does with regular indexes. But then this is just me speculating.
We have to accept that data publishers are only going to offer a limited range of queries of their data. Complex queries have to be answered by gathering a subset of data (probably from several publishers) locally or in a grid and then querying that in interesting ways. Triple stores would be great for this local cache as it will be smaller and can sit in memory etc. The way to get data into these local caches is by making sure the publishers supply it in RDF using common vocabularies - even if they don't care a fig about RDF and are just using an XML Schema which has an RDF mapping.
Can we make a separation between the use of RDF/S for transfer, for query and for storage or are these things only in my mind?
Thanks for your input on this,
Roger
Steven Perry wrote:
Bob,
This is interesting stuff. I don't know what claims Oracle is making for it's triple store, but there are many other database-backed triple stores out there and I've examined several of them in depth.
With most of the current crop of triple stores (including Jena's which we use in DiGIR2), triples are stored in a single de-normalized table that has columns for subject, predicate, and object. This table is heavily indexed to allow quick lookups, for example, to find statements with a given subject and predicate. The difficult thing in measuring triple store performance is not raw throughput (how fast triples can be loaded or listed), but is instead their performance on queries. With many of the triple stores I've examined, raw throughput is limited only by how fast the underlying database is at performing SQL inserts and selects. Granted there is some overhead in the RDF framework that sits atop the database, but performance for insertions and basic retrievals is dominated by the underlying database.
With sophisticated queries the story is quite different. For a long time every triple store had it's own query language. Now that the world is starting to standardize on SPARQL I hope to see a standard set of SPARQL-based metrics that will allow query performance comparisons to be made across triple store implementations. SPARQL is very powerful and allows a large variety of useful queries. However much of SPARQL cannot be pushed down into SQL queries. This makes any triple store designed to work over a relational database at risk of having to load all triples into memory for examination by the RDF framework in order to answer sophisticated SPARQL queries. The simplest example of such a query is one that uses the filter(regex()) pattern because most relational databases cannot perform XPath's matches regex function.
I hope to have more information about Oracle's performance claims soon and I'll share them with the list when I get them.
-Steve
Bob Morris wrote:
http://www.franz.com/resources/educational_resources/white_papers/AllegroCac... is a rather interesting piece about RDF scalability. They claim to load 300,000 triples/sec from a triple store based on Allegro Common Lisp.
Allegro CL is also at the heart of Ora Lassila's Wilbur toolkit. OINK, a way cool new Wilbur application is described at http://www.lassila.org/blog/archive/2006/03/oink.html. [Does Wilbur run on the free version of Allegro?]. Wilbur loads 2600 triples/sec. Lassila is generally regarded with Hendler and Berner's-Lee as one of the founders of the Semantic Web.
[Bill Campbell, a colleague of mine and author of UMB-Scheme distributed with RedHat Linux, once remarked "XML is just Lisp with pointy brackets". The above might support: "RDF is just CLOS with pointy brackets". Which, by the way, is positive.]
Does anyone know what triple retrieval claims Oracle is making for its triple store support?
There is a good current survey of RDF programming support at http://www.wiwiss.fu-berlin.de/suhl/bizer/toolkits/
--Bob
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
Hi Roger,
Thanks for the excellent reply. You raise some serious concerns and touch on some of the issues that Robert Gales, Dave Vieglais, and I have been investigating.
Roger Hyam wrote:
Steve,
I find the triple store debate interesting because the goal posts seem to shift.
Perhaps the goal posts seem to shift because I haven't clearly made my point. We've been envisioning many different services that might participate in a semantic network. These include but aren't limited to providers, indexers, aggregators/mirrors, LSID authorities, analysis services and translation services. In addition to these we see a bunch of web and desktop applications that use these services. Only some of these services and applications are good candidates for implementation over a triple store. I'll address that more fully in a minute. Despite the fact that not all services will be backed by them, we see triple stores playing an important role.
No one would claim to design a conventional relational structure (or perhaps generate one from an XML Schema?) that was guaranteed to perform equally quickly for any arbitrary query. All real world relational schemas are optimized for a particular purpose. When people talk about triple stores they expect to be able to ask them *anything *and get a responsive answer - when that is never expected of relational db.
I certainly never meant to imply that any arbitrary query could be guaranteed answerable in a reasonable amount of time. As you point out, it's well known that this is not true of either relational databases that use SQL or of triple stores that use SPARQL. However, since SPARQL query-enabled triple stores may play a significant role in some of the services I listed above, I'm interested in measuring relative performance of different triple store implementations. Where we decide to use them, we ought to use the fastest stores available.
As an example: Donald would be mad to build the GBIF data portal as a basic triple store because 90% of queries are currently going to be the same i.e. by taxon name and geographical area. Even if there is a triple store out the back some place one is going to want to optimize for the most common queries. If you want to ask something weird you are going to have to wait!
I agree that any hypothetical RDF-based GBIF portal is a bad candidate for implementation over a triple store. To build something like a GBIF portal, I'd use a design that is a combination of an indexer and an aggregator. In my mind, an aggregator is a piece of software that harvests RDF from providers and stores it in a persistent cache. An indexer either harvests RDF or scans an aggregator's cache in order to build indexes that can be used to rapidly answer simple queries. Index services don't provide a SPARQL query interface so they don't need to be implemented over a triple store. Instead an index service could be backed by a field-enabled on-disk inverted index. This is the same technology that backs search engines and *it is very different from a general purpose database*. So an index service is a kind of search engine. A hypothetical GBIF index service could be designed to index only the scientific name and geography fields of the concise bounded descriptions (metadata objects) that represent specimens.
The query interface for an indexer is much like a search engine except that the query string can be a boolean expression of field/query-term pairs. For example, class:"DarwinCoreSpecimen" AND scientificName:"Anthophora linsleyi" AND country:"US"
Also like a search engine, the result of an index service query is a list of URIs. With an indexer these URIs are the LSIDs of the matching CBDs (metadata objects). If the client of a hypothetical GBIF index service wants RDF returned instead of a list of LSIDs, it can fetch the corresponding RDF chunks from the aggregator or resolve each LSID against it's authority.
This is a scalable design built using well-understood search engine technology. It allows one to perform simple searches very rapidly, but it has a downside that I'll address below.
Enabling a client to ask an arbitrary query of a data store with no knowledge of the underlying structure (only the semantics) and guaranteeing response times seems, to me, to be a general problem of any system - whether we are in a triple based world or a XML Schema based one. It also seems to be one we don't need to answer.
I imagine that the people who are looking at optimizing triple stores are looking at using the queries to build 'clever' indexes that amount to separate tables for triple patterns that occur regularly a little like MS SQL Server does with regular indexes. But then this is just me speculating.
We have to accept that data publishers are only going to offer a limited range of queries of their data. Complex queries have to be answered by gathering a subset of data (probably from several publishers) locally or in a grid and then querying that in interesting ways. Triple stores would be great for this local cache as it will be smaller and can sit in memory etc. The way to get data into these local caches is by making sure the publishers supply it in RDF using common vocabularies - even if they don't care a fig about RDF and are just using an XML Schema which has an RDF mapping.
I understand where you're coming from here and I sympathize with your design to keep the barrier-to-entry low for data providers. However, I hope we can aim higher than this for a several reasons.
The first is pragmatic: if data providers offer only limited query capabilities, then they become more difficult to use. As an example, imagine that a data provider serves specimens and supports queries by the taxonomic rank including genus, species or subspecies as well as by country. If I want to get all specimens collected in Kansas I have to query for and download all specimens for the US and then filter out results from the other 49 states before I can do anything meaningful with the data. Likewise, if I want to get all specimens for a particular order like Hymenoptera, then I'm forced to do several queries for genera that I think are under it, then aggregate the result and filter out any false positives before I can actually use the data. To sum up this first point, providing only minimal query capabilities on providers can increase the number of queries that must be made to perform a search and can lead to excessive traffic, not to mention inconvenience to the users of providers. This is one of the criticisms that also applies to an index service; if a field that you're interested in is not indexed, then you can't query on it.
You might argue that no one would restrict searches to only those three ranks. That argument brings me to my second point. When you place restrictions on the *type* of queries supported by a provider it's quite easy to inadvertently prevent a large number of simple but useful queries. The two example queries above (state = Kansas, order = Hymenoptera) are both simple searches yet they were not directly allowed by the provider. If the reason behind restricting queries is to lighten the load on providers and enable rapid response times, then, in the examples above, the client was frustrated (because she had to find an alternative method to get the desired data) and the goal was not met because the provider incurred the expense of transmitting a huge amount of data in the first case and handling many separate queries in the second.
To me, the major benefit of RDF is that it gives us a flexible set of data structures that can be extended and expanded over time. RDF makes it easier to interoperate with new data models within our domain such as descriptive data and taxon concepts. It could also make it easier to integrate our data with data sets from other disciplines such as geology, physical oceanography, or climatology. One consequence of this is that we can't reliably know now what queries will be most beneficial in the future. Right now the low hanging fruit is specimen queries by taxa and gross locality, but will this be true tomorrow? Building and deploying general-use data provider software is expensive, but if we restrict queries on providers then we have to have many different domain-specific data provider packages (one for specimen, one for names, etc.). Are we going to disallow the same provider from serving both specimens and descriptive data at the same time? I'd hate to see us restrict queries to a limited set now only to find that we have to change this set in the future.
So how do we guarantee reasonable performance on providers? I don't think we need to. Performance is only one criteria for queries. The other two important ones are precision and recall. If users of a particular application think performance is more important than precision, then the application should be configured to use an index service. This is a great solution for portal-style browsing applications. If, on the other hand, the application is designed to collect data for an analysis such as niche modeling, then precision is much more important than performance and it should be allowed to pose complicated queries to providers.
I think what's important is making sure that providers do not suffer inadvertent denial of service attacks by overeager clients posing too many simultaneous long-running queries. I think the best way to prevent this is to allow providers to limit the number of threads they will service and limit the time they will allocate to any single query (to say 5 minutes). If they can't respond in that time, then they ought to send the HTTP 408 timeout response. This is a moderate level of protection for providers that also allows us the flexibility to pose new queries over expanded data models in the future without rewriting or redeploying code.
Sorry for the long-winded response. In part I wanted to get these ideas out before the TAG meeting. At the meeting I'd be happy to present some of these ideas in more depth (with pretty diagrams) and perhaps stage a demonstration of a prototype provider (DiGIR2), a prototype index service, and a prototype application that uses them.
-Steve
Can we make a separation between the use of RDF/S for transfer, for query and for storage or are these things only in my mind?
Thanks for your input on this,
Roger
Steven Perry wrote:
Bob,
This is interesting stuff. I don't know what claims Oracle is making for it's triple store, but there are many other database-backed triple stores out there and I've examined several of them in depth.
With most of the current crop of triple stores (including Jena's which we use in DiGIR2), triples are stored in a single de-normalized table that has columns for subject, predicate, and object. This table is heavily indexed to allow quick lookups, for example, to find statements with a given subject and predicate. The difficult thing in measuring triple store performance is not raw throughput (how fast triples can be loaded or listed), but is instead their performance on queries. With many of the triple stores I've examined, raw throughput is limited only by how fast the underlying database is at performing SQL inserts and selects. Granted there is some overhead in the RDF framework that sits atop the database, but performance for insertions and basic retrievals is dominated by the underlying database.
With sophisticated queries the story is quite different. For a long time every triple store had it's own query language. Now that the world is starting to standardize on SPARQL I hope to see a standard set of SPARQL-based metrics that will allow query performance comparisons to be made across triple store implementations. SPARQL is very powerful and allows a large variety of useful queries. However much of SPARQL cannot be pushed down into SQL queries. This makes any triple store designed to work over a relational database at risk of having to load all triples into memory for examination by the RDF framework in order to answer sophisticated SPARQL queries. The simplest example of such a query is one that uses the filter(regex()) pattern because most relational databases cannot perform XPath's matches regex function.
I hope to have more information about Oracle's performance claims soon and I'll share them with the list when I get them.
-Steve
Bob Morris wrote:
http://www.franz.com/resources/educational_resources/white_papers/AllegroCac... is a rather interesting piece about RDF scalability. They claim to load 300,000 triples/sec from a triple store based on Allegro Common Lisp.
Allegro CL is also at the heart of Ora Lassila's Wilbur toolkit. OINK, a way cool new Wilbur application is described at http://www.lassila.org/blog/archive/2006/03/oink.html. [Does Wilbur run on the free version of Allegro?]. Wilbur loads 2600 triples/sec. Lassila is generally regarded with Hendler and Berner's-Lee as one of the founders of the Semantic Web.
[Bill Campbell, a colleague of mine and author of UMB-Scheme distributed with RedHat Linux, once remarked "XML is just Lisp with pointy brackets". The above might support: "RDF is just CLOS with pointy brackets". Which, by the way, is positive.]
Does anyone know what triple retrieval claims Oracle is making for its triple store support?
There is a good current survey of RDF programming support at http://www.wiwiss.fu-berlin.de/suhl/bizer/toolkits/
--Bob
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
--
Roger Hyam Technical Architect Taxonomic Databases Working Group
http://www.tdwg.org roger@tdwg.org
+44 1578 722782
Roger wrote:
No one would claim to design a conventional relational structure (or perhaps generate one from an XML Schema?) that was guaranteed to perform equally quickly for any arbitrary query. All real world relational schemas are optimized for a particular purpose. When people
I would not expect triple store to behave well for all possible queries. However, my feeling is that triple store in rdbms will very soon hit the ceiling. Essentially it is an "overnormalized" model, i.e. the parts of an entity are considered independent entities themselves.
With RDBS (based on experience more than theory) I can say that a good relational model holds for an astonishingly wide range of queries. That is, I only exceptionally "optimize" the model for queries (largely by adding additional indices), and my experience is that the query optimizer, exactly because I do not tell how to solve it) works well with rather unexpected queries.
As an example: Donald would be mad to build the GBIF data portal as a basic triple store because 90% of queries are currently going to be the same i.e. by taxon name and geographical area. ...
I agree that any hypothetical RDF-based GBIF portal is a bad candidate for implementation over a triple store. To build something like a GBIF portal, I'd use a design that is a combination of an indexer and an aggregator. In my mind, an aggregator is a piece of software that harvests RDF from providers and stores it in a persistent cache. An indexer either harvests RDF or scans an aggregator's cache in order to build indexes that can be used to rapidly answer simple queries. ...
This discussion is exactly the one-way street use case of data which I can see that RDF is good for. However, in the case of taxonomic and descriptive data would it not be absurd to be unable to build upon existing data?
The underlying assumption in GBIF seems to be that knowledge is institutionalized, and adding knowledge is done only within the institution. I believe that this is true for specimens, and may be desirable to become true for names. However, these examples are rather the exception (and essentially boring infrastructure cases - no-one is interested in specimens or names per se).
The assumption does not hold true for the truly relevant forms of knowledge on biology, including species concepts, taxonomic revisions, organism properties and interactions, and identification. Here knowledge has for centuries been expressed in short articles or monographs. The goal of many of us (SDD and others) is to move this on to digital, and to become able to share this knowledge. That means that I must be able to write software that fully captures all data in a document - and triple store seems to be the only way to handle RDF-based data.
--- Personally I think that the current data-provider/data aggregator mode of GBIF is already contraproductive in the case of species data. In my scientific work I am unable to make use of GBIF information, because it no longer corresponds to interpretable trust networks, but refers to unitnerpretable aggregators and superaggregators like species 2000. ---
This is not to say that RDF is not the right choice for exactly the knowledge- documents TAxMLit, SDD and other want to share (i.e. edit on two sides, with different, TDWG-standard compatible software...). But I am worried that the generality of RDF *may* make it impossible to limit the number of options, choices etc. a software has to deal with, and ONLY allows super-generalized software like reasoning engines and triple-store.
Gregor---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn@bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Königin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203
Gregor,
Thanks as ever for your thoughts. I think that (at the risk of digressing too far) I should make a few comments on the points at which your comments relate to GBIF.
1. I repeat what I have said many times before. TDWG must not optimise its standards either for what GBIF is today or for what we may like it to become if such optimisation will in any way be harmful for other purposes. We need a good information architecture which will support the widest possible range of applications.
2. Please don't judge GBIF's goals on the limited achievements of the current data portal (which will be discarded as soon as I can replace it - some time later this year). The aim is certainly to provide clear "trust networks" (all the way back to original sources) and to allow all data to be filtered by such criteria.
3. Right now we are offering access to very little data that I would characterise as "species data". I would have been massively surprised if the current portal were providing you with anything valuable for your work. I'm really looking for guidance from people such as yourself who have clear expectations of how data should be presented for their purposes. Please give your thoughts.
4. I would also say that the level of aggregation that may be appropriate for specimen/observation data (to facilitate rapid search by taxonomic, geospatial and temporal criteria) is unlikely to the same for many other classes of information. In such cases I expect central services such as GBIF to be much more clearly acting as brokers to help users find and retrieve information. Again I'd really like to receive your comments on how e.g. the international pool of SDD data might best be handled.
I'll also comment briefly again on the general subject of RDF, since I get the impression that some people think I (or GBIF) has a secret agenda in this area.
As far as I am concerned, the following are the things I would really like to see as elements in the overall TDWG architecture. Most other things could be worked out in just about any way that will work and I would still be very happy.
1. Clear, well-understood ontology of primary data classes
2. Data modelled as extensible sets of properties for these classes (more Darwin Core-like, rather than as monolithic documents)
3. Modeling in a 'neutral' language such as UML with clearly defined mechanisms for generating actual working representations
4. A well-defined way to represent of data in RDF for situations which need it (e.g. LSID metadata)
5. An LSID identifier property available for use with any object
6. A clear path to allow us to use TAPIR queries to perform searches for data objects (much simpler with objects like these than for whole documents)
For number 2 above, RDF or an OWL language would certainly be a good fit, but I know that a Darwin Core-like (GML-like) approach could easily give us what we need and I would be thrilled with any approach that met this criterion.
As a side issue, I'm not sure how easy it really would be for us to use an RDBMS-based approach to support the integration of all of the disparate and relevant information which is (as you say) scattered through so many sources. A world in which anyone can annotate any data element would seem much more suitable.
I'd also like to emphasise that the choice of a TDWG representation for data (XML schema, RDF, whatever) should serve the needs of data exchange and will not necessarily be the appropriate way to store the data at either end (any more than we would expect all collection databases to have a flat table which looks like Darwin Core). The database you construct to support efficient queries need not be the same as the one that I construct, or the object model inside someone else's application. The critical issue is how easy it is for two parties to exchange the set of objects and properties that they wish to share.
Thanks,
Donald
--------------------------------------------------------------- Donald Hobern (dhobern@gbif.org) Programme Officer for Data Access and Database Interoperability Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480 ---------------------------------------------------------------
-----Original Message----- From: Tdwg-tag-bounces@lists.tdwg.org [mailto:Tdwg-tag-bounces@lists.tdwg.org] On Behalf Of Gregor Hagedorn Sent: 30 March 2006 19:29 To: Tdwg-tag@lists.tdwg.org Subject: [Tdwg-tag] Triple store debate ...
Roger wrote:
No one would claim to design a conventional relational structure (or perhaps generate one from an XML Schema?) that was guaranteed to perform equally quickly for any arbitrary query. All real world relational schemas are optimized for a particular purpose. When people
I would not expect triple store to behave well for all possible queries. However, my feeling is that triple store in rdbms will very soon hit the ceiling. Essentially it is an "overnormalized" model, i.e. the parts of an entity are considered independent entities themselves.
With RDBS (based on experience more than theory) I can say that a good relational model holds for an astonishingly wide range of queries. That is, I only exceptionally "optimize" the model for queries (largely by adding additional indices), and my experience is that the query optimizer, exactly because I do not tell how to solve it) works well with rather unexpected queries.
As an example: Donald would be mad to build the GBIF data portal as a basic triple store because 90% of queries are currently going to be the same i.e. by taxon name and geographical area. ...
I agree that any hypothetical RDF-based GBIF portal is a bad candidate for implementation over a triple store. To build something like a GBIF portal, I'd use a design that is a combination of an indexer and an aggregator. In my mind, an aggregator is a piece of software that harvests RDF from providers and stores it in a persistent cache. An indexer either harvests RDF or scans an aggregator's cache in order to build indexes that can be used to rapidly answer simple queries. ...
This discussion is exactly the one-way street use case of data which I can see that RDF is good for. However, in the case of taxonomic and descriptive data
would it not be absurd to be unable to build upon existing data?
The underlying assumption in GBIF seems to be that knowledge is institutionalized, and adding knowledge is done only within the institution. I believe that this is true for specimens, and may be desirable to become true
for names. However, these examples are rather the exception (and essentially
boring infrastructure cases - no-one is interested in specimens or names per
se).
The assumption does not hold true for the truly relevant forms of knowledge on biology, including species concepts, taxonomic revisions, organism properties and interactions, and identification. Here knowledge has for centuries been expressed in short articles or monographs. The goal of many of us (SDD and others) is to move this on to digital, and to become able to share this knowledge. That means that I must be able to write software that fully captures all data in a document - and triple store seems to be the only way to handle
RDF-based data.
--- Personally I think that the current data-provider/data aggregator mode of GBIF is already contraproductive in the case of species data. In my scientific work I am unable to make use of GBIF information, because it no longer corresponds to interpretable trust networks, but refers to unitnerpretable aggregators and superaggregators like species 2000. ---
This is not to say that RDF is not the right choice for exactly the knowledge- documents TAxMLit, SDD and other want to share (i.e. edit on two sides, with
different, TDWG-standard compatible software...). But I am worried that the generality of RDF *may* make it impossible to limit the number of options, choices etc. a software has to deal with, and ONLY allows super-generalized software like reasoning engines and triple-store.
Gregor---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn@bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Königin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203
_______________________________________________ Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
Donald writes:
- Please don't judge GBIF's goals on the limited achievements of the
current data portal (which will be discarded as soon as I can replace it - some time later this year). The aim is certainly to provide clear "trust networks" (all the way back to original sources) and to allow all data to be filtered by such criteria.
I did not mean to belittle anything GBIF has achieved - I probably should have made that clearer. I am not arguing against GBIFs or your achievements in using big data providers, I just try to express my thoughts that part (and I believe a big part) of the future may lie in something less organized, document-centric rather than institutional-database-provider-centric.
This has some implications for the RDF debate - if we argue that big data providers will set up the conversion tools to publish excerpts of their proprietary data structure in RDF, and if we argue that RDF import use cases are relevant mostly for aggregators/indexing services.
Your point about trust-networks is excellent (and I did participate in your survey asking for suggestions about the data portal...)
retrieve information. Again I'd really like to receive your comments on how e.g. the international pool of SDD data might best be handled.
I have no easy answer to this. Not only is SDD implemented only in beta stage, but also the Delta/Lucid/DeltaAccess documents which could currently be expressed in SDD are rarely made available - partly because their seems to be not enough value in making them available (which using them through GBIF could change in the future) partly because people have reservations making their work available.
As a side issue, I'm not sure how easy it really would be for us to use an RDBMS-based approach to support the integration of all of the disparate and relevant information which is (as you say) scattered through so many sources. A world in which anyone can annotate any data element would seem much more suitable.
Perhaps indeed, I just cannot think it through, it seems to blow my brain. I feel a major point is what Steven said about CBD being the real unit of information, not the triples. This rings a bell in me, but I cannot hear it load enough yet. I am sorry if this is causing confusing posts from me.
An example: In SDD we think modifiers are very important.
"Flower red" and "Flowers almost never red"
could in RDF be:
a) Species - FlowerColor - red
b) Species - FlowerColor - red ReificationOfTheAbove - Modifier - "almost never"
Getting this as independent, extensible tuples (getting the first, but not the second is not real information. The whole, the "CBD" is the unit of information which I can critize, reject, approve.
Similar in Taxon concepts expressed through character circumscription, the concept that can be analyzed or critisized is only the total of all descriptive statements, nothing less.
Now, in the xml-schema world, this boundary is assumed (though nowhere guaranteed) to be a document. However, in SDD we ran exactly into the opposite problem, that we did mean to extent across servers and documents (although in a less atomized way than RDF). RDF would be a solution for these problems - perhaps we will understand the problems better if we better understand how to define and refer to CBDs when using RDF? I do not understand this yet.
Donald writes:
The database you construct to support efficient queries need not be the same as the one that I construct, or the object model inside someone else's application. The critical issue is how easy it is for two parties to exchange the set of objects and properties that they wish to share.
If you want interoperability of documents, you have to be able to match imported data losslessly into your inner information model. I feel that it will be very difficult to import data into your (permanent, editable) data store unless you at least use a very similar basic object ontology and a similar concept of cardinality constraints. Currently a major problem in consuming DarwinCore flat structures is that your are left to guess about relationships between multiple element instances. Better "boxing" in object types of DwC clearly overcomes this, but if you have two different boxing models (object ontology, internal information model) the problem probably appears worse than before.
Gregor---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn@bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Königin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203
Hi all,
RDF to me appears on a level of abstraction making it very hard for me to follow the documentation and discussion. Most of the examples are embedded in an artificial intelligence / reasoning use cases that I have no experience with.
I am a biologist and I feel comfortable with UML, ER-modeling, xml-schema- modeling, and - surprise - relational databases. I believe many others are as well - how many datastores are actually build upon RDBMS technology?
To me xml-schema maps nicely to both UML-like OO-modeling and Relational DBMS. I can guess about the advantages of opening this all up and seeing the world as a huge set of unstructured statement tupels. But it also scares me.
Angst is a bad advisor. But then if only a minority of the current few people involved can follow on the RDF abstraction level. A few questions I have:
* Would we be first in line to try rdf for such complex models as biodiversity informatics?
* Do Genbank/EMBL with their hundreds of employees and programmers use rdf? Internally/externally? The molecular bioinformatics is probably 1000 times larger than our biodiversity informatics.
* Why are GML, SVG etc. based on xml schema and not RDFS? Is this just historical?
* Are there any tools around that let me import RDF into a relational database (simple tools for xml-schema-based import/export are almost standard part of databases now, or you can use comfortable graphical tools like Altova MapForce).
-- I am just trying to test some tools to help me to visualize RDFS productions (like Roger has send around) on a level comparable with the UML-like xml-schema editors (Spy, Stylus, Oracle, etc.) I will try Altova SemanticWorks and Protege over the next week. The screenshot seem to be about AI and semantic web much more than about information models (those creatures where you try to simplify the world to make it manageable...).
Gregor---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn@bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Königin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203
I can contribute a few answers:
- Would we be first in line to try rdf for such complex models as
biodiversity informatics?
I think there's no harm, and likely some benefit, to experimentation. However, I believe our models need to be specified in some way that is (as) independent (as possible) of serialization / encoding.
This of course completely avoids the question of what we should use as a practical matter for communications and online schemas. In that department, I would tend to go with XML Schema until standard RDF encodings have been developed for models upon which we (are likely to) depend. Mixed implementations might also be appropriate and manageable in some cases.
- Why are GML, SVG etc. based on xml schema and not RDFS? Is this just
historical?
In the case of GML, I guess you could say it's historical. An early version of GML was actually coded in RDF, and GML implicitly incorporates some of the same semantic notions as RDF. XML Schema has been used since then for formal specifications because when it was created, RDF and the tools that support it were not sufficiently mature. There is somewhat of a movement now to develop RDF serializations, but I'm not aware of any formal projects.
Flip
Phillip C. Dibner Ecosystem Associates (650) 948-3537 (650) 948-7895 Fax
Thanks, Flip.
I agree that we should as far as we can start with data models which are defined independently of the encoding. The key factor is to move towards a cleaner object-oriented model (with well-defined objects and relationships) and then we should be able to play with different encodings much more easily than today.
Best wishes,
Donald
--------------------------------------------------------------- Donald Hobern (dhobern@gbif.org) Programme Officer for Data Access and Database Interoperability Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480 ---------------------------------------------------------------
-----Original Message----- From: Tdwg-tag-bounces@lists.tdwg.org [mailto:Tdwg-tag-bounces@lists.tdwg.org] On Behalf Of Phillip C.Dibner Sent: 24 March 2006 20:53 To: G. Hagedorn Cc: Tdwg-tag@lists.tdwg.org Subject: Re: [Tdwg-tag] RDF instead of xml schema
I can contribute a few answers:
- Would we be first in line to try rdf for such complex models as
biodiversity informatics?
I think there's no harm, and likely some benefit, to experimentation. However, I believe our models need to be specified in some way that is (as) independent (as possible) of serialization / encoding.
This of course completely avoids the question of what we should use as a practical matter for communications and online schemas. In that department, I would tend to go with XML Schema until standard RDF encodings have been developed for models upon which we (are likely to) depend. Mixed implementations might also be appropriate and manageable in some cases.
- Why are GML, SVG etc. based on xml schema and not RDFS? Is this just
historical?
In the case of GML, I guess you could say it's historical. An early version of GML was actually coded in RDF, and GML implicitly incorporates some of the same semantic notions as RDF. XML Schema has been used since then for formal specifications because when it was created, RDF and the tools that support it were not sufficiently mature. There is somewhat of a movement now to develop RDF serializations, but I'm not aware of any formal projects.
Flip
Phillip C. Dibner Ecosystem Associates (650) 948-3537 (650) 948-7895 Fax
_______________________________________________ Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
Gregor,
I can understand your angst, but I would like to suggest that XML schema actually only really provides good support for some aspects of OO modelling. Extending classes is a real problem.
A data model encoded in RDF can still make use of an ontology language to provide greater rigour in the way that objects are defined.
As was indicated in some of the earlier messages here, it is even possible to put together a data model which looks fundamentally just the same as one defined using XML schema but which is using RDF technologies under the covers and which consequently is easier to extend than XML schema.
For me however the biggest factors of importance in a revision of our data models would be:
1. A cleaner separation between different object classes (not all versioned in a single schema).
2. A good model to support easy extension (using a multiple inheritance approach) so that different (potentially overlapping) communities can add extra information in the ways that best suit them.
3. An underlying ontology that is sufficient for us at least to identify the object class of each record.
RDF technologies are an excellent way to do this. GML has managed to produce many of the same features, but has probably done so largely by replicating the essentials of RDF modelling.
Thanks,
Donald
--------------------------------------------------------------- Donald Hobern (dhobern@gbif.org) Programme Officer for Data Access and Database Interoperability Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480 ---------------------------------------------------------------
-----Original Message----- From: Tdwg-tag-bounces@lists.tdwg.org [mailto:Tdwg-tag-bounces@lists.tdwg.org] On Behalf Of Gregor Hagedorn Sent: 24 March 2006 18:37 To: Tdwg-tag@lists.tdwg.org Subject: [Tdwg-tag] RDF instead of xml schema
Hi all,
RDF to me appears on a level of abstraction making it very hard for me to follow the documentation and discussion. Most of the examples are embedded in an artificial intelligence / reasoning use cases that I have no experience with.
I am a biologist and I feel comfortable with UML, ER-modeling, xml-schema- modeling, and - surprise - relational databases. I believe many others are as well - how many datastores are actually build upon RDBMS technology?
To me xml-schema maps nicely to both UML-like OO-modeling and Relational DBMS. I can guess about the advantages of opening this all up and seeing the world as a huge set of unstructured statement tupels. But it also scares me.
Angst is a bad advisor. But then if only a minority of the current few people involved can follow on the RDF abstraction level. A few questions I have:
* Would we be first in line to try rdf for such complex models as biodiversity informatics?
* Do Genbank/EMBL with their hundreds of employees and programmers use rdf? Internally/externally? The molecular bioinformatics is probably 1000 times larger than our biodiversity informatics.
* Why are GML, SVG etc. based on xml schema and not RDFS? Is this just historical?
* Are there any tools around that let me import RDF into a relational database (simple tools for xml-schema-based import/export are almost standard part of
databases now, or you can use comfortable graphical tools like Altova MapForce).
-- I am just trying to test some tools to help me to visualize RDFS productions (like Roger has send around) on a level comparable with the UML-like xml-schema editors (Spy, Stylus, Oracle, etc.) I will try Altova SemanticWorks and Protege over the next week. The screenshot seem to be about AI and semantic web much
more than about information models (those creatures where you try to simplify the world to make it manageable...).
Gregor---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn@bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Königin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203
_______________________________________________ Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
Hi,
Just a comment on the GML part.
RDF technologies are an excellent way to do this. GML has managed to produce many of the same features, but has probably done so largely by replicating the essentials of RDF modelling.
I understand that GML has provided standard explicit encodings for only some things. These include most of the base things that people need to share like geometry, topology, observations, coordinate reference systems, etc.. These items are covered by fixed schema components.
In the other hand GML does not want to invent another schema language to cover a broader range of application domains, for example, biodiversity informatics. For schema definition they have elected (at least for the near term) XML Schema. GML has used other schema languages in the past such as DTD and RDF, but it does not try to create another schema language just for GML.
So, do not consider GML just as a modeling language. It provides a framework where application models can be created together with geographical stuff using the "most popular" schema language of the moment and keep interoperability possible.
I hope this clarify someone.
Javier.
Hi Rob,
Thanks for your contribution. My comments below:
Robert Gales wrote:
Just thoughts/comments on the use of XML Schema for validating RDF documents.
I'm afraid that by using XML Schema to validate RDF documents, we would be creating unnecessary constraints on the system. Some services may want to serve data in formats other than RDF/XML, for example N-Triple or Turtle for various reasons. Neither of these would be able to be validated by an XML Schema. For example, I've been working on indexing large quantities of data represented as RDF using standard IR techniques. N-Triple has distinct benefits over other representations because its grammar is trivial. Another benefit of N-Triple is that one can use simple concatenation to build a model without being required to use an in memory model through an RDF library such as Jena. For example, I can build a large single document containing N-Triples about millions of resources. The index maintains file position and size for each resource indexed. The benefit of using N-Triple is that upon querying, I can simple use fast random access to the file based on the position and size stored in the index to read in chunks of N-Triple based on the size and immediately start streaming the results across the wire.
This sounds like really interesting work! And it illustrates why what I am proposing is useful. If you wanted to introduce data into your index from arbitrary data providers would you prefer:
1. RDF as N-Triple: I guess this would be your favorite but it is unlikely that all data sources would give it in this - though some might. 2. RDF as XML: I guess this is second best as you can convert it to N-Triple then append it to your index. You don't care how the serialization into XML is done as a library can read it and convert it so long as it is valid. 3. XML according to some arbitrary schema: This is what you will get today. This is a nightmare as you would have to work out a mapping from the 'semantics' that may be in the document structure or schema into RDF triples.
What I am suggesting is that publishers who want to do 3 (which is a potential nightmare to consumers and indexers) could, by careful schema design, make themselves into 2 above - which makes them interoperable with an RDF world.
With the additional constraint of using only RDF/XML as the output format, the above indexer example would either need to custom serialize N-Tripe -> RDF/XML or use a library to read it into an in-memory model to serialize it as RDF/XML.
If you want to return stuff from your index as N-Triple then your customers are going to have to be able to handle it. If they can't handle it you won't get any customers. If you serialize just the query results as RDF/XML then it may be a lot easier for people to consume. Perhaps you could offer a choice. I would suggest that if your customers wanted to have a particular 'avowed' serialization of the RDF then they should do it themselves but that it might be easier to do from XML than N-Triple.
Another concern is that we will be reducing any serialization potential we have from standard libraries. Jena, Redland, SemWeb, or any other library that can produce and consume RDF is not likely to produce RDF/XML in the same format. Producers of RDF now will not only be required to use RDF/XML as opposed to other formats such as N-Triple, but will be required to write custom serialization code to translate the in-memory model for the library of their choice into the structured RDF response that fits the XML Schema. It seems to me, we are really removing one of the technical benefits of using RDF. Services and consumers really should not need to be concerned about the specific structure of the bits of RDF across the wire so long as its valid RDF.
I agree with you fully but we are starting from scratch here and we have to take everyone with us - and see who we can pick up along the way. I get a lot of messages from people (written, verbal and body language) saying that they reckon doing things in RDF is dangerous because "it will never work". They may be right or they may be sticking with what they are comfortable with. If I can say "OK, forget about RDF, just make your XML look like this (which happens to be valid RDF)" then everyone can come to the same party.
1. If you 'speak' RDF/XML then anyone on the network can understand you. You can do this with any old script. 2. If you can understand RDF/XML you can listen to anyone on the network. You can do this with any old RDF parser... 3. If you don't understand RDF/XML then you will have to put some effort in to understand everyone but you will be able to understand a subset of people who use 'avowed' serializations that you care about.
What is important here is that if RDF really is a terrible thing then the consumers of data in category 3 will grow in number and nobody will bother with triples in a few years. On the other hand if RDF is so great then consumers in category 2 will die out. Darwin kind of had it right for these things. I hope that the use of 'avowed' serializations will just let nature take its course. I sure as hell don't want the responsibility :)
In my humble opinion, any constraints and validation should be either at the level of the ontology (OWL-Lite, OWL-DL, RDFS/OWL) or through a reasoner that can be packaged and distributed for use within any application that desires to utilize our products.
Yes. Ideally that is how it should be done.If we have a basic RDFS ontology for the shared objects then people can extend this for their own purposes with OWL ontologies. We will never get agreement on a complete OWL ontology for the whole domain for sociological as well as technical reasons.
I think my mistake here is calling this a 'generic' solution. It is a bridging technology.
Does this make sense?
Roger
Cheers, Rob
Roger Hyam wrote:
Hi Everyone,
I am cross posting this to the TCS list and the TAG list because it is relevant to both but responses should fall neatly into things to do with nomenclature (for the TCS list) and things to do with technology - for the TAG list. The bit about avowed serializations of RDF below are TAG relevant.
The move towards using LSIDs and the implied use of RDF for metadata has lead to the question: "Can we do TCS is RDF?". I have put together a package of files to encode the TaxonName part of TCS as an RDF vocabulary. It is not 100% complete but could form the basis of a solution.
You can download it here: http://biodiv.hyam.net/schemas/TCS_RDF/tcs_rdf_examples.zip
For the impatient you can see a summary of the vocabulary here: http://biodiv.hyam.net/schemas/TCS_RDF/TaxonNames.html
and an example xml document here: http://biodiv.hyam.net/schemas/TCS_RDF/instance.xml
It has actually been quite easy (though time consuming) to represent the semantics in the TCS XML Schema as RDF. Generally elements within the TaxonName element have become properties of the TaxonName class with some minor name changes. Several other classes were needed to represent NomenclaturalNotes and Typification events. The only difficult part was with Typification. A nomenclatural type is both a property of a name and, if it is a lectotype, a separate object that merely references a type and a name. The result is a compromise in an object that can be embedded as a property. I use instances for controlled vocabularies that may be controversial or may not.
What is lost in only using RDFS is control over validation. It is not possible to specify that certain combinations of properties are permissible and certain not. There are two approaches to adding more 'validation':
OWL Ontologies
An OWL ontology could be built that makes assertions about the items in the RDF ontology. It would be possible to use necessary and sufficient properties to assert that instances of TaxonName are valid members of an OWL class for BotanicalSubspeciesName for example. In fact far more control could be introduced in this way than is present in the current XML Schema. What is important to note is that any such OWL ontology could be separate from the common vocabulary suggested here. Different users could develop their own ontologies for their own purposes. This is a good thing as it is probably impossible to come up with a single, agreed ontology that handles the full complexity of the domain.
I would argue strongly that we should not build a single central ontology that summarizes all we know about nomenclature - we couldn't do it within my lifetime :)
Avowed Serializations
Because RDF can be serialized as XML it is possible for an XML document to both validate against an XML Schema AND be valid RDF. This may be a useful generic solution so I'll explain it here in an attempt to make it accessible to those not familiar with the technology.
The same RDF data can be serialized in XML in many ways and different code libraries will do it differently though all code libraries can read the serializations produced by others. It is possible to pick one of the ways of serializing a particular set of RDF data and design a XML Schema to validate the resulting structure. I am stuck for a way to describe this so I am going to use the term 'avowed serialization' (Avowed means 'openly declared') as opposed to 'arbitrary serialization'. This is the approach taken by the prismstandard.org http://www.prismstandard.orggroup for their standard and it gives a number of benefits as a bridging technology:
- Publishing applications that are not RDF aware (even simple scripts) can produce regular XML Schema validated XML documents that just happen to also be RDF compliant.
- Consuming applications can assume that all data is just RDF and not worry about the particular XML Schema used. These are the applications that are likely to have to merge different kinds of data from different suppliers so they benefit most from treating it like RDF.
- Because it is regular structured XML it can be transformed using XSLT into other document formats such as 'legacy' non-RDF compliant structures - if required.
There is one direction that data would not flow without some effort. The same data published in an arbitrary serialization rather than the avowed one could be transformed, probably via several XSLT steps, into the avowed serialization and therefore made available to legacy applications using 3 above. This may not be worth the bother or may be useful. Some of the code involved would be generic to all transformations so may not be too great. It would certainly be possible for restricted data sets.
To demonstrate this instance.xml is included in the package along with avowed.xsd and two supporting files. instance.xml will validate against avowed.xsd and parse correctly in the w3c RDF parser.
I have not provided XSLT to convert instance.xml to the TCS standard format though I believe it could be done quite easily if required. Converting arbitrary documents from the current TCS to the structure represented in avowed.xsd would be more tricky but feasible and certainly possible for restricted uses of the schema that are typical from individual data suppliers.
Contents
This is what the files in this package are:
README.txt = this file TaxonNames.rdfs = An RDF vocabulary that represents TCS TaxonNames object. TaxonNames.html = Documentation from TaxonNames.rdfs - much more readable. instance.xml = an example of an XML document that is RDF compliant use of the vocabulary and XML Schema compliant. avowed.xsd = XML Schema that instance.xml validates against. dc.xsd = XML Schema that is used by avowed.xsd. taxonnames.xsd = XML Schema that is used by avowed.xsd. rdf2html.css = the style formatting for TaxonNames.html rdfs2html.xsl = XSLT style sheet to generate docs from TaxonNames.rdfs tcs_1.01.xsd = the TCS XML Schema for reference.
Needs for other Vocabularies
What is obvious looking at the vocabulary for TaxonNames here is that we need vocabularies for people, teams of people, literature and specimens as soon as possible.
Need for conventions
In order for all exchanged objects to be discoverable in a reasonable way we need to have conventions on the use of rdfs:label for Classes and Properties and dc:title for instances.
The namespaces used in these examples are fantasy as we have not finalized them yet.
Minor changes in TCS
There are a few points where I have intentionally not followed TCS 1.01 (there are probably others where it is accidental).
* basionym is a direct pointer to a TaxonName rather than a NomenclaturalNote. I couldn't see why it was a nomenclatural note in the 1.01 version as it is a simple pointer to a name. * changed name of genus element to genusEpithet property. The contents of the element are not to be used alone and are not a genus name in themselves (uninomial should be used in this case) so genusEpithet is more appropriate - even if it is not common English usage. * Addition of referenceTo property. The vocabulary may be used to mark up an occurrence of a name that is not a publishing of a new name. In these cases the thing being marked up is actually a pointer to another object, either a TaxonName issued by a nomenclator or a TaxonConcept. In these cases we need to have a reference field. Here is an example (assuming namespace) <TaxonName
referenceTo="urn:lsid:example.com:myconcepts:1234"><genusEpithet>Bellis</genusEpithet><specificEpithet>perennis</specificEpithet></TaxonName>
This could possibly appear in a XHTML document for example. Comments Please
All this amounts to a complex suggestion of how things could be done. i.e. we develop central vocabularies that go no further than RDFS but permit exchange and validation of data using avowed serializations and OWL ontologies.
What do you think?
Roger
--
Roger Hyam Technical Architect Taxonomic Databases Working Group
http://www.tdwg.org roger@tdwg.org
+44 1578 722782
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
Hi Roger,
Not a problem. Given the amount of deployed services, I completely understand the intent of "avowed serializations" to bridge the gap between the current architecture and a RDF-based architecture. However, given the fact that we *cannot* achieve true backwards compatibility should this really be a primary driving factor for the new architecture? At least in my mind, backwards compatibility would mean that existing software solutions would not require updating to function within the new architecture.
As you've noted, even with avowed serializations, producers and consumers would require updating to play nicely with the architecture. (Existing DiGIR and BioCase providers would require updating, etc.) If we must incur the expense of updating existing software to support avowed serializations, why not just update them to fully support RDF? I just don't feel that the utility of avowed serializations outweighs the cost to implement it, particularly if that cost could be redirected to upgrading existing services to fully support RDF.
I'm also a bit concerned about how we would handle schema interoperation/intergration/extension with validation using XML Schema. These issues were two of the reasons RDF was appealing as a modeling language. At first glance it would seem to me that we would need one schema for every potential combination that people would be interested in, at least if validation against XML schema is a requirement. This, to me at least, is in direct opposition to using RDF for the benefits of schema integration and extensibility.
Anyway, I'll have to think about this a bit more, Rob
Roger Hyam wrote:
Hi Rob,
Thanks for your contribution. My comments below:
Robert Gales wrote:
Just thoughts/comments on the use of XML Schema for validating RDF documents.
I'm afraid that by using XML Schema to validate RDF documents, we would be creating unnecessary constraints on the system. Some services may want to serve data in formats other than RDF/XML, for example N-Triple or Turtle for various reasons. Neither of these would be able to be validated by an XML Schema. For example, I've been working on indexing large quantities of data represented as RDF using standard IR techniques. N-Triple has distinct benefits over other representations because its grammar is trivial. Another benefit of N-Triple is that one can use simple concatenation to build a model without being required to use an in memory model through an RDF library such as Jena. For example, I can build a large single document containing N-Triples about millions of resources. The index maintains file position and size for each resource indexed. The benefit of using N-Triple is that upon querying, I can simple use fast random access to the file based on the position and size stored in the index to read in chunks of N-Triple based on the size and immediately start streaming the results across the wire.
This sounds like really interesting work! And it illustrates why what I am proposing is useful. If you wanted to introduce data into your index from arbitrary data providers would you prefer:
- RDF as N-Triple: I guess this would be your favorite but it is unlikely that all data sources would give it in this - though some might.
- RDF as XML: I guess this is second best as you can convert it to N-Triple then append it to your index. You don't care how the serialization into XML is done as a library can read it and convert it so long as it is valid.
- XML according to some arbitrary schema: This is what you will get today. This is a nightmare as you would have to work out a mapping from the 'semantics' that may be in the document structure or schema into RDF triples.
What I am suggesting is that publishers who want to do 3 (which is a potential nightmare to consumers and indexers) could, by careful schema design, make themselves into 2 above - which makes them interoperable with an RDF world.
With the additional constraint of using only RDF/XML as the output format, the above indexer example would either need to custom serialize N-Tripe -> RDF/XML or use a library to read it into an in-memory model to serialize it as RDF/XML.
If you want to return stuff from your index as N-Triple then your customers are going to have to be able to handle it. If they can't handle it you won't get any customers. If you serialize just the query results as RDF/XML then it may be a lot easier for people to consume. Perhaps you could offer a choice. I would suggest that if your customers wanted to have a particular 'avowed' serialization of the RDF then they should do it themselves but that it might be easier to do from XML than N-Triple.
Another concern is that we will be reducing any serialization potential we have from standard libraries. Jena, Redland, SemWeb, or any other library that can produce and consume RDF is not likely to produce RDF/XML in the same format. Producers of RDF now will not only be required to use RDF/XML as opposed to other formats such as N-Triple, but will be required to write custom serialization code to translate the in-memory model for the library of their choice into the structured RDF response that fits the XML Schema. It seems to me, we are really removing one of the technical benefits of using RDF. Services and consumers really should not need to be concerned about the specific structure of the bits of RDF across the wire so long as its valid RDF.
I agree with you fully but we are starting from scratch here and we have to take everyone with us - and see who we can pick up along the way. I get a lot of messages from people (written, verbal and body language) saying that they reckon doing things in RDF is dangerous because "it will never work". They may be right or they may be sticking with what they are comfortable with. If I can say "OK, forget about RDF, just make your XML look like this (which happens to be valid RDF)" then everyone can come to the same party.
- If you 'speak' RDF/XML then anyone on the network can understand you. You can do this with any old script.
- If you can understand RDF/XML you can listen to anyone on the network. You can do this with any old RDF parser...
- If you don't understand RDF/XML then you will have to put some effort in to understand everyone but you will be able to understand a subset of people who use 'avowed' serializations that you care about.
What is important here is that if RDF really is a terrible thing then the consumers of data in category 3 will grow in number and nobody will bother with triples in a few years. On the other hand if RDF is so great then consumers in category 2 will die out. Darwin kind of had it right for these things. I hope that the use of 'avowed' serializations will just let nature take its course. I sure as hell don't want the responsibility :)
In my humble opinion, any constraints and validation should be either at the level of the ontology (OWL-Lite, OWL-DL, RDFS/OWL) or through a reasoner that can be packaged and distributed for use within any application that desires to utilize our products.
Yes. Ideally that is how it should be done.If we have a basic RDFS ontology for the shared objects then people can extend this for their own purposes with OWL ontologies. We will never get agreement on a complete OWL ontology for the whole domain for sociological as well as technical reasons.
I think my mistake here is calling this a 'generic' solution. It is a bridging technology.
Does this make sense?
Roger
Cheers, Rob
Roger Hyam wrote:
Hi Everyone,
I am cross posting this to the TCS list and the TAG list because it is relevant to both but responses should fall neatly into things to do with nomenclature (for the TCS list) and things to do with technology - for the TAG list. The bit about avowed serializations of RDF below are TAG relevant.
The move towards using LSIDs and the implied use of RDF for metadata has lead to the question: "Can we do TCS is RDF?". I have put together a package of files to encode the TaxonName part of TCS as an RDF vocabulary. It is not 100% complete but could form the basis of a solution.
You can download it here: http://biodiv.hyam.net/schemas/TCS_RDF/tcs_rdf_examples.zip
For the impatient you can see a summary of the vocabulary here: http://biodiv.hyam.net/schemas/TCS_RDF/TaxonNames.html
and an example xml document here: http://biodiv.hyam.net/schemas/TCS_RDF/instance.xml
It has actually been quite easy (though time consuming) to represent the semantics in the TCS XML Schema as RDF. Generally elements within the TaxonName element have become properties of the TaxonName class with some minor name changes. Several other classes were needed to represent NomenclaturalNotes and Typification events. The only difficult part was with Typification. A nomenclatural type is both a property of a name and, if it is a lectotype, a separate object that merely references a type and a name. The result is a compromise in an object that can be embedded as a property. I use instances for controlled vocabularies that may be controversial or may not.
What is lost in only using RDFS is control over validation. It is not possible to specify that certain combinations of properties are permissible and certain not. There are two approaches to adding more 'validation':
OWL Ontologies
An OWL ontology could be built that makes assertions about the items in the RDF ontology. It would be possible to use necessary and sufficient properties to assert that instances of TaxonName are valid members of an OWL class for BotanicalSubspeciesName for example. In fact far more control could be introduced in this way than is present in the current XML Schema. What is important to note is that any such OWL ontology could be separate from the common vocabulary suggested here. Different users could develop their own ontologies for their own purposes. This is a good thing as it is probably impossible to come up with a single, agreed ontology that handles the full complexity of the domain.
I would argue strongly that we should not build a single central ontology that summarizes all we know about nomenclature - we couldn't do it within my lifetime :)
Avowed Serializations
Because RDF can be serialized as XML it is possible for an XML document to both validate against an XML Schema AND be valid RDF. This may be a useful generic solution so I'll explain it here in an attempt to make it accessible to those not familiar with the technology.
The same RDF data can be serialized in XML in many ways and different code libraries will do it differently though all code libraries can read the serializations produced by others. It is possible to pick one of the ways of serializing a particular set of RDF data and design a XML Schema to validate the resulting structure. I am stuck for a way to describe this so I am going to use the term 'avowed serialization' (Avowed means 'openly declared') as opposed to 'arbitrary serialization'. This is the approach taken by the prismstandard.org http://www.prismstandard.orggroup for their standard and it gives a number of benefits as a bridging technology:
- Publishing applications that are not RDF aware (even simple scripts) can produce regular XML Schema validated XML documents that just happen to also be RDF compliant.
- Consuming applications can assume that all data is just RDF and not worry about the particular XML Schema used. These are the applications that are likely to have to merge different kinds of data from different suppliers so they benefit most from treating it like RDF.
- Because it is regular structured XML it can be transformed using XSLT into other document formats such as 'legacy' non-RDF compliant structures - if required.
There is one direction that data would not flow without some effort. The same data published in an arbitrary serialization rather than the avowed one could be transformed, probably via several XSLT steps, into the avowed serialization and therefore made available to legacy applications using 3 above. This may not be worth the bother or may be useful. Some of the code involved would be generic to all transformations so may not be too great. It would certainly be possible for restricted data sets.
To demonstrate this instance.xml is included in the package along with avowed.xsd and two supporting files. instance.xml will validate against avowed.xsd and parse correctly in the w3c RDF parser.
I have not provided XSLT to convert instance.xml to the TCS standard format though I believe it could be done quite easily if required. Converting arbitrary documents from the current TCS to the structure represented in avowed.xsd would be more tricky but feasible and certainly possible for restricted uses of the schema that are typical from individual data suppliers.
Contents
This is what the files in this package are:
README.txt = this file TaxonNames.rdfs = An RDF vocabulary that represents TCS TaxonNames object. TaxonNames.html = Documentation from TaxonNames.rdfs - much more readable. instance.xml = an example of an XML document that is RDF compliant use of the vocabulary and XML Schema compliant. avowed.xsd = XML Schema that instance.xml validates against. dc.xsd = XML Schema that is used by avowed.xsd. taxonnames.xsd = XML Schema that is used by avowed.xsd. rdf2html.css = the style formatting for TaxonNames.html rdfs2html.xsl = XSLT style sheet to generate docs from TaxonNames.rdfs tcs_1.01.xsd = the TCS XML Schema for reference.
Needs for other Vocabularies
What is obvious looking at the vocabulary for TaxonNames here is that we need vocabularies for people, teams of people, literature and specimens as soon as possible.
Need for conventions
In order for all exchanged objects to be discoverable in a reasonable way we need to have conventions on the use of rdfs:label for Classes and Properties and dc:title for instances.
The namespaces used in these examples are fantasy as we have not finalized them yet.
Minor changes in TCS
There are a few points where I have intentionally not followed TCS 1.01 (there are probably others where it is accidental).
* basionym is a direct pointer to a TaxonName rather than a NomenclaturalNote. I couldn't see why it was a nomenclatural note in the 1.01 version as it is a simple pointer to a name. * changed name of genus element to genusEpithet property. The contents of the element are not to be used alone and are not a genus name in themselves (uninomial should be used in this case) so genusEpithet is more appropriate - even if it is not common English usage. * Addition of referenceTo property. The vocabulary may be used to mark up an occurrence of a name that is not a publishing of a new name. In these cases the thing being marked up is actually a pointer to another object, either a TaxonName issued by a nomenclator or a TaxonConcept. In these cases we need to have a reference field. Here is an example (assuming namespace) <TaxonName
referenceTo="urn:lsid:example.com:myconcepts:1234"><genusEpithet>Bellis</genusEpithet><specificEpithet>perennis</specificEpithet></TaxonName>
This could possibly appear in a XHTML document for example. Comments Please
All this amounts to a complex suggestion of how things could be done. i.e. we develop central vocabularies that go no further than RDFS but permit exchange and validation of data using avowed serializations and OWL ontologies.
What do you think?
Roger
--
Roger Hyam Technical Architect Taxonomic Databases Working Group
http://www.tdwg.org roger@tdwg.org
+44 1578 722782
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
--
Roger Hyam Technical Architect Taxonomic Databases Working Group
http://www.tdwg.org roger@tdwg.org
+44 1578 722782
Hi Rob,
More comments below.
Robert Gales wrote:
Hi Roger,
Not a problem. Given the amount of deployed services, I completely understand the intent of "avowed serializations" to bridge the gap between the current architecture and a RDF-based architecture. However, given the fact that we *cannot* achieve true backwards compatibility should this really be a primary driving factor for the new architecture? At least in my mind, backwards compatibility would mean that existing software solutions would not require updating to function within the new architecture.
I was thinking of it more as a forward compatibility.
As you've noted, even with avowed serializations, producers and consumers would require updating to play nicely with the architecture. (Existing DiGIR and BioCase providers would require updating, etc.) If we must incur the expense of updating existing software to support avowed serializations, why not just update them to fully support RDF? I just don't feel that the utility of avowed serializations outweighs the cost to implement it, particularly if that cost could be redirected to upgrading existing services to fully support RDF.
This is a very good point. I imagine getting existing applications to return RDF/XML in response to existing queries would be fairly easy. It could be just a matter of returning another XML Schema based document as I have demonstrated. Upgrading them so that they can be queried as if they were triple stores (using SPARQL or our own system) is another matter altogether and a place we may never reach.
I'm also a bit concerned about how we would handle schema interoperation/intergration/extension with validation using XML Schema. These issues were two of the reasons RDF was appealing as a modeling language. At first glance it would seem to me that we would need one schema for every potential combination that people would be interested in, at least if validation against XML schema is a requirement. This, to me at least, is in direct opposition to using RDF for the benefits of schema integration and extensibility.
You are correct. That is why the future might be "RDF like" but it is not going to happen tomorrow. If some one is putting together an XML Schema today (and, thanks to Altova software, they probably are) then they should at least think about making the structure of that schema go node-arc-node-arc so that it is easy to map into RDF if need be - plus it gives clear document design.
Anyway, I'll have to think about this a bit more,
Keep on thinking. How about wrappers to BioCASE and DiGIR providers?
All the best,
Roger
Rob
Roger Hyam wrote:
Hi Rob,
Thanks for your contribution. My comments below:
Robert Gales wrote:
Just thoughts/comments on the use of XML Schema for validating RDF documents.
I'm afraid that by using XML Schema to validate RDF documents, we would be creating unnecessary constraints on the system. Some services may want to serve data in formats other than RDF/XML, for example N-Triple or Turtle for various reasons. Neither of these would be able to be validated by an XML Schema. For example, I've been working on indexing large quantities of data represented as RDF using standard IR techniques. N-Triple has distinct benefits over other representations because its grammar is trivial. Another benefit of N-Triple is that one can use simple concatenation to build a model without being required to use an in memory model through an RDF library such as Jena. For example, I can build a large single document containing N-Triples about millions of resources. The index maintains file position and size for each resource indexed. The benefit of using N-Triple is that upon querying, I can simple use fast random access to the file based on the position and size stored in the index to read in chunks of N-Triple based on the size and immediately start streaming the results across the wire.
This sounds like really interesting work! And it illustrates why what I am proposing is useful. If you wanted to introduce data into your index from arbitrary data providers would you prefer:
- RDF as N-Triple: I guess this would be your favorite but it is unlikely that all data sources would give it in this - though some might.
- RDF as XML: I guess this is second best as you can convert it to N-Triple then append it to your index. You don't care how the serialization into XML is done as a library can read it and convert it so long as it is valid.
- XML according to some arbitrary schema: This is what you will get today. This is a nightmare as you would have to work out a mapping from the 'semantics' that may be in the document structure or schema into RDF triples.
What I am suggesting is that publishers who want to do 3 (which is a potential nightmare to consumers and indexers) could, by careful schema design, make themselves into 2 above - which makes them interoperable with an RDF world.
With the additional constraint of using only RDF/XML as the output format, the above indexer example would either need to custom serialize N-Tripe -> RDF/XML or use a library to read it into an in-memory model to serialize it as RDF/XML.
If you want to return stuff from your index as N-Triple then your customers are going to have to be able to handle it. If they can't handle it you won't get any customers. If you serialize just the query results as RDF/XML then it may be a lot easier for people to consume. Perhaps you could offer a choice. I would suggest that if your customers wanted to have a particular 'avowed' serialization of the RDF then they should do it themselves but that it might be easier to do from XML than N-Triple.
Another concern is that we will be reducing any serialization potential we have from standard libraries. Jena, Redland, SemWeb, or any other library that can produce and consume RDF is not likely to produce RDF/XML in the same format. Producers of RDF now will not only be required to use RDF/XML as opposed to other formats such as N-Triple, but will be required to write custom serialization code to translate the in-memory model for the library of their choice into the structured RDF response that fits the XML Schema. It seems to me, we are really removing one of the technical benefits of using RDF. Services and consumers really should not need to be concerned about the specific structure of the bits of RDF across the wire so long as its valid RDF.
I agree with you fully but we are starting from scratch here and we have to take everyone with us - and see who we can pick up along the way. I get a lot of messages from people (written, verbal and body language) saying that they reckon doing things in RDF is dangerous because "it will never work". They may be right or they may be sticking with what they are comfortable with. If I can say "OK, forget about RDF, just make your XML look like this (which happens to be valid RDF)" then everyone can come to the same party.
- If you 'speak' RDF/XML then anyone on the network can understand you. You can do this with any old script.
- If you can understand RDF/XML you can listen to anyone on the network. You can do this with any old RDF parser...
- If you don't understand RDF/XML then you will have to put some effort in to understand everyone but you will be able to understand a subset of people who use 'avowed' serializations that you care about.
What is important here is that if RDF really is a terrible thing then the consumers of data in category 3 will grow in number and nobody will bother with triples in a few years. On the other hand if RDF is so great then consumers in category 2 will die out. Darwin kind of had it right for these things. I hope that the use of 'avowed' serializations will just let nature take its course. I sure as hell don't want the responsibility :)
In my humble opinion, any constraints and validation should be either at the level of the ontology (OWL-Lite, OWL-DL, RDFS/OWL) or through a reasoner that can be packaged and distributed for use within any application that desires to utilize our products.
Yes. Ideally that is how it should be done.If we have a basic RDFS ontology for the shared objects then people can extend this for their own purposes with OWL ontologies. We will never get agreement on a complete OWL ontology for the whole domain for sociological as well as technical reasons.
I think my mistake here is calling this a 'generic' solution. It is a bridging technology.
Does this make sense?
Roger
Cheers, Rob
Roger Hyam wrote:
Hi Everyone,
I am cross posting this to the TCS list and the TAG list because it is relevant to both but responses should fall neatly into things to do with nomenclature (for the TCS list) and things to do with technology - for the TAG list. The bit about avowed serializations of RDF below are TAG relevant.
The move towards using LSIDs and the implied use of RDF for metadata has lead to the question: "Can we do TCS is RDF?". I have put together a package of files to encode the TaxonName part of TCS as an RDF vocabulary. It is not 100% complete but could form the basis of a solution.
You can download it here: http://biodiv.hyam.net/schemas/TCS_RDF/tcs_rdf_examples.zip
For the impatient you can see a summary of the vocabulary here: http://biodiv.hyam.net/schemas/TCS_RDF/TaxonNames.html
and an example xml document here: http://biodiv.hyam.net/schemas/TCS_RDF/instance.xml
It has actually been quite easy (though time consuming) to represent the semantics in the TCS XML Schema as RDF. Generally elements within the TaxonName element have become properties of the TaxonName class with some minor name changes. Several other classes were needed to represent NomenclaturalNotes and Typification events. The only difficult part was with Typification. A nomenclatural type is both a property of a name and, if it is a lectotype, a separate object that merely references a type and a name. The result is a compromise in an object that can be embedded as a property. I use instances for controlled vocabularies that may be controversial or may not.
What is lost in only using RDFS is control over validation. It is not possible to specify that certain combinations of properties are permissible and certain not. There are two approaches to adding more 'validation':
OWL Ontologies
An OWL ontology could be built that makes assertions about the items in the RDF ontology. It would be possible to use necessary and sufficient properties to assert that instances of TaxonName are valid members of an OWL class for BotanicalSubspeciesName for example. In fact far more control could be introduced in this way than is present in the current XML Schema. What is important to note is that any such OWL ontology could be separate from the common vocabulary suggested here. Different users could develop their own ontologies for their own purposes. This is a good thing as it is probably impossible to come up with a single, agreed ontology that handles the full complexity of the domain.
I would argue strongly that we should not build a single central ontology that summarizes all we know about nomenclature - we couldn't do it within my lifetime :)
Avowed Serializations
Because RDF can be serialized as XML it is possible for an XML document to both validate against an XML Schema AND be valid RDF. This may be a useful generic solution so I'll explain it here in an attempt to make it accessible to those not familiar with the technology.
The same RDF data can be serialized in XML in many ways and different code libraries will do it differently though all code libraries can read the serializations produced by others. It is possible to pick one of the ways of serializing a particular set of RDF data and design a XML Schema to validate the resulting structure. I am stuck for a way to describe this so I am going to use the term 'avowed serialization' (Avowed means 'openly declared') as opposed to 'arbitrary serialization'. This is the approach taken by the prismstandard.org http://www.prismstandard.orggroup for their standard and it gives a number of benefits as a bridging technology:
- Publishing applications that are not RDF aware (even simple scripts) can produce regular XML Schema validated XML documents that just happen to also be RDF compliant.
- Consuming applications can assume that all data is just RDF and not worry about the particular XML Schema used. These are the applications that are likely to have to merge different kinds of data from different suppliers so they benefit most from treating it like RDF.
- Because it is regular structured XML it can be transformed using XSLT into other document formats such as 'legacy' non-RDF compliant structures - if required.
There is one direction that data would not flow without some effort. The same data published in an arbitrary serialization rather than the avowed one could be transformed, probably via several XSLT steps, into the avowed serialization and therefore made available to legacy applications using 3 above. This may not be worth the bother or may be useful. Some of the code involved would be generic to all transformations so may not be too great. It would certainly be possible for restricted data sets.
To demonstrate this instance.xml is included in the package along with avowed.xsd and two supporting files. instance.xml will validate against avowed.xsd and parse correctly in the w3c RDF parser.
I have not provided XSLT to convert instance.xml to the TCS standard format though I believe it could be done quite easily if required. Converting arbitrary documents from the current TCS to the structure represented in avowed.xsd would be more tricky but feasible and certainly possible for restricted uses of the schema that are typical from individual data suppliers.
Contents
This is what the files in this package are:
README.txt = this file TaxonNames.rdfs = An RDF vocabulary that represents TCS TaxonNames object. TaxonNames.html = Documentation from TaxonNames.rdfs - much more readable. instance.xml = an example of an XML document that is RDF compliant use of the vocabulary and XML Schema compliant. avowed.xsd = XML Schema that instance.xml validates against. dc.xsd = XML Schema that is used by avowed.xsd. taxonnames.xsd = XML Schema that is used by avowed.xsd. rdf2html.css = the style formatting for TaxonNames.html rdfs2html.xsl = XSLT style sheet to generate docs from TaxonNames.rdfs tcs_1.01.xsd = the TCS XML Schema for reference.
Needs for other Vocabularies
What is obvious looking at the vocabulary for TaxonNames here is that we need vocabularies for people, teams of people, literature and specimens as soon as possible.
Need for conventions
In order for all exchanged objects to be discoverable in a reasonable way we need to have conventions on the use of rdfs:label for Classes and Properties and dc:title for instances.
The namespaces used in these examples are fantasy as we have not finalized them yet.
Minor changes in TCS
There are a few points where I have intentionally not followed TCS 1.01 (there are probably others where it is accidental).
* basionym is a direct pointer to a TaxonName rather than a NomenclaturalNote. I couldn't see why it was a nomenclatural
note in the 1.01 version as it is a simple pointer to a name. * changed name of genus element to genusEpithet property. The contents of the element are not to be used alone and are not a genus name in themselves (uninomial should be used in this case) so genusEpithet is more appropriate - even if it is not common English usage. * Addition of referenceTo property. The vocabulary may be used to mark up an occurrence of a name that is not a publishing of a new name. In these cases the thing being marked up is actually a pointer to another object, either a TaxonName issued by a nomenclator or a TaxonConcept. In these cases we need to have a reference field. Here is an example (assuming namespace) <TaxonName
referenceTo="urn:lsid:example.com:myconcepts:1234"><genusEpithet>Bellis</genusEpithet><specificEpithet>perennis</specificEpithet></TaxonName>
This could possibly appear in a XHTML document for example. Comments Please
All this amounts to a complex suggestion of how things could be done. i.e. we develop central vocabularies that go no further than RDFS but permit exchange and validation of data using avowed serializations and OWL ontologies.
What do you think?
Roger
--
Roger Hyam Technical Architect Taxonomic Databases Working Group
http://www.tdwg.org roger@tdwg.org
+44 1578 722782
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
--
Roger Hyam Technical Architect Taxonomic Databases Working Group
http://www.tdwg.org roger@tdwg.org
+44 1578 722782
participants (10)
-
Bob Morris
-
Donald Hobern
-
Fini A. Alring
-
Gregor Hagedorn
-
Javier de la Torre
-
Phillip C. Dibner
-
Ricardo Scachetti Pereira
-
Robert Gales
-
Roger Hyam
-
Steven Perry