Re: [Tdwg-tag] TCS in RDF for use in LSIDs and possible generic mechanism.

22 Mar 2006

      Just thoughts/comments on the use of XML Schema for validating RDF 
documents.

I'm afraid that by using XML Schema to validate RDF documents, we would 
be creating unnecessary constraints on the system.  Some services may 
want to serve data in formats other than RDF/XML, for example N-Triple 
or Turtle for various reasons.  Neither of these would be able to be 
validated by an XML Schema.  For example, I've been working on indexing 
large quantities of data represented as RDF using standard IR 
techniques.  N-Triple has distinct benefits over other representations 
because its grammar is trivial.  Another benefit of N-Triple is that one 
can use simple concatenation to build a model without being required to 
use an in memory model through an RDF library such as Jena.  For 
example, I can build a large single document containing N-Triples about 
millions of resources.  The index maintains file position and size for 
each resource indexed.  The benefit of using N-Triple is that upon 
querying, I can simple use fast random access to the file based on the 
position and size stored in the index to read in chunks of N-Triple 
based on the size and immediately start streaming the results across the 
wire.

With the additional constraint of using only RDF/XML as the output 
format, the above indexer example would either need to custom serialize 
N-Tripe -> RDF/XML or use a library to read it into an in-memory model 
to serialize it as RDF/XML.

Another concern is that we will be reducing any serialization potential 
we have from standard libraries.  Jena, Redland, SemWeb, or any other 
library that can produce and consume RDF is not likely to produce 
RDF/XML in the same format.  Producers of RDF now will not only be 
required to use RDF/XML as opposed to other formats such as N-Triple, 
but will be required to write custom serialization code to translate the 
in-memory model for the library of their choice into the structured RDF 
response that fits the XML Schema.  It seems to me, we are really 
removing one of the technical benefits of using RDF.  Services and 
consumers really should not need to be concerned about the specific 
structure of the bits of RDF across the wire so long as its valid RDF.

In my humble opinion, any constraints and validation should be either at 
the level of the ontology (OWL-Lite, OWL-DL, RDFS/OWL) or through a 
reasoner that can be packaged and distributed for use within any 
application that desires to utilize our products.

Cheers,
Rob

Roger Hyam wrote:
...
Hi Everyone,
I am cross posting this to the TCS list and the TAG list because it is 
relevant to both but responses should fall neatly into things to do with 
nomenclature (for the TCS list) and things to do with technology - for 
the TAG list. The bit about avowed serializations of RDF below are TAG 
relevant.
The move towards using LSIDs and the implied use of RDF for metadata has 
lead to the question: "Can we do TCS is RDF?". I have put together a 
package of files to encode the TaxonName part of TCS as an RDF 
vocabulary. It is not 100% complete but could form the basis of a solution.
You can download it here:
http://biodiv.hyam.net/schemas/TCS_RDF/tcs_rdf_examples.zip
For the impatient you can see a summary of the vocabulary here: 
http://biodiv.hyam.net/schemas/TCS_RDF/TaxonNames.html
and an example xml document here: 
http://biodiv.hyam.net/schemas/TCS_RDF/instance.xml
It has actually been quite easy (though time consuming) to represent the 
semantics in the TCS XML Schema as RDF. Generally elements within the 
TaxonName element have become properties of the TaxonName class with 
some minor name changes. Several other classes were needed to represent 
NomenclaturalNotes and Typification events. The only difficult part was 
with Typification. A nomenclatural type is both a property of a name 
and, if it is a lectotype, a separate object that merely references a 
type and a name. The result is a compromise in an object that can be 
embedded as a property. I use instances for controlled vocabularies that 
may be controversial or may not.
What is lost in only using RDFS is control over validation. It is not 
possible to specify that certain combinations of properties are 
permissible and certain not. There are two approaches to adding more 
'validation':
OWL Ontologies
An OWL ontology could be built that makes assertions about the items in 
the RDF ontology. It would be possible to use necessary and sufficient 
properties to assert that instances of TaxonName are valid members of an 
OWL class for BotanicalSubspeciesName for example. In fact far more 
control could be introduced in this way than is present in the current 
XML Schema. What is important to note is that any such OWL ontology 
could be separate from the common vocabulary suggested here. Different 
users could develop their own ontologies for their own purposes. This is 
a good thing as it is probably impossible to come up with a single, 
agreed ontology that handles the full complexity of the domain.
I would argue strongly that we should not build a single central 
ontology that summarizes all we know about nomenclature - we couldn't do 
it within my lifetime :)
Avowed Serializations
Because RDF can be serialized as XML it is possible for an XML document 
to both validate against an XML Schema AND be valid RDF.  This may be a 
useful generic solution so I'll explain it here in an attempt to make it 
accessible to those not familiar with the technology.
The same RDF data can be serialized in XML in many ways and different 
code libraries will do it differently though all code libraries can read 
the serializations produced by others. It is possible to pick one of the 
ways of serializing a particular set of RDF data and design a XML Schema 
to validate the resulting structure. I am stuck for a way to describe 
this so I am going to use the term 'avowed serialization' (Avowed means 
'openly declared') as opposed to 'arbitrary serialization'. This is the 
approach taken by the prismstandard.org 
<http://www.prismstandard.org>group for their standard and it gives a 
number of benefits as a bridging technology:
1. Publishing applications that are not RDF aware (even simple
      scripts) can produce regular XML Schema validated XML documents
      that just happen to also be RDF compliant.
   2. Consuming applications can assume that all data is just RDF and
      not worry about the particular XML Schema used. These are the
      applications that are likely to have to merge different kinds of
      data from different suppliers so they benefit most from treating
      it like RDF.
   3. Because it is regular structured XML it can be transformed using
      XSLT into other document formats such as 'legacy' non-RDF
      compliant structures - if required.
There is one direction that data would not flow without some effort. The 
same data published in an arbitrary serialization rather than the avowed 
one could be transformed, probably via several XSLT steps, into the 
avowed serialization and therefore made available to legacy applications 
using 3 above. This may not be worth the bother or may be useful. Some 
of the code involved would be generic to all transformations so may not 
be too great. It would certainly be possible for restricted data sets.
To demonstrate this instance.xml is included in the package along with 
avowed.xsd and two supporting files. instance.xml will validate against 
avowed.xsd and parse correctly in the w3c RDF parser.
I have not provided XSLT to convert instance.xml to the TCS standard 
format though I believe it could be done quite easily if required. 
Converting arbitrary documents from the current TCS to the structure 
represented in avowed.xsd would be more tricky but feasible and 
certainly possible for restricted uses of the schema that are typical 
from individual data suppliers.
Contents
This is what the files in this package are:
README.txt = this file
TaxonNames.rdfs = An RDF vocabulary that represents TCS TaxonNames object.
TaxonNames.html = Documentation from TaxonNames.rdfs - much more readable.
instance.xml = an example of an XML document that is RDF compliant use 
of the vocabulary and XML Schema compliant.
avowed.xsd = XML Schema that instance.xml validates against.
dc.xsd = XML Schema that is used by avowed.xsd.
taxonnames.xsd = XML Schema that is used by avowed.xsd.
rdf2html.css = the style formatting for TaxonNames.html
rdfs2html.xsl = XSLT style sheet to generate docs from TaxonNames.rdfs
tcs_1.01.xsd = the TCS XML Schema for reference.
Needs for other Vocabularies
What is obvious looking at the vocabulary for TaxonNames here is that we 
need vocabularies for people, teams of people, literature and specimens 
as soon as possible.
Need for conventions
In order for all exchanged objects to be discoverable in a reasonable 
way we need to have conventions on the use of rdfs:label for Classes and 
Properties and dc:title for instances.
The namespaces used in these examples are fantasy as we have not 
finalized them yet.
Minor changes in TCS
There are a few points where I have intentionally not followed TCS 1.01 
(there are probably others where it is accidental).
* basionym is a direct pointer to a TaxonName rather than a
      NomenclaturalNote. I couldn't see why it was a nomenclatural note
      in the 1.01 version as it is a simple pointer to a name.
    * changed name of genus element to genusEpithet  property. The
      contents of the element are not to be used alone and are not a
      genus name in themselves (uninomial should be used in this case)
      so genusEpithet is more appropriate - even if it is not common
      English usage.
    * Addition of referenceTo property. The vocabulary may be used to
      mark up an occurrence of a name that is not a publishing of a new
      name. In these cases the thing being marked up is actually a
      pointer to another object, either a TaxonName issued by a
      nomenclator or a TaxonConcept. In these cases we need to have a
      reference field. Here is an example (assuming namespace)
      <TaxonName
      referenceTo="urn:lsid:example.com:myconcepts:1234"><genusEpithet>Bellis</genusEpithet><specificEpithet>perennis</specificEpithet></TaxonName>
      This could possibly appear in a XHTML document for example.
Comments Please
All this amounts to a complex suggestion of how things could be done. 
i.e. we develop central vocabularies that go no further than RDFS but 
permit exchange and validation of data using avowed serializations and 
OWL ontologies.
What do you think?
Roger
--
-------------------------------------
 Roger Hyam
 Technical Architect
 Taxonomic Databases Working Group
-------------------------------------
 http://www.tdwg.org
 roger@tdwg.org
 +44 1578 722782
-------------------------------------
------------------------------------------------------------------------
_______________________________________________
Tdwg-tag mailing list
Tdwg-tag@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org

Re: [Tdwg-tag] TCS in RDF for use in LSIDs and possible generic mechanism.

Robert Gales