[Tdwg-tag] TCS in RDF for use in LSIDs and possiblegeneric mechanism.

Wed Mar 22 17:12:33 CET 2006

I'm pretty confused by this exchange, probably because I am so new to 
issues of RDF serialization.

I would have thought that the people who want XML Schema validation, or 
any other non-RDF validation, do so because they have a need to embed in 
LSID data or metadata something which only has a non-RDF serialization, 
and that their real need is to be able to validate the embedded data 
after its extraction and deserialization into pure XML.  Image pixels 
come to mind,  accompanied by an assertion that the representation has 
some specified encoding and when decoded represents a media file meeting 
some ISO standard. (As an aside, if there is a standard way of embedding 
binary data in RDF it would help our image LSID effort to have a pointer 
to it). SDD (which we should perhaps call "SDD-XML)  is another, mainly 
because it is almost certainly as expressive as full RDF itself, and 
there are no RDF editing tools on the horizon which even SEEK asserts 
biologists will happily embrace, from which I infer that some such 
current XML-Schema TDWG projects will have a rather long life.  From 
experience in a project with NatureServe, I am guessing that 
Observations have similar issues.

Thinking about image data is probably the least controversial, albeit 
the simplest, case.

So if all the above is meaningful and correct, it seems to me that the 
issue may be one of settling on TDWG standards, consistent with current 
RDF best practices, about how to signal that some embedded stuff is 
actually in some "non-RDF" serialization, with enough standardized RDF 
to guide extraction, and that this is probably idiosyncratic to the 
kinds of embedded data. Isn't this basically the same issue as custom 
serializers/deserializers for SOAP? And it hasn't already been addressed 
by the RDF community???

Or maybe I am so clueless that this all sounds like the rantings of a 
character from a Borges story. (Which is often what I feel like when 
contemplating RDF. The Library of Babel comes to mind, as does the story 
whose title I forget which is about two giants locked in mortal combat, 
except that each is dreaming the combat in some kind of shared dream and 
the one who wins in the dream gets to wake up.) :-) ...

Donald Hobern wrote:

>Rob,
>
>It may help to see this from the other side.  You are quite right that we
>would not want an RDF model to be constrained by the fact that some people
>wish to share data using XML Schema validation.  
>
>However, if the documents validated this way are a valid subset of the valid
>documents according to the RDF model, it may give these people a chance to
>make their data available for use in an RDF-enabled world.
>
>Consumers of RDF data can freely handle alternative representations such as
>N-Triple as well as the XML-encoded form.  Because such heterogeneity is
>always possible, it may not be a big deal if some of the providers are
>unaware that their documents are expressed in XML RDF, so long as they are
>valid as such.
>
>Does that make sense? 
>
>Donald
> 
>---------------------------------------------------------------
>Donald Hobern (dhobern at gbif.org)
>Programme Officer for Data Access and Database Interoperability 
>Global Biodiversity Information Facility Secretariat 
>Universitetsparken 15, DK-2100 Copenhagen, Denmark
>Tel: +45-35321483   Mobile: +45-28751483   Fax: +45-35321480
>---------------------------------------------------------------
>
>
>-----Original Message-----
>From: Tdwg-tag-bounces at lists.tdwg.org
>[mailto:Tdwg-tag-bounces at lists.tdwg.org] On Behalf Of Robert Gales
>Sent: 22 March 2006 16:09
>To: roger at tdwg.org
>Cc: Tdwg-tag at lists.tdwg.org
>Subject: Re: [Tdwg-tag] TCS in RDF for use in LSIDs and possiblegeneric
>mechanism.
>
>Just thoughts/comments on the use of XML Schema for validating RDF 
>documents.
>
>I'm afraid that by using XML Schema to validate RDF documents, we would 
>be creating unnecessary constraints on the system.  Some services may 
>want to serve data in formats other than RDF/XML, for example N-Triple 
>or Turtle for various reasons.  Neither of these would be able to be 
>validated by an XML Schema.  For example, I've been working on indexing 
>large quantities of data represented as RDF using standard IR 
>techniques.  N-Triple has distinct benefits over other representations 
>because its grammar is trivial.  Another benefit of N-Triple is that one 
>can use simple concatenation to build a model without being required to 
>use an in memory model through an RDF library such as Jena.  For 
>example, I can build a large single document containing N-Triples about 
>millions of resources.  The index maintains file position and size for 
>each resource indexed.  The benefit of using N-Triple is that upon 
>querying, I can simple use fast random access to the file based on the 
>position and size stored in the index to read in chunks of N-Triple 
>based on the size and immediately start streaming the results across the 
>wire.
>
>With the additional constraint of using only RDF/XML as the output 
>format, the above indexer example would either need to custom serialize 
>N-Tripe -> RDF/XML or use a library to read it into an in-memory model 
>to serialize it as RDF/XML.
>
>Another concern is that we will be reducing any serialization potential 
>we have from standard libraries.  Jena, Redland, SemWeb, or any other 
>library that can produce and consume RDF is not likely to produce 
>RDF/XML in the same format.  Producers of RDF now will not only be 
>required to use RDF/XML as opposed to other formats such as N-Triple, 
>but will be required to write custom serialization code to translate the 
>in-memory model for the library of their choice into the structured RDF 
>response that fits the XML Schema.  It seems to me, we are really 
>removing one of the technical benefits of using RDF.  Services and 
>consumers really should not need to be concerned about the specific 
>structure of the bits of RDF across the wire so long as its valid RDF.
>
>In my humble opinion, any constraints and validation should be either at 
>the level of the ontology (OWL-Lite, OWL-DL, RDFS/OWL) or through a 
>reasoner that can be packaged and distributed for use within any 
>application that desires to utilize our products.
>
>Cheers,
>Rob
>
>Roger Hyam wrote:
>  
>
>>Hi Everyone,
>>
>>I am cross posting this to the TCS list and the TAG list because it is 
>>relevant to both but responses should fall neatly into things to do with 
>>nomenclature (for the TCS list) and things to do with technology - for 
>>the TAG list. The bit about avowed serializations of RDF below are TAG 
>>relevant.
>>
>>The move towards using LSIDs and the implied use of RDF for metadata has 
>>lead to the question: "Can we do TCS is RDF?". I have put together a 
>>package of files to encode the TaxonName part of TCS as an RDF 
>>vocabulary. It is not 100% complete but could form the basis of a
>>    
>>
>solution.
>  
>
>>You can download it here:
>>http://biodiv.hyam.net/schemas/TCS_RDF/tcs_rdf_examples.zip
>>
>>For the impatient you can see a summary of the vocabulary here: 
>>http://biodiv.hyam.net/schemas/TCS_RDF/TaxonNames.html
>>
>>and an example xml document here: 
>>http://biodiv.hyam.net/schemas/TCS_RDF/instance.xml
>>
>>It has actually been quite easy (though time consuming) to represent the 
>>semantics in the TCS XML Schema as RDF. Generally elements within the 
>>TaxonName element have become properties of the TaxonName class with 
>>some minor name changes. Several other classes were needed to represent 
>>NomenclaturalNotes and Typification events. The only difficult part was 
>>with Typification. A nomenclatural type is both a property of a name 
>>and, if it is a lectotype, a separate object that merely references a 
>>type and a name. The result is a compromise in an object that can be 
>>embedded as a property. I use instances for controlled vocabularies that 
>>may be controversial or may not.
>>
>>What is lost in only using RDFS is control over validation. It is not 
>>possible to specify that certain combinations of properties are 
>>permissible and certain not. There are two approaches to adding more 
>>'validation':
>>
>>
>>      OWL Ontologies
>>
>>An OWL ontology could be built that makes assertions about the items in 
>>the RDF ontology. It would be possible to use necessary and sufficient 
>>properties to assert that instances of TaxonName are valid members of an 
>>OWL class for BotanicalSubspeciesName for example. In fact far more 
>>control could be introduced in this way than is present in the current 
>>XML Schema. What is important to note is that any such OWL ontology 
>>could be separate from the common vocabulary suggested here. Different 
>>users could develop their own ontologies for their own purposes. This is 
>>a good thing as it is probably impossible to come up with a single, 
>>agreed ontology that handles the full complexity of the domain.
>>
>>I would argue strongly that we should not build a single central 
>>ontology that summarizes all we know about nomenclature - we couldn't do 
>>it within my lifetime :)
>>
>>
>>      Avowed Serializations
>>
>>Because RDF can be serialized as XML it is possible for an XML document 
>>to both validate against an XML Schema AND be valid RDF.  This may be a 
>>useful generic solution so I'll explain it here in an attempt to make it 
>>accessible to those not familiar with the technology.
>>
>>The same RDF data can be serialized in XML in many ways and different 
>>code libraries will do it differently though all code libraries can read 
>>the serializations produced by others. It is possible to pick one of the 
>>ways of serializing a particular set of RDF data and design a XML Schema 
>>to validate the resulting structure. I am stuck for a way to describe 
>>this so I am going to use the term 'avowed serialization' (Avowed means 
>>'openly declared') as opposed to 'arbitrary serialization'. This is the 
>>approach taken by the prismstandard.org 
>><http://www.prismstandard.org>group for their standard and it gives a 
>>number of benefits as a bridging technology:
>>
>>   1. Publishing applications that are not RDF aware (even simple
>>      scripts) can produce regular XML Schema validated XML documents
>>      that just happen to also be RDF compliant.
>>   2. Consuming applications can assume that all data is just RDF and
>>      not worry about the particular XML Schema used. These are the
>>      applications that are likely to have to merge different kinds of
>>      data from different suppliers so they benefit most from treating
>>      it like RDF.
>>   3. Because it is regular structured XML it can be transformed using
>>      XSLT into other document formats such as 'legacy' non-RDF
>>      compliant structures - if required.
>>
>>There is one direction that data would not flow without some effort. The 
>>same data published in an arbitrary serialization rather than the avowed 
>>one could be transformed, probably via several XSLT steps, into the 
>>avowed serialization and therefore made available to legacy applications 
>>using 3 above. This may not be worth the bother or may be useful. Some 
>>of the code involved would be generic to all transformations so may not 
>>be too great. It would certainly be possible for restricted data sets.
>>
>>To demonstrate this instance.xml is included in the package along with 
>>avowed.xsd and two supporting files. instance.xml will validate against 
>>avowed.xsd and parse correctly in the w3c RDF parser.
>>
>>I have not provided XSLT to convert instance.xml to the TCS standard 
>>format though I believe it could be done quite easily if required. 
>>Converting arbitrary documents from the current TCS to the structure 
>>represented in avowed.xsd would be more tricky but feasible and 
>>certainly possible for restricted uses of the schema that are typical 
>>from individual data suppliers.
>>
>>
>>      Contents
>>
>>This is what the files in this package are:
>>
>>README.txt = this file
>>TaxonNames.rdfs = An RDF vocabulary that represents TCS TaxonNames object.
>>TaxonNames.html = Documentation from TaxonNames.rdfs - much more readable.
>>instance.xml = an example of an XML document that is RDF compliant use 
>>of the vocabulary and XML Schema compliant.
>>avowed.xsd = XML Schema that instance.xml validates against.
>>dc.xsd = XML Schema that is used by avowed.xsd.
>>taxonnames.xsd = XML Schema that is used by avowed.xsd.
>>rdf2html.css = the style formatting for TaxonNames.html
>>rdfs2html.xsl = XSLT style sheet to generate docs from TaxonNames.rdfs
>>tcs_1.01.xsd = the TCS XML Schema for reference.
>>
>>
>>      Needs for other Vocabularies
>>
>>What is obvious looking at the vocabulary for TaxonNames here is that we 
>>need vocabularies for people, teams of people, literature and specimens 
>>as soon as possible.
>>
>>
>>      Need for conventions
>>
>>In order for all exchanged objects to be discoverable in a reasonable 
>>way we need to have conventions on the use of rdfs:label for Classes and 
>>Properties and dc:title for instances.
>>
>>The namespaces used in these examples are fantasy as we have not 
>>finalized them yet.
>>
>>
>>      Minor changes in TCS
>>
>>There are a few points where I have intentionally not followed TCS 1.01 
>>(there are probably others where it is accidental).
>>
>>    * basionym is a direct pointer to a TaxonName rather than a
>>      NomenclaturalNote. I couldn't see why it was a nomenclatural note
>>      in the 1.01 version as it is a simple pointer to a name.
>>    * changed name of genus element to genusEpithet  property. The
>>      contents of the element are not to be used alone and are not a
>>      genus name in themselves (uninomial should be used in this case)
>>      so genusEpithet is more appropriate - even if it is not common
>>      English usage.
>>    * Addition of referenceTo property. The vocabulary may be used to
>>      mark up an occurrence of a name that is not a publishing of a new
>>      name. In these cases the thing being marked up is actually a
>>      pointer to another object, either a TaxonName issued by a
>>      nomenclator or a TaxonConcept. In these cases we need to have a
>>      reference field. Here is an example (assuming namespace)
>>      <TaxonName
>>
>>    
>>
>referenceTo="urn:lsid:example.com:myconcepts:1234"><genusEpithet>Bellis</gen
>usEpithet><specificEpithet>perennis</specificEpithet></TaxonName>
>  
>
>>      This could possibly appear in a XHTML document for example.
>>
>>
>>      Comments Please
>>
>>All this amounts to a complex suggestion of how things could be done. 
>>i.e. we develop central vocabularies that go no further than RDFS but 
>>permit exchange and validation of data using avowed serializations and 
>>OWL ontologies.
>>
>>What do you think?
>>
>>Roger
>>
>>
>>-- 
>>
>>-------------------------------------
>> Roger Hyam
>> Technical Architect
>> Taxonomic Databases Working Group
>>-------------------------------------
>> http://www.tdwg.org
>> roger at tdwg.org
>> +44 1578 722782
>>-------------------------------------
>>
>>
>>------------------------------------------------------------------------
>>
>>_______________________________________________
>>Tdwg-tag mailing list
>>Tdwg-tag at lists.tdwg.org
>>http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
>>    
>>
>
>
>
>_______________________________________________
>Tdwg-tag mailing list
>Tdwg-tag at lists.tdwg.org
>http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
>  
>

-- 
Robert A. Morris
Professor of Computer Science
UMASS-Boston
ram at cs.umb.edu
http://www.cs.umb.edu/efg
http://www.cs.umb.edu/~ram
phone (+1)617 287 6466