[Tdwg-tag] TCS in RDF for use in LSIDs and possiblegeneric mechanism.

Wed Mar 22 17:51:53 CET 2006

Hi Bob,

Comments below:

Bob Morris wrote:
> I'm pretty confused by this exchange, probably because I am so new to 
> issues of RDF serialization.
>
> I would have thought that the people who want XML Schema validation, 
> or any other non-RDF validation, do so because they have a need to 
> embed in LSID data or metadata something which only has a non-RDF 
> serialization, and that their real need is to be able to validate the 
> embedded data after its extraction and deserialization into pure XML.  
> Image pixels come to mind,  accompanied by an assertion that the 
> representation has some specified encoding and when decoded represents 
> a media file meeting some ISO standard. (As an aside, if there is a 
> standard way of embedding binary data in RDF it would help our image 
> LSID effort to have a pointer to it). SDD (which we should perhaps 
> call "SDD-XML)  is another, mainly because it is almost certainly as 
> expressive as full RDF itself, and there are no RDF editing tools on 
> the horizon which even SEEK asserts biologists will happily embrace, 
> from which I infer that some such current XML-Schema TDWG projects 
> will have a rather long life.  From experience in a project with 
> NatureServe, I am guessing that Observations have similar issues.
>
Can you point me in the direction of a tool that will generate  
interfaces for data editing from an arbitrary XML Schema - and works out 
the box? I presume that, as you bemoan the lack of them for RDF, they 
exist for XML Schema based documents.

I am not talking XML editors (Spy and friends) here I am talking pukka 
data editors for regular biologists to use. I gave up trying to find 
these tools a while back but they may now have reached maturity. There 
was all this promise that one would be able to define a document 
structure in XML Schema and distribute this to clients and they would 
just see groovy forms to fill in and manage data in a database 
somewhere. No more slog in designing user interfaces just one generic 
tool. I got excited about XForms but soon got over that. Is there a tool 
I can download and use with SDD, TCS and ABCD?

If I could get my hands on such a tool I could try it with one of my 
'avowed' serialization schemas and then demonstrate that I am editing 
RDF with it - but that really would be confusing. Where are the generic 
XML tools you imply?

> Thinking about image data is probably the least controversial, albeit 
> the simplest, case.
>
> So if all the above is meaningful and correct, it seems to me that the 
> issue may be one of settling on TDWG standards, consistent with 
> current RDF best practices, about how to signal that some embedded 
> stuff is actually in some "non-RDF" serialization, with enough 
> standardized RDF to guide extraction, and that this is probably 
> idiosyncratic to the kinds of embedded data. Isn't this basically the 
> same issue as custom serializers/deserializers for SOAP? And it hasn't 
> already been addressed by the RDF community???
>
> Or maybe I am so clueless that this all sounds like the rantings of a 
> character from a Borges story. (Which is often what I feel like when 
> contemplating RDF. The Library of Babel comes to mind, as does the 
> story whose title I forget which is about two giants locked in mortal 
> combat, except that each is dreaming the combat in some kind of shared 
> dream and the one who wins in the dream gets to wake up.) :-) ...
>
>
>
> Donald Hobern wrote:
>
>> Rob,
>>
>> It may help to see this from the other side.  You are quite right 
>> that we
>> would not want an RDF model to be constrained by the fact that some 
>> people
>> wish to share data using XML Schema validation. 
>> However, if the documents validated this way are a valid subset of 
>> the valid
>> documents according to the RDF model, it may give these people a 
>> chance to
>> make their data available for use in an RDF-enabled world.
>>
>> Consumers of RDF data can freely handle alternative representations 
>> such as
>> N-Triple as well as the XML-encoded form.  Because such heterogeneity is
>> always possible, it may not be a big deal if some of the providers are
>> unaware that their documents are expressed in XML RDF, so long as 
>> they are
>> valid as such.
>>
>> Does that make sense?
>> Donald
>>
>> ---------------------------------------------------------------
>> Donald Hobern (dhobern at gbif.org)
>> Programme Officer for Data Access and Database Interoperability 
>> Global Biodiversity Information Facility Secretariat 
>> Universitetsparken 15, DK-2100 Copenhagen, Denmark
>> Tel: +45-35321483   Mobile: +45-28751483   Fax: +45-35321480
>> ---------------------------------------------------------------
>>
>>
>> -----Original Message-----
>> From: Tdwg-tag-bounces at lists.tdwg.org
>> [mailto:Tdwg-tag-bounces at lists.tdwg.org] On Behalf Of Robert Gales
>> Sent: 22 March 2006 16:09
>> To: roger at tdwg.org
>> Cc: Tdwg-tag at lists.tdwg.org
>> Subject: Re: [Tdwg-tag] TCS in RDF for use in LSIDs and possiblegeneric
>> mechanism.
>>
>> Just thoughts/comments on the use of XML Schema for validating RDF 
>> documents.
>>
>> I'm afraid that by using XML Schema to validate RDF documents, we 
>> would be creating unnecessary constraints on the system.  Some 
>> services may want to serve data in formats other than RDF/XML, for 
>> example N-Triple or Turtle for various reasons.  Neither of these 
>> would be able to be validated by an XML Schema.  For example, I've 
>> been working on indexing large quantities of data represented as RDF 
>> using standard IR techniques.  N-Triple has distinct benefits over 
>> other representations because its grammar is trivial.  Another 
>> benefit of N-Triple is that one can use simple concatenation to build 
>> a model without being required to use an in memory model through an 
>> RDF library such as Jena.  For example, I can build a large single 
>> document containing N-Triples about millions of resources.  The index 
>> maintains file position and size for each resource indexed.  The 
>> benefit of using N-Triple is that upon querying, I can simple use 
>> fast random access to the file based on the position and size stored 
>> in the index to read in chunks of N-Triple based on the size and 
>> immediately start streaming the results across the wire.
>>
>> With the additional constraint of using only RDF/XML as the output 
>> format, the above indexer example would either need to custom 
>> serialize N-Tripe -> RDF/XML or use a library to read it into an 
>> in-memory model to serialize it as RDF/XML.
>>
>> Another concern is that we will be reducing any serialization 
>> potential we have from standard libraries.  Jena, Redland, SemWeb, or 
>> any other library that can produce and consume RDF is not likely to 
>> produce RDF/XML in the same format.  Producers of RDF now will not 
>> only be required to use RDF/XML as opposed to other formats such as 
>> N-Triple, but will be required to write custom serialization code to 
>> translate the in-memory model for the library of their choice into 
>> the structured RDF response that fits the XML Schema.  It seems to 
>> me, we are really removing one of the technical benefits of using 
>> RDF.  Services and consumers really should not need to be concerned 
>> about the specific structure of the bits of RDF across the wire so 
>> long as its valid RDF.
>>
>> In my humble opinion, any constraints and validation should be either 
>> at the level of the ontology (OWL-Lite, OWL-DL, RDFS/OWL) or through 
>> a reasoner that can be packaged and distributed for use within any 
>> application that desires to utilize our products.
>>
>> Cheers,
>> Rob
>>
>> Roger Hyam wrote:
>>  
>>
>>> Hi Everyone,
>>>
>>> I am cross posting this to the TCS list and the TAG list because it 
>>> is relevant to both but responses should fall neatly into things to 
>>> do with nomenclature (for the TCS list) and things to do with 
>>> technology - for the TAG list. The bit about avowed serializations 
>>> of RDF below are TAG relevant.
>>>
>>> The move towards using LSIDs and the implied use of RDF for metadata 
>>> has lead to the question: "Can we do TCS is RDF?". I have put 
>>> together a package of files to encode the TaxonName part of TCS as 
>>> an RDF vocabulary. It is not 100% complete but could form the basis 
>>> of a
>>>   
>> solution.
>>  
>>
>>> You can download it here:
>>> http://biodiv.hyam.net/schemas/TCS_RDF/tcs_rdf_examples.zip
>>>
>>> For the impatient you can see a summary of the vocabulary here: 
>>> http://biodiv.hyam.net/schemas/TCS_RDF/TaxonNames.html
>>>
>>> and an example xml document here: 
>>> http://biodiv.hyam.net/schemas/TCS_RDF/instance.xml
>>>
>>> It has actually been quite easy (though time consuming) to represent 
>>> the semantics in the TCS XML Schema as RDF. Generally elements 
>>> within the TaxonName element have become properties of the TaxonName 
>>> class with some minor name changes. Several other classes were 
>>> needed to represent NomenclaturalNotes and Typification events. The 
>>> only difficult part was with Typification. A nomenclatural type is 
>>> both a property of a name and, if it is a lectotype, a separate 
>>> object that merely references a type and a name. The result is a 
>>> compromise in an object that can be embedded as a property. I use 
>>> instances for controlled vocabularies that may be controversial or 
>>> may not.
>>>
>>> What is lost in only using RDFS is control over validation. It is 
>>> not possible to specify that certain combinations of properties are 
>>> permissible and certain not. There are two approaches to adding more 
>>> 'validation':
>>>
>>>
>>>      OWL Ontologies
>>>
>>> An OWL ontology could be built that makes assertions about the items 
>>> in the RDF ontology. It would be possible to use necessary and 
>>> sufficient properties to assert that instances of TaxonName are 
>>> valid members of an OWL class for BotanicalSubspeciesName for 
>>> example. In fact far more control could be introduced in this way 
>>> than is present in the current XML Schema. What is important to note 
>>> is that any such OWL ontology could be separate from the common 
>>> vocabulary suggested here. Different users could develop their own 
>>> ontologies for their own purposes. This is a good thing as it is 
>>> probably impossible to come up with a single, agreed ontology that 
>>> handles the full complexity of the domain.
>>>
>>> I would argue strongly that we should not build a single central 
>>> ontology that summarizes all we know about nomenclature - we 
>>> couldn't do it within my lifetime :)
>>>
>>>
>>>      Avowed Serializations
>>>
>>> Because RDF can be serialized as XML it is possible for an XML 
>>> document to both validate against an XML Schema AND be valid RDF.  
>>> This may be a useful generic solution so I'll explain it here in an 
>>> attempt to make it accessible to those not familiar with the 
>>> technology.
>>>
>>> The same RDF data can be serialized in XML in many ways and 
>>> different code libraries will do it differently though all code 
>>> libraries can read the serializations produced by others. It is 
>>> possible to pick one of the ways of serializing a particular set of 
>>> RDF data and design a XML Schema to validate the resulting 
>>> structure. I am stuck for a way to describe this so I am going to 
>>> use the term 'avowed serialization' (Avowed means 'openly declared') 
>>> as opposed to 'arbitrary serialization'. This is the approach taken 
>>> by the prismstandard.org <http://www.prismstandard.org>group for 
>>> their standard and it gives a number of benefits as a bridging 
>>> technology:
>>>
>>>   1. Publishing applications that are not RDF aware (even simple
>>>      scripts) can produce regular XML Schema validated XML documents
>>>      that just happen to also be RDF compliant.
>>>   2. Consuming applications can assume that all data is just RDF and
>>>      not worry about the particular XML Schema used. These are the
>>>      applications that are likely to have to merge different kinds of
>>>      data from different suppliers so they benefit most from treating
>>>      it like RDF.
>>>   3. Because it is regular structured XML it can be transformed using
>>>      XSLT into other document formats such as 'legacy' non-RDF
>>>      compliant structures - if required.
>>>
>>> There is one direction that data would not flow without some effort. 
>>> The same data published in an arbitrary serialization rather than 
>>> the avowed one could be transformed, probably via several XSLT 
>>> steps, into the avowed serialization and therefore made available to 
>>> legacy applications using 3 above. This may not be worth the bother 
>>> or may be useful. Some of the code involved would be generic to all 
>>> transformations so may not be too great. It would certainly be 
>>> possible for restricted data sets.
>>>
>>> To demonstrate this instance.xml is included in the package along 
>>> with avowed.xsd and two supporting files. instance.xml will validate 
>>> against avowed.xsd and parse correctly in the w3c RDF parser.
>>>
>>> I have not provided XSLT to convert instance.xml to the TCS standard 
>>> format though I believe it could be done quite easily if required. 
>>> Converting arbitrary documents from the current TCS to the structure 
>>> represented in avowed.xsd would be more tricky but feasible and 
>>> certainly possible for restricted uses of the schema that are 
>>> typical from individual data suppliers.
>>>
>>>
>>>      Contents
>>>
>>> This is what the files in this package are:
>>>
>>> README.txt = this file
>>> TaxonNames.rdfs = An RDF vocabulary that represents TCS TaxonNames 
>>> object.
>>> TaxonNames.html = Documentation from TaxonNames.rdfs - much more 
>>> readable.
>>> instance.xml = an example of an XML document that is RDF compliant 
>>> use of the vocabulary and XML Schema compliant.
>>> avowed.xsd = XML Schema that instance.xml validates against.
>>> dc.xsd = XML Schema that is used by avowed.xsd.
>>> taxonnames.xsd = XML Schema that is used by avowed.xsd.
>>> rdf2html.css = the style formatting for TaxonNames.html
>>> rdfs2html.xsl = XSLT style sheet to generate docs from TaxonNames.rdfs
>>> tcs_1.01.xsd = the TCS XML Schema for reference.
>>>
>>>
>>>      Needs for other Vocabularies
>>>
>>> What is obvious looking at the vocabulary for TaxonNames here is 
>>> that we need vocabularies for people, teams of people, literature 
>>> and specimens as soon as possible.
>>>
>>>
>>>      Need for conventions
>>>
>>> In order for all exchanged objects to be discoverable in a 
>>> reasonable way we need to have conventions on the use of rdfs:label 
>>> for Classes and Properties and dc:title for instances.
>>>
>>> The namespaces used in these examples are fantasy as we have not 
>>> finalized them yet.
>>>
>>>
>>>      Minor changes in TCS
>>>
>>> There are a few points where I have intentionally not followed TCS 
>>> 1.01 (there are probably others where it is accidental).
>>>
>>>    * basionym is a direct pointer to a TaxonName rather than a
>>>      NomenclaturalNote. I couldn't see why it was a nomenclatural note
>>>      in the 1.01 version as it is a simple pointer to a name.
>>>    * changed name of genus element to genusEpithet  property. The
>>>      contents of the element are not to be used alone and are not a
>>>      genus name in themselves (uninomial should be used in this case)
>>>      so genusEpithet is more appropriate - even if it is not common
>>>      English usage.
>>>    * Addition of referenceTo property. The vocabulary may be used to
>>>      mark up an occurrence of a name that is not a publishing of a new
>>>      name. In these cases the thing being marked up is actually a
>>>      pointer to another object, either a TaxonName issued by a
>>>      nomenclator or a TaxonConcept. In these cases we need to have a
>>>      reference field. Here is an example (assuming namespace)
>>>      <TaxonName
>>>
>>>   
>> referenceTo="urn:lsid:example.com:myconcepts:1234"><genusEpithet>Bellis</gen 
>>
>> usEpithet><specificEpithet>perennis</specificEpithet></TaxonName>
>>  
>>
>>>      This could possibly appear in a XHTML document for example.
>>>
>>>
>>>      Comments Please
>>>
>>> All this amounts to a complex suggestion of how things could be 
>>> done. i.e. we develop central vocabularies that go no further than 
>>> RDFS but permit exchange and validation of data using avowed 
>>> serializations and OWL ontologies.
>>>
>>> What do you think?
>>>
>>> Roger
>>>
>>>
>>> -- 
>>>
>>> -------------------------------------
>>> Roger Hyam
>>> Technical Architect
>>> Taxonomic Databases Working Group
>>> -------------------------------------
>>> http://www.tdwg.org
>>> roger at tdwg.org
>>> +44 1578 722782
>>> -------------------------------------
>>>
>>>
>>> ------------------------------------------------------------------------ 
>>>
>>>
>>> _______________________________________________
>>> Tdwg-tag mailing list
>>> Tdwg-tag at lists.tdwg.org
>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
>>>   
>>
>>
>>
>> _______________________________________________
>> Tdwg-tag mailing list
>> Tdwg-tag at lists.tdwg.org
>> http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
>>  
>>
>

-- 

-------------------------------------
 Roger Hyam
 Technical Architect
 Taxonomic Databases Working Group
-------------------------------------
 http://www.tdwg.org
 roger at tdwg.org
 +44 1578 722782
-------------------------------------