[Tdwg-tag] TCS in RDF for use in LSIDs and possible generic mechanism.

Thu Mar 23 09:05:04 CET 2006

Hi Rob,

More comments below.

Robert Gales wrote:
> Hi Roger,
>
> Not a problem.  Given the amount of deployed services, I completely 
> understand the intent of "avowed serializations" to bridge the gap 
> between the current architecture and a RDF-based architecture.  
> However, given the fact that we *cannot* achieve true backwards 
> compatibility should this really be a primary driving factor for the 
> new architecture?    At least in my mind, backwards compatibility 
> would mean that existing software solutions would not require updating 
> to function within the new architecture.
I was thinking of it more as a forward compatibility.
> As you've noted, even with avowed serializations, producers and 
> consumers would require updating to play nicely with the architecture. 
> (Existing DiGIR and BioCase providers would require updating, etc.)  
> If we must incur the expense of updating existing software to support 
> avowed serializations, why not just update them to fully support RDF? 
>  I just don't feel that the utility of avowed serializations outweighs 
> the cost to implement it, particularly if that cost could be 
> redirected to upgrading existing services to fully support RDF.
>
This is a very good point. I imagine getting existing applications to 
return RDF/XML in response to existing queries would be fairly easy. It 
could be just a matter of returning another XML Schema based document as 
I have demonstrated. Upgrading them so that they can be queried as if 
they were triple stores (using SPARQL or our own system) is another 
matter altogether and a place we may never reach.
> I'm also a bit concerned about how we would handle schema 
> interoperation/intergration/extension with validation using XML 
> Schema.  These issues were two of the reasons RDF was appealing as a 
> modeling language.  At first glance it would seem to me that we would 
> need one schema for every potential combination that people would be 
> interested in, at least if validation against XML schema is a 
> requirement.  This, to me at least, is in direct opposition to using 
> RDF for the benefits of schema integration and extensibility.
>
You are correct. That is why the future might be "RDF like" but it is 
not going to happen tomorrow. If some one is putting together an XML 
Schema today (and, thanks to Altova software, they probably are) then 
they should at least think about making the structure of that schema go 
node-arc-node-arc so that it is easy to map into RDF if need be - plus 
it gives clear document design.

> Anyway, I'll have to think about this a bit more,
Keep on thinking. How about wrappers to BioCASE and DiGIR providers?

All the best,

Roger

> Rob
>
> Roger Hyam wrote:
>> Hi Rob,
>>
>> Thanks for your contribution. My comments below:
>>
>> Robert Gales wrote:
>>> Just thoughts/comments on the use of XML Schema for validating RDF 
>>> documents.
>>>
>>> I'm afraid that by using XML Schema to validate RDF documents, we 
>>> would be creating unnecessary constraints on the system.  Some 
>>> services may want to serve data in formats other than RDF/XML, for 
>>> example N-Triple or Turtle for various reasons.  Neither of these 
>>> would be able to be validated by an XML Schema.  For example, I've 
>>> been working on indexing large quantities of data represented as RDF 
>>> using standard IR techniques.  N-Triple has distinct benefits over 
>>> other representations because its grammar is trivial.  Another 
>>> benefit of N-Triple is that one can use simple concatenation to 
>>> build a model without being required to use an in memory model 
>>> through an RDF library such as Jena.  For example, I can build a 
>>> large single document containing N-Triples about millions of 
>>> resources.  The index maintains file position and size for each 
>>> resource indexed.  The benefit of using N-Triple is that upon 
>>> querying, I can simple use fast random access to the file based on 
>>> the position and size stored in the index to read in chunks of 
>>> N-Triple based on the size and immediately start streaming the 
>>> results across the wire.
>>>
>> This sounds like really interesting work! And it illustrates why what 
>> I am proposing is useful. If you wanted to introduce data into your 
>> index from arbitrary data providers would you prefer:
>>
>>    1. RDF as N-Triple: I guess this would be your favorite but it is
>>       unlikely that all data sources would give it in this - though some
>>       might.
>>    2. RDF as XML: I guess this is second best as you can convert it to
>>       N-Triple then append it to your index. You don't care how the
>>       serialization into XML is done as a library can read it and
>>       convert it so long as it is valid.
>>    3. XML according to some arbitrary schema: This is what you will get
>>       today. This is a nightmare as you would have to work out a mapping
>>       from the 'semantics' that may be in the document structure or
>>       schema into RDF triples.
>>
>> What I am suggesting is that publishers who want to do 3 (which is a 
>> potential nightmare to consumers and indexers) could, by careful 
>> schema design, make themselves into 2 above - which makes them 
>> interoperable with an RDF world.
>>
>>> With the additional constraint of using only RDF/XML as the output 
>>> format, the above indexer example would either need to custom 
>>> serialize N-Tripe -> RDF/XML or use a library to read it into an 
>>> in-memory model to serialize it as RDF/XML.
>>>
>> If you want to return stuff from your index as N-Triple then your 
>> customers are going to have to be able to handle it. If they can't 
>> handle it you won't get any customers. If you serialize just the 
>> query results as RDF/XML then it may be a lot easier for people to 
>> consume. Perhaps you could offer a choice. I would suggest that if 
>> your customers wanted to have a particular 'avowed' serialization of 
>> the RDF  then they should do it themselves but that it might be 
>> easier to do from XML than N-Triple.
>>> Another concern is that we will be reducing any serialization 
>>> potential we have from standard libraries.  Jena, Redland, SemWeb, 
>>> or any other library that can produce and consume RDF is not likely 
>>> to produce RDF/XML in the same format.  Producers of RDF now will 
>>> not only be required to use RDF/XML as opposed to other formats such 
>>> as N-Triple, but will be required to write custom serialization code 
>>> to translate the in-memory model for the library of their choice 
>>> into the structured RDF response that fits the XML Schema.  It seems 
>>> to me, we are really removing one of the technical benefits of using 
>>> RDF.  Services and consumers really should not need to be concerned 
>>> about the specific structure of the bits of RDF across the wire so 
>>> long as its valid RDF.
>> I agree with you fully but we are starting from scratch here and we 
>> have to take everyone with us - and see who we can pick up along the 
>> way. I get a lot of messages from people (written, verbal and body 
>> language) saying that they reckon doing things in RDF is dangerous 
>> because "it will never work". They may be right or they may be 
>> sticking with what they are comfortable with. If I can say "OK, 
>> forget about RDF, just make your XML look like this (which happens to 
>> be valid RDF)" then everyone can come to the same party.
>>
>>    1. If you 'speak' RDF/XML then anyone on the network can understand
>>       you. You can do this with any old script.
>>    2. If you can understand RDF/XML you can listen to anyone on the
>>       network. You can do this with any old RDF parser...
>>    3. If you don't understand RDF/XML then you will have to put some
>>       effort in to understand everyone but you will be able to
>>       understand a subset of people who use 'avowed' serializations that
>>       you care about.
>>
>> What is important here is that if RDF really is a terrible thing then 
>> the consumers of data in category 3  will grow in number and nobody 
>> will bother with triples in a few years. On the other hand if RDF is 
>> so great then consumers in category 2 will die out. Darwin kind of 
>> had it right for these things. I hope that the use of 'avowed' 
>> serializations will just let nature take its course. I sure as hell 
>> don't want the responsibility :)
>>> In my humble opinion, any constraints and validation should be 
>>> either at the level of the ontology (OWL-Lite, OWL-DL, RDFS/OWL) or 
>>> through a reasoner that can be packaged and distributed for use 
>>> within any application that desires to utilize our products.
>>>
>> Yes. Ideally that is how it should be done.If we have a basic RDFS 
>> ontology for the shared objects then people can extend this for their 
>> own purposes with OWL ontologies. We will never get agreement on a 
>> complete OWL ontology for the whole domain for sociological as well 
>> as technical reasons.
>>
>> I think my mistake here is calling this a 'generic' solution. It is a 
>> bridging technology.
>>
>> Does this make sense?
>>
>> Roger
>>
>>
>>> Cheers,
>>> Rob
>>>
>>> Roger Hyam wrote:
>>>> Hi Everyone,
>>>>
>>>> I am cross posting this to the TCS list and the TAG list because it 
>>>> is relevant to both but responses should fall neatly into things to 
>>>> do with nomenclature (for the TCS list) and things to do with 
>>>> technology - for the TAG list. The bit about avowed serializations 
>>>> of RDF below are TAG relevant.
>>>>
>>>> The move towards using LSIDs and the implied use of RDF for 
>>>> metadata has lead to the question: "Can we do TCS is RDF?". I have 
>>>> put together a package of files to encode the TaxonName part of TCS 
>>>> as an RDF vocabulary. It is not 100% complete but could form the 
>>>> basis of a solution.
>>>>
>>>> You can download it here:
>>>> http://biodiv.hyam.net/schemas/TCS_RDF/tcs_rdf_examples.zip
>>>>
>>>> For the impatient you can see a summary of the vocabulary here: 
>>>> http://biodiv.hyam.net/schemas/TCS_RDF/TaxonNames.html
>>>>
>>>> and an example xml document here: 
>>>> http://biodiv.hyam.net/schemas/TCS_RDF/instance.xml
>>>>
>>>> It has actually been quite easy (though time consuming) to 
>>>> represent the semantics in the TCS XML Schema as RDF. Generally 
>>>> elements within the TaxonName element have become properties of the 
>>>> TaxonName class with some minor name changes. Several other classes 
>>>> were needed to represent NomenclaturalNotes and Typification 
>>>> events. The only difficult part was with Typification. A 
>>>> nomenclatural type is both a property of a name and, if it is a 
>>>> lectotype, a separate object that merely references a type and a 
>>>> name. The result is a compromise in an object that can be embedded 
>>>> as a property. I use instances for controlled vocabularies that may 
>>>> be controversial or may not.
>>>>
>>>> What is lost in only using RDFS is control over validation. It is 
>>>> not possible to specify that certain combinations of properties are 
>>>> permissible and certain not. There are two approaches to adding 
>>>> more 'validation':
>>>>
>>>>
>>>>       OWL Ontologies
>>>>
>>>> An OWL ontology could be built that makes assertions about the 
>>>> items in the RDF ontology. It would be possible to use necessary 
>>>> and sufficient properties to assert that instances of TaxonName are 
>>>> valid members of an OWL class for BotanicalSubspeciesName for 
>>>> example. In fact far more control could be introduced in this way 
>>>> than is present in the current XML Schema. What is important to 
>>>> note is that any such OWL ontology could be separate from the 
>>>> common vocabulary suggested here. Different users could develop 
>>>> their own ontologies for their own purposes. This is a good thing 
>>>> as it is probably impossible to come up with a single, agreed 
>>>> ontology that handles the full complexity of the domain.
>>>>
>>>> I would argue strongly that we should not build a single central 
>>>> ontology that summarizes all we know about nomenclature - we 
>>>> couldn't do it within my lifetime :)
>>>>
>>>>
>>>>       Avowed Serializations
>>>>
>>>> Because RDF can be serialized as XML it is possible for an XML 
>>>> document to both validate against an XML Schema AND be valid RDF.  
>>>> This may be a useful generic solution so I'll explain it here in an 
>>>> attempt to make it accessible to those not familiar with the 
>>>> technology.
>>>>
>>>> The same RDF data can be serialized in XML in many ways and 
>>>> different code libraries will do it differently though all code 
>>>> libraries can read the serializations produced by others. It is 
>>>> possible to pick one of the ways of serializing a particular set of 
>>>> RDF data and design a XML Schema to validate the resulting 
>>>> structure. I am stuck for a way to describe this so I am going to 
>>>> use the term 'avowed serialization' (Avowed means 'openly 
>>>> declared') as opposed to 'arbitrary serialization'. This is the 
>>>> approach taken by the prismstandard.org 
>>>> <http://www.prismstandard.org>group for their standard and it gives 
>>>> a number of benefits as a bridging technology:
>>>>
>>>>    1. Publishing applications that are not RDF aware (even simple
>>>>       scripts) can produce regular XML Schema validated XML documents
>>>>       that just happen to also be RDF compliant.
>>>>    2. Consuming applications can assume that all data is just RDF and
>>>>       not worry about the particular XML Schema used. These are the
>>>>       applications that are likely to have to merge different kinds of
>>>>       data from different suppliers so they benefit most from treating
>>>>       it like RDF.
>>>>    3. Because it is regular structured XML it can be transformed using
>>>>       XSLT into other document formats such as 'legacy' non-RDF
>>>>       compliant structures - if required.
>>>>
>>>> There is one direction that data would not flow without some 
>>>> effort. The same data published in an arbitrary serialization 
>>>> rather than the avowed one could be transformed, probably via 
>>>> several XSLT steps, into the avowed serialization and therefore 
>>>> made available to legacy applications using 3 above. This may not 
>>>> be worth the bother or may be useful. Some of the code involved 
>>>> would be generic to all transformations so may not be too great. It 
>>>> would certainly be possible for restricted data sets.
>>>>
>>>> To demonstrate this instance.xml is included in the package along 
>>>> with avowed.xsd and two supporting files. instance.xml will 
>>>> validate against avowed.xsd and parse correctly in the w3c RDF parser.
>>>>
>>>> I have not provided XSLT to convert instance.xml to the TCS 
>>>> standard format though I believe it could be done quite easily if 
>>>> required. Converting arbitrary documents from the current TCS to 
>>>> the structure represented in avowed.xsd would be more tricky but 
>>>> feasible and certainly possible for restricted uses of the schema 
>>>> that are typical from individual data suppliers.
>>>>
>>>>
>>>>       Contents
>>>>
>>>> This is what the files in this package are:
>>>>
>>>> README.txt = this file
>>>> TaxonNames.rdfs = An RDF vocabulary that represents TCS TaxonNames 
>>>> object.
>>>> TaxonNames.html = Documentation from TaxonNames.rdfs - much more 
>>>> readable.
>>>> instance.xml = an example of an XML document that is RDF compliant 
>>>> use of the vocabulary and XML Schema compliant.
>>>> avowed.xsd = XML Schema that instance.xml validates against.
>>>> dc.xsd = XML Schema that is used by avowed.xsd.
>>>> taxonnames.xsd = XML Schema that is used by avowed.xsd.
>>>> rdf2html.css = the style formatting for TaxonNames.html
>>>> rdfs2html.xsl = XSLT style sheet to generate docs from TaxonNames.rdfs
>>>> tcs_1.01.xsd = the TCS XML Schema for reference.
>>>>
>>>>
>>>>       Needs for other Vocabularies
>>>>
>>>> What is obvious looking at the vocabulary for TaxonNames here is 
>>>> that we need vocabularies for people, teams of people, literature 
>>>> and specimens as soon as possible.
>>>>
>>>>
>>>>       Need for conventions
>>>>
>>>> In order for all exchanged objects to be discoverable in a 
>>>> reasonable way we need to have conventions on the use of rdfs:label 
>>>> for Classes and Properties and dc:title for instances.
>>>>
>>>> The namespaces used in these examples are fantasy as we have not 
>>>> finalized them yet.
>>>>
>>>>
>>>>       Minor changes in TCS
>>>>
>>>> There are a few points where I have intentionally not followed TCS 
>>>> 1.01 (there are probably others where it is accidental).
>>>>
>>>>     * basionym is a direct pointer to a TaxonName rather than a
>>>>       NomenclaturalNote. I couldn't see why it was a nomenclatural 
>>>> note
>>>>       in the 1.01 version as it is a simple pointer to a name.
>>>>     * changed name of genus element to genusEpithet  property. The
>>>>       contents of the element are not to be used alone and are not a
>>>>       genus name in themselves (uninomial should be used in this case)
>>>>       so genusEpithet is more appropriate - even if it is not common
>>>>       English usage.
>>>>     * Addition of referenceTo property. The vocabulary may be used to
>>>>       mark up an occurrence of a name that is not a publishing of a 
>>>> new
>>>>       name. In these cases the thing being marked up is actually a
>>>>       pointer to another object, either a TaxonName issued by a
>>>>       nomenclator or a TaxonConcept. In these cases we need to have a
>>>>       reference field. Here is an example (assuming namespace)
>>>>       <TaxonName
>>>>       
>>>> referenceTo="urn:lsid:example.com:myconcepts:1234"><genusEpithet>Bellis</genusEpithet><specificEpithet>perennis</specificEpithet></TaxonName> 
>>>>
>>>>       This could possibly appear in a XHTML document for example.
>>>>
>>>>
>>>>       Comments Please
>>>>
>>>> All this amounts to a complex suggestion of how things could be 
>>>> done. i.e. we develop central vocabularies that go no further than 
>>>> RDFS but permit exchange and validation of data using avowed 
>>>> serializations and OWL ontologies.
>>>>
>>>> What do you think?
>>>>
>>>> Roger
>>>>
>>>>
>>>> -- 
>>>>
>>>> -------------------------------------
>>>>  Roger Hyam
>>>>  Technical Architect
>>>>  Taxonomic Databases Working Group
>>>> -------------------------------------
>>>>  http://www.tdwg.org
>>>>  roger at tdwg.org
>>>>  +44 1578 722782
>>>> -------------------------------------
>>>>
>>>>
>>>> ------------------------------------------------------------------------ 
>>>>
>>>>
>>>> _______________________________________________
>>>> Tdwg-tag mailing list
>>>> Tdwg-tag at lists.tdwg.org
>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
>>>
>>
>>
>> -- 
>>
>> -------------------------------------
>>  Roger Hyam
>>  Technical Architect
>>  Taxonomic Databases Working Group
>> -------------------------------------
>>  http://www.tdwg.org
>>  roger at tdwg.org
>>  +44 1578 722782
>> -------------------------------------
>>
>

-- 

-------------------------------------
 Roger Hyam
 Technical Architect
 Taxonomic Databases Working Group
-------------------------------------
 http://www.tdwg.org
 roger at tdwg.org
 +44 1578 722782
-------------------------------------