[Tdwg-tag] TCS in RDF for use in LSIDs and possiblegeneric mechanism.
Roger Hyam
roger at tdwg.org
Wed Mar 22 17:51:53 CET 2006
Hi Bob,
Comments below:
Bob Morris wrote:
> I'm pretty confused by this exchange, probably because I am so new to
> issues of RDF serialization.
>
> I would have thought that the people who want XML Schema validation,
> or any other non-RDF validation, do so because they have a need to
> embed in LSID data or metadata something which only has a non-RDF
> serialization, and that their real need is to be able to validate the
> embedded data after its extraction and deserialization into pure XML.
> Image pixels come to mind, accompanied by an assertion that the
> representation has some specified encoding and when decoded represents
> a media file meeting some ISO standard. (As an aside, if there is a
> standard way of embedding binary data in RDF it would help our image
> LSID effort to have a pointer to it). SDD (which we should perhaps
> call "SDD-XML) is another, mainly because it is almost certainly as
> expressive as full RDF itself, and there are no RDF editing tools on
> the horizon which even SEEK asserts biologists will happily embrace,
> from which I infer that some such current XML-Schema TDWG projects
> will have a rather long life. From experience in a project with
> NatureServe, I am guessing that Observations have similar issues.
>
Can you point me in the direction of a tool that will generate
interfaces for data editing from an arbitrary XML Schema - and works out
the box? I presume that, as you bemoan the lack of them for RDF, they
exist for XML Schema based documents.
I am not talking XML editors (Spy and friends) here I am talking pukka
data editors for regular biologists to use. I gave up trying to find
these tools a while back but they may now have reached maturity. There
was all this promise that one would be able to define a document
structure in XML Schema and distribute this to clients and they would
just see groovy forms to fill in and manage data in a database
somewhere. No more slog in designing user interfaces just one generic
tool. I got excited about XForms but soon got over that. Is there a tool
I can download and use with SDD, TCS and ABCD?
If I could get my hands on such a tool I could try it with one of my
'avowed' serialization schemas and then demonstrate that I am editing
RDF with it - but that really would be confusing. Where are the generic
XML tools you imply?
> Thinking about image data is probably the least controversial, albeit
> the simplest, case.
>
> So if all the above is meaningful and correct, it seems to me that the
> issue may be one of settling on TDWG standards, consistent with
> current RDF best practices, about how to signal that some embedded
> stuff is actually in some "non-RDF" serialization, with enough
> standardized RDF to guide extraction, and that this is probably
> idiosyncratic to the kinds of embedded data. Isn't this basically the
> same issue as custom serializers/deserializers for SOAP? And it hasn't
> already been addressed by the RDF community???
>
> Or maybe I am so clueless that this all sounds like the rantings of a
> character from a Borges story. (Which is often what I feel like when
> contemplating RDF. The Library of Babel comes to mind, as does the
> story whose title I forget which is about two giants locked in mortal
> combat, except that each is dreaming the combat in some kind of shared
> dream and the one who wins in the dream gets to wake up.) :-) ...
>
>
>
> Donald Hobern wrote:
>
>> Rob,
>>
>> It may help to see this from the other side. You are quite right
>> that we
>> would not want an RDF model to be constrained by the fact that some
>> people
>> wish to share data using XML Schema validation.
>> However, if the documents validated this way are a valid subset of
>> the valid
>> documents according to the RDF model, it may give these people a
>> chance to
>> make their data available for use in an RDF-enabled world.
>>
>> Consumers of RDF data can freely handle alternative representations
>> such as
>> N-Triple as well as the XML-encoded form. Because such heterogeneity is
>> always possible, it may not be a big deal if some of the providers are
>> unaware that their documents are expressed in XML RDF, so long as
>> they are
>> valid as such.
>>
>> Does that make sense?
>> Donald
>>
>> ---------------------------------------------------------------
>> Donald Hobern (dhobern at gbif.org)
>> Programme Officer for Data Access and Database Interoperability
>> Global Biodiversity Information Facility Secretariat
>> Universitetsparken 15, DK-2100 Copenhagen, Denmark
>> Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480
>> ---------------------------------------------------------------
>>
>>
>> -----Original Message-----
>> From: Tdwg-tag-bounces at lists.tdwg.org
>> [mailto:Tdwg-tag-bounces at lists.tdwg.org] On Behalf Of Robert Gales
>> Sent: 22 March 2006 16:09
>> To: roger at tdwg.org
>> Cc: Tdwg-tag at lists.tdwg.org
>> Subject: Re: [Tdwg-tag] TCS in RDF for use in LSIDs and possiblegeneric
>> mechanism.
>>
>> Just thoughts/comments on the use of XML Schema for validating RDF
>> documents.
>>
>> I'm afraid that by using XML Schema to validate RDF documents, we
>> would be creating unnecessary constraints on the system. Some
>> services may want to serve data in formats other than RDF/XML, for
>> example N-Triple or Turtle for various reasons. Neither of these
>> would be able to be validated by an XML Schema. For example, I've
>> been working on indexing large quantities of data represented as RDF
>> using standard IR techniques. N-Triple has distinct benefits over
>> other representations because its grammar is trivial. Another
>> benefit of N-Triple is that one can use simple concatenation to build
>> a model without being required to use an in memory model through an
>> RDF library such as Jena. For example, I can build a large single
>> document containing N-Triples about millions of resources. The index
>> maintains file position and size for each resource indexed. The
>> benefit of using N-Triple is that upon querying, I can simple use
>> fast random access to the file based on the position and size stored
>> in the index to read in chunks of N-Triple based on the size and
>> immediately start streaming the results across the wire.
>>
>> With the additional constraint of using only RDF/XML as the output
>> format, the above indexer example would either need to custom
>> serialize N-Tripe -> RDF/XML or use a library to read it into an
>> in-memory model to serialize it as RDF/XML.
>>
>> Another concern is that we will be reducing any serialization
>> potential we have from standard libraries. Jena, Redland, SemWeb, or
>> any other library that can produce and consume RDF is not likely to
>> produce RDF/XML in the same format. Producers of RDF now will not
>> only be required to use RDF/XML as opposed to other formats such as
>> N-Triple, but will be required to write custom serialization code to
>> translate the in-memory model for the library of their choice into
>> the structured RDF response that fits the XML Schema. It seems to
>> me, we are really removing one of the technical benefits of using
>> RDF. Services and consumers really should not need to be concerned
>> about the specific structure of the bits of RDF across the wire so
>> long as its valid RDF.
>>
>> In my humble opinion, any constraints and validation should be either
>> at the level of the ontology (OWL-Lite, OWL-DL, RDFS/OWL) or through
>> a reasoner that can be packaged and distributed for use within any
>> application that desires to utilize our products.
>>
>> Cheers,
>> Rob
>>
>> Roger Hyam wrote:
>>
>>
>>> Hi Everyone,
>>>
>>> I am cross posting this to the TCS list and the TAG list because it
>>> is relevant to both but responses should fall neatly into things to
>>> do with nomenclature (for the TCS list) and things to do with
>>> technology - for the TAG list. The bit about avowed serializations
>>> of RDF below are TAG relevant.
>>>
>>> The move towards using LSIDs and the implied use of RDF for metadata
>>> has lead to the question: "Can we do TCS is RDF?". I have put
>>> together a package of files to encode the TaxonName part of TCS as
>>> an RDF vocabulary. It is not 100% complete but could form the basis
>>> of a
>>>
>> solution.
>>
>>
>>> You can download it here:
>>> http://biodiv.hyam.net/schemas/TCS_RDF/tcs_rdf_examples.zip
>>>
>>> For the impatient you can see a summary of the vocabulary here:
>>> http://biodiv.hyam.net/schemas/TCS_RDF/TaxonNames.html
>>>
>>> and an example xml document here:
>>> http://biodiv.hyam.net/schemas/TCS_RDF/instance.xml
>>>
>>> It has actually been quite easy (though time consuming) to represent
>>> the semantics in the TCS XML Schema as RDF. Generally elements
>>> within the TaxonName element have become properties of the TaxonName
>>> class with some minor name changes. Several other classes were
>>> needed to represent NomenclaturalNotes and Typification events. The
>>> only difficult part was with Typification. A nomenclatural type is
>>> both a property of a name and, if it is a lectotype, a separate
>>> object that merely references a type and a name. The result is a
>>> compromise in an object that can be embedded as a property. I use
>>> instances for controlled vocabularies that may be controversial or
>>> may not.
>>>
>>> What is lost in only using RDFS is control over validation. It is
>>> not possible to specify that certain combinations of properties are
>>> permissible and certain not. There are two approaches to adding more
>>> 'validation':
>>>
>>>
>>> OWL Ontologies
>>>
>>> An OWL ontology could be built that makes assertions about the items
>>> in the RDF ontology. It would be possible to use necessary and
>>> sufficient properties to assert that instances of TaxonName are
>>> valid members of an OWL class for BotanicalSubspeciesName for
>>> example. In fact far more control could be introduced in this way
>>> than is present in the current XML Schema. What is important to note
>>> is that any such OWL ontology could be separate from the common
>>> vocabulary suggested here. Different users could develop their own
>>> ontologies for their own purposes. This is a good thing as it is
>>> probably impossible to come up with a single, agreed ontology that
>>> handles the full complexity of the domain.
>>>
>>> I would argue strongly that we should not build a single central
>>> ontology that summarizes all we know about nomenclature - we
>>> couldn't do it within my lifetime :)
>>>
>>>
>>> Avowed Serializations
>>>
>>> Because RDF can be serialized as XML it is possible for an XML
>>> document to both validate against an XML Schema AND be valid RDF.
>>> This may be a useful generic solution so I'll explain it here in an
>>> attempt to make it accessible to those not familiar with the
>>> technology.
>>>
>>> The same RDF data can be serialized in XML in many ways and
>>> different code libraries will do it differently though all code
>>> libraries can read the serializations produced by others. It is
>>> possible to pick one of the ways of serializing a particular set of
>>> RDF data and design a XML Schema to validate the resulting
>>> structure. I am stuck for a way to describe this so I am going to
>>> use the term 'avowed serialization' (Avowed means 'openly declared')
>>> as opposed to 'arbitrary serialization'. This is the approach taken
>>> by the prismstandard.org <http://www.prismstandard.org>group for
>>> their standard and it gives a number of benefits as a bridging
>>> technology:
>>>
>>> 1. Publishing applications that are not RDF aware (even simple
>>> scripts) can produce regular XML Schema validated XML documents
>>> that just happen to also be RDF compliant.
>>> 2. Consuming applications can assume that all data is just RDF and
>>> not worry about the particular XML Schema used. These are the
>>> applications that are likely to have to merge different kinds of
>>> data from different suppliers so they benefit most from treating
>>> it like RDF.
>>> 3. Because it is regular structured XML it can be transformed using
>>> XSLT into other document formats such as 'legacy' non-RDF
>>> compliant structures - if required.
>>>
>>> There is one direction that data would not flow without some effort.
>>> The same data published in an arbitrary serialization rather than
>>> the avowed one could be transformed, probably via several XSLT
>>> steps, into the avowed serialization and therefore made available to
>>> legacy applications using 3 above. This may not be worth the bother
>>> or may be useful. Some of the code involved would be generic to all
>>> transformations so may not be too great. It would certainly be
>>> possible for restricted data sets.
>>>
>>> To demonstrate this instance.xml is included in the package along
>>> with avowed.xsd and two supporting files. instance.xml will validate
>>> against avowed.xsd and parse correctly in the w3c RDF parser.
>>>
>>> I have not provided XSLT to convert instance.xml to the TCS standard
>>> format though I believe it could be done quite easily if required.
>>> Converting arbitrary documents from the current TCS to the structure
>>> represented in avowed.xsd would be more tricky but feasible and
>>> certainly possible for restricted uses of the schema that are
>>> typical from individual data suppliers.
>>>
>>>
>>> Contents
>>>
>>> This is what the files in this package are:
>>>
>>> README.txt = this file
>>> TaxonNames.rdfs = An RDF vocabulary that represents TCS TaxonNames
>>> object.
>>> TaxonNames.html = Documentation from TaxonNames.rdfs - much more
>>> readable.
>>> instance.xml = an example of an XML document that is RDF compliant
>>> use of the vocabulary and XML Schema compliant.
>>> avowed.xsd = XML Schema that instance.xml validates against.
>>> dc.xsd = XML Schema that is used by avowed.xsd.
>>> taxonnames.xsd = XML Schema that is used by avowed.xsd.
>>> rdf2html.css = the style formatting for TaxonNames.html
>>> rdfs2html.xsl = XSLT style sheet to generate docs from TaxonNames.rdfs
>>> tcs_1.01.xsd = the TCS XML Schema for reference.
>>>
>>>
>>> Needs for other Vocabularies
>>>
>>> What is obvious looking at the vocabulary for TaxonNames here is
>>> that we need vocabularies for people, teams of people, literature
>>> and specimens as soon as possible.
>>>
>>>
>>> Need for conventions
>>>
>>> In order for all exchanged objects to be discoverable in a
>>> reasonable way we need to have conventions on the use of rdfs:label
>>> for Classes and Properties and dc:title for instances.
>>>
>>> The namespaces used in these examples are fantasy as we have not
>>> finalized them yet.
>>>
>>>
>>> Minor changes in TCS
>>>
>>> There are a few points where I have intentionally not followed TCS
>>> 1.01 (there are probably others where it is accidental).
>>>
>>> * basionym is a direct pointer to a TaxonName rather than a
>>> NomenclaturalNote. I couldn't see why it was a nomenclatural note
>>> in the 1.01 version as it is a simple pointer to a name.
>>> * changed name of genus element to genusEpithet property. The
>>> contents of the element are not to be used alone and are not a
>>> genus name in themselves (uninomial should be used in this case)
>>> so genusEpithet is more appropriate - even if it is not common
>>> English usage.
>>> * Addition of referenceTo property. The vocabulary may be used to
>>> mark up an occurrence of a name that is not a publishing of a new
>>> name. In these cases the thing being marked up is actually a
>>> pointer to another object, either a TaxonName issued by a
>>> nomenclator or a TaxonConcept. In these cases we need to have a
>>> reference field. Here is an example (assuming namespace)
>>> <TaxonName
>>>
>>>
>> referenceTo="urn:lsid:example.com:myconcepts:1234"><genusEpithet>Bellis</gen
>>
>> usEpithet><specificEpithet>perennis</specificEpithet></TaxonName>
>>
>>
>>> This could possibly appear in a XHTML document for example.
>>>
>>>
>>> Comments Please
>>>
>>> All this amounts to a complex suggestion of how things could be
>>> done. i.e. we develop central vocabularies that go no further than
>>> RDFS but permit exchange and validation of data using avowed
>>> serializations and OWL ontologies.
>>>
>>> What do you think?
>>>
>>> Roger
>>>
>>>
>>> --
>>>
>>> -------------------------------------
>>> Roger Hyam
>>> Technical Architect
>>> Taxonomic Databases Working Group
>>> -------------------------------------
>>> http://www.tdwg.org
>>> roger at tdwg.org
>>> +44 1578 722782
>>> -------------------------------------
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>> _______________________________________________
>>> Tdwg-tag mailing list
>>> Tdwg-tag at lists.tdwg.org
>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
>>>
>>
>>
>>
>> _______________________________________________
>> Tdwg-tag mailing list
>> Tdwg-tag at lists.tdwg.org
>> http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
>>
>>
>
--
-------------------------------------
Roger Hyam
Technical Architect
Taxonomic Databases Working Group
-------------------------------------
http://www.tdwg.org
roger at tdwg.org
+44 1578 722782
-------------------------------------
More information about the tdwg-tag
mailing list