Re: [tdwg-tnc] LSIDs and taxon concepts

8 Oct 2007

      Éamonn,

When I hear validity I always think "Valid for whom?" Maybe I am too  
much of a relativist/liberal etc. Anyhow...

There are three scenarios

1) The consumer and producer of the document have an agreement that  
the serialized RDF will agree with a particular XML Schema. It is not  
possible to include an schemaLocation attribute in serialized RDF so  
this has to be asserted somewhere else. If you (as a consumer) are  
using TAPIR then you will have specified the output model (an XML  
Schema) when you asked for the data in the first place and the data  
should be valid against that schema or the provider is being  
naughty.  You could then validate it against another schema of your  
own prior to import.

2) If you are not getting the RDF from TAPIR then you don't know the  
document structure. It could be serialized in a number of ways. In  
this case you should use an RDF parser (they are available for all  
languages) to generate an in memory (or database backed) model and  
programatically work over this to do as you please. Using the  
resource centered approach your code would do things like get all the  
TaxonConcepts in the graph then work over them asking for properties  
and values and doing something sensible with them. This is similar to  
what you would do if you got a schema validated XML document that you  
had bound to Java objects using JAXB or even if you read a valid  
document into a DOM of some kind.

3) If you are a really sophisticated semantic web wise client you  
would put RDF straight into a model (or triple store) with your own  
ontology that made assertions about what it considered valid  
TaxonConcepts. You would then just ask the model for a list of valid  
TaxonConcepts.

I would be interested to meet anyone who is actually validating XML  
documents they get back from a supplier using XML Schema and relying  
on that validity alone to import the data into their own data model.  
I have talked with people who generate Java from XML Schema but they  
usually then mess with it to get it to work for their application.

If I received and ABCD document that wasn't valid for example I would  
have a choice of rejecting it entirely or trying to work out if the  
bit that broke the validity effects my data model. If it is valid I  
still have to check that the contents of the elements fit my model.  
Whatever happens I have to walk over the DOM programatically so there  
seems little point in actually validating it first I may as well just  
let my own code fail in a way I understand and can recover from.

There is an analogy I use. When I post a letter I have to have a  
valid envelope, stamp and address on it. I don't expect the postman  
to open the letter and say "Hey that is bad grammar I am not  
delivering it!".  When I read a letter I can understand it if the  
grammar is bad in parts - especially if those parts are unimportant  
to me. It might be valid to me but not to the postman.

(BTW there is a postal strike in the UK at the moment - not sure if  
that strengthens the analogy or invalidates it)

I hope this helps,

Roger

On 8 Oct 2007, at 15:25, Eamonn O Tuama wrote:
...
Hi Roger,
I'd like you to comment on the issue of validation. In RDF, with  
its Open
World assumption, we loose the ability to validate. So how easy is  
it to
take RDF output and, if an application requires it, re-format it so  
that it
can be validated against an XML Schema, i.e., take your example  
below and
link it to an XML Schema. I understand that there are multiple ways  
in RDF
to express the same thing, so would that create problems for a  
schema if the
RDF was coming from different data providers. Or do we have control  
of that
because of the LSID vocabularies?
Éamonn
-----Original Message-----
From: tdwg-tnc-bounces@lists.tdwg.org
[mailto:tdwg-tnc-bounces@lists.tdwg.org] On Behalf Of Roger Hyam
Sent: 05 October 2007 16:51
To: Richard Pyle
Cc: tdwg-tnc@lists.tdwg.org
Subject: Re: [tdwg-tnc] LSIDs and taxon concepts
Paul, Rich et al.
I'll try and answer all the questions in a single mail and also keep
it short.
Taxon Concept Schema (TCS) is an XML Schema that was standardized by
TDWG in 2005 but TCS also uses as short hand for distinguishing
between Taxon Names and Taxon Concepts.
The fundamental thing that TCS does (both the schema and the way of
modeling) is separate TaxonNames (or nomenclatural acts) from
TaxonConcepts (actual delimited or implied taxa that one would
identify something to).
In order to issue LSIDs for TaxonNames or TaxonConcepts it is
necessary to represent them in RDF rather than XML Schema. RDF is far
more modular by nature than XML Schema and so two vocabularies were
put together to represent TaxonNames and TaxonConcepts (rather than
one schema) but unless you are issuing pure nomenclatural data you
will usually use both.
http://rs.tdwg.org/ontology/voc/TaxonName
http://rs.tdwg.org/ontology/voc/TaxonConcept
The TaxonName vocabulary is being used by IPNI, Index Fungorum and
soon ZooBank. It is also being used by GBIF and anyone else who uses
the TaxonConcept or TaxonOccurrence because it is embedded within
these vocabularies. In fact it could be used anywhere some one wants
to break apart a name string.
The TaxonOccurrence vocabulary is being issued by the CATE project
(don't have the reference to hand) and Species2k/Catalogue of life
are going to use it for their checklist and of course by others who
issue TaxonOccurrence data.
I'll show and example of the embedding as it makes things clearer.
This is abbreviated for clarity. Suppose we want to express an
occurrence of a taxon (perhaps as a specimen)
<to:TaxonOccurrence rdf:about="urn:lsid:example.com:specimens:1234">
  <dc:title>Hyam.R.D. 284927 - Rhododendron ponticum L.</dc:title>
  <to:collector>Roger Hyam</to:collector>
  <.... other stuff ...>
  <to:identifiedTo>
      <to:Identification>
      	<to:expertName>Chris Browning</to:expertName>
      	<to:taxonName>Rhododendron ponticum
L.</to:taxonName>
      	<to:taxon >
      		<tc:TaxonConcept>
      			<tc:hasName>
      				<tn:TaxonName>
<tn:genusPart>Rhododendron</tn:genusPart>
<tn:specificEpithet>ponticum</tn:specificEpithet >
<tn:authorship>L.</tn:authorship >
      				<tn:TaxonName>
      			</tc:hasName>
      			<tc:accordingToString>Brown and
Smith 1995</tc:accordingToString>
      			<tcom:publishedIn>Some monograph by
some guys</tcom:publishedIn>
      			<tcom:microReference>page
32</tcom:microReference>
      			<tcom:publishedInCitation
rdf:reference="http://
some.uri.to.some.citation.for.the.pub"/>
      		</tc:TaxonConcept>
      	</to:taxon>
      </to:Identification>	
  </to:identifiedTo>
</to:TaxonOccurrence>
So we have a TaxonOccurrence (really like a DarwinCore record but
with embedding). In order to express the identification of this
specimen in more detail than just a string we include a TaxonConcept
and a TaxonName. Neither the concept nor the name have identities
(they are both anonymous) but they are both objects of that type.
They could be replaced by references to external instances. There are
also properties to allow the supplier to "cop-out" of embedding
referencing anything and simply include a string if that is all they
have in their database.
So in issuing a TaxonOccurrence record I use both TaxonConcept and
TaxonName vocabularies. I am not using TCS in the sense of the XML
Schema but I am using in the sense of the notions involved.
This is where we are  headed with integrated standards and semantic
approaches.
I hope this helps.
BTW I hope it answers Rich's question as it is possible to add
reference info in the TaxonConcept to say where it was published
using the common properties defined in:
http://rs.tdwg.org/ontology/voc/Common
All the best,
Roger
On 5 Oct 2007, at 13:58, Richard Pyle wrote:
...
Hi Paul and others,
This leads me to a couple of questions about serving TCS data.  For
example,
strictly speaking, ZooBank will be return metadata in accordance
with the
"TaxonNameLsidVoc"
(http://wiki.tdwg.org/twiki/bin/view/TAG/TaxonNameLsidVoc), which
is based
on TCS, but is not TCS per se (ZooBank is concerned with taxon
names, not
concepts). There is also the TaxonConceptLsidVoc
(http://wiki.tdwg.org/twiki/bin/view/TAG/TaxonConceptLsidVoc), which
together with the TaxonNameLsidVoc and other more genral ontologies,
collectively represent the same information as a TCS XML document.
I guess
that one of the things I'm not clear on is whether RDF returned for
an LSID
counts as "TCS", or does TCS specifically mean a document structured
according to the TCS XML Schema?
Also, what are we really serving when we say we're serving TCS
documents?
Name-only data is part of TCS, but I wouldn't think of it as TCS
per se.  I
think you need it in the cntext of an "accordingTo" instance.  (By
the way
-- Roger -- I'd always thought of "accordingTo" as referring to a
PublicationCitation, not an Actor or Team.  A topic of discussion for
another day...
But my point is, I've got hundreds of thousands of database records
for
[Name accordingTo Publication], which each represent a pointer to a
taxon
concept (that is, "concept" sensu Kennedy, not sensu Pyle).  And
for many of
these, I also have information on synonymies within the Publication
(i.e.,
taxon concepts defined at the resolution of names, which means at the
implied resolution of type specimens).  What I don't have,  
however, is
robust sets of "taxon concept" records that go into more specific
detail
regarding the definition of the concept itself (in terms of non-type
specimens and/or character data, for example).  Also, I don't have
much in
the way of third-party RelationshipAssertions to define how these
alternate
concepts map to each other.
This leads to the question I've been meaning to ask, which is "How
much
information do I need before I call it a TCS document?"  I would
say raw
names data alone don't cut it -- you would need at least an
"accordingTo"
before you could call it a concept/TCS document.  But if all I have
as an
accordingTo (with no additional specimens or characters or
RelationshipAssertions), do I still call it TCS?
Sorry if I'm over-thinking this...
Aloha,
Rich
...
-----Original Message-----
From: tdwg-tnc-bounces@lists.tdwg.org
[mailto:tdwg-tnc-bounces@lists.tdwg.org] On Behalf Of Paul Allen
Sent: Friday, October 05, 2007 2:11 AM
To: tdwg-tnc@lists.tdwg.org
Subject: [tdwg-tnc] LSIDs and taxon concepts
Hi all,
I'm new to this list and hope that the following are
appropriate questions.
In Bratislava, I wasn't keeping detailed enough notes on
projects and their current and future plans wrt TCS.
What sites are currently publishing TCS-formatted data or
will be within the year? I know that zoobank.org will be
publishing TCS data in the near future. Is GBIF? ITIS? Species2000?
What sites are publishing real "taxon concept" data (in TCS
format or not)?
Conversely, what sites are simply publishing "nominal taxon
concepts" as opposed to detailed authoritative taxon concepts?
Is this the kind of thing for which we should generate a
survey to send to sites (i.e. their plans for publishing TCS)
or distrubute to TDWG members?
Thanks,
Paul
------------------------
Paul Allen, Assistant Director
Information Science             pea1@cornell.edu
Cornell Lab of Ornithology     (800) 843-BIRD
159 Sapsucker Woods Road       (607) 254-2480 (direct)
Ithaca, NY 14850               (607) 254-2415 (fax)
http://www.birds.cornell.edu/
http://bna.birds.cornell.edu/
http://www.ebird.org/
http://bird.atlasing.org/
------------------------
_______________________________________________
tdwg-tnc mailing list
tdwg-tnc@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tnc
_______________________________________________
tdwg-tnc mailing list
tdwg-tnc@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tnc
_______________________________________________
tdwg-tnc mailing list
tdwg-tnc@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tnc