[tdwg-tag] Any TCS users with experiences to report? [SEC=UNCLASSIFIED]

Fri Nov 2 02:31:02 CET 2012

Firstly, the XML schema:

http://biodiversity.org.au/xml/ibis is an xml namespace, which works a bit differently to RDF namespaces.

RDF does not have an explicit mechanism for finding schema metadata. By convention (and it is just a convention), we usually find the schema for a namespace by assuming that the namespace URI will work as a URL that can be fetched, and that fetching it  will pull back a schema description (possibly in any one of several formats, using HTTP content negotiation).

In XML, however, namespaces are explicitly linked to schema documents by the xsi:schemaLocation attribute.

The xml generated by biodiversity.org.au

	http://biodiversity.org.au/apni.taxon/54321.xml

Comes back with  the declaration 

<app:documents xmlns:app="http://biodiversity.org.au/xml/servicelayer/content" xmlns:ibis="http://biodiversity.org.au/xml/ibis" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:cfg="http://biodiversity.org.au/xml/servicelayer/configuration" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation=" http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/dc.xsd http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/dcterms.xsd http://purl.org/dc/dcmitype/ http://dublincore.org/schemas/xmls/qdc/dcmitype.xsd http://biodiversity.org.au/xml/servicelayer/content http://anbg.gov.au/ala/schemas/xml/app.xsd http://biodiversity.org.au/xml/ibis http://biodiversity.org.au/xml/ibis-20120706.xsd http://biodiversity.org.au/xml/apni http://biodiversity.org.au/xml/apni-20120706.xsd http://biodiversity.org.au/xml/afd http://biodiversity.org.au/xml/afd-20120706.xsd http://biodiversity.org.au/xml/col http://biodiversity.org.au/xml/col-20110615.xsd ">

Although it's a bit buried in there, XML parsers can see from this that the xml namespace "http://biodiversity.org.au/xml/ibis" has a location of "http://biodiversity.org.au/xml/ibis-20120706.xsd". All of our XML is supposed to validate, and last time I checked it did.

By the way - note the date on the filename. We have changed the schema from time to time. Another change is upcoming: the addition of an "excluded" flag for concepts that have been considered for APC and have been explicitly excluded (for a variety of reasons). This will be managed by a new schema document being available on our server and the generated xsi:schemaLocation attribute being changed.

Secondly, TCS:

The issue with TCS is that it is very difficult to extend. To use a bit of TCS in some other schema, you would import the element types and extend them. But TCS mostly does not expose its element types as named types that can be referenced externally - it's all done inline. This means that the only place a TCS "ScientificName" or "Rank" element can appear is somewhere inside a TCS DataSet element. This is not in itself a show-stopper: we could simply generate a DataSet wrapper when we produce output in response to fetches.

But there were other issues such as (and I can only recall one or two at the moment - this mail is not a full defence of our decision to not go with TCS):

A TaxonConcepts element may have multiple TaxonRelationships element. We would like to attach additional data to each relationship to capture information that TCS cannot. There is a ProviderSpecificData element, but this is at the end of the TaxonConcept element, and I could not work out a way to stuff the extra data for each relationship into that ProviderSpecificData element in such a way as it was attached to the correct relationship - although re-looking at it now I see a "ref" attribute and perhaps that is meant to do the job.

There are multiple TCS "relationship types", but these did not quite match the data we had. It is not possible to put anything but a TCS relationship type enum into the "type" attribute of a TaxonRelationship element, so we wind up having to provide two fields - the "real" type and the nearest TCS equivalent. The "real" type needs to go in the ProviderSpecificData section - miles away (in the document) from what is supposed to be the primary place where the relationship is described. It's ugly. Furthermore, some of our relationships don't really match the TCS ones at all well - to the point that using a TCS type would be misleading. The TCS enumeration does not have a "other" value, so there was a bit of an impasse.

In any event, we were looking at either putting some relationships in the TCS array, and some in the PSD array, or putting corresponding arrays in each. Of course, in the provider specific data section we cannot use any of the TCS elements, because the element types are not exposed and can only appear in a TCS DataSet at the correct spot. It just got to the point where the ProviderSpecificData section was bigger and more interesting than the TCS, so we broke it out into a separate XML document (which was bundles with the TCS using an ibis:documents wrapper), at which point we couldn't help but ask "Can some one explain again why we are trying to do this?". After more discussions with both the zoologists and the botanists, attempting to work out which TCS enumerated values I should use for what, we gave up.

TCS does an admirable job of being watertight. If you have any valid XML document with any TCS element, then you know that it will be enclosed in a DataSet element and come bundled with enough context to make sense of it. It's a model for shifting around entire, self-contained *sets* of data. Entire taxonomies, sitting as big files on a disk (or in an xml store). But our service layer serves up fragments - one or several taxa in response to a request, and TCS turned out to not be a good model for what we do. The history of trying to use it has left us with a legacy of having multiple relationship-type fields (relationship "description" and relationship "category") whose product does not form a sensible set of values.

What we have now is a site-specific schema that captures and exposes the data we have. Admittedly, this means that the grand goal we are all trying to accomplish - a consistent worldwide net of data - is not as far down the track as we were hoping to go. It means that the problem of working out how data set 1 matches data set 2 is pushed off onto aggregators, a job that is in general impossible for an aggregator to accomplish.  If we could have fitted our data into TCS, if everyone else could also have done so, then that would have been wonderful. We were reluctant to abandon it, but to get our data out the door we eventually did.

On 02/11/2012, at 9:41 AM, <Tony.Rees at csiro.au> <Tony.Rees at csiro.au> wrote:

> Hi TDWG persons,
>  
> I am involved in an activity here to set a local standard for storing taxonomic name, identifier and (probably) hierarchy information in metadata records using our profile of ISO 19115 for the latter, and the question will come up as to whether to use elements from TCS, DwC, EML, NCBII extension to ISO 19115, or other. By default I would expect the front runner to be TCS but it appears few if any major systems have ever gone that route – I have looked at ITIS, COL, TROPICOS, WoRMS, IPNI, GBIF, AFD/APNI, more… the nearest would perhaps be AFD/APNI (hence copying Paul on this email) however their “ibis” schema, though apparently based originally on TCS,http://biodiversity.org.au/xml/ibis-20120909.xsd , does not make any explicit reference to the TCS schema so far as I can see. (Note also the cited schema definition http://biodiversity.org.au/xml/ibis [or presumablyhttp://biodiversity.org.au/xml/ibis.xsd] does not seem to exist, but maybe I am missing something).
>  
> I am in the interesting position of also wishing to make apps which both publish and consume taxonomic name information so *could* implement TCS for these, but if no-one else is doing so maybe that is not a path to future data harmonisation, and something like DwC might be better.
>  
> It does seem odd that we have a standard endorsed in 2005 by TDWG which is apparently unused by any current major players in the real world. Any thoughts?
>  
> Regards - Tony
>  
> Tony Rees
> Manager, Divisional Data Centre,
> CSIRO Marine and Atmospheric Research,
> GPO Box 1538,
> Hobart, Tasmania 7001, Australia
> Ph: 0362 325318 (Int: +61 362 325318)
> Fax: 0362 325000 (Int: +61 362 325000)
> e-mail: Tony.Rees at csiro.au
> Manager, OBIS Australia regional node, http://www.obis.org.au/
> Biodiversity informatics research activities: http://www.cmar.csiro.au/datacentre/biodiversity.htm
> Personal info: http://www.fishbase.org/collaborators/collaboratorsummary.cfm?id=1566
> LinkedIn profile: http://www.linkedin.com/pub/tony-rees/18/770/36
>  
> From: tdwg-tag-bounces at lists.tdwg.org [mailto:tdwg-tag-bounces at lists.tdwg.org] On Behalf Of Paul Murray
> Sent: Wednesday, 7 March 2012 12:52 PM
> To: Steve Baskauf
> Cc: "Éamonn Ó Tuama (GBIF)"; TDWG TAG
> Subject: Re: [tdwg-tag] Creating a TDWG standard for documenting Data Standards [SEC=UNCLASSIFIED]
>  
>  
> On 07/03/2012, at 3:11 AM, Steve Baskauf wrote:
> 
> 
> Dag and Éamonn,
> 
> In the context of the discussion which has been going on in the TDWG RDF mailing list, I have been thinking more about the issue of how to deal with DwC terms which state "Recommended best practice is to use a controlled vocabulary...".  That would be dcterms:type, dwc:language, dwc:basisOfRecord, dwc:sex, dwc:lifeStage, dwc:reproductiveCondition, dwc:behavior, dwc:establishmentMeans, dwc:occurrenceStatus, dwc:disposition, dwc:continent, dwc:waterBody, dwc:islandGroup, dwc:island, dwc:country, dwc:verbatimCoordinateSystem, dwc:georeferenceVerificationStatus, dwc:identificationVerificationStatus, dwc:taxonRank; dwc:nomenclaturalCode, dwc:taxonomicStatus, dwc:relationshipOfResource, and dwc:measurementType .
>  
>  
> We here have had all sorts of problems using other people's vocabularies - they never quite match the data we have. Our solution has been to use the standard terms where possible, but to mint our own where needed. We create RDF objects and to declare them as being the correct type.
>  
> For instance, 
>           http://biodiversity.org.au/voc/afd/AFD#RelationshipTypeTerm
>  
> Is declared to be a subclass of
>           http://rs.tdwg.org/ontology/voc/TaxonConcept#TaxonRelationshipTerm
>  
> And we have a few specific items of that type:
>     http://biodiversity.org.au/voc/afd/RelationshipTypeTerm#has-emendation
>     http://biodiversity.org.au/voc/afd/RelationshipTypeTerm#has-invalid-name
>     http://biodiversity.org.au/voc/afd/RelationshipTypeTerm#has-junior-homonym
>     http://biodiversity.org.au/voc/afd/RelationshipTypeTerm#has-miscellaneous-literature-name
>  
> These individuals are therefore correctly typed to be legitimately be used as a TDWG  relationshipCategory. 
>  
> Your lists of dwc:disposition values does not need to be exhaustive. It's legitimate (from a machine point of view) for a site to create their own terms. However, this does mean that the world becomes fragmented into a number of site-specific vocabularies that cannot be machine-reasoned over. The underlying reason for this is that that is in fact the way the world actually is at the moment, and there's not a lot of help for it.
>  
> -------------------------------------------------------------
>  
> There are two or three approaches to using a standard vocabulary when your own data does not quite match it.
>  
> You can use the standard term that is *closest in meaning* to your own term. The difficulty here is that if the meaning of the standard term implies things that are not true of your data, using it  means that you are asserting things that are in fact not true, and for that reason I suggest that it's not the way to go.
>  
> You can use the standard term whose definition encompasses your term. The difficulty here is that some vocabularies (notably Taxon Concept Schema) don't have "other" or "unspecified" values for their enumerations - they are not exhaustive.
>  
> In either of these cases, you will want to supplement the standard term with another value specific to your own data set, whose definition you make available. There are a few ways to do that.
>  
> You can use the "define your own term" mechanism and assert both
>   _:_ tdwg:has_relationship_type tdwg:is-subtaxon-of  .
>   _:_ tdwg:has_relationship_type my-voc:is-recently-declared-subtaxon-of  .
>  
> You can have a completely separate predicate:
>   _:_ tdwg:has_relationship_type tdwg:is-subtaxon-of  .
>   _:_ myvoc:has_relationship_type my-voc:is-recently-declared-subtaxon-of  .
>  
> You can also be terribly clever and declare your own predicate to be a super-property of the TDWG predicate, one whose range is a union. This isn't terribly useful to people using your data unless the tdwg triple is also asserted.
>  
> Another alternative is to create an OWL rule that says 
> "if a thing has relationship-type my-voc:is-recently-declared-subtaxon-of, then it also has relationship-type tdwg:is-subtaxon-of"
>  
> But this creates a performance hit.
>  
> -------------------------------------------------------------
>  
> That little discussion aside, my main concern is that you don't get mired in attempting to exhaustively list all the different island types (etc) as part of the vocabulary that you are creating. It's a never-ending job. It might be an idea to have the design guideline that no enumeration class defined by the vocabulary shall have more than 10 values. It's arbitrary, but it will keep people from being carried away subdividing types into a hierarchy that they think is a good idea, but which doesn't match the data people already have.
>  
> I'd also suggest that that every enumeration (ie, ist of individuals) include two special values:
>  
> NOT_SPECIFIED. This value is not present in the source, underlying data. It isn't in the database, the respondent didn't fill out the form fully. Perhaps "NULL" might be a better name - assuming people at this level know what it means.
> OTHER. This means the value is some specific value, but it's not covered in the TDWG list. I am not sure if this value should be explicitly used if you are publishing your own vocabulary and using terms from that. I'm inclined to say it should not be, because doing that would result in two values for predicates that naturally should be functional.
>  
> These special values *can* be done as a single instance, which means you could easily pull all "not specifieds" out of a dataset, but that means that either the ranges would have to be declared as a union, which is messy, or the individuals would have to be declared as having all possible types, which would break disjoint class declarations.
>  
> If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments. Please consider the environment before printing this email.
> 
> _______________________________________________
> tdwg-tag mailing list
> tdwg-tag at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-tag

If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.

Please consider the environment before printing this email.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-tag/attachments/20121102/d8243414/attachment-0001.html