[tdwg-tag] Creating a TDWG standard for documenting Data Standards [SEC=UNCLASSIFIED]

Wed Mar 7 02:52:18 CET 2012

On 07/03/2012, at 3:11 AM, Steve Baskauf wrote:

> Dag and Éamonn,
> 
> In the context of the discussion which has been going on in the TDWG RDF mailing list, I have been thinking more about the issue of how to deal with DwC terms which state "Recommended best practice is to use a controlled vocabulary...".  That would be dcterms:type, dwc:language, dwc:basisOfRecord, dwc:sex, dwc:lifeStage, dwc:reproductiveCondition, dwc:behavior, dwc:establishmentMeans, dwc:occurrenceStatus, dwc:disposition, dwc:continent, dwc:waterBody, dwc:islandGroup, dwc:island, dwc:country, dwc:verbatimCoordinateSystem, dwc:georeferenceVerificationStatus, dwc:identificationVerificationStatus, dwc:taxonRank; dwc:nomenclaturalCode, dwc:taxonomicStatus, dwc:relationshipOfResource, and dwc:measurementType .

We here have had all sorts of problems using other people's vocabularies - they never quite match the data we have. Our solution has been to use the standard terms where possible, but to mint our own where needed. We create RDF objects and to declare them as being the correct type.

For instance, 
	http://biodiversity.org.au/voc/afd/AFD#RelationshipTypeTerm

Is declared to be a subclass of
	http://rs.tdwg.org/ontology/voc/TaxonConcept#TaxonRelationshipTerm

And we have a few specific items of that type:
    http://biodiversity.org.au/voc/afd/RelationshipTypeTerm#has-emendation
    http://biodiversity.org.au/voc/afd/RelationshipTypeTerm#has-invalid-name
    http://biodiversity.org.au/voc/afd/RelationshipTypeTerm#has-junior-homonym
    http://biodiversity.org.au/voc/afd/RelationshipTypeTerm#has-miscellaneous-literature-name

These individuals are therefore correctly typed to be legitimately be used as a TDWG  relationshipCategory. 

Your lists of dwc:disposition values does not need to be exhaustive. It's legitimate (from a machine point of view) for a site to create their own terms. However, this does mean that the world becomes fragmented into a number of site-specific vocabularies that cannot be machine-reasoned over. The underlying reason for this is that that is in fact the way the world actually is at the moment, and there's not a lot of help for it.

-------------------------------------------------------------

There are two or three approaches to using a standard vocabulary when your own data does not quite match it.

You can use the standard term that is *closest in meaning* to your own term. The difficulty here is that if the meaning of the standard term implies things that are not true of your data, using it  means that you are asserting things that are in fact not true, and for that reason I suggest that it's not the way to go.

You can use the standard term whose definition encompasses your term. The difficulty here is that some vocabularies (notably Taxon Concept Schema) don't have "other" or "unspecified" values for their enumerations - they are not exhaustive.

In either of these cases, you will want to supplement the standard term with another value specific to your own data set, whose definition you make available. There are a few ways to do that.

You can use the "define your own term" mechanism and assert both
  _:_ tdwg:has_relationship_type tdwg:is-subtaxon-of  .
  _:_ tdwg:has_relationship_type my-voc:is-recently-declared-subtaxon-of  .

You can have a completely separate predicate:
  _:_ tdwg:has_relationship_type tdwg:is-subtaxon-of  .
  _:_ myvoc:has_relationship_type my-voc:is-recently-declared-subtaxon-of  .

You can also be terribly clever and declare your own predicate to be a super-property of the TDWG predicate, one whose range is a union. This isn't terribly useful to people using your data unless the tdwg triple is also asserted.

Another alternative is to create an OWL rule that says 
"if a thing has relationship-type my-voc:is-recently-declared-subtaxon-of, then it also has relationship-type tdwg:is-subtaxon-of"

But this creates a performance hit.

-------------------------------------------------------------

That little discussion aside, my main concern is that you don't get mired in attempting to exhaustively list all the different island types (etc) as part of the vocabulary that you are creating. It's a never-ending job. It might be an idea to have the design guideline that no enumeration class defined by the vocabulary shall have more than 10 values. It's arbitrary, but it will keep people from being carried away subdividing types into a hierarchy that they think is a good idea, but which doesn't match the data people already have.

I'd also suggest that that every enumeration (ie, ist of individuals) include two special values:

NOT_SPECIFIED. This value is not present in the source, underlying data. It isn't in the database, the respondent didn't fill out the form fully. Perhaps "NULL" might be a better name - assuming people at this level know what it means.
OTHER. This means the value is some specific value, but it's not covered in the TDWG list. I am not sure if this value should be explicitly used if you are publishing your own vocabulary and using terms from that. I'm inclined to say it should not be, because doing that would result in two values for predicates that naturally should be functional.

These special values *can* be done as a single instance, which means you could easily pull all "not specifieds" out of a dataset, but that means that either the ranges would have to be declared as a union, which is messy, or the individuals would have to be declared as having all possible types, which would break disjoint class declarations.

If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.

Please consider the environment before printing this email.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-tag/attachments/20120307/40d4aad7/attachment-0001.html