Re: [tdwg-content] A plea around basisOfRecord (Was: Proposed new Darwin Core terms - abundance, abundanceAsPercent)

14 Oct 2013

      Rod,
http://code.google.com/p/tdwg-rdf/wiki/BiodiversityOntologies which has 
been online since last March.
Steve

Roderic Page wrote:
...
...
What might help is a way to visualise the TDWG LSID ontology in terms 
of the interconnections between the different classes. I'm not aware 
of such a visualisation (nor of an equivalent one for the Darwin Core 
classes).
In any event, it seems odd to have two distinct ontologies that are 
both in use, and which overlap so significantly.
Regards
Rod
On 13 Oct 2013, at 16:12, Donald Hobern [GBIF] wrote:
...
It’s been a couple of weeks but I said I’d try to write something 
about a more general concern I have around the way we use 
basisOfRecord and dcterms:type to hold values like occurrence, event 
and materialSample.  This is something that has concerned me for 
years and that, I worry, is making everything we all do much messier 
than it need be.
I believe that the way we have come to use Darwin Core basisOfRecord 
is confused and unhelpful.  I really wish we used Darwin Core like this:
1.       basisOfRecord should be used ONLY to indicate the type of 
evidence that lies behind a record – a key aspect of whether the 
record is likely to be useful for different purposes
2.       basisOfRecord values should be taken from a hierarchical 
vocabulary with three main branches:
a.       “specimens” (i.e. biological material that can be reviewed), 
with a hierarchy of subordinate values such as “pinnedSpecimen”, 
“herbariumSheet”, etc.
b.      derived, non-biological evidence (not sure what name), with a 
hierarchy of subordinate values such as “dnaSequence”, 
“soundRecording”, “stillImage”, etc.
c.       asserted observations with no revisitable evidence other 
than the authority of the observer
3.       TDWG should deliver a basic ontology in the form of a graph 
of key relationships between the most significant conceptual entities 
in our world (TaxonName, TaxonConcept, Identification, Collection, 
Specimen, Locality, Agent, …)
4.       This ontology should not attempt to map all the complexity 
of biodiversity-related data – just provide the high-level map and 
key relationships (TaxonConcept hasName TaxonName, Specimen heldIn 
Collection, etc.) – it should leave definition of other properties as 
a separate, open-ended activity for the community
5.       This ontology should be reviewed at regular intervals and 
versioned as necessary to address critical gaps – provided that 
backwards compatibility is maintained (splitting a class into 
multiple consitituent classes probably won’t break anything, so start 
simple)
6.       The Darwin Core vocabulary should be published as a flat, 
open-ended list of terms with clear definitions that can be freely 
combined as columns in denormalised records
7.       Every Darwin Core term should be documented to be tightly 
associated with a single, fixed class in the ontology (e.g. 
scientificName and specificEpithet are ALWAYS considered to be 
properties of a TaxonName whether or not that TaxonName object is 
clearly referenced or separated out)
8.       Every data publisher should be encouraged to share all 
relevant data elements in their source data in the most convenient 
normalised or denormalised form, provided they use the recognised 
Darwin Core properties for elements that match the definition for 
those terms, and provided they give some metadata for other 
elements.  Possible forms include:
a.       A completely hierarchical, ABCD-like, XML representation
b.      A completely flat denormalised, simple-DwC-like, CVS 
representation, if the data includes no elements with higher cardinality
c.       A set of flat, relational, CVS representations, as with 
Darwin Core Archive star schemas, but with freedom to have more 
complex graphed relationships as needed
9.       Each table of CVS data in 8b and 8c is a view that 
corresponds to a linear subgraph of the TDWG ontology, identified by 
the classes of the DwC properties used – this allows us to infer the 
“shape” of the data in terms of the ontology
10.   If we do this, we do not need to worry about whether a record 
is a checklist record, an event, an occurrence, a material sample or 
whatever else, although we could use the dcterms: type property, or 
some new property, to hold this detail as a further clue to intent 
and possible use for the record
Here is an example.  In today’s terms, what sort of DwC record is 
this?  Do I really have to replace “recordId” with “eventId”, 
“occurrenceId” or similar? And which should I choose?
*recordId, decimalLatitude, decimalLongitude, coordinatePrecision, 
eventDate, scientificName, individualCount*
I think it is clear that this record tells us that there was a 
recording event at a particular time and place where someone or some 
process recorded a given number of individual organisms which were 
identified as representatives of a taxon concept with a name 
corresponding to the supplied scientific name.  In other words this 
gives us some properties from a subgraph that might include, say, 
instances of TDWG Event, Locality, Date, Occurrence, Identification, 
TaxonConcept and TaxonName classes. None of these is specifically 
referenced but we can unambiguously fold the flat record onto the 
ontology.  We can moreover then use the combination of supplied 
elements to decide whether this record would be of interest to GBIF, 
a national information facility, a tool cataloguing uses of 
scientific names, etc.  The same will also apply if multiple CVS 
tables are provided as in 8c.
I have thought about this for a long time and cannot yet think of an 
area in which this would not work efficiently – and unambiguously – 
for all concerned.  There are some cases where multiple instances of 
the same ontology class would be referenced within a single record, 
which may mean more care is needed by the publisher (e.g. if an 
insect specimen record includes a reference to a host plant). There 
may be cases where automated review of the data indicates that there 
are impossible combinations or ambiguities that the publisher must 
resolve.  However I believe we could use this approach to generalise 
all mobilisation and consumption of biodiversity data (including all 
the things we have addressed under ABCD, SDD, TCS, Plinian Core, 
etc.) and to make it genuinely possible for any data holder to share 
all the data they have in a form that makes sense to them, while 
allowing others to consume these data intelligently.
Right now, I think our confused use of basisOfRecord is almost the 
only thing that stops us from exploring this.  We have blurred the 
question of the evidence for a record, with the question of the 
“shape” of the record as a subgraph.  These are different things.  
Separating them will allow us to get away from some of our 
unresolvable debates and open up the doors to much simpler data 
sharing and reuse.
Thanks,
Donald
----------------------------------------------------------------------
Donald Hobern - GBIF Director - dhobern@gbif.org 
<mailto:dhobern@gbif.org>
Global Biodiversity Information Facility http://www.gbif.org/
GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark
Tel: +45 3532 1471  Mob: +45 2875 1471  Fax: +45 2875 1480
----------------------------------------------------------------------
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org <mailto:tdwg-content@lists.tdwg.org>
http://lists.tdwg.org/mailman/listinfo/tdwg-content
---------------------------------------------------------
Roderic Page
Professor of Taxonomy
Institute of Biodiversity, Animal Health and Comparative Medicine
College of Medical, Veterinary and Life Sciences
Graham Kerr Building
University of Glasgow
Glasgow G12 8QQ, UK
Email:  r.page@bio.gla.ac.uk <mailto:r.page@bio.gla.ac.uk>
Tel:  +44 141 330 4778
Fax:  +44 141 330 2792
Skype:  rdmpage
Facebook:  http://www.facebook.com/rdmpage
LinkedIn:  http://uk.linkedin.com/in/rdmpage
Twitter:  http://twitter.com/rdmpage
Blog:  http://iphylo.blogspot.com
Home page:  http://taxonomy.zoology.gla.ac.uk/rod/rod.html
Wikipedia:  http://en.wikipedia.org/wiki/Roderic_D._M._Page
Citations:  
http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ 
<http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ>
ORCID:  http://orcid.org/0000-0002-7101-9767
-- 
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
PMB 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 322-4942
If you fax, please phone or email so that I will know to look for it.
http://bioimages.vanderbilt.edu

Re: [tdwg-content] A plea around basisOfRecord (Was: Proposed new Darwin Core terms - abundance, abundanceAsPercent)

Steve Baskauf