New subject: A plea around basisOfRecord (Was: Proposed new Darwin Core terms - abundance, abundanceAsPercent)

13 Oct 2013

      Its been a couple of weeks but I said Id try to write something about a
more general concern I have around the way we use basisOfRecord and
dcterms:type to hold values like occurrence, event and materialSample.  This
is something that has concerned me for years and that, I worry, is making
everything we all do much messier than it need be.

I believe that the way we have come to use Darwin Core basisOfRecord is
confused and unhelpful.  I really wish we used Darwin Core like this:

1.       basisOfRecord should be used ONLY to indicate the type of evidence
that lies behind a record  a key aspect of whether the record is likely to
be useful for different purposes

2.       basisOfRecord values should be taken from a hierarchical vocabulary
with three main branches:

a.       specimens (i.e. biological material that can be reviewed), with a
hierarchy of subordinate values such as pinnedSpecimen, herbariumSheet,
etc.

b.      derived, non-biological evidence (not sure what name), with a
hierarchy of subordinate values such as dnaSequence, soundRecording,
stillImage, etc.

c.       asserted observations with no revisitable evidence other than the
authority of the observer

3.       TDWG should deliver a basic ontology in the form of a graph of key
relationships between the most significant conceptual entities in our world
(TaxonName, TaxonConcept, Identification, Collection, Specimen, Locality,
Agent, )

4.       This ontology should not attempt to map all the complexity of
biodiversity-related data  just provide the high-level map and key
relationships (TaxonConcept hasName TaxonName, Specimen heldIn Collection,
etc.)  it should leave definition of other properties as a separate,
open-ended activity for the community

5.       This ontology should be reviewed at regular intervals and versioned
as necessary to address critical gaps  provided that backwards
compatibility is maintained (splitting a class into multiple consitituent
classes probably wont break anything, so start simple)

6.       The Darwin Core vocabulary should be published as a flat,
open-ended list of terms with clear definitions that can be freely combined
as columns in denormalised records

7.       Every Darwin Core term should be documented to be tightly
associated with a single, fixed class in the ontology (e.g. scientificName
and specificEpithet are ALWAYS considered to be properties of a TaxonName
whether or not that TaxonName object is clearly referenced or separated out)

8.       Every data publisher should be encouraged to share all relevant
data elements in their source data in the most convenient normalised or
denormalised form, provided they use the recognised Darwin Core properties
for elements that match the definition for those terms, and provided they
give some metadata for other elements.  Possible forms include:

a.       A completely hierarchical, ABCD-like, XML representation

b.      A completely flat denormalised, simple-DwC-like, CVS representation,
if the data includes no elements with higher cardinality

c.       A set of flat, relational, CVS representations, as with Darwin Core
Archive star schemas, but with freedom to have more complex graphed
relationships as needed

9.       Each table of CVS data in 8b and 8c is a view that corresponds to a
linear subgraph of the TDWG ontology, identified by the classes of the DwC
properties used  this allows us to infer the shape of the data in terms
of the ontology

10.   If we do this, we do not need to worry about whether a record is a
checklist record, an event, an occurrence, a material sample or whatever
else, although we could use the dcterms: type property, or some new
property, to hold this detail as a further clue to intent and possible use
for the record

Here is an example.  In todays terms, what sort of DwC record is this?  Do
I really have to replace recordId with eventId, occurrenceId or
similar? And which should I choose?

recordId, decimalLatitude, decimalLongitude, coordinatePrecision, eventDate,
scientificName, individualCount

I think it is clear that this record tells us that there was a recording
event at a particular time and place where someone or some process recorded
a given number of individual organisms which were identified as
representatives of a taxon concept with a name corresponding to the supplied
scientific name.  In other words this gives us some properties from a
subgraph that might include, say, instances of TDWG Event, Locality, Date,
Occurrence, Identification, TaxonConcept and TaxonName classes. None of
these is specifically referenced but we can unambiguously fold the flat
record onto the ontology.  We can moreover then use the combination of
supplied elements to decide whether this record would be of interest to
GBIF, a national information facility, a tool cataloguing uses of scientific
names, etc.  The same will also apply if multiple CVS tables are provided as
in 8c. 

I have thought about this for a long time and cannot yet think of an area in
which this would not work efficiently  and unambiguously  for all
concerned.  There are some cases where multiple instances of the same
ontology class would be referenced within a single record, which may mean
more care is needed by the publisher (e.g. if an insect specimen record
includes a reference to a host plant). There may be cases where automated
review of the data indicates that there are impossible combinations or
ambiguities that the publisher must resolve.  However I believe we could use
this approach to generalise all mobilisation and consumption of biodiversity
data (including all the things we have addressed under ABCD, SDD, TCS,
Plinian Core, etc.) and to make it genuinely possible for any data holder to
share all the data they have in a form that makes sense to them, while
allowing others to consume these data intelligently.

Right now, I think our confused use of basisOfRecord is almost the only
thing that stops us from exploring this.  We have blurred the question of
the evidence for a record, with the question of the shape of the record as
a subgraph.  These are different things.  Separating them will allow us to
get away from some of our unresolvable debates and open up the doors to much
simpler data sharing and reuse.

Thanks, 

Donald 

----------------------------------------------------------------------

Donald Hobern - GBIF Director -  <mailto:dhobern@gbif.org> dhobern@gbif.org 

Global Biodiversity Information Facility  <http://www.gbif.org/>
http://www.gbif.org/ 

GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark

Tel: +45 3532 1471  Mob: +45 2875 1471  Fax: +45 2875 1480

----------------------------------------------------------------------

A plea around basisOfRecord (Was: Proposed new Darwin Core terms - abundance, abundanceAsPercent)

Donald Hobern [GBIF]

Richard Pyle

Donald Hobern [GBIF]

Daniel Janzen

Donald Hobern [GBIF]

Steve Baskauf

Donald Hobern [GBIF]

Steve Baskauf

Roderic Page

Robert Guralnick

Donald Hobern [GBIF]

Steve Baskauf

Steve Baskauf

Roderic Page

tags

participants (6)