[tdwg-content] tdwg-content Digest, Vol 63, Issue 15

Ramona Walls rlwalls2008 at gmail.com
Tue Sep 2 04:44:42 CEST 2014


Hi Markus,

Much of the "muddle" that I was referring to has been exposed in a series
of recent hackathons (to which we have tried to always invite someone from
GBIF), and has not yet been published, although [1,2,3] touch on the issue.
I will try to describe some of the problems we have encountered here.

For those of you who are tired of this thread already, the short answer is
that the problem lies in the lack of underlying clarity and semantics in
Darwin Core, not DwC-A. I think Darwin Core archives do a pretty good job
of capturing the content of Darwin Core terminology, but as those terms are
often (sometime intentionally) ambiguous and imprecise, those aspects are
captured too.

Below are a few of the specific issues. I don't think much of this will be
new to the readers of this list.

1. Ambiguously defined classes: Darwin Core only has a handful of classes
(Occurence, Event. Location, etc.), and those are intended to be used as
grouping classes, rather than strictly defined entities. Therefore, if
someone uses the corresponding properties (OccurenceID, EvenID,
LocationID), they are more or less free to attach those IDs to whatever
they want. In many cases, the type of entity referred to by OccurenceID can
be inferred from the object of basisOfRecord, but not always, and it is
still an inference with some error. *IF* DwC does go in the direction of
capturing sample/survey data, one thing that would help would be to have it
specified via a set of controlled vocabulary terms in basisOfRecord. I
won't even start with Taxon.

2. Domain- and range-less properties: DwC properties (the bulk of the terms
in DwC) are intentionally left without domains and ranges to make them
maximally re-usable. For traditional museum specimen data, this works
reasonably well, and the intended meaning is often clear. We recently
completed the exercise of specifying domains and ranges for the whole DwC
vocabulary, as interpreted for an Occurence Core archive, and found that
many of them do not refer to DwC classes as their domain. This makes in
impossible to infer their meaning without a set of external assumptions.
Our interpreted ranges of DwC properties are a mix of data values
(literals) and potential classes in other ontologies. {I will share this
mapping on request - it is still under review}. The proposed properties
also have no domains or ranges. For example, the suggested range of
'quantity type' is a heterogeneous mix of entities (individuals, biomass,
%species, scale type) that sets off a huge red flag to my ontological self.
And what about the domain of 'quantity type' and 'quantity'? As they are in
the Occurence extension table, they probably are meant to describe whatever
is represented by occurenceID, but this is not specified.

Vague domains and ranges may work for simple sampling schemas that could be
interpreted as  "taxon W (value of occurenceID) has quantity X (value of
quantity) with unit Y (value of samplingUnit) of quantity type Z (value of
quantityType)", but even then it is likely to be ambiguous. Now suppose
W=123someID (with taxonName as Bufo bufo), X=9, Y=individuals, and Z is
left blank, because that is just what people do sometimes. Let's be
generous and suppose that there is also some sampling geometry and location
information. How to interpret this? Does this mean that there is a jar in a
museum with 9 toads in it, all of which came form a single collecting event
at that location? Does it mean that there was a survey done of a the
location described, and 9 toads were counted but not collected? Does it
mean that they surveyed the area and estimated toad abundace at 9 toads
m^-2?  Even if they are conscientious and put in m^2 as the sampling unit,
the meaning is still unclear. Did they measure an area of 1 m2 and find 9
toads or an area of 90 m2 and find 90 toads or did measure an area of 90 m2
and find 9 toads? All of these would be valid interpretations. Of course,
many sampling schemas are much more complex with this, with nested sampling
protocols or repeat sampling of the same area. There is just no way to
capture that without more semantics the DwC can provide.

3. Using the wrong property for information: Even if the larger community
interprets a DwC property with some certainty, that doesn't stop people
from using it wrong. This is often not the fault of DwC or DwC archives, if
people don't take the time to make sure they are filling out the fields
properly. Even with very clearly defined ontology terms, some curators just
don't take the time to read an understand the definitions. However, the
ambiguity and vagueness of DwC terms surely contributes to this.
-----
I think that sampling or survey data is just to structured to describe well
with a flat schema like DwC. That is why most of it to date has been stored
in relational databases. To repeat myself, I think it would make more sense
to first develop a sound semantic model for the data then extract a Darwin
Core archive format for transmitting it, rather than build the DwC archive
first and later try to map it to a semantic model.

Ramona

[1] http://www.biomedcentral.com/1471-2105/15/257
[2]
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0089606
[3]http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3746421/

------------------------------------------------------
Ramona L. Walls, Ph.D.
Scientific Analyst, The iPlant Collaborative, University of Arizona
Research Associate, Bio5 Institute, University of Arizona
Laboratory Research Associate, New York Botanical Garden


On Sat, Aug 30, 2014 at 3:00 AM, <tdwg-content-request at lists.tdwg.org>
wrote:

> Send tdwg-content mailing list submissions to
>         tdwg-content at lists.tdwg.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://lists.tdwg.org/mailman/listinfo/tdwg-content
> or, via email, send a message with subject or body 'help' to
>         tdwg-content-request at lists.tdwg.org
>
> You can reach the person managing the list at
>         tdwg-content-owner at lists.tdwg.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of tdwg-content digest..."
>
>
> Today's Topics:
>
>    1. Re: tdwg-content Digest, Vol 63, Issue 6 (sigh) (Markus D?ring)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 29 Aug 2014 12:02:40 +0200
> From: Markus D?ring <m.doering at mac.com>
> Subject: Re: [tdwg-content] tdwg-content Digest, Vol 63, Issue 6
>         (sigh)
> To: Ramona Walls <rlwalls2008 at gmail.com>
> Cc: TDWG Content Mailing List <tdwg-content at lists.tdwg.org>,    John Deck
>         <jdeck at berkeley.edu>
> Message-ID: <A6B43FE0-A595-4526-BD13-EE196D950435 at mac.com>
> Content-Type: text/plain; charset="windows-1252"
>
> Dear Ramona,
>
> could you point me to the evidence for muddled semantic mappings in
> existing dwc archives? I would like to better understand the problem. Is it
> a general issue with Darwin Core terms and their lose definition or just
> their application in dwc archives?
>
> best,
> Markus
>
>
>
> On 29 Aug 2014, at 06:35, Ramona Walls <rlwalls2008 at gmail.com> wrote:
>
> > I also really do appreciate GBIF's pressing need to serve survey/sample
> data, and I don't have a major of a problem with the idea of an Event core
> or even adding new terms to DwC (in principle). Rather, I am urging caution
> in how we proceed with it.
> >
> > ?amonn, in response to your statement "Once the BCO model is available
> for uptake, it should be possible to develop a mapping between it and the
> simple DwC sample model" I will respond: possible, maybe, easy, no way,
> unambiguous, probably impossible. The problem I foresee is that once the
> "simple DwC sample model" is in place, people will start using it to do all
> kinds of not so simple things, and the mapping will become muddled. We have
> ample evidence that this is the case with existing Darwin Core archives.
> >
> > Going back to the five new terms that ?amonn proposed, I would like to
> see if we can link them NOW to existing ontology terms (as other have
> proposed), thus making their semantics explicit from the start, but still
> allowing GBIF/EU-BON to proceed with the work they need to do. This will
> not prevent people form misusing terms, but may at least help make mapping
> easier later. In cases where the terms can't be mapped to an existing term,
> BCO curators would be willing to help develop a term or set of terms that
> can convey meaning required, or work with other ontology developers to get
> the terms added elsewhere.
> >
> > Trying to be constructive, I attempted to do a quick and dirty,
> preliminary mapping of the five terms (quantity, quantity type, sampling
> geometry, sampling unity, and event series ID), bearing in mind that I am
> not an authority on OBOE or OGC ontologies. [Aside: Based on what I know,
> OGC ontologies are not yet sufficiently developed to provide the semantics
> we need, but I would love for someone to show me otherwise.]
> >
> > A serious problem with mapping these terms to existing ontologies is
> that some of them do NOT map to a single ontology term (namely, quantity,
> quantity type, and sampling geometry). This is evidence that the proposed
> terms could indeed be interpreted in multiple ways and further supports the
> argument that it would not be easy to retrospectively add them to a
> semantic framework at some later date.
> >
> > I think there is a path forward that would allow for both the
> expressiveness of OBOE and other ontologies and convenience of standard
> exchange formats.
> >
> > Ramona
> > ------------------------------------------------------
> > Ramona L. Walls, Ph.D.
> > Scientific Analyst, The iPlant Collaborative, University of Arizona
> > Research Associate, Bio5 Institute, University of Arizona
> > Laboratory Research Associate, New York Botanical Garden
> >
> >
> > On Thu, Aug 28, 2014 at 8:58 AM, Robert Guralnick <
> Robert.Guralnick at colorado.edu> wrote:
> >
> >   Hi all --- Ok, I think the scope of the issue is quite clear. Let me
> summarize:  1)  As ?amonn and the rest of GBIF has made quite clear,  "GBIF
> is faced with the immediate task of making sample-based data discoverable
> and accessible using its current ecosystem of tools" given a funding
> mandate from EU-BON.  2)  The solution for this problem is to develop an
> Event-core and to promote new terms to the Darwin Core to make this
> happen.   I will note a small inconsistency here:  the current ecosystem
> standards and tools of is Darwin Core (as it stands) and publishing systems
> such as IPT.  That ecosystem of tools includes mechanisms to extend Darwin
> Core where needed, via extensions.  The current ecosystem of tools doesn't
> include new Cores or new DwC terms, does it?
> >
> >   So this leads in nicely to the contentious issue(s) and places where
> there seems to be discussions --- these have to do with the nature of the
> changes suggested and the scope of those changes, both in terms of an Event
> core and DwC term additions.  Leaving aside the Event-core for now, the key
> questions simply about term additions to the Darwin Core that seem to be at
> heart here are: 1)  Is the intent of the Darwin Core to model surveys,
> which usually involve multiple kinds and types of sampling over multiple
> sites using multiple methods?  2)  Is the solution to invent new terms for
> the Darwin Core if there are already terms from other efforts, wouldn't we
> work with those existing efforts to assure interoperability?
> >
> >    I appreciate the efforts of GBIF here fully, and am personally torn
> because on the one hand, I fully agree with the goal of extending Darwin
> Core to better represent  richer biodiversity data. On the other hand, I
> worry about process here and how to make that happen in a way that isn't
> too hasty or locks us into just the opposite of what I think many of us
> want with regards to sharing data more broadly than within just one
> ecosystem of tools.
> >
> > Best, Rob
> >
> >
> >
> >
> >
> > On Thu, Aug 28, 2014 at 6:30 AM, John Deck <jdeck at berkeley.edu> wrote:
> > I see the rational for enabling this in Darwin Core Archives and adding
> the new terms.  However, back to what Matt Jones brought up: "won't we just
> end up with a new syntax that does essentially what O&M and OBOE do now?".
> >
> > We should include explicit references to existing terms/definitions that
> encapsulate what we're talking about, e.g. in our MaterialSample proposal
> last year we linked the an existing term in OBI, which has a much richer
> description and context for MaterialSample than what we considered (
> https://code.google.com/p/darwincore/issues/detail?id=167)
> >
> > Have we explored the possibility of doing this with OBOE?  I'm not
> suggesting we adopt OBOE wholesale, but it seems like we have a good
> opportunity to enable better semantic linking with that efforts.
> >
> > John
> >
> > On Thu, Aug 28, 2014 at 4:23 AM, ?amonn ? Tuama [GBIF] <eotuama at gbif.org>
> wrote:
> > Thanks, Ramona and Rob.
> >
> > I'd like to add a few points following on Markus's reply.
> >
> > I think your pressing of the need for a robust semantic model for
> > biodiversity sample/survey data is incontestable ? we do need one and it
> > should enable rich data integration once it is defined and the tools and
> > data standards to support it become available. However, GBIF is faced
> with
> > the immediate task of making sample-based data discoverable and
> accessible
> > using its current ecosystem of tools (IPT) and exchange standards (DwC;
> > EML). Waiting for a functional, implementable semantic model and the
> tools
> > and support services for it is just not an option for us right now.
> >
> > We have already spend considerable time in analysing the merits of
> > Occurrence core vs Event core and have opted for an Event core for
> reasons
> > previously given. I don?t believe we are trying to reconfigure Event (?an
> > action that occurs at a place and during a period of time?) and
> regardless
> > of whether we use Occurrence or Event, the need for some additional terms
> > arise (e.g., quantity, quantityType, samplingGeometry, samplingUnit).
> Once
> > the BCO model is available for uptake, it should be possible to develop a
> > mapping between it and the simple DwC sample model.
> >
> > So GBIF?s stance is that we need to take a two-pronged approach by
> exploring
> > how the IPT and DwC-A can be adapted for publishing sample-based data in
> the
> > near term while supporting the work of TDWG and groups such as the BCO in
> > advancing biodiversity informatics. GBIF has already engaged in the work
> of
> > the BCO and will continue to do so.
> >
> > ?amonn
> >
> > -----Original Message-----
> > From: tdwg-content-bounces at lists.tdwg.org
> > [mailto:tdwg-content-bounces at lists.tdwg.org] On Behalf Of Markus D?ring
> > Sent: 28 August 2014 12:44
> > To: Ramona Walls
> > Cc: TDWG Content Mailing List
> > Subject: Re: [tdwg-content] tdwg-content Digest, Vol 63, Issue 6
> >
> > Hi Ramona & Rob,
> >
> > The Event proposal does not try to change the semantics of an Event, it
> just
> > uses the existing Darwin Core Event "class" at the core in Darwin Core
> > archives. The actual change proposed is simply adding 3 new terms to the
> > Event "group" to better share information about sampling methods &
> efforts,
> > extending the existing limited capabilities of Darwin Core which already
> has
> > the terms dwc:samplingProtocol and dwc:samplingEffort. It also proposes 2
> > new terms for dealing with quantity of Occurrences, something that has
> been
> > discussed since 2012 now, when I had proposed a new abundance term [2].
> >
> >
> > In general application of Darwin Core is not at all limited to specimens
> and
> > observations. It is used for sharing taxonomic datasets already and it's
> > definition and goal is broad. Let me cite some of the introduction to
> Darwin
> > Core [1]:
> >
> > What is the Darwin Core?
> > The Darwin Core is body of standards. It includes a glossary of terms (in
> > other contexts these might be called properties, elements, fields,
> columns,
> > attributes, or concepts) intended to facilitate the sharing of
> information
> > about biological diversity by providing reference definitions, examples,
> and
> > commentaries. The Darwin Core is primarily based on taxa, their
> occurrence
> > in nature as documented by observations, specimens, samples, and related
> > information.
> >
> > Motivation: The Darwin Core standard was originally conceived to
> facilitate
> > the discovery, retrieval, and integration of information about modern
> > biological specimens, their spatiotemporal occurrence, and their
> supporting
> > evidence housed in collections (physical or digital). The Darwin Core
> today
> > is broader in scope and more versatile. It is meant to provide a stable
> > standard reference for sharing information on biological diversity. As a
> > glossary of terms, the Darwin Core is meant to provide stable semantic
> > definitions with the goal of being maximally reusable in a variety of
> > contexts.
> >
> >
> > Markus
> >
> >
> > [1] http://rs.tdwg.org/dwc/index.htm
> > [2] https://code.google.com/p/darwincore/issues/detail?id=142
> >
> >
> > --
> > Markus D?ring
> > Software Developer
> > Global Biodiversity Information Facility (GBIF)
> > mdoering at gbif.org
> > http://www.gbif.org
> >
> >
> >
> > >> On Tue, Aug 19, 2014 at 6:11 AM, ?amonn ? Tuama [GBIF] <
> eotuama at gbif.org>
> > wrote:
> > >>
> > >> Dear All,
> > >>
> > >>
> > >>
> > >> GBIF is committed to exploring ways in which the IPT and Darwin Core
> > Archive format can be extended for publishing sample-based data sets. In
> > association with the EU BON project [1], a customised version of the IPT
> [2]
> > has been deployed to test this using a special type of Darwin Core
> Archive
> > in which the core is an ?Event? with associated taxon occurrences in an
> > ?Occurrence? extension.
> > >>
> > >>
> > >>
> > >> The Darwin Core vocabulary already provides a rich set of terms with
> many
> > relevant for describing sample-based data. Synthesising several sources
> of
> > input (GBIF organised workshop on sample data, May 2013 [3], discussions
> on
> > the TDWG mailing list in late 2013; internal discussion among EU BON
> project
> > partners), five new terms relating to sample data were identified as
> > essential. The complete model including these new terms are fully
> described
> > with examples in the online document ?Publishing sample data using the
> GBIF
> > IPT? [4].
> > >>
> > >>
> > >>
> > >> As a first step towards ratification, we would like to register the
> new
> > terms in the DwC Google Code tracker [5] if there are no major
> objections on
> > this list. The five terms are:
> > >>
> > >>
> > >>
> > >> 1.      quantity: the number or enumeration value of the quantityType
> > (e.g., individuals, biomass, biovolume, BraunBlanquetScale) per
> samplingUnit
> > or a percentage measure recorded for the sample.
> > >>
> > >>
> > >>
> > >> 2.      quantityType: :  the entity being referred to by quantity,
> e.g.,
> > individuals, biomass, %species, scale type.
> > >>
> > >>
> > >>
> > >> 3.      samplingGeometry: an indication of what kind of space was
> > sampled; select from point, line, area or volume.
> > >>
> > >>
> > >>
> > >> 4.      samplingUnit: the unit of measurement used for reporting the
> > quantity in the sample, e.g., minute, hour, day, metre, metre^2, metre^3.
> > It is combined with quantity and quantityType to provide the complete
> > measurement, e.g., 9 individuals per day,  4 biomass-gm per metre^2.
> > >>
> > >>
> > >>
> > >> 5.      eventSeriesID: an identifier for a set of events that are
> > associated in some way, e.g., a monitoring series; may be a global unique
> > identifier or an identifier specific to the series.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> Best regards,
> > >>
> > >>
> > >>
> > >> ?amonn
> > >>
> > >>
> > >>
> > >> [1] http://eubon.eu <http://eubon.eu/>
> > >>
> > >> [2] http://eubon-ipt.gbif.org <http://eubon-ipt.gbif.org/>
> > >>
> > >> [3]
> >
> http://www.standardsingenomics.org/index.php/sigen/article/view/sigs.4898640
> > >>
> > >> [4]  <http://links.gbif.org/sample_data_model>
> > http://links.gbif.org/sample_data_model
> > >>
> > >> [5] https://code.google.com/p/darwincore/issues/list
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> ____________________________________________________
> > >>
> > >> ?amonn ? Tuama, M.Sc., Ph.D. (eotuama at gbif.org),
> > >>
> > >> Senior Programme Officer for Interoperability,
> > >>
> > >> Global Biodiversity Information Facility Secretariat,
> > >>
> > >> Universitetsparken 15, DK-2100, Copenhagen ?, DENMARK
> > >>
> > >> Phone:  +45 3532 1494 <tel:%2B45%203532%201494> ; Fax:  +45 3532 1480
> > <tel:%2B45%203532%201480>
>
> >
>
> End of tdwg-content Digest, Vol 63, Issue 15
> ********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20140901/219cff60/attachment.html 


More information about the tdwg-content mailing list