Re: [tdwg-content] tdwg-content Digest, Vol 63, Issue 6

28 Aug 2014

      Hi Ramona & Rob,
The Event proposal does not try to change the semantics of an Event, it just uses the existing Darwin Core Event "class" at the core in Darwin Core archives. The actual change proposed is simply adding 3 new terms to the Event "group" to better share information about sampling methods & efforts, extending the existing limited capabilities of Darwin Core which already has the terms dwc:samplingProtocol and dwc:samplingEffort. It also proposes 2 new terms for dealing with quantity of Occurrences, something that has been discussed since 2012 now, when I had proposed a new abundance term [2].
In general application of Darwin Core is not at all limited to specimens and observations. It is used for sharing taxonomic datasets already and it's definition and goal is broad. Let me cite some of the introduction to Darwin Core [1]:
What is the Darwin Core?
The Darwin Core is body of standards. It includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing reference definitions, examples, and commentaries. The Darwin Core is primarily based on taxa, their occurrence in nature as documented by observations, specimens, samples, and related information.
Motivation: The Darwin Core standard was originally conceived to facilitate the discovery, retrieval, and integration of information about modern biological specimens, their spatiotemporal occurrence, and their supporting evidence housed in collections (physical or digital). The Darwin Core today is broader in scope and more versatile. It is meant to provide a stable standard reference for sharing information on biological diversity. As a glossary of terms, the Darwin Core is meant to provide stable semantic definitions with the goal of being maximally reusable in a variety of contexts.
Markus
[1] http://rs.tdwg.org/dwc/index.htm
[2] https://code.google.com/p/darwincore/issues/detail?id=142
--
Markus Döring
Software Developer
Global Biodiversity Information Facility (GBIF)
mdoering@gbif.org
http://www.gbif.org
On 27 Aug 2014, at 18:57, Ramona Walls rlwalls2008@gmail.com wrote:
...
I think it is important to consider the purpose of both Darwin Core and DwC archives  in deciding whether or not to expand them, but we should use that consideration to address the question at hand, which is whether or not to add an Event core and additional properties to describe events.
Describing the exchange format before the semantics is the wrong way to go, given that we now have a framework for developing semantics. Expanding Darwin Core before we adequately model survey data is bound to lead to problems later, when we try to retro-fit the semantics to Darwin Core Event archives. This is exactly the problem we are running into now with Occurance archives, and we have the opportunity to avoid it.
I suggest we first use existing ontologies to model survey data, then deal with if and how to exchange that information in DwC-A. This is what I was hinting at in my first email, but should have said more explicitly.
Ramona

Ramona L. Walls, Ph.D.
Scientific Analyst, The iPlant Collaborative, University of Arizona
Research Associate, Bio5 Institute, University of Arizona
Laboratory Research Associate, New York Botanical Garden
On Wed, Aug 27, 2014 at 9:14 AM, Robert Guralnick Robert.Guralnick@colorado.edu wrote:
It may be a sensible view for Darwin Core Archives and their intended use, but Tim's email suggests we should be putting the method of delivery ahead of the standard that delivers that content.  If this was just about DwC-As, why not develop a survey extension that links each occurrence to information about the survey process using the existing star-schema methods we have in place?  Why are we discussing adding terms to the Darwin Core or trying to fully reconfigure what we call an Event?  That is what is on the table, not DwC-As and how we use them.  Or am I missing something?
Best, Rob
On Wed, Aug 27, 2014 at 10:01 AM, Ramona Walls rlwalls2008@gmail.com wrote:
Thanks, Tim, and yes, DwC-A as a view (but not necessarily the primary archive) of data seems like the right point of view.
Ramona

Ramona L. Walls, Ph.D.
Scientific Analyst, The iPlant Collaborative, University of Arizona
Research Associate, Bio5 Institute, University of Arizona
Laboratory Research Associate, New York Botanical Garden
On Wed, Aug 27, 2014 at 1:58 AM, Tim Robertson trobertson@gbif.org wrote:
Hi Ramona,
Those are good points, and I’d like to come back to the original thinking behind the DwC-A.
It was designed and intended to be a simple way of exposing a complete view of a dataset, primarily for building sophisticated indexes, inventories and allowing basic analytics (e.g. GBIF.org being one sophisticated index).  We found that the star schema provided the flexibility to do a lot, and with the bundled metadata (e.g. EML) was enough to trace provenance and allow users to determine if the dataset might be fit for various uses.  In many cases this represents the complete (e.g. lossless) view of a dataset.
What we are discussing here are far richer datasets, where shoe-horning content into the star schema becomes lossy for some, although we’re finding other cases where it is indeed lossless.  I believe we should be looking to harmonise ontologies / models etc as you mention but in parallel we should define one or more star schema views that can still be used for discovery / reporting / basic analytical purpose, and not long term archival of the dataset.  The dataset would then have the canonical rich form and an additional DwC-A view.  What I write here is applicable to all content types of course.
Please also note that many people put supplementary files in the DwC-A which are ignored by DwC-A readers but could be a way of keeping the richer view in the bundle.  If one wished you can describe those supplementary files in the EML document.
Does this gel with the view of others as well?
Cheers,
Tim
On 27 Aug 2014, at 02:55, Ramona Walls rlwalls2008@gmail.com wrote:
...
I think Matt hit the nail on the head. Although Darwin Core can be used to exchange survey data, it lacks the semantics and structure necessary to archive the data without loss of information. I think the biodiversity community would be better served devoting energy to harmonizing existing technologies such as OGC, OBOE, and BCO, not to mention the many database for storing plot or survey data. The goal should be to preserve the data in the most informative manner possible.
There is a strong a case for wanting to search across all evidence for occurences, including surveys and point occurences, so I can see possible demand for a tool that would extract occurences from survey data to a DwC archive. However, I am very concerned that making a DwC archive the primary exchange format for survey or plot data commits us to a path of losing information from the start, for all but the simplest sampling schemas.
Ramona

Ramona L. Walls, Ph.D.
Scientific Analyst, The iPlant Collaborative, University of Arizona
Research Associate, Bio5 Institute, University of Arizona
Laboratory Research Associate, New York Botanical Garden
On Fri, Aug 22, 2014 at 3:00 AM, tdwg-content-request@lists.tdwg.org wrote:
Send tdwg-content mailing list submissions to
        tdwg-content@lists.tdwg.org
To subscribe or unsubscribe via the World Wide Web, visit
        http://lists.tdwg.org/mailman/listinfo/tdwg-content
or, via email, send a message with subject or body 'help' to
        tdwg-content-request@lists.tdwg.org
You can reach the person managing the list at
        tdwg-content-owner@lists.tdwg.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of tdwg-content digest..."
Today's Topics:

Re: Darwin Core: proposed news terms for expressing sample
data (Matt Jones)
Re: Darwin Core: proposed news terms for expressing       sample
data (Donald Hobern [GBIF])

Message: 1
Date: Thu, 21 Aug 2014 18:52:06 -0800
From: Matt Jones jones@nceas.ucsb.edu
Subject: Re: [tdwg-content] Darwin Core: proposed news terms for
        expressing sample data
To: ?amonn ? Tuama [GBIF] eotuama@gbif.org
Cc: TDWG Content Mailing List tdwg-content@lists.tdwg.org
Message-ID:
        CAFSW8xkx7uRP9PC2g3=JT_VJanqujH8nPXoz8GXwh+JwKw5Ccw@mail.gmail.com
Content-Type: text/plain; charset="utf-8"
This proposal is treading on ground that is quite similar to other
observations and measurements standards for data exchange that are already
mature, in particular:

OGC Observations and Measurements (

http://www.opengeospatial.org/standards/om)

Extensible Observation Ontology (OBOE;

https://semtools.ecoinformatics.org/oboe)
The former is a standard and broadly deployed, whereas the latter is part
of a research program in the use of ontologies for measurements.  Through
collaboration between the two projects, they've been modified to be
reasonably isomorphic, but O&M uses an XML serialization while OBOE uses an
OWL-DL serialization. They largely express the same measurements and
sampling model once one gets beyond the terminology differences.
So, I'm wondering if it make much sense to extend Darwin Core, which is at
heart an Occurrence exchange syntax, into this measurements area that is
well represented by these other existing specifications?  I'm curious to
hear why people would even want to do this.  And if we do go down this
path, won't we just end up with a new syntax that does essentially what O&M
and OBOE do now?
Matt
On Thu, Aug 21, 2014 at 12:22 AM, ?amonn ? Tuama [GBIF] eotuama@gbif.org
wrote:
...
Hi Rob, Anne, Rich,
I think Markus has answered your question as to why we opted for an Event
core which is being used in the sense described by Anne and Rich. For any
event, you can have a list of species in an Occurrence extension and for
each species, you can include quantity and quantityType, e.g., biomass,
etc. The proposed term eventSeriesID was intended for linking together
related events, although it now looks like parentEventID might be a better,
more flexible term. The measurementOrFact extension is a good fit for
capturing environmental information relating to an event. See, e.g., the
Gialova Lagoon brackish water invertebrate test data set [1] where a set
of 18 environmental variables, including temp, pH, Rdx, particulate organic
matter, dissolved oxygen, salinity, chlorophyll-a were measured for each
sampling station-sampling period combination. An example mapping is:
Id            measurementType           measurementValue
measurementUnit               measurementRemarks
IA           Tmp (sed)                           21.5
                              degree C                             Tmp
(sed): temperature at the bottom surface
**Controlled vocabularies**
Ideally, the values for samplingUnit and quantityType would be selected
from controlled vocabularies. This is, effectively, what we do by
presenting a small list of values in a drop-down menu. The current values
are what we derived for example data sets and discussion but they can
undoubtedly be extended and improved.
We capture ?bucket? type measures through a combination of samplingEffort,
samplingGeometry and samplingUnit. For example, a pitfall trap (in a point
location) left out for 16 days might have samplingEffort: 16,
samplingGeometry: point and samplingUnit: day. Three m^2 quadrats in a
shore survey might have samplingEffort: 3, samplingGeometry: area and
samplingUnit: m^2.
It would be very useful to see your compilation of scope, effort and
completeness measures to see if we can express them in our model and/or if
we need to reconsider our approach.
?amonn
[1] http://eubon-ipt.gbif.org/resource.do?r=ionian-brackish-lagoon
*From:* tdwg-content-bounces@lists.tdwg.org [mailto:
tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Markus D?ring
*Sent:* 20 August 2014 23:47
*To:* Robert Guralnick
*Cc:* TDWG Content Mailing List
*Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for
expressing sample data
Rob,
this proposal if for monitoring surveys really, not to be confused with
material samples like environmental or tissue samples which have a distinct
new dwc class MaterialSample.
We tend to overload the term sampling a lot and it helps treating material
samples different from pure observational "sampling". That is why the
existing Event class was used as the core and classic Occurrence records as
extensions. A classic example is a vegetation survey where each plot
represents an Event record and each recorded species in that plot will be
an Occurrence extension record with a given quantity. Darwin Core already
offers individualCount to specify quantity, but it is a very specific way
of measuring "abundance" restricted to only some use cases. Abiotic
measurements about the plot (e.g. soil type, pH, temperature) can be
published using the measurements or facts extension linked to the Event
core.
Markus
On 20 Aug 2014, at 20:08, Robert Guralnick Robert.Guralnick@colorado.edu
wrote:
Anne -- I don't know the answers!  These are questions for Eamonn.  I
would presume that a sample could be a jumble of species or even just water
or soil samples, and biomass would refer to that sample - but maybe that
isn't a use case being considered?  The examples given in the longer
document all link an event_id to species name and some measure of quantity
for that species (to the species, not an individual specimen), so I assume
that is the prevailing (or only) case?
Best, Rob
On Wed, Aug 20, 2014 at 11:56 AM, Anne Thessen annethessen@gmail.com
wrote:
Hi Rob
I would like to respond to your item number 2.
From my perspective, I deal with lots of published descriptions of taxa.
The text might say something like "I saw species A in the Chesapeake Bay,
the Adriatic Sea and the Indian Ocean and the biomass is 5 - 9 grams". The
biomass range obviously corresponds to at least three different
occurrences, but how to divide the biomass data? I would love to be able to
have an *event* to attach it all to. There is almost two different levels
of events - a sampling event and a "study event". The "study event" would
correspond to the type of event I would like to use in the above example.
It may not be ideal, but for the old literature that might be the best we
can do.
I have to admit that I don't know enough about trawl data to understand
why an event core would be a problem. It seems that the trawl would be an
event and each biomass measure (of each fish) would be attached to a
separate occurrence which is attached to that event. Am I understanding
this wrong?
btw - I found a workaround for the example I gave, so it's not impossible
to model with the current structure....
Anne
On 8/20/2014 1:16 PM, Robert Guralnick wrote:
?amonn et al. --- Thanks for the clarifications.  I think these help a ton
but it raises a couple more questions for me.

I am surprised that you plan to use of MeasurementorFact extension in

relation to the Event core, which seems like a novel (or perhaps awkward or
unintended?) mechanism for capturing environmental data, but the same
extension was not be seen as relevant for describing samples? Can you
explain more about the thinking there?

There may be a subtle issue here extending "Event" to be more what you

call a "Sampling Event Core".  My read of this is that Darwin Core serves
as a way to deal with point occurrences and Event reflects the context of a
single capture event (whether a single observation, or a bulk sample
capture).  The changes recommended seem to dramatically extend and change
that meaning?  Its simply a question that I don't have answer to, but is
Darwin Core, the right vehicle to start capturing repeated measures of
biomass values from trawls?   I don't have answer but man, terms like
quantityType (as a property of occurrence?) give me pause.

Is Sampling Unit a controlled vocabulary? For another project, I have

looked through - and captured scope, effort and completeness measures from

a large number of published biotic area inventories.  The vast majorities

of these are measured in units like bucket hours, or trap nights.  Is a
"bucket" part of SamplingGeometry or Sampling Unit?  I'd be happy to send
along all the many examples of how biotic inventories of an area are
completed and perhaps it might be good to see how those might be
represented using the terms you are proposing?
Best, Rob
On Wed, Aug 20, 2014 at 10:16 AM, Richard Pyle deepreef@bishopmuseum.org
wrote:
Same here ? Events are central to the work that we do.
Aloha,
Rich
*From:* tdwg-content-bounces@lists.tdwg.org [mailto:
tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Anne Thessen
*Sent:* Wednesday, August 20, 2014 2:59 AM
*To:* tdwg-content@lists.tdwg.org
*Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for
expressing sample data
Hello
I would just like to comment on *event core*.
I've been doing a lot of work translating published data into Darwin Core.
During that process I've wished several times that I could use Event as
core. I am happy to hear about that proposed change. It will make it easier
to model the data I am working with.
Anne
On 8/20/2014 7:04 AM, ?amonn ? Tuama [GBIF] wrote:
Hi Rob,
Thank you for the feedback. I have tried to address the two main issues
you raise below. At the outset, I would like to emphasise that much of this
work is taking place in the context of the EU BON project which includes a
task on developing/enhancing tools and standards for data sharing with a
particular focus on the IPT for publishing sample-based data. So, we were
constrained by the need to publish sample-based data sets in the Darwin
Core Archive format and to demonstrate practical application using a
working prototype. When the discussion on the TDWG list faded out, we took
it to our EU BON partners whose requirements were essential input to
further development. We recognise that these discussions took place away
from TDWG (although the TDWG/EU BON contributors overlapped) and this is
the reason we are presenting  the outcomes here for further consideration.
**Event core**
As the SIGS report indicated, sample data can be modelled in Darwin Core
Archives using either Occurrence or Event as core. This was the starting
point for our evaluation but as things progressed the data wrangling pushed
the model back towards the Event core. We actually went through the
exercise of mapping multiple test datasets in an iterative process spanning
several months' work. In the end, we found that using an Event core better
matched the typical sample data we were dealing with, allowing use of a
measurement-or-fact extension to be included for the efficient expression
of environmental information associated with the event. The choice comes
down to an Occurrence core or an Event core + Occurrence extension. In both
cases, the true observation records are Occurrences. The big difference is
what type the core has and therefore to which kind of records you can
attach further facts and extra information with DwC-A extensions. Many
sampling datasets have very rich information about the site and event, so
it is very natural to hang facts from an Event core. When picking the
Occurrence core those facts would have to be repeated for each and every
occurrence record. Moreover, our approach doesn?t stop anyone from using
the Occurrence core if they so wish. This just provides a different option
for datasets that better fit an Event core model.
I want to stress that we are not building a ?specific IPT version? to
support an Event core but, rather, we adapted the IPT so that it can be
configured to support any generic ?core + extension? format to enable its
use for exploration of more data formats.  This is part of the core
codebase and there were no custom forks of the IPT for this work.  Our view
at GBIF is that if there are significant numbers of data publishers who are
keen to adopt, promote and use a (any) format, and the tools can be
configured to do so, then we should support it, and, if necessary, use a
custom namespace.
**New terms around abundance**
Yes, the discussion on TDWG did fade out but it was clear that the term
?abundance?  as recommended by the SIGS report (along with
abundanceAsPercent) was confusing many when we were looking for term(s)
that reported quantitative measures of organisms in a sample. It also
became clear we would need to be able to state the type of quantity being
measured. An alternative suggestion for using the MeasurementsOrFact class
was immediately shot down.
As some of our main use cases were coming from the EU BON project,
discussion shifted to that forum and consensus formed about the currently
proposed terms. It was within this group that the additional terms
(samplingGeometry, samplingUnit, eventSeriesID) were proposed and where we
began testing with sample data sets.
Best regards,
?amonn
*From:* robgur@gmail.com [mailto:robgur@gmail.com robgur@gmail.com] *On
Behalf Of *Robert Guralnick
*Sent:* 19 August 2014 16:56
*To:* ?amonn ? Tuama [GBIF]
*Cc:* TDWG Content Mailing List
*Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for
expressing sample data
Hi ?amonn --- I am curious about the outcomes presented in the SIGS
paper, in particular, this portion of the paper:
"Solutions without introducing an event core in Darwin Core Archives:
 During the review of the solutions for the uses cases, it became apparent
that either model could be applied to every use case. The core and
extensions bore a complementary relationship and between them could express
all the required information. The core simply provided the central anchor
in the star schema from which to join the additional information.
Therefore, using the Occurrence core, well established in the GBIF network
through uptake of the IPT, seemed more appropriate than inventing
CollectingEvent as an additional core type."
That SIGS paper has John Wieczorek and you both as authors, including
many luminaries across the biodiversity standards spectrum.  Given the
above, its curious to see the EventCore come back again, along with a
specific IPT version to support it.
So I see two issues, conflated, in this post you just made.  One is

the need for an EventCore at all, and the nature of relating Event and
Occurrence/Material Sample.  The second is the introduction of new terms,
which seemingly have arrived after debate on similar terms - but framed
around abundance - stalled a year ago.  To my mind, these both require some
further discussion, because I don't (necessarily) see TDWG community
coherence around either issue?
Best, Rob
On Tue, Aug 19, 2014 at 6:11 AM, ?amonn ? Tuama [GBIF] eotuama@gbif.org
wrote:
Dear All,
GBIF is committed to exploring ways in which the IPT and Darwin Core
Archive format can be extended for publishing sample-based data sets. In
association with the EU BON project [1], a customised version of the IPT
[2] has been deployed to test this using a special type of Darwin Core
Archive in which the core is an ?Event? with associated taxon occurrences
in an ?Occurrence? extension.
The Darwin Core vocabulary already provides a rich set of terms with many
relevant for describing sample-based data. Synthesising several sources of
input (GBIF organised workshop on sample data, May 2013 [3], discussions on
the TDWG mailing list in late 2013; internal discussion among EU BON
project partners), five new terms relating to sample data were identified
as essential. The complete model including these new terms are fully
described with examples in the online document ?Publishing sample data
using the GBIF IPT? [4].
As a first step towards ratification, we would like to register the new
terms in the DwC Google Code tracker [5] if there are no major objections
on this list. The five terms are:

 *quantity*: the number or enumeration value of the quantityType

(e.g., individuals, biomass, biovolume, BraunBlanquetScale) per
samplingUnit or a percentage measure recorded for the sample.

 *quantityType*: :  the entity being referred to by quantity,

e.g., individuals, biomass, %species, scale type.

 *samplingGeometry*: an indication of what kind of space was

sampled; select from point, line, area or volume.

 *samplingUnit*: the unit of measurement used for reporting the

quantity in the sample, e.g., minute, hour, day, metre, metre^2, metre^3.
It is combined with quantity and quantityType to provide the complete
measurement, e.g., 9 individuals per day,  4 biomass-gm per metre^2.

 *eventSeriesID*: an identifier for a set of events that are

associated in some way, e.g., a monitoring series; may be a global unique
identifier or an identifier specific to the series.
Best regards,
?amonn
[1] http://eubon.eu
[2] http://eubon-ipt.gbif.org
[3]
http://www.standardsingenomics.org/index.php/sigen/article/view/sigs.4898640
[4] http://links.gbif.org/sample_data_model
[5] https://code.google.com/p/darwincore/issues/list

*?amonn ? Tuama, M.Sc., Ph.D. (eotuama@gbif.org eotuama@gbif.org), *
*Senior Programme Officer for Interoperability, *
*Global Biodiversity Information Facility Secretariat, *
*Universitetsparken 15, DK-2100, Copenhagen ?, DENMARK*
*Phone:  +45 3532 1494 <%2B45%203532%201494>; Fax:  +45 3532 1480
<%2B45%203532%201480>*

tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content

tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185

tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185

tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content

tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content