Re: [tdwg-content] tdwg-content Digest, Vol 63, Issue 6
I think Matt hit the nail on the head. Although Darwin Core can be used to exchange survey data, it lacks the semantics and structure necessary to archive the data without loss of information. I think the biodiversity community would be better served devoting energy to harmonizing existing technologies such as OGC, OBOE, and BCO, not to mention the many database for storing plot or survey data. The goal should be to preserve the data in the most informative manner possible.
There is a strong a case for wanting to search across all evidence for occurences, including surveys and point occurences, so I can see possible demand for a tool that would extract occurences from survey data to a DwC archive. However, I am very concerned that making a DwC archive the primary exchange format for survey or plot data commits us to a path of losing information from the start, for all but the simplest sampling schemas.
Ramona
------------------------------------------------------ Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Fri, Aug 22, 2014 at 3:00 AM, tdwg-content-request@lists.tdwg.org wrote:
Send tdwg-content mailing list submissions to tdwg-content@lists.tdwg.org
To subscribe or unsubscribe via the World Wide Web, visit http://lists.tdwg.org/mailman/listinfo/tdwg-content or, via email, send a message with subject or body 'help' to tdwg-content-request@lists.tdwg.org
You can reach the person managing the list at tdwg-content-owner@lists.tdwg.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of tdwg-content digest..."
Today's Topics:
- Re: Darwin Core: proposed news terms for expressing sample data (Matt Jones)
- Re: Darwin Core: proposed news terms for expressing sample data (Donald Hobern [GBIF])
Message: 1 Date: Thu, 21 Aug 2014 18:52:06 -0800 From: Matt Jones jones@nceas.ucsb.edu Subject: Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data To: ?amonn ? Tuama [GBIF] eotuama@gbif.org Cc: TDWG Content Mailing List tdwg-content@lists.tdwg.org Message-ID: <CAFSW8xkx7uRP9PC2g3= JT_VJanqujH8nPXoz8GXwh+JwKw5Ccw@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
This proposal is treading on ground that is quite similar to other observations and measurements standards for data exchange that are already mature, in particular:
- OGC Observations and Measurements (
http://www.opengeospatial.org/standards/om)
- Extensible Observation Ontology (OBOE;
https://semtools.ecoinformatics.org/oboe)
The former is a standard and broadly deployed, whereas the latter is part of a research program in the use of ontologies for measurements. Through collaboration between the two projects, they've been modified to be reasonably isomorphic, but O&M uses an XML serialization while OBOE uses an OWL-DL serialization. They largely express the same measurements and sampling model once one gets beyond the terminology differences.
So, I'm wondering if it make much sense to extend Darwin Core, which is at heart an Occurrence exchange syntax, into this measurements area that is well represented by these other existing specifications? I'm curious to hear why people would even want to do this. And if we do go down this path, won't we just end up with a new syntax that does essentially what O&M and OBOE do now?
Matt
On Thu, Aug 21, 2014 at 12:22 AM, ?amonn ? Tuama [GBIF] eotuama@gbif.org wrote:
Hi Rob, Anne, Rich,
I think Markus has answered your question as to why we opted for an Event core which is being used in the sense described by Anne and Rich. For any event, you can have a list of species in an Occurrence extension and for each species, you can include quantity and quantityType, e.g., biomass, etc. The proposed term eventSeriesID was intended for linking together related events, although it now looks like parentEventID might be a
better,
more flexible term. The measurementOrFact extension is a good fit for capturing environmental information relating to an event. See, e.g., the Gialova Lagoon brackish water invertebrate test data set [1] where a set of 18 environmental variables, including temp, pH, Rdx, particulate
organic
matter, dissolved oxygen, salinity, chlorophyll-a were measured for each sampling station-sampling period combination. An example mapping is:
Id measurementType measurementValue measurementUnit measurementRemarks
IA Tmp (sed) 21.5 degree C Tmp (sed): temperature at the bottom surface
**Controlled vocabularies**
Ideally, the values for samplingUnit and quantityType would be selected from controlled vocabularies. This is, effectively, what we do by presenting a small list of values in a drop-down menu. The current values are what we derived for example data sets and discussion but they can undoubtedly be extended and improved.
We capture ?bucket? type measures through a combination of
samplingEffort,
samplingGeometry and samplingUnit. For example, a pitfall trap (in a
point
location) left out for 16 days might have samplingEffort: 16, samplingGeometry: point and samplingUnit: day. Three m^2 quadrats in a shore survey might have samplingEffort: 3, samplingGeometry: area and samplingUnit: m^2.
It would be very useful to see your compilation of scope, effort and completeness measures to see if we can express them in our model and/or
if
we need to reconsider our approach.
?amonn
[1] http://eubon-ipt.gbif.org/resource.do?r=ionian-brackish-lagoon
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Markus D?ring *Sent:* 20 August 2014 23:47 *To:* Robert Guralnick
*Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Rob,
this proposal if for monitoring surveys really, not to be confused with material samples like environmental or tissue samples which have a
distinct
new dwc class MaterialSample.
We tend to overload the term sampling a lot and it helps treating
material
samples different from pure observational "sampling". That is why the existing Event class was used as the core and classic Occurrence records
as
extensions. A classic example is a vegetation survey where each plot represents an Event record and each recorded species in that plot will be an Occurrence extension record with a given quantity. Darwin Core already offers individualCount to specify quantity, but it is a very specific way of measuring "abundance" restricted to only some use cases. Abiotic measurements about the plot (e.g. soil type, pH, temperature) can be published using the measurements or facts extension linked to the Event core.
Markus
On 20 Aug 2014, at 20:08, Robert Guralnick <
Robert.Guralnick@colorado.edu>
wrote:
Anne -- I don't know the answers! These are questions for Eamonn. I would presume that a sample could be a jumble of species or even just
water
or soil samples, and biomass would refer to that sample - but maybe that isn't a use case being considered? The examples given in the longer document all link an event_id to species name and some measure of
quantity
for that species (to the species, not an individual specimen), so I
assume
that is the prevailing (or only) case?
Best, Rob
On Wed, Aug 20, 2014 at 11:56 AM, Anne Thessen annethessen@gmail.com wrote:
Hi Rob I would like to respond to your item number 2. From my perspective, I deal with lots of published descriptions of taxa. The text might say something like "I saw species A in the Chesapeake Bay, the Adriatic Sea and the Indian Ocean and the biomass is 5 - 9 grams".
The
biomass range obviously corresponds to at least three different occurrences, but how to divide the biomass data? I would love to be able
to
have an *event* to attach it all to. There is almost two different levels of events - a sampling event and a "study event". The "study event" would correspond to the type of event I would like to use in the above example. It may not be ideal, but for the old literature that might be the best we can do. I have to admit that I don't know enough about trawl data to understand why an event core would be a problem. It seems that the trawl would be an event and each biomass measure (of each fish) would be attached to a separate occurrence which is attached to that event. Am I understanding this wrong? btw - I found a workaround for the example I gave, so it's not impossible to model with the current structure.... Anne
On 8/20/2014 1:16 PM, Robert Guralnick wrote:
?amonn et al. --- Thanks for the clarifications. I think these help a
ton
but it raises a couple more questions for me.
- I am surprised that you plan to use of MeasurementorFact extension
in
relation to the Event core, which seems like a novel (or perhaps awkward
or
unintended?) mechanism for capturing environmental data, but the same extension was not be seen as relevant for describing samples? Can you explain more about the thinking there?
- There may be a subtle issue here extending "Event" to be more what
you
call a "Sampling Event Core". My read of this is that Darwin Core serves as a way to deal with point occurrences and Event reflects the context
of a
single capture event (whether a single observation, or a bulk sample capture). The changes recommended seem to dramatically extend and change that meaning? Its simply a question that I don't have answer to, but is Darwin Core, the right vehicle to start capturing repeated measures of biomass values from trawls? I don't have answer but man, terms like quantityType (as a property of occurrence?) give me pause.
- Is Sampling Unit a controlled vocabulary? For another project, I have
looked through - and captured scope, effort and completeness measures
from
- a large number of published biotic area inventories. The vast
majorities
of these are measured in units like bucket hours, or trap nights. Is a "bucket" part of SamplingGeometry or Sampling Unit? I'd be happy to send along all the many examples of how biotic inventories of an area are completed and perhaps it might be good to see how those might be represented using the terms you are proposing?
Best, Rob
On Wed, Aug 20, 2014 at 10:16 AM, Richard Pyle <
deepreef@bishopmuseum.org>
wrote:
Same here ? Events are central to the work that we do.
Aloha,
Rich
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Anne Thessen *Sent:* Wednesday, August 20, 2014 2:59 AM *To:* tdwg-content@lists.tdwg.org
*Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hello I would just like to comment on *event core*. I've been doing a lot of work translating published data into Darwin
Core.
During that process I've wished several times that I could use Event as core. I am happy to hear about that proposed change. It will make it
easier
to model the data I am working with. Anne
On 8/20/2014 7:04 AM, ?amonn ? Tuama [GBIF] wrote:
Hi Rob,
Thank you for the feedback. I have tried to address the two main issues you raise below. At the outset, I would like to emphasise that much of
this
work is taking place in the context of the EU BON project which includes
a
task on developing/enhancing tools and standards for data sharing with a particular focus on the IPT for publishing sample-based data. So, we were constrained by the need to publish sample-based data sets in the Darwin Core Archive format and to demonstrate practical application using a working prototype. When the discussion on the TDWG list faded out, we
took
it to our EU BON partners whose requirements were essential input to further development. We recognise that these discussions took place away from TDWG (although the TDWG/EU BON contributors overlapped) and this is the reason we are presenting the outcomes here for further
consideration.
**Event core**
As the SIGS report indicated, sample data can be modelled in Darwin Core Archives using either Occurrence or Event as core. This was the starting point for our evaluation but as things progressed the data wrangling
pushed
the model back towards the Event core. We actually went through the exercise of mapping multiple test datasets in an iterative process
spanning
several months' work. In the end, we found that using an Event core
better
matched the typical sample data we were dealing with, allowing use of a measurement-or-fact extension to be included for the efficient expression of environmental information associated with the event. The choice comes down to an Occurrence core or an Event core + Occurrence extension. In
both
cases, the true observation records are Occurrences. The big difference
is
what type the core has and therefore to which kind of records you can attach further facts and extra information with DwC-A extensions. Many sampling datasets have very rich information about the site and event, so it is very natural to hang facts from an Event core. When picking the Occurrence core those facts would have to be repeated for each and every occurrence record. Moreover, our approach doesn?t stop anyone from using the Occurrence core if they so wish. This just provides a different
option
for datasets that better fit an Event core model.
I want to stress that we are not building a ?specific IPT version? to support an Event core but, rather, we adapted the IPT so that it can be configured to support any generic ?core + extension? format to enable its use for exploration of more data formats. This is part of the core codebase and there were no custom forks of the IPT for this work. Our
view
at GBIF is that if there are significant numbers of data publishers who
are
keen to adopt, promote and use a (any) format, and the tools can be configured to do so, then we should support it, and, if necessary, use a custom namespace.
**New terms around abundance**
Yes, the discussion on TDWG did fade out but it was clear that the term ?abundance? as recommended by the SIGS report (along with abundanceAsPercent) was confusing many when we were looking for term(s) that reported quantitative measures of organisms in a sample. It also became clear we would need to be able to state the type of quantity being measured. An alternative suggestion for using the MeasurementsOrFact
class
was immediately shot down.
As some of our main use cases were coming from the EU BON project, discussion shifted to that forum and consensus formed about the currently proposed terms. It was within this group that the additional terms (samplingGeometry, samplingUnit, eventSeriesID) were proposed and where
we
began testing with sample data sets.
Best regards,
?amonn
*From:* robgur@gmail.com [mailto:robgur@gmail.com robgur@gmail.com]
*On
Behalf Of *Robert Guralnick *Sent:* 19 August 2014 16:56 *To:* ?amonn ? Tuama [GBIF] *Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hi ?amonn --- I am curious about the outcomes presented in the SIGS paper, in particular, this portion of the paper:
"Solutions without introducing an event core in Darwin Core Archives: During the review of the solutions for the uses cases, it became
apparent
that either model could be applied to every use case. The core and extensions bore a complementary relationship and between them could
express
all the required information. The core simply provided the central anchor in the star schema from which to join the additional information. Therefore, using the Occurrence core, well established in the GBIF
network
through uptake of the IPT, seemed more appropriate than inventing CollectingEvent as an additional core type."
That SIGS paper has John Wieczorek and you both as authors, including many luminaries across the biodiversity standards spectrum. Given the above, its curious to see the EventCore come back again, along with a specific IPT version to support it.
So I see two issues, conflated, in this post you just made. One is
the need for an EventCore at all, and the nature of relating Event and Occurrence/Material Sample. The second is the introduction of new terms, which seemingly have arrived after debate on similar terms - but framed around abundance - stalled a year ago. To my mind, these both require
some
further discussion, because I don't (necessarily) see TDWG community coherence around either issue?
Best, Rob
On Tue, Aug 19, 2014 at 6:11 AM, ?amonn ? Tuama [GBIF] <eotuama@gbif.org
wrote:
Dear All,
GBIF is committed to exploring ways in which the IPT and Darwin Core Archive format can be extended for publishing sample-based data sets. In association with the EU BON project [1], a customised version of the IPT [2] has been deployed to test this using a special type of Darwin Core Archive in which the core is an ?Event? with associated taxon occurrences in an ?Occurrence? extension.
The Darwin Core vocabulary already provides a rich set of terms with many relevant for describing sample-based data. Synthesising several sources
of
input (GBIF organised workshop on sample data, May 2013 [3], discussions
on
the TDWG mailing list in late 2013; internal discussion among EU BON project partners), five new terms relating to sample data were identified as essential. The complete model including these new terms are fully described with examples in the online document ?Publishing sample data using the GBIF IPT? [4].
As a first step towards ratification, we would like to register the new terms in the DwC Google Code tracker [5] if there are no major objections on this list. The five terms are:
*quantity*: the number or enumeration value of the quantityType
(e.g., individuals, biomass, biovolume, BraunBlanquetScale) per samplingUnit or a percentage measure recorded for the sample.
*quantityType*: : the entity being referred to by quantity,
e.g., individuals, biomass, %species, scale type.
*samplingGeometry*: an indication of what kind of space was
sampled; select from point, line, area or volume.
*samplingUnit*: the unit of measurement used for reporting the
quantity in the sample, e.g., minute, hour, day, metre, metre^2, metre^3. It is combined with quantity and quantityType to provide the complete measurement, e.g., 9 individuals per day, 4 biomass-gm per metre^2.
*eventSeriesID*: an identifier for a set of events that are
associated in some way, e.g., a monitoring series; may be a global unique identifier or an identifier specific to the series.
Best regards,
?amonn
[1] http://eubon.eu
[3]
http://www.standardsingenomics.org/index.php/sigen/article/view/sigs.4898640
[4] http://links.gbif.org/sample_data_model
[5] https://code.google.com/p/darwincore/issues/list
*?amonn ? Tuama, M.Sc., Ph.D. (eotuama@gbif.org eotuama@gbif.org), *
*Senior Programme Officer for Interoperability, *
*Global Biodiversity Information Facility Secretariat, *
*Universitetsparken 15, DK-2100, Copenhagen ?, DENMARK*
*Phone: +45 3532 1494 <%2B45%203532%201494>; Fax: +45 3532 1480 <%2B45%203532%201480>*
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi Ramona,
Those are good points, and I’d like to come back to the original thinking behind the DwC-A.
It was designed and intended to be a simple way of exposing a complete view of a dataset, primarily for building sophisticated indexes, inventories and allowing basic analytics (e.g. GBIF.org being one sophisticated index). We found that the star schema provided the flexibility to do a lot, and with the bundled metadata (e.g. EML) was enough to trace provenance and allow users to determine if the dataset might be fit for various uses. In many cases this represents the complete (e.g. lossless) view of a dataset.
What we are discussing here are far richer datasets, where shoe-horning content into the star schema becomes lossy for some, although we’re finding other cases where it is indeed lossless. I believe we should be looking to harmonise ontologies / models etc as you mention but in parallel we should define one or more star schema views that can still be used for discovery / reporting / basic analytical purpose, and not long term archival of the dataset. The dataset would then have the canonical rich form and an additional DwC-A view. What I write here is applicable to all content types of course.
Please also note that many people put supplementary files in the DwC-A which are ignored by DwC-A readers but could be a way of keeping the richer view in the bundle. If one wished you can describe those supplementary files in the EML document.
Does this gel with the view of others as well?
Cheers, Tim
On 27 Aug 2014, at 02:55, Ramona Walls rlwalls2008@gmail.com wrote:
I think Matt hit the nail on the head. Although Darwin Core can be used to exchange survey data, it lacks the semantics and structure necessary to archive the data without loss of information. I think the biodiversity community would be better served devoting energy to harmonizing existing technologies such as OGC, OBOE, and BCO, not to mention the many database for storing plot or survey data. The goal should be to preserve the data in the most informative manner possible.
There is a strong a case for wanting to search across all evidence for occurences, including surveys and point occurences, so I can see possible demand for a tool that would extract occurences from survey data to a DwC archive. However, I am very concerned that making a DwC archive the primary exchange format for survey or plot data commits us to a path of losing information from the start, for all but the simplest sampling schemas.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Fri, Aug 22, 2014 at 3:00 AM, tdwg-content-request@lists.tdwg.org wrote: Send tdwg-content mailing list submissions to tdwg-content@lists.tdwg.org
To subscribe or unsubscribe via the World Wide Web, visit http://lists.tdwg.org/mailman/listinfo/tdwg-content or, via email, send a message with subject or body 'help' to tdwg-content-request@lists.tdwg.org
You can reach the person managing the list at tdwg-content-owner@lists.tdwg.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of tdwg-content digest..."
Today's Topics:
- Re: Darwin Core: proposed news terms for expressing sample data (Matt Jones)
- Re: Darwin Core: proposed news terms for expressing sample data (Donald Hobern [GBIF])
Message: 1 Date: Thu, 21 Aug 2014 18:52:06 -0800 From: Matt Jones jones@nceas.ucsb.edu Subject: Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data To: ?amonn ? Tuama [GBIF] eotuama@gbif.org Cc: TDWG Content Mailing List tdwg-content@lists.tdwg.org Message-ID: CAFSW8xkx7uRP9PC2g3=JT_VJanqujH8nPXoz8GXwh+JwKw5Ccw@mail.gmail.com Content-Type: text/plain; charset="utf-8"
This proposal is treading on ground that is quite similar to other observations and measurements standards for data exchange that are already mature, in particular:
- OGC Observations and Measurements (
http://www.opengeospatial.org/standards/om)
- Extensible Observation Ontology (OBOE;
https://semtools.ecoinformatics.org/oboe)
The former is a standard and broadly deployed, whereas the latter is part of a research program in the use of ontologies for measurements. Through collaboration between the two projects, they've been modified to be reasonably isomorphic, but O&M uses an XML serialization while OBOE uses an OWL-DL serialization. They largely express the same measurements and sampling model once one gets beyond the terminology differences.
So, I'm wondering if it make much sense to extend Darwin Core, which is at heart an Occurrence exchange syntax, into this measurements area that is well represented by these other existing specifications? I'm curious to hear why people would even want to do this. And if we do go down this path, won't we just end up with a new syntax that does essentially what O&M and OBOE do now?
Matt
On Thu, Aug 21, 2014 at 12:22 AM, ?amonn ? Tuama [GBIF] eotuama@gbif.org wrote:
Hi Rob, Anne, Rich,
I think Markus has answered your question as to why we opted for an Event core which is being used in the sense described by Anne and Rich. For any event, you can have a list of species in an Occurrence extension and for each species, you can include quantity and quantityType, e.g., biomass, etc. The proposed term eventSeriesID was intended for linking together related events, although it now looks like parentEventID might be a better, more flexible term. The measurementOrFact extension is a good fit for capturing environmental information relating to an event. See, e.g., the Gialova Lagoon brackish water invertebrate test data set [1] where a set of 18 environmental variables, including temp, pH, Rdx, particulate organic matter, dissolved oxygen, salinity, chlorophyll-a were measured for each sampling station-sampling period combination. An example mapping is:
Id measurementType measurementValue measurementUnit measurementRemarks
IA Tmp (sed) 21.5 degree C Tmp (sed): temperature at the bottom surface
**Controlled vocabularies**
Ideally, the values for samplingUnit and quantityType would be selected from controlled vocabularies. This is, effectively, what we do by presenting a small list of values in a drop-down menu. The current values are what we derived for example data sets and discussion but they can undoubtedly be extended and improved.
We capture ?bucket? type measures through a combination of samplingEffort, samplingGeometry and samplingUnit. For example, a pitfall trap (in a point location) left out for 16 days might have samplingEffort: 16, samplingGeometry: point and samplingUnit: day. Three m^2 quadrats in a shore survey might have samplingEffort: 3, samplingGeometry: area and samplingUnit: m^2.
It would be very useful to see your compilation of scope, effort and completeness measures to see if we can express them in our model and/or if we need to reconsider our approach.
?amonn
[1] http://eubon-ipt.gbif.org/resource.do?r=ionian-brackish-lagoon
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Markus D?ring *Sent:* 20 August 2014 23:47 *To:* Robert Guralnick
*Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Rob,
this proposal if for monitoring surveys really, not to be confused with material samples like environmental or tissue samples which have a distinct new dwc class MaterialSample.
We tend to overload the term sampling a lot and it helps treating material samples different from pure observational "sampling". That is why the existing Event class was used as the core and classic Occurrence records as extensions. A classic example is a vegetation survey where each plot represents an Event record and each recorded species in that plot will be an Occurrence extension record with a given quantity. Darwin Core already offers individualCount to specify quantity, but it is a very specific way of measuring "abundance" restricted to only some use cases. Abiotic measurements about the plot (e.g. soil type, pH, temperature) can be published using the measurements or facts extension linked to the Event core.
Markus
On 20 Aug 2014, at 20:08, Robert Guralnick Robert.Guralnick@colorado.edu wrote:
Anne -- I don't know the answers! These are questions for Eamonn. I would presume that a sample could be a jumble of species or even just water or soil samples, and biomass would refer to that sample - but maybe that isn't a use case being considered? The examples given in the longer document all link an event_id to species name and some measure of quantity for that species (to the species, not an individual specimen), so I assume that is the prevailing (or only) case?
Best, Rob
On Wed, Aug 20, 2014 at 11:56 AM, Anne Thessen annethessen@gmail.com wrote:
Hi Rob I would like to respond to your item number 2. From my perspective, I deal with lots of published descriptions of taxa. The text might say something like "I saw species A in the Chesapeake Bay, the Adriatic Sea and the Indian Ocean and the biomass is 5 - 9 grams". The biomass range obviously corresponds to at least three different occurrences, but how to divide the biomass data? I would love to be able to have an *event* to attach it all to. There is almost two different levels of events - a sampling event and a "study event". The "study event" would correspond to the type of event I would like to use in the above example. It may not be ideal, but for the old literature that might be the best we can do. I have to admit that I don't know enough about trawl data to understand why an event core would be a problem. It seems that the trawl would be an event and each biomass measure (of each fish) would be attached to a separate occurrence which is attached to that event. Am I understanding this wrong? btw - I found a workaround for the example I gave, so it's not impossible to model with the current structure.... Anne
On 8/20/2014 1:16 PM, Robert Guralnick wrote:
?amonn et al. --- Thanks for the clarifications. I think these help a ton but it raises a couple more questions for me.
- I am surprised that you plan to use of MeasurementorFact extension in
relation to the Event core, which seems like a novel (or perhaps awkward or unintended?) mechanism for capturing environmental data, but the same extension was not be seen as relevant for describing samples? Can you explain more about the thinking there?
- There may be a subtle issue here extending "Event" to be more what you
call a "Sampling Event Core". My read of this is that Darwin Core serves as a way to deal with point occurrences and Event reflects the context of a single capture event (whether a single observation, or a bulk sample capture). The changes recommended seem to dramatically extend and change that meaning? Its simply a question that I don't have answer to, but is Darwin Core, the right vehicle to start capturing repeated measures of biomass values from trawls? I don't have answer but man, terms like quantityType (as a property of occurrence?) give me pause.
- Is Sampling Unit a controlled vocabulary? For another project, I have
looked through - and captured scope, effort and completeness measures from
- a large number of published biotic area inventories. The vast majorities
of these are measured in units like bucket hours, or trap nights. Is a "bucket" part of SamplingGeometry or Sampling Unit? I'd be happy to send along all the many examples of how biotic inventories of an area are completed and perhaps it might be good to see how those might be represented using the terms you are proposing?
Best, Rob
On Wed, Aug 20, 2014 at 10:16 AM, Richard Pyle deepreef@bishopmuseum.org wrote:
Same here ? Events are central to the work that we do.
Aloha,
Rich
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Anne Thessen *Sent:* Wednesday, August 20, 2014 2:59 AM *To:* tdwg-content@lists.tdwg.org
*Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hello I would just like to comment on *event core*. I've been doing a lot of work translating published data into Darwin Core. During that process I've wished several times that I could use Event as core. I am happy to hear about that proposed change. It will make it easier to model the data I am working with. Anne
On 8/20/2014 7:04 AM, ?amonn ? Tuama [GBIF] wrote:
Hi Rob,
Thank you for the feedback. I have tried to address the two main issues you raise below. At the outset, I would like to emphasise that much of this work is taking place in the context of the EU BON project which includes a task on developing/enhancing tools and standards for data sharing with a particular focus on the IPT for publishing sample-based data. So, we were constrained by the need to publish sample-based data sets in the Darwin Core Archive format and to demonstrate practical application using a working prototype. When the discussion on the TDWG list faded out, we took it to our EU BON partners whose requirements were essential input to further development. We recognise that these discussions took place away from TDWG (although the TDWG/EU BON contributors overlapped) and this is the reason we are presenting the outcomes here for further consideration.
**Event core**
As the SIGS report indicated, sample data can be modelled in Darwin Core Archives using either Occurrence or Event as core. This was the starting point for our evaluation but as things progressed the data wrangling pushed the model back towards the Event core. We actually went through the exercise of mapping multiple test datasets in an iterative process spanning several months' work. In the end, we found that using an Event core better matched the typical sample data we were dealing with, allowing use of a measurement-or-fact extension to be included for the efficient expression of environmental information associated with the event. The choice comes down to an Occurrence core or an Event core + Occurrence extension. In both cases, the true observation records are Occurrences. The big difference is what type the core has and therefore to which kind of records you can attach further facts and extra information with DwC-A extensions. Many sampling datasets have very rich information about the site and event, so it is very natural to hang facts from an Event core. When picking the Occurrence core those facts would have to be repeated for each and every occurrence record. Moreover, our approach doesn?t stop anyone from using the Occurrence core if they so wish. This just provides a different option for datasets that better fit an Event core model.
I want to stress that we are not building a ?specific IPT version? to support an Event core but, rather, we adapted the IPT so that it can be configured to support any generic ?core + extension? format to enable its use for exploration of more data formats. This is part of the core codebase and there were no custom forks of the IPT for this work. Our view at GBIF is that if there are significant numbers of data publishers who are keen to adopt, promote and use a (any) format, and the tools can be configured to do so, then we should support it, and, if necessary, use a custom namespace.
**New terms around abundance**
Yes, the discussion on TDWG did fade out but it was clear that the term ?abundance? as recommended by the SIGS report (along with abundanceAsPercent) was confusing many when we were looking for term(s) that reported quantitative measures of organisms in a sample. It also became clear we would need to be able to state the type of quantity being measured. An alternative suggestion for using the MeasurementsOrFact class was immediately shot down.
As some of our main use cases were coming from the EU BON project, discussion shifted to that forum and consensus formed about the currently proposed terms. It was within this group that the additional terms (samplingGeometry, samplingUnit, eventSeriesID) were proposed and where we began testing with sample data sets.
Best regards,
?amonn
*From:* robgur@gmail.com [mailto:robgur@gmail.com robgur@gmail.com] *On Behalf Of *Robert Guralnick *Sent:* 19 August 2014 16:56 *To:* ?amonn ? Tuama [GBIF] *Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hi ?amonn --- I am curious about the outcomes presented in the SIGS paper, in particular, this portion of the paper:
"Solutions without introducing an event core in Darwin Core Archives: During the review of the solutions for the uses cases, it became apparent that either model could be applied to every use case. The core and extensions bore a complementary relationship and between them could express all the required information. The core simply provided the central anchor in the star schema from which to join the additional information. Therefore, using the Occurrence core, well established in the GBIF network through uptake of the IPT, seemed more appropriate than inventing CollectingEvent as an additional core type."
That SIGS paper has John Wieczorek and you both as authors, including many luminaries across the biodiversity standards spectrum. Given the above, its curious to see the EventCore come back again, along with a specific IPT version to support it.
So I see two issues, conflated, in this post you just made. One is
the need for an EventCore at all, and the nature of relating Event and Occurrence/Material Sample. The second is the introduction of new terms, which seemingly have arrived after debate on similar terms - but framed around abundance - stalled a year ago. To my mind, these both require some further discussion, because I don't (necessarily) see TDWG community coherence around either issue?
Best, Rob
On Tue, Aug 19, 2014 at 6:11 AM, ?amonn ? Tuama [GBIF] eotuama@gbif.org wrote:
Dear All,
GBIF is committed to exploring ways in which the IPT and Darwin Core Archive format can be extended for publishing sample-based data sets. In association with the EU BON project [1], a customised version of the IPT [2] has been deployed to test this using a special type of Darwin Core Archive in which the core is an ?Event? with associated taxon occurrences in an ?Occurrence? extension.
The Darwin Core vocabulary already provides a rich set of terms with many relevant for describing sample-based data. Synthesising several sources of input (GBIF organised workshop on sample data, May 2013 [3], discussions on the TDWG mailing list in late 2013; internal discussion among EU BON project partners), five new terms relating to sample data were identified as essential. The complete model including these new terms are fully described with examples in the online document ?Publishing sample data using the GBIF IPT? [4].
As a first step towards ratification, we would like to register the new terms in the DwC Google Code tracker [5] if there are no major objections on this list. The five terms are:
*quantity*: the number or enumeration value of the quantityType
(e.g., individuals, biomass, biovolume, BraunBlanquetScale) per samplingUnit or a percentage measure recorded for the sample.
*quantityType*: : the entity being referred to by quantity,
e.g., individuals, biomass, %species, scale type.
*samplingGeometry*: an indication of what kind of space was
sampled; select from point, line, area or volume.
*samplingUnit*: the unit of measurement used for reporting the
quantity in the sample, e.g., minute, hour, day, metre, metre^2, metre^3. It is combined with quantity and quantityType to provide the complete measurement, e.g., 9 individuals per day, 4 biomass-gm per metre^2.
*eventSeriesID*: an identifier for a set of events that are
associated in some way, e.g., a monitoring series; may be a global unique identifier or an identifier specific to the series.
Best regards,
?amonn
[1] http://eubon.eu
[3] http://www.standardsingenomics.org/index.php/sigen/article/view/sigs.4898640
[4] http://links.gbif.org/sample_data_model
[5] https://code.google.com/p/darwincore/issues/list
*?amonn ? Tuama, M.Sc., Ph.D. (eotuama@gbif.org eotuama@gbif.org), *
*Senior Programme Officer for Interoperability, *
*Global Biodiversity Information Facility Secretariat, *
*Universitetsparken 15, DK-2100, Copenhagen ?, DENMARK*
*Phone: +45 3532 1494 <%2B45%203532%201494>; Fax: +45 3532 1480 <%2B45%203532%201480>*
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Thanks, Tim, and yes, DwC-A as a view (but not necessarily the primary archive) of data seems like the right point of view.
Ramona
------------------------------------------------------ Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Wed, Aug 27, 2014 at 1:58 AM, Tim Robertson trobertson@gbif.org wrote:
Hi Ramona,
Those are good points, and I’d like to come back to the original thinking behind the DwC-A.
It was designed and intended to be a simple way of exposing a complete *view* of a dataset, primarily for building sophisticated indexes, inventories and allowing basic analytics (e.g. GBIF.org being one sophisticated index). We found that the star schema provided the flexibility to do a lot, and with the bundled metadata (e.g. EML) was enough to trace provenance and allow users to determine if the dataset might be fit for various uses. In many cases this represents the complete (e.g. lossless) view of a dataset.
What we are discussing here are far richer datasets, where shoe-horning content into the star schema becomes lossy for some, although we’re finding other cases where it is indeed lossless. I believe we should be looking to harmonise ontologies / models etc as you mention but in parallel we should define one or more star schema views that can still be used for discovery / reporting / basic analytical purpose, and not long term archival of the dataset. The dataset would then have the canonical rich form and an additional DwC-A view. What I write here is applicable to all content types of course.
Please also note that many people put supplementary files in the DwC-A which are ignored by DwC-A readers but could be a way of keeping the richer view in the bundle. If one wished you can describe those supplementary files in the EML document.
Does this gel with the view of others as well?
Cheers, Tim
On 27 Aug 2014, at 02:55, Ramona Walls rlwalls2008@gmail.com wrote:
I think Matt hit the nail on the head. Although Darwin Core can be used to exchange survey data, it lacks the semantics and structure necessary to archive the data without loss of information. I think the biodiversity community would be better served devoting energy to harmonizing existing technologies such as OGC, OBOE, and BCO, not to mention the many database for storing plot or survey data. The goal should be to preserve the data in the most informative manner possible.
There is a strong a case for wanting to search across all evidence for occurences, including surveys and point occurences, so I can see possible demand for a tool that would extract occurences from survey data to a DwC archive. However, I am very concerned that making a DwC archive the primary exchange format for survey or plot data commits us to a path of losing information from the start, for all but the simplest sampling schemas.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Fri, Aug 22, 2014 at 3:00 AM, tdwg-content-request@lists.tdwg.org wrote:
Send tdwg-content mailing list submissions to tdwg-content@lists.tdwg.org
To subscribe or unsubscribe via the World Wide Web, visit http://lists.tdwg.org/mailman/listinfo/tdwg-content or, via email, send a message with subject or body 'help' to tdwg-content-request@lists.tdwg.org
You can reach the person managing the list at tdwg-content-owner@lists.tdwg.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of tdwg-content digest..."
Today's Topics:
- Re: Darwin Core: proposed news terms for expressing sample data (Matt Jones)
- Re: Darwin Core: proposed news terms for expressing sample data (Donald Hobern [GBIF])
Message: 1 Date: Thu, 21 Aug 2014 18:52:06 -0800 From: Matt Jones jones@nceas.ucsb.edu Subject: Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data To: ?amonn ? Tuama [GBIF] eotuama@gbif.org Cc: TDWG Content Mailing List tdwg-content@lists.tdwg.org Message-ID: <CAFSW8xkx7uRP9PC2g3= JT_VJanqujH8nPXoz8GXwh+JwKw5Ccw@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
This proposal is treading on ground that is quite similar to other observations and measurements standards for data exchange that are already mature, in particular:
- OGC Observations and Measurements (
http://www.opengeospatial.org/standards/om)
- Extensible Observation Ontology (OBOE;
https://semtools.ecoinformatics.org/oboe)
The former is a standard and broadly deployed, whereas the latter is part of a research program in the use of ontologies for measurements. Through collaboration between the two projects, they've been modified to be reasonably isomorphic, but O&M uses an XML serialization while OBOE uses an OWL-DL serialization. They largely express the same measurements and sampling model once one gets beyond the terminology differences.
So, I'm wondering if it make much sense to extend Darwin Core, which is at heart an Occurrence exchange syntax, into this measurements area that is well represented by these other existing specifications? I'm curious to hear why people would even want to do this. And if we do go down this path, won't we just end up with a new syntax that does essentially what O&M and OBOE do now?
Matt
On Thu, Aug 21, 2014 at 12:22 AM, ?amonn ? Tuama [GBIF] <eotuama@gbif.org
wrote:
Hi Rob, Anne, Rich,
I think Markus has answered your question as to why we opted for an
Event
core which is being used in the sense described by Anne and Rich. For
any
event, you can have a list of species in an Occurrence extension and for each species, you can include quantity and quantityType, e.g., biomass, etc. The proposed term eventSeriesID was intended for linking together related events, although it now looks like parentEventID might be a
better,
more flexible term. The measurementOrFact extension is a good fit for capturing environmental information relating to an event. See, e.g., the Gialova Lagoon brackish water invertebrate test data set [1] where a set of 18 environmental variables, including temp, pH, Rdx, particulate
organic
matter, dissolved oxygen, salinity, chlorophyll-a were measured for each sampling station-sampling period combination. An example mapping is:
Id measurementType measurementValue measurementUnit measurementRemarks
IA Tmp (sed) 21.5 degree C Tmp (sed): temperature at the bottom surface
**Controlled vocabularies**
Ideally, the values for samplingUnit and quantityType would be selected from controlled vocabularies. This is, effectively, what we do by presenting a small list of values in a drop-down menu. The current
values
are what we derived for example data sets and discussion but they can undoubtedly be extended and improved.
We capture ?bucket? type measures through a combination of
samplingEffort,
samplingGeometry and samplingUnit. For example, a pitfall trap (in a
point
location) left out for 16 days might have samplingEffort: 16, samplingGeometry: point and samplingUnit: day. Three m^2 quadrats in a shore survey might have samplingEffort: 3, samplingGeometry: area and samplingUnit: m^2.
It would be very useful to see your compilation of scope, effort and completeness measures to see if we can express them in our model and/or
if
we need to reconsider our approach.
?amonn
[1] http://eubon-ipt.gbif.org/resource.do?r=ionian-brackish-lagoon
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Markus D?ring *Sent:* 20 August 2014 23:47 *To:* Robert Guralnick
*Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Rob,
this proposal if for monitoring surveys really, not to be confused with material samples like environmental or tissue samples which have a
distinct
new dwc class MaterialSample.
We tend to overload the term sampling a lot and it helps treating
material
samples different from pure observational "sampling". That is why the existing Event class was used as the core and classic Occurrence
records as
extensions. A classic example is a vegetation survey where each plot represents an Event record and each recorded species in that plot will
be
an Occurrence extension record with a given quantity. Darwin Core
already
offers individualCount to specify quantity, but it is a very specific
way
of measuring "abundance" restricted to only some use cases. Abiotic measurements about the plot (e.g. soil type, pH, temperature) can be published using the measurements or facts extension linked to the Event core.
Markus
On 20 Aug 2014, at 20:08, Robert Guralnick <
Robert.Guralnick@colorado.edu>
wrote:
Anne -- I don't know the answers! These are questions for Eamonn. I would presume that a sample could be a jumble of species or even just
water
or soil samples, and biomass would refer to that sample - but maybe that isn't a use case being considered? The examples given in the longer document all link an event_id to species name and some measure of
quantity
for that species (to the species, not an individual specimen), so I
assume
that is the prevailing (or only) case?
Best, Rob
On Wed, Aug 20, 2014 at 11:56 AM, Anne Thessen annethessen@gmail.com wrote:
Hi Rob I would like to respond to your item number 2. From my perspective, I deal with lots of published descriptions of taxa. The text might say something like "I saw species A in the Chesapeake
Bay,
the Adriatic Sea and the Indian Ocean and the biomass is 5 - 9 grams".
The
biomass range obviously corresponds to at least three different occurrences, but how to divide the biomass data? I would love to be
able to
have an *event* to attach it all to. There is almost two different
levels
of events - a sampling event and a "study event". The "study event"
would
correspond to the type of event I would like to use in the above
example.
It may not be ideal, but for the old literature that might be the best
we
can do. I have to admit that I don't know enough about trawl data to understand why an event core would be a problem. It seems that the trawl would be
an
event and each biomass measure (of each fish) would be attached to a separate occurrence which is attached to that event. Am I understanding this wrong? btw - I found a workaround for the example I gave, so it's not
impossible
to model with the current structure.... Anne
On 8/20/2014 1:16 PM, Robert Guralnick wrote:
?amonn et al. --- Thanks for the clarifications. I think these help a
ton
but it raises a couple more questions for me.
- I am surprised that you plan to use of MeasurementorFact extension
in
relation to the Event core, which seems like a novel (or perhaps
awkward or
unintended?) mechanism for capturing environmental data, but the same extension was not be seen as relevant for describing samples? Can you explain more about the thinking there?
- There may be a subtle issue here extending "Event" to be more what
you
call a "Sampling Event Core". My read of this is that Darwin Core
serves
as a way to deal with point occurrences and Event reflects the context
of a
single capture event (whether a single observation, or a bulk sample capture). The changes recommended seem to dramatically extend and
change
that meaning? Its simply a question that I don't have answer to, but is Darwin Core, the right vehicle to start capturing repeated measures of biomass values from trawls? I don't have answer but man, terms like quantityType (as a property of occurrence?) give me pause.
- Is Sampling Unit a controlled vocabulary? For another project, I
have
looked through - and captured scope, effort and completeness measures
from
- a large number of published biotic area inventories. The vast
majorities
of these are measured in units like bucket hours, or trap nights. Is a "bucket" part of SamplingGeometry or Sampling Unit? I'd be happy to
send
along all the many examples of how biotic inventories of an area are completed and perhaps it might be good to see how those might be represented using the terms you are proposing?
Best, Rob
On Wed, Aug 20, 2014 at 10:16 AM, Richard Pyle <
deepreef@bishopmuseum.org>
wrote:
Same here ? Events are central to the work that we do.
Aloha,
Rich
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Anne Thessen *Sent:* Wednesday, August 20, 2014 2:59 AM *To:* tdwg-content@lists.tdwg.org
*Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hello I would just like to comment on *event core*. I've been doing a lot of work translating published data into Darwin
Core.
During that process I've wished several times that I could use Event as core. I am happy to hear about that proposed change. It will make it
easier
to model the data I am working with. Anne
On 8/20/2014 7:04 AM, ?amonn ? Tuama [GBIF] wrote:
Hi Rob,
Thank you for the feedback. I have tried to address the two main issues you raise below. At the outset, I would like to emphasise that much of
this
work is taking place in the context of the EU BON project which
includes a
task on developing/enhancing tools and standards for data sharing with a particular focus on the IPT for publishing sample-based data. So, we
were
constrained by the need to publish sample-based data sets in the Darwin Core Archive format and to demonstrate practical application using a working prototype. When the discussion on the TDWG list faded out, we
took
it to our EU BON partners whose requirements were essential input to further development. We recognise that these discussions took place away from TDWG (although the TDWG/EU BON contributors overlapped) and this is the reason we are presenting the outcomes here for further
consideration.
**Event core**
As the SIGS report indicated, sample data can be modelled in Darwin Core Archives using either Occurrence or Event as core. This was the starting point for our evaluation but as things progressed the data wrangling
pushed
the model back towards the Event core. We actually went through the exercise of mapping multiple test datasets in an iterative process
spanning
several months' work. In the end, we found that using an Event core
better
matched the typical sample data we were dealing with, allowing use of a measurement-or-fact extension to be included for the efficient
expression
of environmental information associated with the event. The choice comes down to an Occurrence core or an Event core + Occurrence extension. In
both
cases, the true observation records are Occurrences. The big difference
is
what type the core has and therefore to which kind of records you can attach further facts and extra information with DwC-A extensions. Many sampling datasets have very rich information about the site and event,
so
it is very natural to hang facts from an Event core. When picking the Occurrence core those facts would have to be repeated for each and every occurrence record. Moreover, our approach doesn?t stop anyone from using the Occurrence core if they so wish. This just provides a different
option
for datasets that better fit an Event core model.
I want to stress that we are not building a ?specific IPT version? to support an Event core but, rather, we adapted the IPT so that it can be configured to support any generic ?core + extension? format to enable
its
use for exploration of more data formats. This is part of the core codebase and there were no custom forks of the IPT for this work. Our
view
at GBIF is that if there are significant numbers of data publishers who
are
keen to adopt, promote and use a (any) format, and the tools can be configured to do so, then we should support it, and, if necessary, use a custom namespace.
**New terms around abundance**
Yes, the discussion on TDWG did fade out but it was clear that the term ?abundance? as recommended by the SIGS report (along with abundanceAsPercent) was confusing many when we were looking for term(s) that reported quantitative measures of organisms in a sample. It also became clear we would need to be able to state the type of quantity
being
measured. An alternative suggestion for using the MeasurementsOrFact
class
was immediately shot down.
As some of our main use cases were coming from the EU BON project, discussion shifted to that forum and consensus formed about the
currently
proposed terms. It was within this group that the additional terms (samplingGeometry, samplingUnit, eventSeriesID) were proposed and where
we
began testing with sample data sets.
Best regards,
?amonn
*From:* robgur@gmail.com [mailto:robgur@gmail.com robgur@gmail.com]
*On
Behalf Of *Robert Guralnick *Sent:* 19 August 2014 16:56 *To:* ?amonn ? Tuama [GBIF] *Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hi ?amonn --- I am curious about the outcomes presented in the SIGS paper, in particular, this portion of the paper:
"Solutions without introducing an event core in Darwin Core Archives: During the review of the solutions for the uses cases, it became
apparent
that either model could be applied to every use case. The core and extensions bore a complementary relationship and between them could
express
all the required information. The core simply provided the central
anchor
in the star schema from which to join the additional information. Therefore, using the Occurrence core, well established in the GBIF
network
through uptake of the IPT, seemed more appropriate than inventing CollectingEvent as an additional core type."
That SIGS paper has John Wieczorek and you both as authors, including many luminaries across the biodiversity standards spectrum. Given the above, its curious to see the EventCore come back again, along with a specific IPT version to support it.
So I see two issues, conflated, in this post you just made. One is
the need for an EventCore at all, and the nature of relating Event and Occurrence/Material Sample. The second is the introduction of new
terms,
which seemingly have arrived after debate on similar terms - but framed around abundance - stalled a year ago. To my mind, these both require
some
further discussion, because I don't (necessarily) see TDWG community coherence around either issue?
Best, Rob
On Tue, Aug 19, 2014 at 6:11 AM, ?amonn ? Tuama [GBIF] <
eotuama@gbif.org>
wrote:
Dear All,
GBIF is committed to exploring ways in which the IPT and Darwin Core Archive format can be extended for publishing sample-based data sets. In association with the EU BON project [1], a customised version of the IPT [2] has been deployed to test this using a special type of Darwin Core Archive in which the core is an ?Event? with associated taxon
occurrences
in an ?Occurrence? extension.
The Darwin Core vocabulary already provides a rich set of terms with
many
relevant for describing sample-based data. Synthesising several sources
of
input (GBIF organised workshop on sample data, May 2013 [3],
discussions on
the TDWG mailing list in late 2013; internal discussion among EU BON project partners), five new terms relating to sample data were
identified
as essential. The complete model including these new terms are fully described with examples in the online document ?Publishing sample data using the GBIF IPT? [4].
As a first step towards ratification, we would like to register the new terms in the DwC Google Code tracker [5] if there are no major
objections
on this list. The five terms are:
*quantity*: the number or enumeration value of the quantityType
(e.g., individuals, biomass, biovolume, BraunBlanquetScale) per samplingUnit or a percentage measure recorded for the sample.
*quantityType*: : the entity being referred to by quantity,
e.g., individuals, biomass, %species, scale type.
*samplingGeometry*: an indication of what kind of space was
sampled; select from point, line, area or volume.
*samplingUnit*: the unit of measurement used for reporting the
quantity in the sample, e.g., minute, hour, day, metre, metre^2,
metre^3.
It is combined with quantity and quantityType to provide the complete measurement, e.g., 9 individuals per day, 4 biomass-gm per metre^2.
*eventSeriesID*: an identifier for a set of events that are
associated in some way, e.g., a monitoring series; may be a global
unique
identifier or an identifier specific to the series.
Best regards,
?amonn
[1] http://eubon.eu
[3]
http://www.standardsingenomics.org/index.php/sigen/article/view/sigs.4898640
[4] http://links.gbif.org/sample_data_model
[5] https://code.google.com/p/darwincore/issues/list
*?amonn ? Tuama, M.Sc., Ph.D. (eotuama@gbif.org eotuama@gbif.org), *
*Senior Programme Officer for Interoperability, *
*Global Biodiversity Information Facility Secretariat, *
*Universitetsparken 15, DK-2100, Copenhagen ?, DENMARK*
*Phone: +45 3532 1494 <%2B45%203532%201494>; Fax: +45 3532 1480 <%2B45%203532%201480>*
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
It may be a sensible view for Darwin Core Archives and their intended use, but Tim's email suggests we should be putting the method of delivery ahead of the standard that delivers that content. If this was just about DwC-As, why not develop a survey extension that links each occurrence to information about the survey process using the existing star-schema methods we have in place? Why are we discussing adding terms to the Darwin Core or trying to fully reconfigure what we call an Event? That is what is on the table, not DwC-As and how we use them. Or am I missing something?
Best, Rob
On Wed, Aug 27, 2014 at 10:01 AM, Ramona Walls rlwalls2008@gmail.com wrote:
Thanks, Tim, and yes, DwC-A as a view (but not necessarily the primary archive) of data seems like the right point of view.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Wed, Aug 27, 2014 at 1:58 AM, Tim Robertson trobertson@gbif.org wrote:
Hi Ramona,
Those are good points, and I’d like to come back to the original thinking behind the DwC-A.
It was designed and intended to be a simple way of exposing a complete *view* of a dataset, primarily for building sophisticated indexes, inventories and allowing basic analytics (e.g. GBIF.org being one sophisticated index). We found that the star schema provided the flexibility to do a lot, and with the bundled metadata (e.g. EML) was enough to trace provenance and allow users to determine if the dataset might be fit for various uses. In many cases this represents the complete (e.g. lossless) view of a dataset.
What we are discussing here are far richer datasets, where shoe-horning content into the star schema becomes lossy for some, although we’re finding other cases where it is indeed lossless. I believe we should be looking to harmonise ontologies / models etc as you mention but in parallel we should define one or more star schema views that can still be used for discovery / reporting / basic analytical purpose, and not long term archival of the dataset. The dataset would then have the canonical rich form and an additional DwC-A view. What I write here is applicable to all content types of course.
Please also note that many people put supplementary files in the DwC-A which are ignored by DwC-A readers but could be a way of keeping the richer view in the bundle. If one wished you can describe those supplementary files in the EML document.
Does this gel with the view of others as well?
Cheers, Tim
On 27 Aug 2014, at 02:55, Ramona Walls rlwalls2008@gmail.com wrote:
I think Matt hit the nail on the head. Although Darwin Core can be used to exchange survey data, it lacks the semantics and structure necessary to archive the data without loss of information. I think the biodiversity community would be better served devoting energy to harmonizing existing technologies such as OGC, OBOE, and BCO, not to mention the many database for storing plot or survey data. The goal should be to preserve the data in the most informative manner possible.
There is a strong a case for wanting to search across all evidence for occurences, including surveys and point occurences, so I can see possible demand for a tool that would extract occurences from survey data to a DwC archive. However, I am very concerned that making a DwC archive the primary exchange format for survey or plot data commits us to a path of losing information from the start, for all but the simplest sampling schemas.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Fri, Aug 22, 2014 at 3:00 AM, tdwg-content-request@lists.tdwg.org wrote:
Send tdwg-content mailing list submissions to tdwg-content@lists.tdwg.org
To subscribe or unsubscribe via the World Wide Web, visit http://lists.tdwg.org/mailman/listinfo/tdwg-content or, via email, send a message with subject or body 'help' to tdwg-content-request@lists.tdwg.org
You can reach the person managing the list at tdwg-content-owner@lists.tdwg.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of tdwg-content digest..."
Today's Topics:
- Re: Darwin Core: proposed news terms for expressing sample data (Matt Jones)
- Re: Darwin Core: proposed news terms for expressing sample data (Donald Hobern [GBIF])
Message: 1 Date: Thu, 21 Aug 2014 18:52:06 -0800 From: Matt Jones jones@nceas.ucsb.edu Subject: Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data To: ?amonn ? Tuama [GBIF] eotuama@gbif.org Cc: TDWG Content Mailing List tdwg-content@lists.tdwg.org Message-ID: <CAFSW8xkx7uRP9PC2g3= JT_VJanqujH8nPXoz8GXwh+JwKw5Ccw@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
This proposal is treading on ground that is quite similar to other observations and measurements standards for data exchange that are already mature, in particular:
- OGC Observations and Measurements (
http://www.opengeospatial.org/standards/om)
- Extensible Observation Ontology (OBOE;
https://semtools.ecoinformatics.org/oboe)
The former is a standard and broadly deployed, whereas the latter is part of a research program in the use of ontologies for measurements. Through collaboration between the two projects, they've been modified to be reasonably isomorphic, but O&M uses an XML serialization while OBOE uses an OWL-DL serialization. They largely express the same measurements and sampling model once one gets beyond the terminology differences.
So, I'm wondering if it make much sense to extend Darwin Core, which is at heart an Occurrence exchange syntax, into this measurements area that is well represented by these other existing specifications? I'm curious to hear why people would even want to do this. And if we do go down this path, won't we just end up with a new syntax that does essentially what O&M and OBOE do now?
Matt
On Thu, Aug 21, 2014 at 12:22 AM, ?amonn ? Tuama [GBIF] < eotuama@gbif.org> wrote:
Hi Rob, Anne, Rich,
I think Markus has answered your question as to why we opted for an
Event
core which is being used in the sense described by Anne and Rich. For
any
event, you can have a list of species in an Occurrence extension and
for
each species, you can include quantity and quantityType, e.g., biomass, etc. The proposed term eventSeriesID was intended for linking together related events, although it now looks like parentEventID might be a
better,
more flexible term. The measurementOrFact extension is a good fit for capturing environmental information relating to an event. See, e.g.,
the
Gialova Lagoon brackish water invertebrate test data set [1] where a
set
of 18 environmental variables, including temp, pH, Rdx, particulate
organic
matter, dissolved oxygen, salinity, chlorophyll-a were measured for
each
sampling station-sampling period combination. An example mapping is:
Id measurementType measurementValue measurementUnit measurementRemarks
IA Tmp (sed) 21.5 degree C Tmp (sed): temperature at the bottom surface
**Controlled vocabularies**
Ideally, the values for samplingUnit and quantityType would be selected from controlled vocabularies. This is, effectively, what we do by presenting a small list of values in a drop-down menu. The current
values
are what we derived for example data sets and discussion but they can undoubtedly be extended and improved.
We capture ?bucket? type measures through a combination of
samplingEffort,
samplingGeometry and samplingUnit. For example, a pitfall trap (in a
point
location) left out for 16 days might have samplingEffort: 16, samplingGeometry: point and samplingUnit: day. Three m^2 quadrats in a shore survey might have samplingEffort: 3, samplingGeometry: area and samplingUnit: m^2.
It would be very useful to see your compilation of scope, effort and completeness measures to see if we can express them in our model
and/or if
we need to reconsider our approach.
?amonn
[1] http://eubon-ipt.gbif.org/resource.do?r=ionian-brackish-lagoon
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Markus D?ring *Sent:* 20 August 2014 23:47 *To:* Robert Guralnick
*Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Rob,
this proposal if for monitoring surveys really, not to be confused with material samples like environmental or tissue samples which have a
distinct
new dwc class MaterialSample.
We tend to overload the term sampling a lot and it helps treating
material
samples different from pure observational "sampling". That is why the existing Event class was used as the core and classic Occurrence
records as
extensions. A classic example is a vegetation survey where each plot represents an Event record and each recorded species in that plot will
be
an Occurrence extension record with a given quantity. Darwin Core
already
offers individualCount to specify quantity, but it is a very specific
way
of measuring "abundance" restricted to only some use cases. Abiotic measurements about the plot (e.g. soil type, pH, temperature) can be published using the measurements or facts extension linked to the Event core.
Markus
On 20 Aug 2014, at 20:08, Robert Guralnick <
Robert.Guralnick@colorado.edu>
wrote:
Anne -- I don't know the answers! These are questions for Eamonn. I would presume that a sample could be a jumble of species or even just
water
or soil samples, and biomass would refer to that sample - but maybe
that
isn't a use case being considered? The examples given in the longer document all link an event_id to species name and some measure of
quantity
for that species (to the species, not an individual specimen), so I
assume
that is the prevailing (or only) case?
Best, Rob
On Wed, Aug 20, 2014 at 11:56 AM, Anne Thessen annethessen@gmail.com wrote:
Hi Rob I would like to respond to your item number 2. From my perspective, I deal with lots of published descriptions of
taxa.
The text might say something like "I saw species A in the Chesapeake
Bay,
the Adriatic Sea and the Indian Ocean and the biomass is 5 - 9 grams".
The
biomass range obviously corresponds to at least three different occurrences, but how to divide the biomass data? I would love to be
able to
have an *event* to attach it all to. There is almost two different
levels
of events - a sampling event and a "study event". The "study event"
would
correspond to the type of event I would like to use in the above
example.
It may not be ideal, but for the old literature that might be the best
we
can do. I have to admit that I don't know enough about trawl data to understand why an event core would be a problem. It seems that the trawl would be
an
event and each biomass measure (of each fish) would be attached to a separate occurrence which is attached to that event. Am I understanding this wrong? btw - I found a workaround for the example I gave, so it's not
impossible
to model with the current structure.... Anne
On 8/20/2014 1:16 PM, Robert Guralnick wrote:
?amonn et al. --- Thanks for the clarifications. I think these help a
ton
but it raises a couple more questions for me.
- I am surprised that you plan to use of MeasurementorFact
extension in
relation to the Event core, which seems like a novel (or perhaps
awkward or
unintended?) mechanism for capturing environmental data, but the same extension was not be seen as relevant for describing samples? Can you explain more about the thinking there?
- There may be a subtle issue here extending "Event" to be more what
you
call a "Sampling Event Core". My read of this is that Darwin Core
serves
as a way to deal with point occurrences and Event reflects the context
of a
single capture event (whether a single observation, or a bulk sample capture). The changes recommended seem to dramatically extend and
change
that meaning? Its simply a question that I don't have answer to, but
is
Darwin Core, the right vehicle to start capturing repeated measures of biomass values from trawls? I don't have answer but man, terms like quantityType (as a property of occurrence?) give me pause.
- Is Sampling Unit a controlled vocabulary? For another project, I
have
looked through - and captured scope, effort and completeness measures
from
- a large number of published biotic area inventories. The vast
majorities
of these are measured in units like bucket hours, or trap nights. Is a "bucket" part of SamplingGeometry or Sampling Unit? I'd be happy to
send
along all the many examples of how biotic inventories of an area are completed and perhaps it might be good to see how those might be represented using the terms you are proposing?
Best, Rob
On Wed, Aug 20, 2014 at 10:16 AM, Richard Pyle <
deepreef@bishopmuseum.org>
wrote:
Same here ? Events are central to the work that we do.
Aloha,
Rich
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Anne Thessen *Sent:* Wednesday, August 20, 2014 2:59 AM *To:* tdwg-content@lists.tdwg.org
*Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hello I would just like to comment on *event core*. I've been doing a lot of work translating published data into Darwin
Core.
During that process I've wished several times that I could use Event as core. I am happy to hear about that proposed change. It will make it
easier
to model the data I am working with. Anne
On 8/20/2014 7:04 AM, ?amonn ? Tuama [GBIF] wrote:
Hi Rob,
Thank you for the feedback. I have tried to address the two main issues you raise below. At the outset, I would like to emphasise that much of
this
work is taking place in the context of the EU BON project which
includes a
task on developing/enhancing tools and standards for data sharing with
a
particular focus on the IPT for publishing sample-based data. So, we
were
constrained by the need to publish sample-based data sets in the Darwin Core Archive format and to demonstrate practical application using a working prototype. When the discussion on the TDWG list faded out, we
took
it to our EU BON partners whose requirements were essential input to further development. We recognise that these discussions took place
away
from TDWG (although the TDWG/EU BON contributors overlapped) and this
is
the reason we are presenting the outcomes here for further
consideration.
**Event core**
As the SIGS report indicated, sample data can be modelled in Darwin
Core
Archives using either Occurrence or Event as core. This was the
starting
point for our evaluation but as things progressed the data wrangling
pushed
the model back towards the Event core. We actually went through the exercise of mapping multiple test datasets in an iterative process
spanning
several months' work. In the end, we found that using an Event core
better
matched the typical sample data we were dealing with, allowing use of a measurement-or-fact extension to be included for the efficient
expression
of environmental information associated with the event. The choice
comes
down to an Occurrence core or an Event core + Occurrence extension. In
both
cases, the true observation records are Occurrences. The big
difference is
what type the core has and therefore to which kind of records you can attach further facts and extra information with DwC-A extensions. Many sampling datasets have very rich information about the site and event,
so
it is very natural to hang facts from an Event core. When picking the Occurrence core those facts would have to be repeated for each and
every
occurrence record. Moreover, our approach doesn?t stop anyone from
using
the Occurrence core if they so wish. This just provides a different
option
for datasets that better fit an Event core model.
I want to stress that we are not building a ?specific IPT version? to support an Event core but, rather, we adapted the IPT so that it can be configured to support any generic ?core + extension? format to enable
its
use for exploration of more data formats. This is part of the core codebase and there were no custom forks of the IPT for this work. Our
view
at GBIF is that if there are significant numbers of data publishers
who are
keen to adopt, promote and use a (any) format, and the tools can be configured to do so, then we should support it, and, if necessary, use
a
custom namespace.
**New terms around abundance**
Yes, the discussion on TDWG did fade out but it was clear that the term ?abundance? as recommended by the SIGS report (along with abundanceAsPercent) was confusing many when we were looking for term(s) that reported quantitative measures of organisms in a sample. It also became clear we would need to be able to state the type of quantity
being
measured. An alternative suggestion for using the MeasurementsOrFact
class
was immediately shot down.
As some of our main use cases were coming from the EU BON project, discussion shifted to that forum and consensus formed about the
currently
proposed terms. It was within this group that the additional terms (samplingGeometry, samplingUnit, eventSeriesID) were proposed and
where we
began testing with sample data sets.
Best regards,
?amonn
*From:* robgur@gmail.com [mailto:robgur@gmail.com robgur@gmail.com]
*On
Behalf Of *Robert Guralnick *Sent:* 19 August 2014 16:56 *To:* ?amonn ? Tuama [GBIF] *Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hi ?amonn --- I am curious about the outcomes presented in the SIGS paper, in particular, this portion of the paper:
"Solutions without introducing an event core in Darwin Core Archives: During the review of the solutions for the uses cases, it became
apparent
that either model could be applied to every use case. The core and extensions bore a complementary relationship and between them could
express
all the required information. The core simply provided the central
anchor
in the star schema from which to join the additional information. Therefore, using the Occurrence core, well established in the GBIF
network
through uptake of the IPT, seemed more appropriate than inventing CollectingEvent as an additional core type."
That SIGS paper has John Wieczorek and you both as authors,
including
many luminaries across the biodiversity standards spectrum. Given the above, its curious to see the EventCore come back again, along with a specific IPT version to support it.
So I see two issues, conflated, in this post you just made. One is
the need for an EventCore at all, and the nature of relating Event and Occurrence/Material Sample. The second is the introduction of new
terms,
which seemingly have arrived after debate on similar terms - but framed around abundance - stalled a year ago. To my mind, these both require
some
further discussion, because I don't (necessarily) see TDWG community coherence around either issue?
Best, Rob
On Tue, Aug 19, 2014 at 6:11 AM, ?amonn ? Tuama [GBIF] <
eotuama@gbif.org>
wrote:
Dear All,
GBIF is committed to exploring ways in which the IPT and Darwin Core Archive format can be extended for publishing sample-based data sets.
In
association with the EU BON project [1], a customised version of the
IPT
[2] has been deployed to test this using a special type of Darwin Core Archive in which the core is an ?Event? with associated taxon
occurrences
in an ?Occurrence? extension.
The Darwin Core vocabulary already provides a rich set of terms with
many
relevant for describing sample-based data. Synthesising several
sources of
input (GBIF organised workshop on sample data, May 2013 [3],
discussions on
the TDWG mailing list in late 2013; internal discussion among EU BON project partners), five new terms relating to sample data were
identified
as essential. The complete model including these new terms are fully described with examples in the online document ?Publishing sample data using the GBIF IPT? [4].
As a first step towards ratification, we would like to register the new terms in the DwC Google Code tracker [5] if there are no major
objections
on this list. The five terms are:
*quantity*: the number or enumeration value of the quantityType
(e.g., individuals, biomass, biovolume, BraunBlanquetScale) per samplingUnit or a percentage measure recorded for the sample.
*quantityType*: : the entity being referred to by quantity,
e.g., individuals, biomass, %species, scale type.
*samplingGeometry*: an indication of what kind of space was
sampled; select from point, line, area or volume.
*samplingUnit*: the unit of measurement used for reporting the
quantity in the sample, e.g., minute, hour, day, metre, metre^2,
metre^3.
It is combined with quantity and quantityType to provide the complete measurement, e.g., 9 individuals per day, 4 biomass-gm per metre^2.
*eventSeriesID*: an identifier for a set of events that are
associated in some way, e.g., a monitoring series; may be a global
unique
identifier or an identifier specific to the series.
Best regards,
?amonn
[1] http://eubon.eu
[3]
http://www.standardsingenomics.org/index.php/sigen/article/view/sigs.4898640
[4] http://links.gbif.org/sample_data_model
[5] https://code.google.com/p/darwincore/issues/list
*?amonn ? Tuama, M.Sc., Ph.D. (eotuama@gbif.org eotuama@gbif.org), *
*Senior Programme Officer for Interoperability, *
*Global Biodiversity Information Facility Secretariat, *
*Universitetsparken 15, DK-2100, Copenhagen ?, DENMARK*
*Phone: +45 3532 1494 <%2B45%203532%201494>; Fax: +45 3532 1480 <%2B45%203532%201480>*
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
I think it is important to consider the purpose of both Darwin Core and DwC archives in deciding whether or not to expand them, but we should use that consideration to address the question at hand, which is whether or not to add an Event core and additional properties to describe events.
Describing the exchange format before the semantics is the wrong way to go, given that we now have a framework for developing semantics. Expanding Darwin Core before we adequately model survey data is bound to lead to problems later, when we try to retro-fit the semantics to Darwin Core Event archives. This is exactly the problem we are running into now with Occurance archives, and we have the opportunity to avoid it.
I suggest we first use existing ontologies to model survey data, then deal with if and how to exchange that information in DwC-A. This is what I was hinting at in my first email, but should have said more explicitly.
Ramona
------------------------------------------------------ Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Wed, Aug 27, 2014 at 9:14 AM, Robert Guralnick < Robert.Guralnick@colorado.edu> wrote:
It may be a sensible view for Darwin Core Archives and their intended use, but Tim's email suggests we should be putting the method of delivery ahead of the standard that delivers that content. If this was just about DwC-As, why not develop a survey extension that links each occurrence to information about the survey process using the existing star-schema methods we have in place? Why are we discussing adding terms to the Darwin Core or trying to fully reconfigure what we call an Event? That is what is on the table, not DwC-As and how we use them. Or am I missing something?
Best, Rob
On Wed, Aug 27, 2014 at 10:01 AM, Ramona Walls rlwalls2008@gmail.com wrote:
Thanks, Tim, and yes, DwC-A as a view (but not necessarily the primary archive) of data seems like the right point of view.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Wed, Aug 27, 2014 at 1:58 AM, Tim Robertson trobertson@gbif.org wrote:
Hi Ramona,
Those are good points, and I’d like to come back to the original thinking behind the DwC-A.
It was designed and intended to be a simple way of exposing a complete *view* of a dataset, primarily for building sophisticated indexes, inventories and allowing basic analytics (e.g. GBIF.org being one sophisticated index). We found that the star schema provided the flexibility to do a lot, and with the bundled metadata (e.g. EML) was enough to trace provenance and allow users to determine if the dataset might be fit for various uses. In many cases this represents the complete (e.g. lossless) view of a dataset.
What we are discussing here are far richer datasets, where shoe-horning content into the star schema becomes lossy for some, although we’re finding other cases where it is indeed lossless. I believe we should be looking to harmonise ontologies / models etc as you mention but in parallel we should define one or more star schema views that can still be used for discovery / reporting / basic analytical purpose, and not long term archival of the dataset. The dataset would then have the canonical rich form and an additional DwC-A view. What I write here is applicable to all content types of course.
Please also note that many people put supplementary files in the DwC-A which are ignored by DwC-A readers but could be a way of keeping the richer view in the bundle. If one wished you can describe those supplementary files in the EML document.
Does this gel with the view of others as well?
Cheers, Tim
On 27 Aug 2014, at 02:55, Ramona Walls rlwalls2008@gmail.com wrote:
I think Matt hit the nail on the head. Although Darwin Core can be used to exchange survey data, it lacks the semantics and structure necessary to archive the data without loss of information. I think the biodiversity community would be better served devoting energy to harmonizing existing technologies such as OGC, OBOE, and BCO, not to mention the many database for storing plot or survey data. The goal should be to preserve the data in the most informative manner possible.
There is a strong a case for wanting to search across all evidence for occurences, including surveys and point occurences, so I can see possible demand for a tool that would extract occurences from survey data to a DwC archive. However, I am very concerned that making a DwC archive the primary exchange format for survey or plot data commits us to a path of losing information from the start, for all but the simplest sampling schemas.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Fri, Aug 22, 2014 at 3:00 AM, tdwg-content-request@lists.tdwg.org wrote:
Send tdwg-content mailing list submissions to tdwg-content@lists.tdwg.org
To subscribe or unsubscribe via the World Wide Web, visit http://lists.tdwg.org/mailman/listinfo/tdwg-content or, via email, send a message with subject or body 'help' to tdwg-content-request@lists.tdwg.org
You can reach the person managing the list at tdwg-content-owner@lists.tdwg.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of tdwg-content digest..."
Today's Topics:
- Re: Darwin Core: proposed news terms for expressing sample data (Matt Jones)
- Re: Darwin Core: proposed news terms for expressing sample data (Donald Hobern [GBIF])
Message: 1 Date: Thu, 21 Aug 2014 18:52:06 -0800 From: Matt Jones jones@nceas.ucsb.edu Subject: Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data To: ?amonn ? Tuama [GBIF] eotuama@gbif.org Cc: TDWG Content Mailing List tdwg-content@lists.tdwg.org Message-ID: <CAFSW8xkx7uRP9PC2g3= JT_VJanqujH8nPXoz8GXwh+JwKw5Ccw@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
This proposal is treading on ground that is quite similar to other observations and measurements standards for data exchange that are already mature, in particular:
- OGC Observations and Measurements (
http://www.opengeospatial.org/standards/om)
- Extensible Observation Ontology (OBOE;
https://semtools.ecoinformatics.org/oboe)
The former is a standard and broadly deployed, whereas the latter is part of a research program in the use of ontologies for measurements. Through collaboration between the two projects, they've been modified to be reasonably isomorphic, but O&M uses an XML serialization while OBOE uses an OWL-DL serialization. They largely express the same measurements and sampling model once one gets beyond the terminology differences.
So, I'm wondering if it make much sense to extend Darwin Core, which is at heart an Occurrence exchange syntax, into this measurements area that is well represented by these other existing specifications? I'm curious to hear why people would even want to do this. And if we do go down this path, won't we just end up with a new syntax that does essentially what O&M and OBOE do now?
Matt
On Thu, Aug 21, 2014 at 12:22 AM, ?amonn ? Tuama [GBIF] < eotuama@gbif.org> wrote:
Hi Rob, Anne, Rich,
I think Markus has answered your question as to why we opted for an
Event
core which is being used in the sense described by Anne and Rich. For
any
event, you can have a list of species in an Occurrence extension and
for
each species, you can include quantity and quantityType, e.g.,
biomass,
etc. The proposed term eventSeriesID was intended for linking together related events, although it now looks like parentEventID might be a
better,
more flexible term. The measurementOrFact extension is a good fit for capturing environmental information relating to an event. See, e.g.,
the
Gialova Lagoon brackish water invertebrate test data set [1] where a
set
of 18 environmental variables, including temp, pH, Rdx, particulate
organic
matter, dissolved oxygen, salinity, chlorophyll-a were measured for
each
sampling station-sampling period combination. An example mapping is:
Id measurementType measurementValue measurementUnit measurementRemarks
IA Tmp (sed) 21.5 degree C Tmp (sed): temperature at the bottom surface
**Controlled vocabularies**
Ideally, the values for samplingUnit and quantityType would be
selected
from controlled vocabularies. This is, effectively, what we do by presenting a small list of values in a drop-down menu. The current
values
are what we derived for example data sets and discussion but they can undoubtedly be extended and improved.
We capture ?bucket? type measures through a combination of
samplingEffort,
samplingGeometry and samplingUnit. For example, a pitfall trap (in a
point
location) left out for 16 days might have samplingEffort: 16, samplingGeometry: point and samplingUnit: day. Three m^2 quadrats in a shore survey might have samplingEffort: 3, samplingGeometry: area and samplingUnit: m^2.
It would be very useful to see your compilation of scope, effort and completeness measures to see if we can express them in our model
and/or if
we need to reconsider our approach.
?amonn
[1] http://eubon-ipt.gbif.org/resource.do?r=ionian-brackish-lagoon
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Markus D?ring *Sent:* 20 August 2014 23:47 *To:* Robert Guralnick
*Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Rob,
this proposal if for monitoring surveys really, not to be confused
with
material samples like environmental or tissue samples which have a
distinct
new dwc class MaterialSample.
We tend to overload the term sampling a lot and it helps treating
material
samples different from pure observational "sampling". That is why the existing Event class was used as the core and classic Occurrence
records as
extensions. A classic example is a vegetation survey where each plot represents an Event record and each recorded species in that plot
will be
an Occurrence extension record with a given quantity. Darwin Core
already
offers individualCount to specify quantity, but it is a very specific
way
of measuring "abundance" restricted to only some use cases. Abiotic measurements about the plot (e.g. soil type, pH, temperature) can be published using the measurements or facts extension linked to the
Event
core.
Markus
On 20 Aug 2014, at 20:08, Robert Guralnick <
Robert.Guralnick@colorado.edu>
wrote:
Anne -- I don't know the answers! These are questions for Eamonn.
I
would presume that a sample could be a jumble of species or even just
water
or soil samples, and biomass would refer to that sample - but maybe
that
isn't a use case being considered? The examples given in the longer document all link an event_id to species name and some measure of
quantity
for that species (to the species, not an individual specimen), so I
assume
that is the prevailing (or only) case?
Best, Rob
On Wed, Aug 20, 2014 at 11:56 AM, Anne Thessen <annethessen@gmail.com
wrote:
Hi Rob I would like to respond to your item number 2. From my perspective, I deal with lots of published descriptions of
taxa.
The text might say something like "I saw species A in the Chesapeake
Bay,
the Adriatic Sea and the Indian Ocean and the biomass is 5 - 9
grams". The
biomass range obviously corresponds to at least three different occurrences, but how to divide the biomass data? I would love to be
able to
have an *event* to attach it all to. There is almost two different
levels
of events - a sampling event and a "study event". The "study event"
would
correspond to the type of event I would like to use in the above
example.
It may not be ideal, but for the old literature that might be the
best we
can do. I have to admit that I don't know enough about trawl data to
understand
why an event core would be a problem. It seems that the trawl would
be an
event and each biomass measure (of each fish) would be attached to a separate occurrence which is attached to that event. Am I
understanding
this wrong? btw - I found a workaround for the example I gave, so it's not
impossible
to model with the current structure.... Anne
On 8/20/2014 1:16 PM, Robert Guralnick wrote:
?amonn et al. --- Thanks for the clarifications. I think these help
a ton
but it raises a couple more questions for me.
- I am surprised that you plan to use of MeasurementorFact
extension in
relation to the Event core, which seems like a novel (or perhaps
awkward or
unintended?) mechanism for capturing environmental data, but the same extension was not be seen as relevant for describing samples? Can you explain more about the thinking there?
- There may be a subtle issue here extending "Event" to be more
what you
call a "Sampling Event Core". My read of this is that Darwin Core
serves
as a way to deal with point occurrences and Event reflects the
context of a
single capture event (whether a single observation, or a bulk sample capture). The changes recommended seem to dramatically extend and
change
that meaning? Its simply a question that I don't have answer to, but
is
Darwin Core, the right vehicle to start capturing repeated measures of biomass values from trawls? I don't have answer but man, terms like quantityType (as a property of occurrence?) give me pause.
- Is Sampling Unit a controlled vocabulary? For another project, I
have
looked through - and captured scope, effort and completeness measures
from
- a large number of published biotic area inventories. The vast
majorities
of these are measured in units like bucket hours, or trap nights. Is
a
"bucket" part of SamplingGeometry or Sampling Unit? I'd be happy to
send
along all the many examples of how biotic inventories of an area are completed and perhaps it might be good to see how those might be represented using the terms you are proposing?
Best, Rob
On Wed, Aug 20, 2014 at 10:16 AM, Richard Pyle <
deepreef@bishopmuseum.org>
wrote:
Same here ? Events are central to the work that we do.
Aloha,
Rich
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Anne Thessen *Sent:* Wednesday, August 20, 2014 2:59 AM *To:* tdwg-content@lists.tdwg.org
*Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hello I would just like to comment on *event core*. I've been doing a lot of work translating published data into Darwin
Core.
During that process I've wished several times that I could use Event
as
core. I am happy to hear about that proposed change. It will make it
easier
to model the data I am working with. Anne
On 8/20/2014 7:04 AM, ?amonn ? Tuama [GBIF] wrote:
Hi Rob,
Thank you for the feedback. I have tried to address the two main
issues
you raise below. At the outset, I would like to emphasise that much
of this
work is taking place in the context of the EU BON project which
includes a
task on developing/enhancing tools and standards for data sharing
with a
particular focus on the IPT for publishing sample-based data. So, we
were
constrained by the need to publish sample-based data sets in the
Darwin
Core Archive format and to demonstrate practical application using a working prototype. When the discussion on the TDWG list faded out, we
took
it to our EU BON partners whose requirements were essential input to further development. We recognise that these discussions took place
away
from TDWG (although the TDWG/EU BON contributors overlapped) and this
is
the reason we are presenting the outcomes here for further
consideration.
**Event core**
As the SIGS report indicated, sample data can be modelled in Darwin
Core
Archives using either Occurrence or Event as core. This was the
starting
point for our evaluation but as things progressed the data wrangling
pushed
the model back towards the Event core. We actually went through the exercise of mapping multiple test datasets in an iterative process
spanning
several months' work. In the end, we found that using an Event core
better
matched the typical sample data we were dealing with, allowing use of
a
measurement-or-fact extension to be included for the efficient
expression
of environmental information associated with the event. The choice
comes
down to an Occurrence core or an Event core + Occurrence extension.
In both
cases, the true observation records are Occurrences. The big
difference is
what type the core has and therefore to which kind of records you can attach further facts and extra information with DwC-A extensions. Many sampling datasets have very rich information about the site and
event, so
it is very natural to hang facts from an Event core. When picking the Occurrence core those facts would have to be repeated for each and
every
occurrence record. Moreover, our approach doesn?t stop anyone from
using
the Occurrence core if they so wish. This just provides a different
option
for datasets that better fit an Event core model.
I want to stress that we are not building a ?specific IPT version? to support an Event core but, rather, we adapted the IPT so that it can
be
configured to support any generic ?core + extension? format to enable
its
use for exploration of more data formats. This is part of the core codebase and there were no custom forks of the IPT for this work.
Our view
at GBIF is that if there are significant numbers of data publishers
who are
keen to adopt, promote and use a (any) format, and the tools can be configured to do so, then we should support it, and, if necessary,
use a
custom namespace.
**New terms around abundance**
Yes, the discussion on TDWG did fade out but it was clear that the
term
?abundance? as recommended by the SIGS report (along with abundanceAsPercent) was confusing many when we were looking for
term(s)
that reported quantitative measures of organisms in a sample. It also became clear we would need to be able to state the type of quantity
being
measured. An alternative suggestion for using the MeasurementsOrFact
class
was immediately shot down.
As some of our main use cases were coming from the EU BON project, discussion shifted to that forum and consensus formed about the
currently
proposed terms. It was within this group that the additional terms (samplingGeometry, samplingUnit, eventSeriesID) were proposed and
where we
began testing with sample data sets.
Best regards,
?amonn
*From:* robgur@gmail.com [mailto:robgur@gmail.com robgur@gmail.com]
*On
Behalf Of *Robert Guralnick *Sent:* 19 August 2014 16:56 *To:* ?amonn ? Tuama [GBIF] *Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hi ?amonn --- I am curious about the outcomes presented in the SIGS paper, in particular, this portion of the paper:
"Solutions without introducing an event core in Darwin Core Archives: During the review of the solutions for the uses cases, it became
apparent
that either model could be applied to every use case. The core and extensions bore a complementary relationship and between them could
express
all the required information. The core simply provided the central
anchor
in the star schema from which to join the additional information. Therefore, using the Occurrence core, well established in the GBIF
network
through uptake of the IPT, seemed more appropriate than inventing CollectingEvent as an additional core type."
That SIGS paper has John Wieczorek and you both as authors,
including
many luminaries across the biodiversity standards spectrum. Given the above, its curious to see the EventCore come back again, along with a specific IPT version to support it.
So I see two issues, conflated, in this post you just made. One
is
the need for an EventCore at all, and the nature of relating Event and Occurrence/Material Sample. The second is the introduction of new
terms,
which seemingly have arrived after debate on similar terms - but
framed
around abundance - stalled a year ago. To my mind, these both
require some
further discussion, because I don't (necessarily) see TDWG community coherence around either issue?
Best, Rob
On Tue, Aug 19, 2014 at 6:11 AM, ?amonn ? Tuama [GBIF] <
eotuama@gbif.org>
wrote:
Dear All,
GBIF is committed to exploring ways in which the IPT and Darwin Core Archive format can be extended for publishing sample-based data sets.
In
association with the EU BON project [1], a customised version of the
IPT
[2] has been deployed to test this using a special type of Darwin Core Archive in which the core is an ?Event? with associated taxon
occurrences
in an ?Occurrence? extension.
The Darwin Core vocabulary already provides a rich set of terms with
many
relevant for describing sample-based data. Synthesising several
sources of
input (GBIF organised workshop on sample data, May 2013 [3],
discussions on
the TDWG mailing list in late 2013; internal discussion among EU BON project partners), five new terms relating to sample data were
identified
as essential. The complete model including these new terms are fully described with examples in the online document ?Publishing sample data using the GBIF IPT? [4].
As a first step towards ratification, we would like to register the
new
terms in the DwC Google Code tracker [5] if there are no major
objections
on this list. The five terms are:
*quantity*: the number or enumeration value of the
quantityType
(e.g., individuals, biomass, biovolume, BraunBlanquetScale) per samplingUnit or a percentage measure recorded for the sample.
*quantityType*: : the entity being referred to by quantity,
e.g., individuals, biomass, %species, scale type.
*samplingGeometry*: an indication of what kind of space was
sampled; select from point, line, area or volume.
*samplingUnit*: the unit of measurement used for reporting the
quantity in the sample, e.g., minute, hour, day, metre, metre^2,
metre^3.
It is combined with quantity and quantityType to provide the complete measurement, e.g., 9 individuals per day, 4 biomass-gm per metre^2.
*eventSeriesID*: an identifier for a set of events that are
associated in some way, e.g., a monitoring series; may be a global
unique
identifier or an identifier specific to the series.
Best regards,
?amonn
[1] http://eubon.eu
[3]
http://www.standardsingenomics.org/index.php/sigen/article/view/sigs.4898640
[4] http://links.gbif.org/sample_data_model
[5] https://code.google.com/p/darwincore/issues/list
*?amonn ? Tuama, M.Sc., Ph.D. (eotuama@gbif.org eotuama@gbif.org),
*Senior Programme Officer for Interoperability, *
*Global Biodiversity Information Facility Secretariat, *
*Universitetsparken 15, DK-2100, Copenhagen ?, DENMARK*
*Phone: +45 3532 1494 <%2B45%203532%201494>; Fax: +45 3532 1480 <%2B45%203532%201480>*
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi Ramona & Rob,
The Event proposal does not try to change the semantics of an Event, it just uses the existing Darwin Core Event "class" at the core in Darwin Core archives. The actual change proposed is simply adding 3 new terms to the Event "group" to better share information about sampling methods & efforts, extending the existing limited capabilities of Darwin Core which already has the terms dwc:samplingProtocol and dwc:samplingEffort. It also proposes 2 new terms for dealing with quantity of Occurrences, something that has been discussed since 2012 now, when I had proposed a new abundance term [2].
In general application of Darwin Core is not at all limited to specimens and observations. It is used for sharing taxonomic datasets already and it's definition and goal is broad. Let me cite some of the introduction to Darwin Core [1]:
What is the Darwin Core? The Darwin Core is body of standards. It includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing reference definitions, examples, and commentaries. The Darwin Core is primarily based on taxa, their occurrence in nature as documented by observations, specimens, samples, and related information.
Motivation: The Darwin Core standard was originally conceived to facilitate the discovery, retrieval, and integration of information about modern biological specimens, their spatiotemporal occurrence, and their supporting evidence housed in collections (physical or digital). The Darwin Core today is broader in scope and more versatile. It is meant to provide a stable standard reference for sharing information on biological diversity. As a glossary of terms, the Darwin Core is meant to provide stable semantic definitions with the goal of being maximally reusable in a variety of contexts.
Markus
[1] http://rs.tdwg.org/dwc/index.htm [2] https://code.google.com/p/darwincore/issues/detail?id=142
-- Markus Döring Software Developer Global Biodiversity Information Facility (GBIF) mdoering@gbif.org http://www.gbif.org
On 27 Aug 2014, at 18:57, Ramona Walls rlwalls2008@gmail.com wrote:
I think it is important to consider the purpose of both Darwin Core and DwC archives in deciding whether or not to expand them, but we should use that consideration to address the question at hand, which is whether or not to add an Event core and additional properties to describe events.
Describing the exchange format before the semantics is the wrong way to go, given that we now have a framework for developing semantics. Expanding Darwin Core before we adequately model survey data is bound to lead to problems later, when we try to retro-fit the semantics to Darwin Core Event archives. This is exactly the problem we are running into now with Occurance archives, and we have the opportunity to avoid it.
I suggest we first use existing ontologies to model survey data, then deal with if and how to exchange that information in DwC-A. This is what I was hinting at in my first email, but should have said more explicitly.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Wed, Aug 27, 2014 at 9:14 AM, Robert Guralnick Robert.Guralnick@colorado.edu wrote:
It may be a sensible view for Darwin Core Archives and their intended use, but Tim's email suggests we should be putting the method of delivery ahead of the standard that delivers that content. If this was just about DwC-As, why not develop a survey extension that links each occurrence to information about the survey process using the existing star-schema methods we have in place? Why are we discussing adding terms to the Darwin Core or trying to fully reconfigure what we call an Event? That is what is on the table, not DwC-As and how we use them. Or am I missing something?
Best, Rob
On Wed, Aug 27, 2014 at 10:01 AM, Ramona Walls rlwalls2008@gmail.com wrote: Thanks, Tim, and yes, DwC-A as a view (but not necessarily the primary archive) of data seems like the right point of view.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Wed, Aug 27, 2014 at 1:58 AM, Tim Robertson trobertson@gbif.org wrote: Hi Ramona,
Those are good points, and I’d like to come back to the original thinking behind the DwC-A.
It was designed and intended to be a simple way of exposing a complete view of a dataset, primarily for building sophisticated indexes, inventories and allowing basic analytics (e.g. GBIF.org being one sophisticated index). We found that the star schema provided the flexibility to do a lot, and with the bundled metadata (e.g. EML) was enough to trace provenance and allow users to determine if the dataset might be fit for various uses. In many cases this represents the complete (e.g. lossless) view of a dataset.
What we are discussing here are far richer datasets, where shoe-horning content into the star schema becomes lossy for some, although we’re finding other cases where it is indeed lossless. I believe we should be looking to harmonise ontologies / models etc as you mention but in parallel we should define one or more star schema views that can still be used for discovery / reporting / basic analytical purpose, and not long term archival of the dataset. The dataset would then have the canonical rich form and an additional DwC-A view. What I write here is applicable to all content types of course.
Please also note that many people put supplementary files in the DwC-A which are ignored by DwC-A readers but could be a way of keeping the richer view in the bundle. If one wished you can describe those supplementary files in the EML document.
Does this gel with the view of others as well?
Cheers, Tim
On 27 Aug 2014, at 02:55, Ramona Walls rlwalls2008@gmail.com wrote:
I think Matt hit the nail on the head. Although Darwin Core can be used to exchange survey data, it lacks the semantics and structure necessary to archive the data without loss of information. I think the biodiversity community would be better served devoting energy to harmonizing existing technologies such as OGC, OBOE, and BCO, not to mention the many database for storing plot or survey data. The goal should be to preserve the data in the most informative manner possible.
There is a strong a case for wanting to search across all evidence for occurences, including surveys and point occurences, so I can see possible demand for a tool that would extract occurences from survey data to a DwC archive. However, I am very concerned that making a DwC archive the primary exchange format for survey or plot data commits us to a path of losing information from the start, for all but the simplest sampling schemas.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Fri, Aug 22, 2014 at 3:00 AM, tdwg-content-request@lists.tdwg.org wrote: Send tdwg-content mailing list submissions to tdwg-content@lists.tdwg.org
To subscribe or unsubscribe via the World Wide Web, visit http://lists.tdwg.org/mailman/listinfo/tdwg-content or, via email, send a message with subject or body 'help' to tdwg-content-request@lists.tdwg.org
You can reach the person managing the list at tdwg-content-owner@lists.tdwg.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of tdwg-content digest..."
Today's Topics:
- Re: Darwin Core: proposed news terms for expressing sample data (Matt Jones)
- Re: Darwin Core: proposed news terms for expressing sample data (Donald Hobern [GBIF])
Message: 1 Date: Thu, 21 Aug 2014 18:52:06 -0800 From: Matt Jones jones@nceas.ucsb.edu Subject: Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data To: ?amonn ? Tuama [GBIF] eotuama@gbif.org Cc: TDWG Content Mailing List tdwg-content@lists.tdwg.org Message-ID: CAFSW8xkx7uRP9PC2g3=JT_VJanqujH8nPXoz8GXwh+JwKw5Ccw@mail.gmail.com Content-Type: text/plain; charset="utf-8"
This proposal is treading on ground that is quite similar to other observations and measurements standards for data exchange that are already mature, in particular:
- OGC Observations and Measurements (
http://www.opengeospatial.org/standards/om)
- Extensible Observation Ontology (OBOE;
https://semtools.ecoinformatics.org/oboe)
The former is a standard and broadly deployed, whereas the latter is part of a research program in the use of ontologies for measurements. Through collaboration between the two projects, they've been modified to be reasonably isomorphic, but O&M uses an XML serialization while OBOE uses an OWL-DL serialization. They largely express the same measurements and sampling model once one gets beyond the terminology differences.
So, I'm wondering if it make much sense to extend Darwin Core, which is at heart an Occurrence exchange syntax, into this measurements area that is well represented by these other existing specifications? I'm curious to hear why people would even want to do this. And if we do go down this path, won't we just end up with a new syntax that does essentially what O&M and OBOE do now?
Matt
On Thu, Aug 21, 2014 at 12:22 AM, ?amonn ? Tuama [GBIF] eotuama@gbif.org wrote:
Hi Rob, Anne, Rich,
I think Markus has answered your question as to why we opted for an Event core which is being used in the sense described by Anne and Rich. For any event, you can have a list of species in an Occurrence extension and for each species, you can include quantity and quantityType, e.g., biomass, etc. The proposed term eventSeriesID was intended for linking together related events, although it now looks like parentEventID might be a better, more flexible term. The measurementOrFact extension is a good fit for capturing environmental information relating to an event. See, e.g., the Gialova Lagoon brackish water invertebrate test data set [1] where a set of 18 environmental variables, including temp, pH, Rdx, particulate organic matter, dissolved oxygen, salinity, chlorophyll-a were measured for each sampling station-sampling period combination. An example mapping is:
Id measurementType measurementValue measurementUnit measurementRemarks
IA Tmp (sed) 21.5 degree C Tmp (sed): temperature at the bottom surface
**Controlled vocabularies**
Ideally, the values for samplingUnit and quantityType would be selected from controlled vocabularies. This is, effectively, what we do by presenting a small list of values in a drop-down menu. The current values are what we derived for example data sets and discussion but they can undoubtedly be extended and improved.
We capture ?bucket? type measures through a combination of samplingEffort, samplingGeometry and samplingUnit. For example, a pitfall trap (in a point location) left out for 16 days might have samplingEffort: 16, samplingGeometry: point and samplingUnit: day. Three m^2 quadrats in a shore survey might have samplingEffort: 3, samplingGeometry: area and samplingUnit: m^2.
It would be very useful to see your compilation of scope, effort and completeness measures to see if we can express them in our model and/or if we need to reconsider our approach.
?amonn
[1] http://eubon-ipt.gbif.org/resource.do?r=ionian-brackish-lagoon
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Markus D?ring *Sent:* 20 August 2014 23:47 *To:* Robert Guralnick
*Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Rob,
this proposal if for monitoring surveys really, not to be confused with material samples like environmental or tissue samples which have a distinct new dwc class MaterialSample.
We tend to overload the term sampling a lot and it helps treating material samples different from pure observational "sampling". That is why the existing Event class was used as the core and classic Occurrence records as extensions. A classic example is a vegetation survey where each plot represents an Event record and each recorded species in that plot will be an Occurrence extension record with a given quantity. Darwin Core already offers individualCount to specify quantity, but it is a very specific way of measuring "abundance" restricted to only some use cases. Abiotic measurements about the plot (e.g. soil type, pH, temperature) can be published using the measurements or facts extension linked to the Event core.
Markus
On 20 Aug 2014, at 20:08, Robert Guralnick Robert.Guralnick@colorado.edu wrote:
Anne -- I don't know the answers! These are questions for Eamonn. I would presume that a sample could be a jumble of species or even just water or soil samples, and biomass would refer to that sample - but maybe that isn't a use case being considered? The examples given in the longer document all link an event_id to species name and some measure of quantity for that species (to the species, not an individual specimen), so I assume that is the prevailing (or only) case?
Best, Rob
On Wed, Aug 20, 2014 at 11:56 AM, Anne Thessen annethessen@gmail.com wrote:
Hi Rob I would like to respond to your item number 2. From my perspective, I deal with lots of published descriptions of taxa. The text might say something like "I saw species A in the Chesapeake Bay, the Adriatic Sea and the Indian Ocean and the biomass is 5 - 9 grams". The biomass range obviously corresponds to at least three different occurrences, but how to divide the biomass data? I would love to be able to have an *event* to attach it all to. There is almost two different levels of events - a sampling event and a "study event". The "study event" would correspond to the type of event I would like to use in the above example. It may not be ideal, but for the old literature that might be the best we can do. I have to admit that I don't know enough about trawl data to understand why an event core would be a problem. It seems that the trawl would be an event and each biomass measure (of each fish) would be attached to a separate occurrence which is attached to that event. Am I understanding this wrong? btw - I found a workaround for the example I gave, so it's not impossible to model with the current structure.... Anne
On 8/20/2014 1:16 PM, Robert Guralnick wrote:
?amonn et al. --- Thanks for the clarifications. I think these help a ton but it raises a couple more questions for me.
- I am surprised that you plan to use of MeasurementorFact extension in
relation to the Event core, which seems like a novel (or perhaps awkward or unintended?) mechanism for capturing environmental data, but the same extension was not be seen as relevant for describing samples? Can you explain more about the thinking there?
- There may be a subtle issue here extending "Event" to be more what you
call a "Sampling Event Core". My read of this is that Darwin Core serves as a way to deal with point occurrences and Event reflects the context of a single capture event (whether a single observation, or a bulk sample capture). The changes recommended seem to dramatically extend and change that meaning? Its simply a question that I don't have answer to, but is Darwin Core, the right vehicle to start capturing repeated measures of biomass values from trawls? I don't have answer but man, terms like quantityType (as a property of occurrence?) give me pause.
- Is Sampling Unit a controlled vocabulary? For another project, I have
looked through - and captured scope, effort and completeness measures from
- a large number of published biotic area inventories. The vast majorities
of these are measured in units like bucket hours, or trap nights. Is a "bucket" part of SamplingGeometry or Sampling Unit? I'd be happy to send along all the many examples of how biotic inventories of an area are completed and perhaps it might be good to see how those might be represented using the terms you are proposing?
Best, Rob
On Wed, Aug 20, 2014 at 10:16 AM, Richard Pyle deepreef@bishopmuseum.org wrote:
Same here ? Events are central to the work that we do.
Aloha,
Rich
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Anne Thessen *Sent:* Wednesday, August 20, 2014 2:59 AM *To:* tdwg-content@lists.tdwg.org
*Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hello I would just like to comment on *event core*. I've been doing a lot of work translating published data into Darwin Core. During that process I've wished several times that I could use Event as core. I am happy to hear about that proposed change. It will make it easier to model the data I am working with. Anne
On 8/20/2014 7:04 AM, ?amonn ? Tuama [GBIF] wrote:
Hi Rob,
Thank you for the feedback. I have tried to address the two main issues you raise below. At the outset, I would like to emphasise that much of this work is taking place in the context of the EU BON project which includes a task on developing/enhancing tools and standards for data sharing with a particular focus on the IPT for publishing sample-based data. So, we were constrained by the need to publish sample-based data sets in the Darwin Core Archive format and to demonstrate practical application using a working prototype. When the discussion on the TDWG list faded out, we took it to our EU BON partners whose requirements were essential input to further development. We recognise that these discussions took place away from TDWG (although the TDWG/EU BON contributors overlapped) and this is the reason we are presenting the outcomes here for further consideration.
**Event core**
As the SIGS report indicated, sample data can be modelled in Darwin Core Archives using either Occurrence or Event as core. This was the starting point for our evaluation but as things progressed the data wrangling pushed the model back towards the Event core. We actually went through the exercise of mapping multiple test datasets in an iterative process spanning several months' work. In the end, we found that using an Event core better matched the typical sample data we were dealing with, allowing use of a measurement-or-fact extension to be included for the efficient expression of environmental information associated with the event. The choice comes down to an Occurrence core or an Event core + Occurrence extension. In both cases, the true observation records are Occurrences. The big difference is what type the core has and therefore to which kind of records you can attach further facts and extra information with DwC-A extensions. Many sampling datasets have very rich information about the site and event, so it is very natural to hang facts from an Event core. When picking the Occurrence core those facts would have to be repeated for each and every occurrence record. Moreover, our approach doesn?t stop anyone from using the Occurrence core if they so wish. This just provides a different option for datasets that better fit an Event core model.
I want to stress that we are not building a ?specific IPT version? to support an Event core but, rather, we adapted the IPT so that it can be configured to support any generic ?core + extension? format to enable its use for exploration of more data formats. This is part of the core codebase and there were no custom forks of the IPT for this work. Our view at GBIF is that if there are significant numbers of data publishers who are keen to adopt, promote and use a (any) format, and the tools can be configured to do so, then we should support it, and, if necessary, use a custom namespace.
**New terms around abundance**
Yes, the discussion on TDWG did fade out but it was clear that the term ?abundance? as recommended by the SIGS report (along with abundanceAsPercent) was confusing many when we were looking for term(s) that reported quantitative measures of organisms in a sample. It also became clear we would need to be able to state the type of quantity being measured. An alternative suggestion for using the MeasurementsOrFact class was immediately shot down.
As some of our main use cases were coming from the EU BON project, discussion shifted to that forum and consensus formed about the currently proposed terms. It was within this group that the additional terms (samplingGeometry, samplingUnit, eventSeriesID) were proposed and where we began testing with sample data sets.
Best regards,
?amonn
*From:* robgur@gmail.com [mailto:robgur@gmail.com robgur@gmail.com] *On Behalf Of *Robert Guralnick *Sent:* 19 August 2014 16:56 *To:* ?amonn ? Tuama [GBIF] *Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hi ?amonn --- I am curious about the outcomes presented in the SIGS paper, in particular, this portion of the paper:
"Solutions without introducing an event core in Darwin Core Archives: During the review of the solutions for the uses cases, it became apparent that either model could be applied to every use case. The core and extensions bore a complementary relationship and between them could express all the required information. The core simply provided the central anchor in the star schema from which to join the additional information. Therefore, using the Occurrence core, well established in the GBIF network through uptake of the IPT, seemed more appropriate than inventing CollectingEvent as an additional core type."
That SIGS paper has John Wieczorek and you both as authors, including many luminaries across the biodiversity standards spectrum. Given the above, its curious to see the EventCore come back again, along with a specific IPT version to support it.
So I see two issues, conflated, in this post you just made. One is
the need for an EventCore at all, and the nature of relating Event and Occurrence/Material Sample. The second is the introduction of new terms, which seemingly have arrived after debate on similar terms - but framed around abundance - stalled a year ago. To my mind, these both require some further discussion, because I don't (necessarily) see TDWG community coherence around either issue?
Best, Rob
On Tue, Aug 19, 2014 at 6:11 AM, ?amonn ? Tuama [GBIF] eotuama@gbif.org wrote:
Dear All,
GBIF is committed to exploring ways in which the IPT and Darwin Core Archive format can be extended for publishing sample-based data sets. In association with the EU BON project [1], a customised version of the IPT [2] has been deployed to test this using a special type of Darwin Core Archive in which the core is an ?Event? with associated taxon occurrences in an ?Occurrence? extension.
The Darwin Core vocabulary already provides a rich set of terms with many relevant for describing sample-based data. Synthesising several sources of input (GBIF organised workshop on sample data, May 2013 [3], discussions on the TDWG mailing list in late 2013; internal discussion among EU BON project partners), five new terms relating to sample data were identified as essential. The complete model including these new terms are fully described with examples in the online document ?Publishing sample data using the GBIF IPT? [4].
As a first step towards ratification, we would like to register the new terms in the DwC Google Code tracker [5] if there are no major objections on this list. The five terms are:
*quantity*: the number or enumeration value of the quantityType
(e.g., individuals, biomass, biovolume, BraunBlanquetScale) per samplingUnit or a percentage measure recorded for the sample.
*quantityType*: : the entity being referred to by quantity,
e.g., individuals, biomass, %species, scale type.
*samplingGeometry*: an indication of what kind of space was
sampled; select from point, line, area or volume.
*samplingUnit*: the unit of measurement used for reporting the
quantity in the sample, e.g., minute, hour, day, metre, metre^2, metre^3. It is combined with quantity and quantityType to provide the complete measurement, e.g., 9 individuals per day, 4 biomass-gm per metre^2.
*eventSeriesID*: an identifier for a set of events that are
associated in some way, e.g., a monitoring series; may be a global unique identifier or an identifier specific to the series.
Best regards,
?amonn
[1] http://eubon.eu
[3] http://www.standardsingenomics.org/index.php/sigen/article/view/sigs.4898640
[4] http://links.gbif.org/sample_data_model
[5] https://code.google.com/p/darwincore/issues/list
*?amonn ? Tuama, M.Sc., Ph.D. (eotuama@gbif.org eotuama@gbif.org), *
*Senior Programme Officer for Interoperability, *
*Global Biodiversity Information Facility Secretariat, *
*Universitetsparken 15, DK-2100, Copenhagen ?, DENMARK*
*Phone: +45 3532 1494 <%2B45%203532%201494>; Fax: +45 3532 1480 <%2B45%203532%201480>*
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Thanks, Ramona and Rob.
I'd like to add a few points following on Markus's reply.
I think your pressing of the need for a robust semantic model for biodiversity sample/survey data is incontestable we do need one and it should enable rich data integration once it is defined and the tools and data standards to support it become available. However, GBIF is faced with the immediate task of making sample-based data discoverable and accessible using its current ecosystem of tools (IPT) and exchange standards (DwC; EML). Waiting for a functional, implementable semantic model and the tools and support services for it is just not an option for us right now.
We have already spend considerable time in analysing the merits of Occurrence core vs Event core and have opted for an Event core for reasons previously given. I dont believe we are trying to reconfigure Event (an action that occurs at a place and during a period of time) and regardless of whether we use Occurrence or Event, the need for some additional terms arise (e.g., quantity, quantityType, samplingGeometry, samplingUnit). Once the BCO model is available for uptake, it should be possible to develop a mapping between it and the simple DwC sample model.
So GBIFs stance is that we need to take a two-pronged approach by exploring how the IPT and DwC-A can be adapted for publishing sample-based data in the near term while supporting the work of TDWG and groups such as the BCO in advancing biodiversity informatics. GBIF has already engaged in the work of the BCO and will continue to do so.
Éamonn
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Markus Döring Sent: 28 August 2014 12:44 To: Ramona Walls Cc: TDWG Content Mailing List Subject: Re: [tdwg-content] tdwg-content Digest, Vol 63, Issue 6
Hi Ramona & Rob,
The Event proposal does not try to change the semantics of an Event, it just uses the existing Darwin Core Event "class" at the core in Darwin Core archives. The actual change proposed is simply adding 3 new terms to the Event "group" to better share information about sampling methods & efforts, extending the existing limited capabilities of Darwin Core which already has the terms dwc:samplingProtocol and dwc:samplingEffort. It also proposes 2 new terms for dealing with quantity of Occurrences, something that has been discussed since 2012 now, when I had proposed a new abundance term [2].
In general application of Darwin Core is not at all limited to specimens and observations. It is used for sharing taxonomic datasets already and it's definition and goal is broad. Let me cite some of the introduction to Darwin Core [1]:
What is the Darwin Core? The Darwin Core is body of standards. It includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing reference definitions, examples, and commentaries. The Darwin Core is primarily based on taxa, their occurrence in nature as documented by observations, specimens, samples, and related information.
Motivation: The Darwin Core standard was originally conceived to facilitate the discovery, retrieval, and integration of information about modern biological specimens, their spatiotemporal occurrence, and their supporting evidence housed in collections (physical or digital). The Darwin Core today is broader in scope and more versatile. It is meant to provide a stable standard reference for sharing information on biological diversity. As a glossary of terms, the Darwin Core is meant to provide stable semantic definitions with the goal of being maximally reusable in a variety of contexts.
Markus
[1] http://rs.tdwg.org/dwc/index.htm [2] https://code.google.com/p/darwincore/issues/detail?id=142
-- Markus Döring Software Developer Global Biodiversity Information Facility (GBIF) mdoering@gbif.org http://www.gbif.org
On 27 Aug 2014, at 18:57, Ramona Walls rlwalls2008@gmail.com wrote:
I think it is important to consider the purpose of both Darwin Core and
DwC archives in deciding whether or not to expand them, but we should use that consideration to address the question at hand, which is whether or not to add an Event core and additional properties to describe events.
Describing the exchange format before the semantics is the wrong way to
go, given that we now have a framework for developing semantics. Expanding Darwin Core before we adequately model survey data is bound to lead to problems later, when we try to retro-fit the semantics to Darwin Core Event archives. This is exactly the problem we are running into now with Occurance archives, and we have the opportunity to avoid it.
I suggest we first use existing ontologies to model survey data, then deal
with if and how to exchange that information in DwC-A. This is what I was hinting at in my first email, but should have said more explicitly.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Wed, Aug 27, 2014 at 9:14 AM, Robert Guralnick
Robert.Guralnick@colorado.edu wrote:
It may be a sensible view for Darwin Core Archives and their intended
use, but Tim's email suggests we should be putting the method of delivery ahead of the standard that delivers that content. If this was just about DwC-As, why not develop a survey extension that links each occurrence to information about the survey process using the existing star-schema methods we have in place? Why are we discussing adding terms to the Darwin Core or trying to fully reconfigure what we call an Event? That is what is on the table, not DwC-As and how we use them. Or am I missing something?
Best, Rob
On Wed, Aug 27, 2014 at 10:01 AM, Ramona Walls rlwalls2008@gmail.com
wrote:
Thanks, Tim, and yes, DwC-A as a view (but not necessarily the primary
archive) of data seems like the right point of view.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Wed, Aug 27, 2014 at 1:58 AM, Tim Robertson trobertson@gbif.org
wrote:
Hi Ramona,
Those are good points, and Id like to come back to the original thinking
behind the DwC-A.
It was designed and intended to be a simple way of exposing a complete
view of a dataset, primarily for building sophisticated indexes, inventories and allowing basic analytics (e.g. GBIF.org being one sophisticated index). We found that the star schema provided the flexibility to do a lot, and with the bundled metadata (e.g. EML) was enough to trace provenance and allow users to determine if the dataset might be fit for various uses. In many cases this represents the complete (e.g. lossless) view of a dataset.
What we are discussing here are far richer datasets, where shoe-horning
content into the star schema becomes lossy for some, although were finding other cases where it is indeed lossless. I believe we should be looking to harmonise ontologies / models etc as you mention but in parallel we should define one or more star schema views that can still be used for discovery / reporting / basic analytical purpose, and not long term archival of the dataset. The dataset would then have the canonical rich form and an additional DwC-A view. What I write here is applicable to all content types of course.
Please also note that many people put supplementary files in the DwC-A
which are ignored by DwC-A readers but could be a way of keeping the richer view in the bundle. If one wished you can describe those supplementary files in the EML document.
Does this gel with the view of others as well?
Cheers, Tim
On 27 Aug 2014, at 02:55, Ramona Walls rlwalls2008@gmail.com wrote:
I think Matt hit the nail on the head. Although Darwin Core can be used
to exchange survey data, it lacks the semantics and structure necessary to archive the data without loss of information. I think the biodiversity community would be better served devoting energy to harmonizing existing technologies such as OGC, OBOE, and BCO, not to mention the many database for storing plot or survey data. The goal should be to preserve the data in the most informative manner possible.
There is a strong a case for wanting to search across all evidence for
occurences, including surveys and point occurences, so I can see possible demand for a tool that would extract occurences from survey data to a DwC archive. However, I am very concerned that making a DwC archive the primary exchange format for survey or plot data commits us to a path of losing information from the start, for all but the simplest sampling schemas.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Fri, Aug 22, 2014 at 3:00 AM, tdwg-content-request@lists.tdwg.org
wrote:
Send tdwg-content mailing list submissions to tdwg-content@lists.tdwg.org
To subscribe or unsubscribe via the World Wide Web, visit http://lists.tdwg.org/mailman/listinfo/tdwg-content or, via email, send a message with subject or body 'help' to tdwg-content-request@lists.tdwg.org
You can reach the person managing the list at tdwg-content-owner@lists.tdwg.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of tdwg-content digest..."
Today's Topics:
- Re: Darwin Core: proposed news terms for expressing sample data (Matt Jones)
- Re: Darwin Core: proposed news terms for expressing sample data (Donald Hobern [GBIF])
Message: 1 Date: Thu, 21 Aug 2014 18:52:06 -0800 From: Matt Jones jones@nceas.ucsb.edu Subject: Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data To: ?amonn ? Tuama [GBIF] eotuama@gbif.org Cc: TDWG Content Mailing List tdwg-content@lists.tdwg.org Message-ID:
CAFSW8xkx7uRP9PC2g3=JT_VJanqujH8nPXoz8GXwh+JwKw5Ccw@mail.gmail.com
Content-Type: text/plain; charset="utf-8"
This proposal is treading on ground that is quite similar to other observations and measurements standards for data exchange that are
already
mature, in particular:
- OGC Observations and Measurements (
http://www.opengeospatial.org/standards/om)
- Extensible Observation Ontology (OBOE;
https://semtools.ecoinformatics.org/oboe)
The former is a standard and broadly deployed, whereas the latter is part of a research program in the use of ontologies for measurements. Through collaboration between the two projects, they've been modified to be reasonably isomorphic, but O&M uses an XML serialization while OBOE uses
an
OWL-DL serialization. They largely express the same measurements and sampling model once one gets beyond the terminology differences.
So, I'm wondering if it make much sense to extend Darwin Core, which is
at
heart an Occurrence exchange syntax, into this measurements area that is well represented by these other existing specifications? I'm curious to hear why people would even want to do this. And if we do go down this path, won't we just end up with a new syntax that does essentially what
O&M
and OBOE do now?
Matt
On Thu, Aug 21, 2014 at 12:22 AM, ?amonn ? Tuama [GBIF]
wrote:
Hi Rob, Anne, Rich,
I think Markus has answered your question as to why we opted for an
Event
core which is being used in the sense described by Anne and Rich. For
any
event, you can have a list of species in an Occurrence extension and
for
each species, you can include quantity and quantityType, e.g., biomass, etc. The proposed term eventSeriesID was intended for linking together related events, although it now looks like parentEventID might be a
better,
more flexible term. The measurementOrFact extension is a good fit for capturing environmental information relating to an event. See, e.g.,
the
Gialova Lagoon brackish water invertebrate test data set [1] where a
set
of 18 environmental variables, including temp, pH, Rdx, particulate
organic
matter, dissolved oxygen, salinity, chlorophyll-a were measured for
each
sampling station-sampling period combination. An example mapping is:
Id measurementType measurementValue measurementUnit measurementRemarks
IA Tmp (sed) 21.5 degree C Tmp (sed): temperature at the bottom surface
**Controlled vocabularies**
Ideally, the values for samplingUnit and quantityType would be selected from controlled vocabularies. This is, effectively, what we do by presenting a small list of values in a drop-down menu. The current
values
are what we derived for example data sets and discussion but they can undoubtedly be extended and improved.
We capture ?bucket? type measures through a combination of
samplingEffort,
samplingGeometry and samplingUnit. For example, a pitfall trap (in a
point
location) left out for 16 days might have samplingEffort: 16, samplingGeometry: point and samplingUnit: day. Three m^2 quadrats in a shore survey might have samplingEffort: 3, samplingGeometry: area and samplingUnit: m^2.
It would be very useful to see your compilation of scope, effort and completeness measures to see if we can express them in our model and/or
if
we need to reconsider our approach.
?amonn
[1] http://eubon-ipt.gbif.org/resource.do?r=ionian-brackish-lagoon
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Markus D?ring *Sent:* 20 August 2014 23:47 *To:* Robert Guralnick
*Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Rob,
this proposal if for monitoring surveys really, not to be confused with material samples like environmental or tissue samples which have a
distinct
new dwc class MaterialSample.
We tend to overload the term sampling a lot and it helps treating
material
samples different from pure observational "sampling". That is why the existing Event class was used as the core and classic Occurrence
records as
extensions. A classic example is a vegetation survey where each plot represents an Event record and each recorded species in that plot will
be
an Occurrence extension record with a given quantity. Darwin Core
already
offers individualCount to specify quantity, but it is a very specific
way
of measuring "abundance" restricted to only some use cases. Abiotic measurements about the plot (e.g. soil type, pH, temperature) can be published using the measurements or facts extension linked to the Event core.
Markus
On 20 Aug 2014, at 20:08, Robert Guralnick
wrote:
Anne -- I don't know the answers! These are questions for Eamonn. I would presume that a sample could be a jumble of species or even just
water
or soil samples, and biomass would refer to that sample - but maybe
that
isn't a use case being considered? The examples given in the longer document all link an event_id to species name and some measure of
quantity
for that species (to the species, not an individual specimen), so I
assume
that is the prevailing (or only) case?
Best, Rob
On Wed, Aug 20, 2014 at 11:56 AM, Anne Thessen annethessen@gmail.com wrote:
Hi Rob I would like to respond to your item number 2. From my perspective, I deal with lots of published descriptions of
taxa.
The text might say something like "I saw species A in the Chesapeake
Bay,
the Adriatic Sea and the Indian Ocean and the biomass is 5 - 9 grams".
The
biomass range obviously corresponds to at least three different occurrences, but how to divide the biomass data? I would love to be
able to
have an *event* to attach it all to. There is almost two different
levels
of events - a sampling event and a "study event". The "study event"
would
correspond to the type of event I would like to use in the above
example.
It may not be ideal, but for the old literature that might be the best
we
can do. I have to admit that I don't know enough about trawl data to understand why an event core would be a problem. It seems that the trawl would be
an
event and each biomass measure (of each fish) would be attached to a separate occurrence which is attached to that event. Am I understanding this wrong? btw - I found a workaround for the example I gave, so it's not
impossible
to model with the current structure.... Anne
On 8/20/2014 1:16 PM, Robert Guralnick wrote:
?amonn et al. --- Thanks for the clarifications. I think these help a
ton
but it raises a couple more questions for me.
- I am surprised that you plan to use of MeasurementorFact extension
in
relation to the Event core, which seems like a novel (or perhaps
awkward or
unintended?) mechanism for capturing environmental data, but the same extension was not be seen as relevant for describing samples? Can you explain more about the thinking there?
- There may be a subtle issue here extending "Event" to be more what
you
call a "Sampling Event Core". My read of this is that Darwin Core
serves
as a way to deal with point occurrences and Event reflects the context
of a
single capture event (whether a single observation, or a bulk sample capture). The changes recommended seem to dramatically extend and
change
that meaning? Its simply a question that I don't have answer to, but
is
Darwin Core, the right vehicle to start capturing repeated measures of biomass values from trawls? I don't have answer but man, terms like quantityType (as a property of occurrence?) give me pause.
- Is Sampling Unit a controlled vocabulary? For another project, I
have
looked through - and captured scope, effort and completeness measures
from
- a large number of published biotic area inventories. The vast
majorities
of these are measured in units like bucket hours, or trap nights. Is a "bucket" part of SamplingGeometry or Sampling Unit? I'd be happy to
send
along all the many examples of how biotic inventories of an area are completed and perhaps it might be good to see how those might be represented using the terms you are proposing?
Best, Rob
On Wed, Aug 20, 2014 at 10:16 AM, Richard Pyle
wrote:
Same here ? Events are central to the work that we do.
Aloha,
Rich
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Anne Thessen *Sent:* Wednesday, August 20, 2014 2:59 AM *To:* tdwg-content@lists.tdwg.org
*Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hello I would just like to comment on *event core*. I've been doing a lot of work translating published data into Darwin
Core.
During that process I've wished several times that I could use Event as core. I am happy to hear about that proposed change. It will make it
easier
to model the data I am working with. Anne
On 8/20/2014 7:04 AM, ?amonn ? Tuama [GBIF] wrote:
Hi Rob,
Thank you for the feedback. I have tried to address the two main issues you raise below. At the outset, I would like to emphasise that much of
this
work is taking place in the context of the EU BON project which
includes a
task on developing/enhancing tools and standards for data sharing with
a
particular focus on the IPT for publishing sample-based data. So, we
were
constrained by the need to publish sample-based data sets in the Darwin Core Archive format and to demonstrate practical application using a working prototype. When the discussion on the TDWG list faded out, we
took
it to our EU BON partners whose requirements were essential input to further development. We recognise that these discussions took place
away
from TDWG (although the TDWG/EU BON contributors overlapped) and this
is
the reason we are presenting the outcomes here for further
consideration.
**Event core**
As the SIGS report indicated, sample data can be modelled in Darwin
Core
Archives using either Occurrence or Event as core. This was the
starting
point for our evaluation but as things progressed the data wrangling
pushed
the model back towards the Event core. We actually went through the exercise of mapping multiple test datasets in an iterative process
spanning
several months' work. In the end, we found that using an Event core
better
matched the typical sample data we were dealing with, allowing use of a measurement-or-fact extension to be included for the efficient
expression
of environmental information associated with the event. The choice
comes
down to an Occurrence core or an Event core + Occurrence extension. In
both
cases, the true observation records are Occurrences. The big difference
is
what type the core has and therefore to which kind of records you can attach further facts and extra information with DwC-A extensions. Many sampling datasets have very rich information about the site and event,
so
it is very natural to hang facts from an Event core. When picking the Occurrence core those facts would have to be repeated for each and
every
occurrence record. Moreover, our approach doesn?t stop anyone from
using
the Occurrence core if they so wish. This just provides a different
option
for datasets that better fit an Event core model.
I want to stress that we are not building a ?specific IPT version? to support an Event core but, rather, we adapted the IPT so that it can be configured to support any generic ?core + extension? format to enable
its
use for exploration of more data formats. This is part of the core codebase and there were no custom forks of the IPT for this work. Our
view
at GBIF is that if there are significant numbers of data publishers who
are
keen to adopt, promote and use a (any) format, and the tools can be configured to do so, then we should support it, and, if necessary, use
a
custom namespace.
**New terms around abundance**
Yes, the discussion on TDWG did fade out but it was clear that the term ?abundance? as recommended by the SIGS report (along with abundanceAsPercent) was confusing many when we were looking for term(s) that reported quantitative measures of organisms in a sample. It also became clear we would need to be able to state the type of quantity
being
measured. An alternative suggestion for using the MeasurementsOrFact
class
was immediately shot down.
As some of our main use cases were coming from the EU BON project, discussion shifted to that forum and consensus formed about the
currently
proposed terms. It was within this group that the additional terms (samplingGeometry, samplingUnit, eventSeriesID) were proposed and where
we
began testing with sample data sets.
Best regards,
?amonn
*From:* robgur@gmail.com [mailto:robgur@gmail.com robgur@gmail.com]
*On
Behalf Of *Robert Guralnick *Sent:* 19 August 2014 16:56 *To:* ?amonn ? Tuama [GBIF] *Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hi ?amonn --- I am curious about the outcomes presented in the SIGS paper, in particular, this portion of the paper:
"Solutions without introducing an event core in Darwin Core Archives: During the review of the solutions for the uses cases, it became
apparent
that either model could be applied to every use case. The core and extensions bore a complementary relationship and between them could
express
all the required information. The core simply provided the central
anchor
in the star schema from which to join the additional information. Therefore, using the Occurrence core, well established in the GBIF
network
through uptake of the IPT, seemed more appropriate than inventing CollectingEvent as an additional core type."
That SIGS paper has John Wieczorek and you both as authors,
including
many luminaries across the biodiversity standards spectrum. Given the above, its curious to see the EventCore come back again, along with a specific IPT version to support it.
So I see two issues, conflated, in this post you just made. One is
the need for an EventCore at all, and the nature of relating Event and Occurrence/Material Sample. The second is the introduction of new
terms,
which seemingly have arrived after debate on similar terms - but framed around abundance - stalled a year ago. To my mind, these both require
some
further discussion, because I don't (necessarily) see TDWG community coherence around either issue?
Best, Rob
On Tue, Aug 19, 2014 at 6:11 AM, ?amonn ? Tuama [GBIF]
wrote:
Dear All,
GBIF is committed to exploring ways in which the IPT and Darwin Core Archive format can be extended for publishing sample-based data sets.
In
association with the EU BON project [1], a customised version of the
IPT
[2] has been deployed to test this using a special type of Darwin Core Archive in which the core is an ?Event? with associated taxon
occurrences
in an ?Occurrence? extension.
The Darwin Core vocabulary already provides a rich set of terms with
many
relevant for describing sample-based data. Synthesising several sources
of
input (GBIF organised workshop on sample data, May 2013 [3],
discussions on
the TDWG mailing list in late 2013; internal discussion among EU BON project partners), five new terms relating to sample data were
identified
as essential. The complete model including these new terms are fully described with examples in the online document ?Publishing sample data using the GBIF IPT? [4].
As a first step towards ratification, we would like to register the new terms in the DwC Google Code tracker [5] if there are no major
objections
on this list. The five terms are:
*quantity*: the number or enumeration value of the quantityType
(e.g., individuals, biomass, biovolume, BraunBlanquetScale) per samplingUnit or a percentage measure recorded for the sample.
*quantityType*: : the entity being referred to by quantity,
e.g., individuals, biomass, %species, scale type.
*samplingGeometry*: an indication of what kind of space was
sampled; select from point, line, area or volume.
*samplingUnit*: the unit of measurement used for reporting the
quantity in the sample, e.g., minute, hour, day, metre, metre^2,
metre^3.
It is combined with quantity and quantityType to provide the complete measurement, e.g., 9 individuals per day, 4 biomass-gm per metre^2.
*eventSeriesID*: an identifier for a set of events that are
associated in some way, e.g., a monitoring series; may be a global
unique
identifier or an identifier specific to the series.
Best regards,
?amonn
[1] http://eubon.eu
[3]
http://www.standardsingenomics.org/index.php/sigen/article/view/sigs.4898640
[4] http://links.gbif.org/sample_data_model
[5] https://code.google.com/p/darwincore/issues/list
*?amonn ? Tuama, M.Sc., Ph.D. (eotuama@gbif.org eotuama@gbif.org), *
*Senior Programme Officer for Interoperability, *
*Global Biodiversity Information Facility Secretariat, *
*Universitetsparken 15, DK-2100, Copenhagen ?, DENMARK*
*Phone: +45 3532 1494 <%2B45%203532%201494>; Fax: +45 3532 1480 <%2B45%203532%201480>*
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
I see the rational for enabling this in Darwin Core Archives and adding the new terms. However, back to what Matt Jones brought up: "won't we just end up with a new syntax that does essentially what O&M and OBOE do now?".
We should include explicit references to existing terms/definitions that encapsulate what we're talking about, e.g. in our MaterialSample proposal last year we linked the an existing term in OBI, which has a much richer description and context for MaterialSample than what we considered ( https://code.google.com/p/darwincore/issues/detail?id=167)
Have we explored the possibility of doing this with OBOE? I'm not suggesting we adopt OBOE wholesale, but it seems like we have a good opportunity to enable better semantic linking with that efforts.
John
On Thu, Aug 28, 2014 at 4:23 AM, Éamonn Ó Tuama [GBIF] eotuama@gbif.org wrote:
Thanks, Ramona and Rob.
I'd like to add a few points following on Markus's reply.
I think your pressing of the need for a robust semantic model for biodiversity sample/survey data is incontestable - we do need one and it should enable rich data integration once it is defined and the tools and data standards to support it become available. However, GBIF is faced with the immediate task of making sample-based data discoverable and accessible using its current ecosystem of tools (IPT) and exchange standards (DwC; EML). Waiting for a functional, implementable semantic model and the tools and support services for it is just not an option for us right now.
We have already spend considerable time in analysing the merits of Occurrence core vs Event core and have opted for an Event core for reasons previously given. I don't believe we are trying to reconfigure Event ("an action that occurs at a place and during a period of time") and regardless of whether we use Occurrence or Event, the need for some additional terms arise (e.g., quantity, quantityType, samplingGeometry, samplingUnit). Once the BCO model is available for uptake, it should be possible to develop a mapping between it and the simple DwC sample model.
So GBIF's stance is that we need to take a two-pronged approach by exploring how the IPT and DwC-A can be adapted for publishing sample-based data in the near term while supporting the work of TDWG and groups such as the BCO in advancing biodiversity informatics. GBIF has already engaged in the work of the BCO and will continue to do so.
Éamonn
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Markus Döring Sent: 28 August 2014 12:44 To: Ramona Walls Cc: TDWG Content Mailing List Subject: Re: [tdwg-content] tdwg-content Digest, Vol 63, Issue 6
Hi Ramona & Rob,
The Event proposal does not try to change the semantics of an Event, it just uses the existing Darwin Core Event "class" at the core in Darwin Core archives. The actual change proposed is simply adding 3 new terms to the Event "group" to better share information about sampling methods & efforts, extending the existing limited capabilities of Darwin Core which already has the terms dwc:samplingProtocol and dwc:samplingEffort. It also proposes 2 new terms for dealing with quantity of Occurrences, something that has been discussed since 2012 now, when I had proposed a new abundance term [2].
In general application of Darwin Core is not at all limited to specimens and observations. It is used for sharing taxonomic datasets already and it's definition and goal is broad. Let me cite some of the introduction to Darwin Core [1]:
What is the Darwin Core? The Darwin Core is body of standards. It includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing reference definitions, examples, and commentaries. The Darwin Core is primarily based on taxa, their occurrence in nature as documented by observations, specimens, samples, and related information.
Motivation: The Darwin Core standard was originally conceived to facilitate the discovery, retrieval, and integration of information about modern biological specimens, their spatiotemporal occurrence, and their supporting evidence housed in collections (physical or digital). The Darwin Core today is broader in scope and more versatile. It is meant to provide a stable standard reference for sharing information on biological diversity. As a glossary of terms, the Darwin Core is meant to provide stable semantic definitions with the goal of being maximally reusable in a variety of contexts.
Markus
[1] http://rs.tdwg.org/dwc/index.htm [2] https://code.google.com/p/darwincore/issues/detail?id=142
-- Markus Döring Software Developer Global Biodiversity Information Facility (GBIF) mdoering@gbif.org http://www.gbif.org
On 27 Aug 2014, at 18:57, Ramona Walls rlwalls2008@gmail.com wrote:
I think it is important to consider the purpose of both Darwin Core and
DwC archives in deciding whether or not to expand them, but we should use that consideration to address the question at hand, which is whether or not to add an Event core and additional properties to describe events.
Describing the exchange format before the semantics is the wrong way to
go, given that we now have a framework for developing semantics. Expanding Darwin Core before we adequately model survey data is bound to lead to problems later, when we try to retro-fit the semantics to Darwin Core Event archives. This is exactly the problem we are running into now with Occurance archives, and we have the opportunity to avoid it.
I suggest we first use existing ontologies to model survey data, then
deal with if and how to exchange that information in DwC-A. This is what I was hinting at in my first email, but should have said more explicitly.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Wed, Aug 27, 2014 at 9:14 AM, Robert Guralnick
Robert.Guralnick@colorado.edu wrote:
It may be a sensible view for Darwin Core Archives and their intended
use, but Tim's email suggests we should be putting the method of delivery ahead of the standard that delivers that content. If this was just about DwC-As, why not develop a survey extension that links each occurrence to information about the survey process using the existing star-schema methods we have in place? Why are we discussing adding terms to the Darwin Core or trying to fully reconfigure what we call an Event? That is what is on the table, not DwC-As and how we use them. Or am I missing something?
Best, Rob
On Wed, Aug 27, 2014 at 10:01 AM, Ramona Walls rlwalls2008@gmail.com
wrote:
Thanks, Tim, and yes, DwC-A as a view (but not necessarily the primary
archive) of data seems like the right point of view.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Wed, Aug 27, 2014 at 1:58 AM, Tim Robertson trobertson@gbif.org
wrote:
Hi Ramona,
Those are good points, and I'd like to come back to the original thinking
behind the DwC-A.
It was designed and intended to be a simple way of exposing a complete
view of a dataset, primarily for building sophisticated indexes, inventories and allowing basic analytics (e.g. GBIF.org being one sophisticated index). We found that the star schema provided the flexibility to do a lot, and with the bundled metadata (e.g. EML) was enough to trace provenance and allow users to determine if the dataset might be fit for various uses. In many cases this represents the complete (e.g. lossless) view of a dataset.
What we are discussing here are far richer datasets, where shoe-horning
content into the star schema becomes lossy for some, although we're finding other cases where it is indeed lossless. I believe we should be looking to harmonise ontologies / models etc as you mention but in parallel we should define one or more star schema views that can still be used for discovery / reporting / basic analytical purpose, and not long term archival of the dataset. The dataset would then have the canonical rich form and an additional DwC-A view. What I write here is applicable to all content types of course.
Please also note that many people put supplementary files in the DwC-A
which are ignored by DwC-A readers but could be a way of keeping the richer view in the bundle. If one wished you can describe those supplementary files in the EML document.
Does this gel with the view of others as well?
Cheers, Tim
On 27 Aug 2014, at 02:55, Ramona Walls rlwalls2008@gmail.com wrote:
I think Matt hit the nail on the head. Although Darwin Core can be used
to exchange survey data, it lacks the semantics and structure necessary to archive the data without loss of information. I think the biodiversity community would be better served devoting energy to harmonizing existing technologies such as OGC, OBOE, and BCO, not to mention the many database for storing plot or survey data. The goal should be to preserve the data in the most informative manner possible.
There is a strong a case for wanting to search across all evidence for
occurences, including surveys and point occurences, so I can see possible demand for a tool that would extract occurences from survey data to a DwC archive. However, I am very concerned that making a DwC archive the primary exchange format for survey or plot data commits us to a path of losing information from the start, for all but the simplest sampling schemas.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Fri, Aug 22, 2014 at 3:00 AM, tdwg-content-request@lists.tdwg.org
wrote:
Send tdwg-content mailing list submissions to tdwg-content@lists.tdwg.org
To subscribe or unsubscribe via the World Wide Web, visit http://lists.tdwg.org/mailman/listinfo/tdwg-content or, via email, send a message with subject or body 'help' to tdwg-content-request@lists.tdwg.org
You can reach the person managing the list at tdwg-content-owner@lists.tdwg.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of tdwg-content digest..."
Today's Topics:
- Re: Darwin Core: proposed news terms for expressing sample data (Matt Jones)
- Re: Darwin Core: proposed news terms for expressing sample data (Donald Hobern [GBIF])
Message: 1 Date: Thu, 21 Aug 2014 18:52:06 -0800 From: Matt Jones jones@nceas.ucsb.edu Subject: Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data To: ?amonn ? Tuama [GBIF] eotuama@gbif.org Cc: TDWG Content Mailing List tdwg-content@lists.tdwg.org Message-ID:
CAFSW8xkx7uRP9PC2g3=JT_VJanqujH8nPXoz8GXwh+JwKw5Ccw@mail.gmail.com
Content-Type: text/plain; charset="utf-8"
This proposal is treading on ground that is quite similar to other observations and measurements standards for data exchange that are
already
mature, in particular:
- OGC Observations and Measurements (
http://www.opengeospatial.org/standards/om)
- Extensible Observation Ontology (OBOE;
https://semtools.ecoinformatics.org/oboe)
The former is a standard and broadly deployed, whereas the latter is
part
of a research program in the use of ontologies for measurements.
Through
collaboration between the two projects, they've been modified to be reasonably isomorphic, but O&M uses an XML serialization while OBOE uses
an
OWL-DL serialization. They largely express the same measurements and sampling model once one gets beyond the terminology differences.
So, I'm wondering if it make much sense to extend Darwin Core, which is
at
heart an Occurrence exchange syntax, into this measurements area that is well represented by these other existing specifications? I'm curious to hear why people would even want to do this. And if we do go down this path, won't we just end up with a new syntax that does essentially what
O&M
and OBOE do now?
Matt
On Thu, Aug 21, 2014 at 12:22 AM, ?amonn ? Tuama [GBIF]
wrote:
Hi Rob, Anne, Rich,
I think Markus has answered your question as to why we opted for an
Event
core which is being used in the sense described by Anne and Rich. For
any
event, you can have a list of species in an Occurrence extension and
for
each species, you can include quantity and quantityType, e.g.,
biomass,
etc. The proposed term eventSeriesID was intended for linking together related events, although it now looks like parentEventID might be a
better,
more flexible term. The measurementOrFact extension is a good fit for capturing environmental information relating to an event. See, e.g.,
the
Gialova Lagoon brackish water invertebrate test data set [1] where a
set
of 18 environmental variables, including temp, pH, Rdx, particulate
organic
matter, dissolved oxygen, salinity, chlorophyll-a were measured for
each
sampling station-sampling period combination. An example mapping is:
Id measurementType measurementValue measurementUnit measurementRemarks
IA Tmp (sed) 21.5 degree C Tmp (sed): temperature at the bottom surface
**Controlled vocabularies**
Ideally, the values for samplingUnit and quantityType would be
selected
from controlled vocabularies. This is, effectively, what we do by presenting a small list of values in a drop-down menu. The current
values
are what we derived for example data sets and discussion but they can undoubtedly be extended and improved.
We capture ?bucket? type measures through a combination of
samplingEffort,
samplingGeometry and samplingUnit. For example, a pitfall trap (in a
point
location) left out for 16 days might have samplingEffort: 16, samplingGeometry: point and samplingUnit: day. Three m^2 quadrats in a shore survey might have samplingEffort: 3, samplingGeometry: area and samplingUnit: m^2.
It would be very useful to see your compilation of scope, effort and completeness measures to see if we can express them in our model
and/or if
we need to reconsider our approach.
?amonn
[1] http://eubon-ipt.gbif.org/resource.do?r=ionian-brackish-lagoon
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Markus D?ring *Sent:* 20 August 2014 23:47 *To:* Robert Guralnick
*Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Rob,
this proposal if for monitoring surveys really, not to be confused
with
material samples like environmental or tissue samples which have a
distinct
new dwc class MaterialSample.
We tend to overload the term sampling a lot and it helps treating
material
samples different from pure observational "sampling". That is why the existing Event class was used as the core and classic Occurrence
records as
extensions. A classic example is a vegetation survey where each plot represents an Event record and each recorded species in that plot will
be
an Occurrence extension record with a given quantity. Darwin Core
already
offers individualCount to specify quantity, but it is a very specific
way
of measuring "abundance" restricted to only some use cases. Abiotic measurements about the plot (e.g. soil type, pH, temperature) can be published using the measurements or facts extension linked to the
Event
core.
Markus
On 20 Aug 2014, at 20:08, Robert Guralnick
wrote:
Anne -- I don't know the answers! These are questions for Eamonn.
I
would presume that a sample could be a jumble of species or even just
water
or soil samples, and biomass would refer to that sample - but maybe
that
isn't a use case being considered? The examples given in the longer document all link an event_id to species name and some measure of
quantity
for that species (to the species, not an individual specimen), so I
assume
that is the prevailing (or only) case?
Best, Rob
On Wed, Aug 20, 2014 at 11:56 AM, Anne Thessen <annethessen@gmail.com
wrote:
Hi Rob I would like to respond to your item number 2. From my perspective, I deal with lots of published descriptions of
taxa.
The text might say something like "I saw species A in the Chesapeake
Bay,
the Adriatic Sea and the Indian Ocean and the biomass is 5 - 9 grams".
The
biomass range obviously corresponds to at least three different occurrences, but how to divide the biomass data? I would love to be
able to
have an *event* to attach it all to. There is almost two different
levels
of events - a sampling event and a "study event". The "study event"
would
correspond to the type of event I would like to use in the above
example.
It may not be ideal, but for the old literature that might be the best
we
can do. I have to admit that I don't know enough about trawl data to
understand
why an event core would be a problem. It seems that the trawl would be
an
event and each biomass measure (of each fish) would be attached to a separate occurrence which is attached to that event. Am I
understanding
this wrong? btw - I found a workaround for the example I gave, so it's not
impossible
to model with the current structure.... Anne
On 8/20/2014 1:16 PM, Robert Guralnick wrote:
?amonn et al. --- Thanks for the clarifications. I think these help a
ton
but it raises a couple more questions for me.
- I am surprised that you plan to use of MeasurementorFact
extension in
relation to the Event core, which seems like a novel (or perhaps
awkward or
unintended?) mechanism for capturing environmental data, but the same extension was not be seen as relevant for describing samples? Can you explain more about the thinking there?
- There may be a subtle issue here extending "Event" to be more what
you
call a "Sampling Event Core". My read of this is that Darwin Core
serves
as a way to deal with point occurrences and Event reflects the context
of a
single capture event (whether a single observation, or a bulk sample capture). The changes recommended seem to dramatically extend and
change
that meaning? Its simply a question that I don't have answer to, but
is
Darwin Core, the right vehicle to start capturing repeated measures of biomass values from trawls? I don't have answer but man, terms like quantityType (as a property of occurrence?) give me pause.
- Is Sampling Unit a controlled vocabulary? For another project, I
have
looked through - and captured scope, effort and completeness measures
from
- a large number of published biotic area inventories. The vast
majorities
of these are measured in units like bucket hours, or trap nights. Is
a
"bucket" part of SamplingGeometry or Sampling Unit? I'd be happy to
send
along all the many examples of how biotic inventories of an area are completed and perhaps it might be good to see how those might be represented using the terms you are proposing?
Best, Rob
On Wed, Aug 20, 2014 at 10:16 AM, Richard Pyle
wrote:
Same here ? Events are central to the work that we do.
Aloha,
Rich
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Anne Thessen *Sent:* Wednesday, August 20, 2014 2:59 AM *To:* tdwg-content@lists.tdwg.org
*Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hello I would just like to comment on *event core*. I've been doing a lot of work translating published data into Darwin
Core.
During that process I've wished several times that I could use Event
as
core. I am happy to hear about that proposed change. It will make it
easier
to model the data I am working with. Anne
On 8/20/2014 7:04 AM, ?amonn ? Tuama [GBIF] wrote:
Hi Rob,
Thank you for the feedback. I have tried to address the two main
issues
you raise below. At the outset, I would like to emphasise that much of
this
work is taking place in the context of the EU BON project which
includes a
task on developing/enhancing tools and standards for data sharing with
a
particular focus on the IPT for publishing sample-based data. So, we
were
constrained by the need to publish sample-based data sets in the
Darwin
Core Archive format and to demonstrate practical application using a working prototype. When the discussion on the TDWG list faded out, we
took
it to our EU BON partners whose requirements were essential input to further development. We recognise that these discussions took place
away
from TDWG (although the TDWG/EU BON contributors overlapped) and this
is
the reason we are presenting the outcomes here for further
consideration.
**Event core**
As the SIGS report indicated, sample data can be modelled in Darwin
Core
Archives using either Occurrence or Event as core. This was the
starting
point for our evaluation but as things progressed the data wrangling
pushed
the model back towards the Event core. We actually went through the exercise of mapping multiple test datasets in an iterative process
spanning
several months' work. In the end, we found that using an Event core
better
matched the typical sample data we were dealing with, allowing use of
a
measurement-or-fact extension to be included for the efficient
expression
of environmental information associated with the event. The choice
comes
down to an Occurrence core or an Event core + Occurrence extension. In
both
cases, the true observation records are Occurrences. The big
difference is
what type the core has and therefore to which kind of records you can attach further facts and extra information with DwC-A extensions. Many sampling datasets have very rich information about the site and event,
so
it is very natural to hang facts from an Event core. When picking the Occurrence core those facts would have to be repeated for each and
every
occurrence record. Moreover, our approach doesn?t stop anyone from
using
the Occurrence core if they so wish. This just provides a different
option
for datasets that better fit an Event core model.
I want to stress that we are not building a ?specific IPT version? to support an Event core but, rather, we adapted the IPT so that it can
be
configured to support any generic ?core + extension? format to enable
its
use for exploration of more data formats. This is part of the core codebase and there were no custom forks of the IPT for this work. Our
view
at GBIF is that if there are significant numbers of data publishers
who are
keen to adopt, promote and use a (any) format, and the tools can be configured to do so, then we should support it, and, if necessary, use
a
custom namespace.
**New terms around abundance**
Yes, the discussion on TDWG did fade out but it was clear that the
term
?abundance? as recommended by the SIGS report (along with abundanceAsPercent) was confusing many when we were looking for
term(s)
that reported quantitative measures of organisms in a sample. It also became clear we would need to be able to state the type of quantity
being
measured. An alternative suggestion for using the MeasurementsOrFact
class
was immediately shot down.
As some of our main use cases were coming from the EU BON project, discussion shifted to that forum and consensus formed about the
currently
proposed terms. It was within this group that the additional terms (samplingGeometry, samplingUnit, eventSeriesID) were proposed and
where we
began testing with sample data sets.
Best regards,
?amonn
*From:* robgur@gmail.com [mailto:robgur@gmail.com robgur@gmail.com]
*On
Behalf Of *Robert Guralnick *Sent:* 19 August 2014 16:56 *To:* ?amonn ? Tuama [GBIF] *Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hi ?amonn --- I am curious about the outcomes presented in the SIGS paper, in particular, this portion of the paper:
"Solutions without introducing an event core in Darwin Core Archives: During the review of the solutions for the uses cases, it became
apparent
that either model could be applied to every use case. The core and extensions bore a complementary relationship and between them could
express
all the required information. The core simply provided the central
anchor
in the star schema from which to join the additional information. Therefore, using the Occurrence core, well established in the GBIF
network
through uptake of the IPT, seemed more appropriate than inventing CollectingEvent as an additional core type."
That SIGS paper has John Wieczorek and you both as authors,
including
many luminaries across the biodiversity standards spectrum. Given the above, its curious to see the EventCore come back again, along with a specific IPT version to support it.
So I see two issues, conflated, in this post you just made. One
is
the need for an EventCore at all, and the nature of relating Event and Occurrence/Material Sample. The second is the introduction of new
terms,
which seemingly have arrived after debate on similar terms - but
framed
around abundance - stalled a year ago. To my mind, these both require
some
further discussion, because I don't (necessarily) see TDWG community coherence around either issue?
Best, Rob
On Tue, Aug 19, 2014 at 6:11 AM, ?amonn ? Tuama [GBIF]
wrote:
Dear All,
GBIF is committed to exploring ways in which the IPT and Darwin Core Archive format can be extended for publishing sample-based data sets.
In
association with the EU BON project [1], a customised version of the
IPT
[2] has been deployed to test this using a special type of Darwin Core Archive in which the core is an ?Event? with associated taxon
occurrences
in an ?Occurrence? extension.
The Darwin Core vocabulary already provides a rich set of terms with
many
relevant for describing sample-based data. Synthesising several
sources of
input (GBIF organised workshop on sample data, May 2013 [3],
discussions on
the TDWG mailing list in late 2013; internal discussion among EU BON project partners), five new terms relating to sample data were
identified
as essential. The complete model including these new terms are fully described with examples in the online document ?Publishing sample data using the GBIF IPT? [4].
As a first step towards ratification, we would like to register the
new
terms in the DwC Google Code tracker [5] if there are no major
objections
on this list. The five terms are:
*quantity*: the number or enumeration value of the
quantityType
(e.g., individuals, biomass, biovolume, BraunBlanquetScale) per samplingUnit or a percentage measure recorded for the sample.
*quantityType*: : the entity being referred to by quantity,
e.g., individuals, biomass, %species, scale type.
*samplingGeometry*: an indication of what kind of space was
sampled; select from point, line, area or volume.
*samplingUnit*: the unit of measurement used for reporting the
quantity in the sample, e.g., minute, hour, day, metre, metre^2,
metre^3.
It is combined with quantity and quantityType to provide the complete measurement, e.g., 9 individuals per day, 4 biomass-gm per metre^2.
*eventSeriesID*: an identifier for a set of events that are
associated in some way, e.g., a monitoring series; may be a global
unique
identifier or an identifier specific to the series.
Best regards,
?amonn
[1] http://eubon.eu
[3]
http://www.standardsingenomics.org/index.php/sigen/article/view/sigs.4898640
[4] http://links.gbif.org/sample_data_model
[5] https://code.google.com/p/darwincore/issues/list
*?amonn ? Tuama, M.Sc., Ph.D. (eotuama@gbif.org eotuama@gbif.org),
*Senior Programme Officer for Interoperability, *
*Global Biodiversity Information Facility Secretariat, *
*Universitetsparken 15, DK-2100, Copenhagen ?, DENMARK*
*Phone: +45 3532 1494 <%2B45%203532%201494>; Fax: +45 3532 1480 <%2B45%203532%201480>*
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi all --- Ok, I think the scope of the issue is quite clear. Let me summarize: 1) As Éamonn and the rest of GBIF has made quite clear, "GBIF is faced with the immediate task of making sample-based data discoverable and accessible using its current ecosystem of tools" given a funding mandate from EU-BON. 2) The solution for this problem is to develop an Event-core and to promote new terms to the Darwin Core to make this happen. I will note a small inconsistency here: the current ecosystem standards and tools of is Darwin Core (as it stands) and publishing systems such as IPT. That ecosystem of tools includes mechanisms to extend Darwin Core where needed, via extensions. The *current* ecosystem of tools doesn't include new Cores or new DwC terms, does it?
So this leads in nicely to the contentious issue(s) and places where there seems to be discussions --- these have to do with the nature of the changes suggested and the scope of those changes, both in terms of an Event core and DwC term additions. Leaving aside the Event-core for now, the key questions simply about term additions to the Darwin Core that seem to be at heart here are: 1) Is the intent of the Darwin Core to model surveys, which usually involve multiple kinds and types of sampling over multiple sites using multiple methods? 2) Is the solution to invent new terms for the Darwin Core if there are already terms from other efforts, wouldn't we work with those existing efforts to assure interoperability?
I appreciate the efforts of GBIF here fully, and am personally torn because on the one hand, I fully agree with the goal of extending Darwin Core to better represent richer biodiversity data. On the other hand, I worry about process here and how to make that happen in a way that isn't too hasty or locks us into just the opposite of what I think many of us want with regards to sharing data more broadly than within just one ecosystem of tools.
Best, Rob
On Thu, Aug 28, 2014 at 6:30 AM, John Deck jdeck@berkeley.edu wrote:
I see the rational for enabling this in Darwin Core Archives and adding the new terms. However, back to what Matt Jones brought up: "won't we just end up with a new syntax that does essentially what O&M and OBOE do now?".
We should include explicit references to existing terms/definitions that encapsulate what we're talking about, e.g. in our MaterialSample proposal last year we linked the an existing term in OBI, which has a much richer description and context for MaterialSample than what we considered ( https://code.google.com/p/darwincore/issues/detail?id=167)
Have we explored the possibility of doing this with OBOE? I'm not suggesting we adopt OBOE wholesale, but it seems like we have a good opportunity to enable better semantic linking with that efforts.
John
On Thu, Aug 28, 2014 at 4:23 AM, Éamonn Ó Tuama [GBIF] eotuama@gbif.org wrote:
Thanks, Ramona and Rob.
I'd like to add a few points following on Markus's reply.
I think your pressing of the need for a robust semantic model for biodiversity sample/survey data is incontestable – we do need one and it should enable rich data integration once it is defined and the tools and data standards to support it become available. However, GBIF is faced with the immediate task of making sample-based data discoverable and accessible using its current ecosystem of tools (IPT) and exchange standards (DwC; EML). Waiting for a functional, implementable semantic model and the tools and support services for it is just not an option for us right now.
We have already spend considerable time in analysing the merits of Occurrence core vs Event core and have opted for an Event core for reasons previously given. I don’t believe we are trying to reconfigure Event (“an action that occurs at a place and during a period of time”) and regardless of whether we use Occurrence or Event, the need for some additional terms arise (e.g., quantity, quantityType, samplingGeometry, samplingUnit). Once the BCO model is available for uptake, it should be possible to develop a mapping between it and the simple DwC sample model.
So GBIF’s stance is that we need to take a two-pronged approach by exploring how the IPT and DwC-A can be adapted for publishing sample-based data in the near term while supporting the work of TDWG and groups such as the BCO in advancing biodiversity informatics. GBIF has already engaged in the work of the BCO and will continue to do so.
Éamonn
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Markus Döring Sent: 28 August 2014 12:44 To: Ramona Walls Cc: TDWG Content Mailing List Subject: Re: [tdwg-content] tdwg-content Digest, Vol 63, Issue 6
Hi Ramona & Rob,
The Event proposal does not try to change the semantics of an Event, it just uses the existing Darwin Core Event "class" at the core in Darwin Core archives. The actual change proposed is simply adding 3 new terms to the Event "group" to better share information about sampling methods & efforts, extending the existing limited capabilities of Darwin Core which already has the terms dwc:samplingProtocol and dwc:samplingEffort. It also proposes 2 new terms for dealing with quantity of Occurrences, something that has been discussed since 2012 now, when I had proposed a new abundance term [2].
In general application of Darwin Core is not at all limited to specimens and observations. It is used for sharing taxonomic datasets already and it's definition and goal is broad. Let me cite some of the introduction to Darwin Core [1]:
What is the Darwin Core? The Darwin Core is body of standards. It includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing reference definitions, examples, and commentaries. The Darwin Core is primarily based on taxa, their occurrence in nature as documented by observations, specimens, samples, and related information.
Motivation: The Darwin Core standard was originally conceived to facilitate the discovery, retrieval, and integration of information about modern biological specimens, their spatiotemporal occurrence, and their supporting evidence housed in collections (physical or digital). The Darwin Core today is broader in scope and more versatile. It is meant to provide a stable standard reference for sharing information on biological diversity. As a glossary of terms, the Darwin Core is meant to provide stable semantic definitions with the goal of being maximally reusable in a variety of contexts.
Markus
[1] http://rs.tdwg.org/dwc/index.htm [2] https://code.google.com/p/darwincore/issues/detail?id=142
-- Markus Döring Software Developer Global Biodiversity Information Facility (GBIF) mdoering@gbif.org http://www.gbif.org
On 27 Aug 2014, at 18:57, Ramona Walls rlwalls2008@gmail.com wrote:
I think it is important to consider the purpose of both Darwin Core and
DwC archives in deciding whether or not to expand them, but we should use that consideration to address the question at hand, which is whether or not to add an Event core and additional properties to describe events.
Describing the exchange format before the semantics is the wrong way to
go, given that we now have a framework for developing semantics. Expanding Darwin Core before we adequately model survey data is bound to lead to problems later, when we try to retro-fit the semantics to Darwin Core Event archives. This is exactly the problem we are running into now with Occurance archives, and we have the opportunity to avoid it.
I suggest we first use existing ontologies to model survey data, then
deal with if and how to exchange that information in DwC-A. This is what I was hinting at in my first email, but should have said more explicitly.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Wed, Aug 27, 2014 at 9:14 AM, Robert Guralnick
Robert.Guralnick@colorado.edu wrote:
It may be a sensible view for Darwin Core Archives and their intended
use, but Tim's email suggests we should be putting the method of delivery ahead of the standard that delivers that content. If this was just about DwC-As, why not develop a survey extension that links each occurrence to information about the survey process using the existing star-schema methods we have in place? Why are we discussing adding terms to the Darwin Core or trying to fully reconfigure what we call an Event? That is what is on the table, not DwC-As and how we use them. Or am I missing something?
Best, Rob
On Wed, Aug 27, 2014 at 10:01 AM, Ramona Walls rlwalls2008@gmail.com
wrote:
Thanks, Tim, and yes, DwC-A as a view (but not necessarily the primary
archive) of data seems like the right point of view.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Wed, Aug 27, 2014 at 1:58 AM, Tim Robertson trobertson@gbif.org
wrote:
Hi Ramona,
Those are good points, and I’d like to come back to the original
thinking behind the DwC-A.
It was designed and intended to be a simple way of exposing a complete
view of a dataset, primarily for building sophisticated indexes, inventories and allowing basic analytics (e.g. GBIF.org being one sophisticated index). We found that the star schema provided the flexibility to do a lot, and with the bundled metadata (e.g. EML) was enough to trace provenance and allow users to determine if the dataset might be fit for various uses. In many cases this represents the complete (e.g. lossless) view of a dataset.
What we are discussing here are far richer datasets, where shoe-horning
content into the star schema becomes lossy for some, although we’re finding other cases where it is indeed lossless. I believe we should be looking to harmonise ontologies / models etc as you mention but in parallel we should define one or more star schema views that can still be used for discovery / reporting / basic analytical purpose, and not long term archival of the dataset. The dataset would then have the canonical rich form and an additional DwC-A view. What I write here is applicable to all content types of course.
Please also note that many people put supplementary files in the DwC-A
which are ignored by DwC-A readers but could be a way of keeping the richer view in the bundle. If one wished you can describe those supplementary files in the EML document.
Does this gel with the view of others as well?
Cheers, Tim
On 27 Aug 2014, at 02:55, Ramona Walls rlwalls2008@gmail.com wrote:
I think Matt hit the nail on the head. Although Darwin Core can be used
to exchange survey data, it lacks the semantics and structure necessary to archive the data without loss of information. I think the biodiversity community would be better served devoting energy to harmonizing existing technologies such as OGC, OBOE, and BCO, not to mention the many database for storing plot or survey data. The goal should be to preserve the data in the most informative manner possible.
There is a strong a case for wanting to search across all evidence for
occurences, including surveys and point occurences, so I can see possible demand for a tool that would extract occurences from survey data to a DwC archive. However, I am very concerned that making a DwC archive the primary exchange format for survey or plot data commits us to a path of losing information from the start, for all but the simplest sampling schemas.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Fri, Aug 22, 2014 at 3:00 AM, tdwg-content-request@lists.tdwg.org
wrote:
Send tdwg-content mailing list submissions to tdwg-content@lists.tdwg.org
To subscribe or unsubscribe via the World Wide Web, visit http://lists.tdwg.org/mailman/listinfo/tdwg-content or, via email, send a message with subject or body 'help' to tdwg-content-request@lists.tdwg.org
You can reach the person managing the list at tdwg-content-owner@lists.tdwg.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of tdwg-content digest..."
Today's Topics:
- Re: Darwin Core: proposed news terms for expressing sample data (Matt Jones)
- Re: Darwin Core: proposed news terms for expressing sample data (Donald Hobern [GBIF])
Message: 1 Date: Thu, 21 Aug 2014 18:52:06 -0800 From: Matt Jones jones@nceas.ucsb.edu Subject: Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data To: ?amonn ? Tuama [GBIF] eotuama@gbif.org Cc: TDWG Content Mailing List tdwg-content@lists.tdwg.org Message-ID:
CAFSW8xkx7uRP9PC2g3=JT_VJanqujH8nPXoz8GXwh+JwKw5Ccw@mail.gmail.com
Content-Type: text/plain; charset="utf-8"
This proposal is treading on ground that is quite similar to other observations and measurements standards for data exchange that are
already
mature, in particular:
- OGC Observations and Measurements (
http://www.opengeospatial.org/standards/om)
- Extensible Observation Ontology (OBOE;
https://semtools.ecoinformatics.org/oboe)
The former is a standard and broadly deployed, whereas the latter is
part
of a research program in the use of ontologies for measurements.
Through
collaboration between the two projects, they've been modified to be reasonably isomorphic, but O&M uses an XML serialization while OBOE
uses an
OWL-DL serialization. They largely express the same measurements and sampling model once one gets beyond the terminology differences.
So, I'm wondering if it make much sense to extend Darwin Core, which is
at
heart an Occurrence exchange syntax, into this measurements area that
is
well represented by these other existing specifications? I'm curious
to
hear why people would even want to do this. And if we do go down this path, won't we just end up with a new syntax that does essentially what
O&M
and OBOE do now?
Matt
On Thu, Aug 21, 2014 at 12:22 AM, ?amonn ? Tuama [GBIF]
wrote:
Hi Rob, Anne, Rich,
I think Markus has answered your question as to why we opted for an
Event
core which is being used in the sense described by Anne and Rich. For
any
event, you can have a list of species in an Occurrence extension and
for
each species, you can include quantity and quantityType, e.g.,
biomass,
etc. The proposed term eventSeriesID was intended for linking
together
related events, although it now looks like parentEventID might be a
better,
more flexible term. The measurementOrFact extension is a good fit for capturing environmental information relating to an event. See, e.g.,
the
Gialova Lagoon brackish water invertebrate test data set [1] where a
set
of 18 environmental variables, including temp, pH, Rdx, particulate
organic
matter, dissolved oxygen, salinity, chlorophyll-a were measured for
each
sampling station-sampling period combination. An example mapping is:
Id measurementType measurementValue measurementUnit measurementRemarks
IA Tmp (sed) 21.5 degree C
Tmp
(sed): temperature at the bottom surface
**Controlled vocabularies**
Ideally, the values for samplingUnit and quantityType would be
selected
from controlled vocabularies. This is, effectively, what we do by presenting a small list of values in a drop-down menu. The current
values
are what we derived for example data sets and discussion but they can undoubtedly be extended and improved.
We capture ?bucket? type measures through a combination of
samplingEffort,
samplingGeometry and samplingUnit. For example, a pitfall trap (in a
point
location) left out for 16 days might have samplingEffort: 16, samplingGeometry: point and samplingUnit: day. Three m^2 quadrats in
a
shore survey might have samplingEffort: 3, samplingGeometry: area and samplingUnit: m^2.
It would be very useful to see your compilation of scope, effort and completeness measures to see if we can express them in our model
and/or if
we need to reconsider our approach.
?amonn
[1] http://eubon-ipt.gbif.org/resource.do?r=ionian-brackish-lagoon
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Markus D?ring *Sent:* 20 August 2014 23:47 *To:* Robert Guralnick
*Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Rob,
this proposal if for monitoring surveys really, not to be confused
with
material samples like environmental or tissue samples which have a
distinct
new dwc class MaterialSample.
We tend to overload the term sampling a lot and it helps treating
material
samples different from pure observational "sampling". That is why the existing Event class was used as the core and classic Occurrence
records as
extensions. A classic example is a vegetation survey where each plot represents an Event record and each recorded species in that plot
will be
an Occurrence extension record with a given quantity. Darwin Core
already
offers individualCount to specify quantity, but it is a very specific
way
of measuring "abundance" restricted to only some use cases. Abiotic measurements about the plot (e.g. soil type, pH, temperature) can be published using the measurements or facts extension linked to the
Event
core.
Markus
On 20 Aug 2014, at 20:08, Robert Guralnick
wrote:
Anne -- I don't know the answers! These are questions for
Eamonn. I
would presume that a sample could be a jumble of species or even just
water
or soil samples, and biomass would refer to that sample - but maybe
that
isn't a use case being considered? The examples given in the longer document all link an event_id to species name and some measure of
quantity
for that species (to the species, not an individual specimen), so I
assume
that is the prevailing (or only) case?
Best, Rob
On Wed, Aug 20, 2014 at 11:56 AM, Anne Thessen <
annethessen@gmail.com>
wrote:
Hi Rob I would like to respond to your item number 2. From my perspective, I deal with lots of published descriptions of
taxa.
The text might say something like "I saw species A in the Chesapeake
Bay,
the Adriatic Sea and the Indian Ocean and the biomass is 5 - 9
grams". The
biomass range obviously corresponds to at least three different occurrences, but how to divide the biomass data? I would love to be
able to
have an *event* to attach it all to. There is almost two different
levels
of events - a sampling event and a "study event". The "study event"
would
correspond to the type of event I would like to use in the above
example.
It may not be ideal, but for the old literature that might be the
best we
can do. I have to admit that I don't know enough about trawl data to
understand
why an event core would be a problem. It seems that the trawl would
be an
event and each biomass measure (of each fish) would be attached to a separate occurrence which is attached to that event. Am I
understanding
this wrong? btw - I found a workaround for the example I gave, so it's not
impossible
to model with the current structure.... Anne
On 8/20/2014 1:16 PM, Robert Guralnick wrote:
?amonn et al. --- Thanks for the clarifications. I think these help
a ton
but it raises a couple more questions for me.
- I am surprised that you plan to use of MeasurementorFact
extension in
relation to the Event core, which seems like a novel (or perhaps
awkward or
unintended?) mechanism for capturing environmental data, but the same extension was not be seen as relevant for describing samples? Can you explain more about the thinking there?
- There may be a subtle issue here extending "Event" to be more
what you
call a "Sampling Event Core". My read of this is that Darwin Core
serves
as a way to deal with point occurrences and Event reflects the
context of a
single capture event (whether a single observation, or a bulk sample capture). The changes recommended seem to dramatically extend and
change
that meaning? Its simply a question that I don't have answer to, but
is
Darwin Core, the right vehicle to start capturing repeated measures
of
biomass values from trawls? I don't have answer but man, terms like quantityType (as a property of occurrence?) give me pause.
- Is Sampling Unit a controlled vocabulary? For another project, I
have
looked through - and captured scope, effort and completeness measures
from
- a large number of published biotic area inventories. The vast
majorities
of these are measured in units like bucket hours, or trap nights.
Is a
"bucket" part of SamplingGeometry or Sampling Unit? I'd be happy to
send
along all the many examples of how biotic inventories of an area are completed and perhaps it might be good to see how those might be represented using the terms you are proposing?
Best, Rob
On Wed, Aug 20, 2014 at 10:16 AM, Richard Pyle
wrote:
Same here ? Events are central to the work that we do.
Aloha,
Rich
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Anne Thessen *Sent:* Wednesday, August 20, 2014 2:59 AM *To:* tdwg-content@lists.tdwg.org
*Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hello I would just like to comment on *event core*. I've been doing a lot of work translating published data into Darwin
Core.
During that process I've wished several times that I could use Event
as
core. I am happy to hear about that proposed change. It will make it
easier
to model the data I am working with. Anne
On 8/20/2014 7:04 AM, ?amonn ? Tuama [GBIF] wrote:
Hi Rob,
Thank you for the feedback. I have tried to address the two main
issues
you raise below. At the outset, I would like to emphasise that much
of this
work is taking place in the context of the EU BON project which
includes a
task on developing/enhancing tools and standards for data sharing
with a
particular focus on the IPT for publishing sample-based data. So, we
were
constrained by the need to publish sample-based data sets in the
Darwin
Core Archive format and to demonstrate practical application using a working prototype. When the discussion on the TDWG list faded out, we
took
it to our EU BON partners whose requirements were essential input to further development. We recognise that these discussions took place
away
from TDWG (although the TDWG/EU BON contributors overlapped) and this
is
the reason we are presenting the outcomes here for further
consideration.
**Event core**
As the SIGS report indicated, sample data can be modelled in Darwin
Core
Archives using either Occurrence or Event as core. This was the
starting
point for our evaluation but as things progressed the data wrangling
pushed
the model back towards the Event core. We actually went through the exercise of mapping multiple test datasets in an iterative process
spanning
several months' work. In the end, we found that using an Event core
better
matched the typical sample data we were dealing with, allowing use
of a
measurement-or-fact extension to be included for the efficient
expression
of environmental information associated with the event. The choice
comes
down to an Occurrence core or an Event core + Occurrence extension.
In both
cases, the true observation records are Occurrences. The big
difference is
what type the core has and therefore to which kind of records you can attach further facts and extra information with DwC-A extensions.
Many
sampling datasets have very rich information about the site and
event, so
it is very natural to hang facts from an Event core. When picking the Occurrence core those facts would have to be repeated for each and
every
occurrence record. Moreover, our approach doesn?t stop anyone from
using
the Occurrence core if they so wish. This just provides a different
option
for datasets that better fit an Event core model.
I want to stress that we are not building a ?specific IPT version? to support an Event core but, rather, we adapted the IPT so that it can
be
configured to support any generic ?core + extension? format to enable
its
use for exploration of more data formats. This is part of the core codebase and there were no custom forks of the IPT for this work.
Our view
at GBIF is that if there are significant numbers of data publishers
who are
keen to adopt, promote and use a (any) format, and the tools can be configured to do so, then we should support it, and, if necessary,
use a
custom namespace.
**New terms around abundance**
Yes, the discussion on TDWG did fade out but it was clear that the
term
?abundance? as recommended by the SIGS report (along with abundanceAsPercent) was confusing many when we were looking for
term(s)
that reported quantitative measures of organisms in a sample. It also became clear we would need to be able to state the type of quantity
being
measured. An alternative suggestion for using the MeasurementsOrFact
class
was immediately shot down.
As some of our main use cases were coming from the EU BON project, discussion shifted to that forum and consensus formed about the
currently
proposed terms. It was within this group that the additional terms (samplingGeometry, samplingUnit, eventSeriesID) were proposed and
where we
began testing with sample data sets.
Best regards,
?amonn
*From:* robgur@gmail.com [mailto:robgur@gmail.com <robgur@gmail.com
]
*On
Behalf Of *Robert Guralnick *Sent:* 19 August 2014 16:56 *To:* ?amonn ? Tuama [GBIF] *Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hi ?amonn --- I am curious about the outcomes presented in the SIGS paper, in particular, this portion of the paper:
"Solutions without introducing an event core in Darwin Core Archives: During the review of the solutions for the uses cases, it became
apparent
that either model could be applied to every use case. The core and extensions bore a complementary relationship and between them could
express
all the required information. The core simply provided the central
anchor
in the star schema from which to join the additional information. Therefore, using the Occurrence core, well established in the GBIF
network
through uptake of the IPT, seemed more appropriate than inventing CollectingEvent as an additional core type."
That SIGS paper has John Wieczorek and you both as authors,
including
many luminaries across the biodiversity standards spectrum. Given
the
above, its curious to see the EventCore come back again, along with a specific IPT version to support it.
So I see two issues, conflated, in this post you just made. One
is
the need for an EventCore at all, and the nature of relating Event
and
Occurrence/Material Sample. The second is the introduction of new
terms,
which seemingly have arrived after debate on similar terms - but
framed
around abundance - stalled a year ago. To my mind, these both
require some
further discussion, because I don't (necessarily) see TDWG community coherence around either issue?
Best, Rob
On Tue, Aug 19, 2014 at 6:11 AM, ?amonn ? Tuama [GBIF]
wrote:
Dear All,
GBIF is committed to exploring ways in which the IPT and Darwin Core Archive format can be extended for publishing sample-based data sets.
In
association with the EU BON project [1], a customised version of the
IPT
[2] has been deployed to test this using a special type of Darwin
Core
Archive in which the core is an ?Event? with associated taxon
occurrences
in an ?Occurrence? extension.
The Darwin Core vocabulary already provides a rich set of terms with
many
relevant for describing sample-based data. Synthesising several
sources of
input (GBIF organised workshop on sample data, May 2013 [3],
discussions on
the TDWG mailing list in late 2013; internal discussion among EU BON project partners), five new terms relating to sample data were
identified
as essential. The complete model including these new terms are fully described with examples in the online document ?Publishing sample
data
using the GBIF IPT? [4].
As a first step towards ratification, we would like to register the
new
terms in the DwC Google Code tracker [5] if there are no major
objections
on this list. The five terms are:
*quantity*: the number or enumeration value of the
quantityType
(e.g., individuals, biomass, biovolume, BraunBlanquetScale) per samplingUnit or a percentage measure recorded for the sample.
*quantityType*: : the entity being referred to by quantity,
e.g., individuals, biomass, %species, scale type.
*samplingGeometry*: an indication of what kind of space was
sampled; select from point, line, area or volume.
*samplingUnit*: the unit of measurement used for reporting
the
quantity in the sample, e.g., minute, hour, day, metre, metre^2,
metre^3.
It is combined with quantity and quantityType to provide the complete measurement, e.g., 9 individuals per day, 4 biomass-gm per metre^2.
*eventSeriesID*: an identifier for a set of events that are
associated in some way, e.g., a monitoring series; may be a global
unique
identifier or an identifier specific to the series.
Best regards,
?amonn
[1] http://eubon.eu
[3]
http://www.standardsingenomics.org/index.php/sigen/article/view/sigs.4898640
[4] http://links.gbif.org/sample_data_model
[5] https://code.google.com/p/darwincore/issues/list
*?amonn ? Tuama, M.Sc., Ph.D. (eotuama@gbif.org eotuama@gbif.org),
*Senior Programme Officer for Interoperability, *
*Global Biodiversity Information Facility Secretariat, *
*Universitetsparken 15, DK-2100, Copenhagen ?, DENMARK*
*Phone: +45 3532 1494 <%2B45%203532%201494>; Fax: +45 3532 1480 <%2B45%203532%201480>*
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Rob, GBIF is hard at work solving a problem for them. It seems natural for them to do what works best for them, particularly dealing with the volume/scale they do. And, Markus, Tim, Eamonn and others have been deeply involved in Darwin Core standards implementation, inventing the DwCA and the IPT in the first place. But, in using DwCA to get actual work done for GBIF they have added a term here and there via their own extensions. See http://tools.gbif.org/dwca-validator/extensions.do. It’s a long list of Registered and Stable extensions. What about Audubon and Plinian Core? Aren’t they actually “extensions” to Darwin Core, particularly in the “universal biodiversity data “ sense Darwin Core seems to be evolving to? Maybe we need a “SampleCore” or “EventCore” that organizes these sampling/event-related terms?
GBIF is creating a new “Event core” for DwCA to do what they need to do. But, I don’t understand the problem with this. Occurrence core, Taxon core, Event core – so what? The DwCA model allows anything to be the core (ignoring IPT) and it does not actually restrict the terms used to DwC only, certainly it already includes Dublin Core and EML. In fact, lots of DwCA exchanges are being used outside GBIF now with extended terms from Audubon Core, Plinian Core, and others. How is this bad? It shows how flexible the DwCA data exchange structure is. And, it is an exchange, not a data storage format. What’s the problem with an Event core?
As an aside, I have said many times that the most confusing thing to our community about DwCA is that it’s called Darwin Core Archive when the structure is in fact not limited to only Darwin Core terms. Any validatable XML term can be used – like http://rs.gbif.org/terms/1.0/verbatimLabel which is used in the How-To Guide Whales example. And I commonly hear people at conferences and meetings now referring to their plans to adopt “Darwin Core” and upon questioning discover they are talking about the DwCA format, not really knowing the details of it. So, the words “Darwin Core” have become ambiguous.
And I want to note that Darwin Core Archive is not a TDWG standard. It is a guideline for implementing the Darwin Core standard in text files. http://rs.tdwg.org/dwc/terms/guides/text/. This is another point of confusion. GBIF describes DwCA as “an internationally recognized biodiversity informatics data standard” http://www.gbif.org/resources/2551, but it’s not a TDWG standard, just a guideline. I think DwCA deserves to be made a TDWG data exchange standard on its own, separate from Darwin Core, and perhaps expanded in scope to embrace the other “Cores” and relabeled to “Biodiversity Data Archive” or something.
Chuck
From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Robert Guralnick Sent: Thursday, August 28, 2014 10:58 AM To: John Deck Cc: TDWG Content Mailing List; Ramona Walls Subject: Re: [tdwg-content] tdwg-content Digest, Vol 63, Issue 6
Hi all --- Ok, I think the scope of the issue is quite clear. Let me summarize: 1) As Éamonn and the rest of GBIF has made quite clear, "GBIF is faced with the immediate task of making sample-based data discoverable and accessible using its current ecosystem of tools" given a funding mandate from EU-BON. 2) The solution for this problem is to develop an Event-core and to promote new terms to the Darwin Core to make this happen. I will note a small inconsistency here: the current ecosystem standards and tools of is Darwin Core (as it stands) and publishing systems such as IPT. That ecosystem of tools includes mechanisms to extend Darwin Core where needed, via extensions. The current ecosystem of tools doesn't include new Cores or new DwC terms, does it?
So this leads in nicely to the contentious issue(s) and places where there seems to be discussions --- these have to do with the nature of the changes suggested and the scope of those changes, both in terms of an Event core and DwC term additions. Leaving aside the Event-core for now, the key questions simply about term additions to the Darwin Core that seem to be at heart here are: 1) Is the intent of the Darwin Core to model surveys, which usually involve multiple kinds and types of sampling over multiple sites using multiple methods? 2) Is the solution to invent new terms for the Darwin Core if there are already terms from other efforts, wouldn't we work with those existing efforts to assure interoperability?
I appreciate the efforts of GBIF here fully, and am personally torn because on the one hand, I fully agree with the goal of extending Darwin Core to better represent richer biodiversity data. On the other hand, I worry about process here and how to make that happen in a way that isn't too hasty or locks us into just the opposite of what I think many of us want with regards to sharing data more broadly than within just one ecosystem of tools.
Best, Rob
On Thu, Aug 28, 2014 at 6:30 AM, John Deck <jdeck@berkeley.edumailto:jdeck@berkeley.edu> wrote: I see the rational for enabling this in Darwin Core Archives and adding the new terms. However, back to what Matt Jones brought up: "won't we just end up with a new syntax that does essentially what O&M and OBOE do now?".
We should include explicit references to existing terms/definitions that encapsulate what we're talking about, e.g. in our MaterialSample proposal last year we linked the an existing term in OBI, which has a much richer description and context for MaterialSample than what we considered (https://code.google.com/p/darwincore/issues/detail?id=167) Have we explored the possibility of doing this with OBOE? I'm not suggesting we adopt OBOE wholesale, but it seems like we have a good opportunity to enable better semantic linking with that efforts.
John
On Thu, Aug 28, 2014 at 4:23 AM, Éamonn Ó Tuama [GBIF] <eotuama@gbif.orgmailto:eotuama@gbif.org> wrote: Thanks, Ramona and Rob.
I'd like to add a few points following on Markus's reply.
I think your pressing of the need for a robust semantic model for biodiversity sample/survey data is incontestable – we do need one and it should enable rich data integration once it is defined and the tools and data standards to support it become available. However, GBIF is faced with the immediate task of making sample-based data discoverable and accessible using its current ecosystem of tools (IPT) and exchange standards (DwC; EML). Waiting for a functional, implementable semantic model and the tools and support services for it is just not an option for us right now.
We have already spend considerable time in analysing the merits of Occurrence core vs Event core and have opted for an Event core for reasons previously given. I don’t believe we are trying to reconfigure Event (“an action that occurs at a place and during a period of time”) and regardless of whether we use Occurrence or Event, the need for some additional terms arise (e.g., quantity, quantityType, samplingGeometry, samplingUnit). Once the BCO model is available for uptake, it should be possible to develop a mapping between it and the simple DwC sample model.
So GBIF’s stance is that we need to take a two-pronged approach by exploring how the IPT and DwC-A can be adapted for publishing sample-based data in the near term while supporting the work of TDWG and groups such as the BCO in advancing biodiversity informatics. GBIF has already engaged in the work of the BCO and will continue to do so.
Éamonn
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Markus Döring Sent: 28 August 2014 12:44 To: Ramona Walls Cc: TDWG Content Mailing List Subject: Re: [tdwg-content] tdwg-content Digest, Vol 63, Issue 6
Hi Ramona & Rob,
The Event proposal does not try to change the semantics of an Event, it just uses the existing Darwin Core Event "class" at the core in Darwin Core archives. The actual change proposed is simply adding 3 new terms to the Event "group" to better share information about sampling methods & efforts, extending the existing limited capabilities of Darwin Core which already has the terms dwc:samplingProtocol and dwc:samplingEffort. It also proposes 2 new terms for dealing with quantity of Occurrences, something that has been discussed since 2012 now, when I had proposed a new abundance term [2].
In general application of Darwin Core is not at all limited to specimens and observations. It is used for sharing taxonomic datasets already and it's definition and goal is broad. Let me cite some of the introduction to Darwin Core [1]:
What is the Darwin Core? The Darwin Core is body of standards. It includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing reference definitions, examples, and commentaries. The Darwin Core is primarily based on taxa, their occurrence in nature as documented by observations, specimens, samples, and related information.
Motivation: The Darwin Core standard was originally conceived to facilitate the discovery, retrieval, and integration of information about modern biological specimens, their spatiotemporal occurrence, and their supporting evidence housed in collections (physical or digital). The Darwin Core today is broader in scope and more versatile. It is meant to provide a stable standard reference for sharing information on biological diversity. As a glossary of terms, the Darwin Core is meant to provide stable semantic definitions with the goal of being maximally reusable in a variety of contexts.
Markus
[1] http://rs.tdwg.org/dwc/index.htm [2] https://code.google.com/p/darwincore/issues/detail?id=142
-- Markus Döring Software Developer Global Biodiversity Information Facility (GBIF) mdoering@gbif.orgmailto:mdoering@gbif.org http://www.gbif.org
On 27 Aug 2014, at 18:57, Ramona Walls <rlwalls2008@gmail.commailto:rlwalls2008@gmail.com> wrote:
I think it is important to consider the purpose of both Darwin Core and
DwC archives in deciding whether or not to expand them, but we should use that consideration to address the question at hand, which is whether or not to add an Event core and additional properties to describe events.
Describing the exchange format before the semantics is the wrong way to
go, given that we now have a framework for developing semantics. Expanding Darwin Core before we adequately model survey data is bound to lead to problems later, when we try to retro-fit the semantics to Darwin Core Event archives. This is exactly the problem we are running into now with Occurance archives, and we have the opportunity to avoid it.
I suggest we first use existing ontologies to model survey data, then deal
with if and how to exchange that information in DwC-A. This is what I was hinting at in my first email, but should have said more explicitly.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Wed, Aug 27, 2014 at 9:14 AM, Robert Guralnick
<Robert.Guralnick@colorado.edumailto:Robert.Guralnick@colorado.edu> wrote:
It may be a sensible view for Darwin Core Archives and their intended
use, but Tim's email suggests we should be putting the method of delivery ahead of the standard that delivers that content. If this was just about DwC-As, why not develop a survey extension that links each occurrence to information about the survey process using the existing star-schema methods we have in place? Why are we discussing adding terms to the Darwin Core or trying to fully reconfigure what we call an Event? That is what is on the table, not DwC-As and how we use them. Or am I missing something?
Best, Rob
On Wed, Aug 27, 2014 at 10:01 AM, Ramona Walls <rlwalls2008@gmail.commailto:rlwalls2008@gmail.com>
wrote:
Thanks, Tim, and yes, DwC-A as a view (but not necessarily the primary
archive) of data seems like the right point of view.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Wed, Aug 27, 2014 at 1:58 AM, Tim Robertson <trobertson@gbif.orgmailto:trobertson@gbif.org>
wrote:
Hi Ramona,
Those are good points, and I’d like to come back to the original thinking
behind the DwC-A.
It was designed and intended to be a simple way of exposing a complete
view of a dataset, primarily for building sophisticated indexes, inventories and allowing basic analytics (e.g. GBIF.org being one sophisticated index). We found that the star schema provided the flexibility to do a lot, and with the bundled metadata (e.g. EML) was enough to trace provenance and allow users to determine if the dataset might be fit for various uses. In many cases this represents the complete (e.g. lossless) view of a dataset.
What we are discussing here are far richer datasets, where shoe-horning
content into the star schema becomes lossy for some, although we’re finding other cases where it is indeed lossless. I believe we should be looking to harmonise ontologies / models etc as you mention but in parallel we should define one or more star schema views that can still be used for discovery / reporting / basic analytical purpose, and not long term archival of the dataset. The dataset would then have the canonical rich form and an additional DwC-A view. What I write here is applicable to all content types of course.
Please also note that many people put supplementary files in the DwC-A
which are ignored by DwC-A readers but could be a way of keeping the richer view in the bundle. If one wished you can describe those supplementary files in the EML document.
Does this gel with the view of others as well?
Cheers, Tim
On 27 Aug 2014, at 02:55, Ramona Walls <rlwalls2008@gmail.commailto:rlwalls2008@gmail.com> wrote:
I think Matt hit the nail on the head. Although Darwin Core can be used
to exchange survey data, it lacks the semantics and structure necessary to archive the data without loss of information. I think the biodiversity community would be better served devoting energy to harmonizing existing technologies such as OGC, OBOE, and BCO, not to mention the many database for storing plot or survey data. The goal should be to preserve the data in the most informative manner possible.
There is a strong a case for wanting to search across all evidence for
occurences, including surveys and point occurences, so I can see possible demand for a tool that would extract occurences from survey data to a DwC archive. However, I am very concerned that making a DwC archive the primary exchange format for survey or plot data commits us to a path of losing information from the start, for all but the simplest sampling schemas.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Fri, Aug 22, 2014 at 3:00 AM, <tdwg-content-request@lists.tdwg.orgmailto:tdwg-content-request@lists.tdwg.org>
wrote:
Send tdwg-content mailing list submissions to tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org
To subscribe or unsubscribe via the World Wide Web, visit http://lists.tdwg.org/mailman/listinfo/tdwg-content or, via email, send a message with subject or body 'help' to tdwg-content-request@lists.tdwg.orgmailto:tdwg-content-request@lists.tdwg.org
You can reach the person managing the list at tdwg-content-owner@lists.tdwg.orgmailto:tdwg-content-owner@lists.tdwg.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of tdwg-content digest..."
Today's Topics:
- Re: Darwin Core: proposed news terms for expressing sample data (Matt Jones)
- Re: Darwin Core: proposed news terms for expressing sample data (Donald Hobern [GBIF])
Message: 1 Date: Thu, 21 Aug 2014 18:52:06 -0800 From: Matt Jones <jones@nceas.ucsb.edumailto:jones@nceas.ucsb.edu> Subject: Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data To: ?amonn ? Tuama [GBIF] <eotuama@gbif.orgmailto:eotuama@gbif.org> Cc: TDWG Content Mailing List <tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org> Message-ID:
<CAFSW8xkx7uRP9PC2g3=JT_VJanqujH8nPXoz8GXwh+JwKw5Ccw@mail.gmail.commailto:JT_VJanqujH8nPXoz8GXwh%2BJwKw5Ccw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
This proposal is treading on ground that is quite similar to other observations and measurements standards for data exchange that are
already
mature, in particular:
- OGC Observations and Measurements (
http://www.opengeospatial.org/standards/om)
- Extensible Observation Ontology (OBOE;
https://semtools.ecoinformatics.org/oboe)
The former is a standard and broadly deployed, whereas the latter is part of a research program in the use of ontologies for measurements. Through collaboration between the two projects, they've been modified to be reasonably isomorphic, but O&M uses an XML serialization while OBOE uses
an
OWL-DL serialization. They largely express the same measurements and sampling model once one gets beyond the terminology differences.
So, I'm wondering if it make much sense to extend Darwin Core, which is
at
heart an Occurrence exchange syntax, into this measurements area that is well represented by these other existing specifications? I'm curious to hear why people would even want to do this. And if we do go down this path, won't we just end up with a new syntax that does essentially what
O&M
and OBOE do now?
Matt
On Thu, Aug 21, 2014 at 12:22 AM, ?amonn ? Tuama [GBIF]
<eotuama@gbif.orgmailto:eotuama@gbif.org>
wrote:
Hi Rob, Anne, Rich,
I think Markus has answered your question as to why we opted for an
Event
core which is being used in the sense described by Anne and Rich. For
any
event, you can have a list of species in an Occurrence extension and
for
each species, you can include quantity and quantityType, e.g., biomass, etc. The proposed term eventSeriesID was intended for linking together related events, although it now looks like parentEventID might be a
better,
more flexible term. The measurementOrFact extension is a good fit for capturing environmental information relating to an event. See, e.g.,
the
Gialova Lagoon brackish water invertebrate test data set [1] where a
set
of 18 environmental variables, including temp, pH, Rdx, particulate
organic
matter, dissolved oxygen, salinity, chlorophyll-a were measured for
each
sampling station-sampling period combination. An example mapping is:
Id measurementType measurementValue measurementUnit measurementRemarks
IA Tmp (sed) 21.5 degree C Tmp (sed): temperature at the bottom surface
**Controlled vocabularies**
Ideally, the values for samplingUnit and quantityType would be selected from controlled vocabularies. This is, effectively, what we do by presenting a small list of values in a drop-down menu. The current
values
are what we derived for example data sets and discussion but they can undoubtedly be extended and improved.
We capture ?bucket? type measures through a combination of
samplingEffort,
samplingGeometry and samplingUnit. For example, a pitfall trap (in a
point
location) left out for 16 days might have samplingEffort: 16, samplingGeometry: point and samplingUnit: day. Three m^2 quadrats in a shore survey might have samplingEffort: 3, samplingGeometry: area and samplingUnit: m^2.
It would be very useful to see your compilation of scope, effort and completeness measures to see if we can express them in our model and/or
if
we need to reconsider our approach.
?amonn
[1] http://eubon-ipt.gbif.org/resource.do?r=ionian-brackish-lagoon
*From:* tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Markus D?ring *Sent:* 20 August 2014 23:47 *To:* Robert Guralnick
*Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Rob,
this proposal if for monitoring surveys really, not to be confused with material samples like environmental or tissue samples which have a
distinct
new dwc class MaterialSample.
We tend to overload the term sampling a lot and it helps treating
material
samples different from pure observational "sampling". That is why the existing Event class was used as the core and classic Occurrence
records as
extensions. A classic example is a vegetation survey where each plot represents an Event record and each recorded species in that plot will
be
an Occurrence extension record with a given quantity. Darwin Core
already
offers individualCount to specify quantity, but it is a very specific
way
of measuring "abundance" restricted to only some use cases. Abiotic measurements about the plot (e.g. soil type, pH, temperature) can be published using the measurements or facts extension linked to the Event core.
Markus
On 20 Aug 2014, at 20:08, Robert Guralnick
<Robert.Guralnick@colorado.edumailto:Robert.Guralnick@colorado.edu>
wrote:
Anne -- I don't know the answers! These are questions for Eamonn. I would presume that a sample could be a jumble of species or even just
water
or soil samples, and biomass would refer to that sample - but maybe
that
isn't a use case being considered? The examples given in the longer document all link an event_id to species name and some measure of
quantity
for that species (to the species, not an individual specimen), so I
assume
that is the prevailing (or only) case?
Best, Rob
On Wed, Aug 20, 2014 at 11:56 AM, Anne Thessen <annethessen@gmail.commailto:annethessen@gmail.com> wrote:
Hi Rob I would like to respond to your item number 2. From my perspective, I deal with lots of published descriptions of
taxa.
The text might say something like "I saw species A in the Chesapeake
Bay,
the Adriatic Sea and the Indian Ocean and the biomass is 5 - 9 grams".
The
biomass range obviously corresponds to at least three different occurrences, but how to divide the biomass data? I would love to be
able to
have an *event* to attach it all to. There is almost two different
levels
of events - a sampling event and a "study event". The "study event"
would
correspond to the type of event I would like to use in the above
example.
It may not be ideal, but for the old literature that might be the best
we
can do. I have to admit that I don't know enough about trawl data to understand why an event core would be a problem. It seems that the trawl would be
an
event and each biomass measure (of each fish) would be attached to a separate occurrence which is attached to that event. Am I understanding this wrong? btw - I found a workaround for the example I gave, so it's not
impossible
to model with the current structure.... Anne
On 8/20/2014 1:16 PM, Robert Guralnick wrote:
?amonn et al. --- Thanks for the clarifications. I think these help a
ton
but it raises a couple more questions for me.
- I am surprised that you plan to use of MeasurementorFact extension
in
relation to the Event core, which seems like a novel (or perhaps
awkward or
unintended?) mechanism for capturing environmental data, but the same extension was not be seen as relevant for describing samples? Can you explain more about the thinking there?
- There may be a subtle issue here extending "Event" to be more what
you
call a "Sampling Event Core". My read of this is that Darwin Core
serves
as a way to deal with point occurrences and Event reflects the context
of a
single capture event (whether a single observation, or a bulk sample capture). The changes recommended seem to dramatically extend and
change
that meaning? Its simply a question that I don't have answer to, but
is
Darwin Core, the right vehicle to start capturing repeated measures of biomass values from trawls? I don't have answer but man, terms like quantityType (as a property of occurrence?) give me pause.
- Is Sampling Unit a controlled vocabulary? For another project, I
have
looked through - and captured scope, effort and completeness measures
from
- a large number of published biotic area inventories. The vast
majorities
of these are measured in units like bucket hours, or trap nights. Is a "bucket" part of SamplingGeometry or Sampling Unit? I'd be happy to
send
along all the many examples of how biotic inventories of an area are completed and perhaps it might be good to see how those might be represented using the terms you are proposing?
Best, Rob
On Wed, Aug 20, 2014 at 10:16 AM, Richard Pyle
<deepreef@bishopmuseum.orgmailto:deepreef@bishopmuseum.org>
wrote:
Same here ? Events are central to the work that we do.
Aloha,
Rich
*From:* tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Anne Thessen *Sent:* Wednesday, August 20, 2014 2:59 AM *To:* tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org
*Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hello I would just like to comment on *event core*. I've been doing a lot of work translating published data into Darwin
Core.
During that process I've wished several times that I could use Event as core. I am happy to hear about that proposed change. It will make it
easier
to model the data I am working with. Anne
On 8/20/2014 7:04 AM, ?amonn ? Tuama [GBIF] wrote:
Hi Rob,
Thank you for the feedback. I have tried to address the two main issues you raise below. At the outset, I would like to emphasise that much of
this
work is taking place in the context of the EU BON project which
includes a
task on developing/enhancing tools and standards for data sharing with
a
particular focus on the IPT for publishing sample-based data. So, we
were
constrained by the need to publish sample-based data sets in the Darwin Core Archive format and to demonstrate practical application using a working prototype. When the discussion on the TDWG list faded out, we
took
it to our EU BON partners whose requirements were essential input to further development. We recognise that these discussions took place
away
from TDWG (although the TDWG/EU BON contributors overlapped) and this
is
the reason we are presenting the outcomes here for further
consideration.
**Event core**
As the SIGS report indicated, sample data can be modelled in Darwin
Core
Archives using either Occurrence or Event as core. This was the
starting
point for our evaluation but as things progressed the data wrangling
pushed
the model back towards the Event core. We actually went through the exercise of mapping multiple test datasets in an iterative process
spanning
several months' work. In the end, we found that using an Event core
better
matched the typical sample data we were dealing with, allowing use of a measurement-or-fact extension to be included for the efficient
expression
of environmental information associated with the event. The choice
comes
down to an Occurrence core or an Event core + Occurrence extension. In
both
cases, the true observation records are Occurrences. The big difference
is
what type the core has and therefore to which kind of records you can attach further facts and extra information with DwC-A extensions. Many sampling datasets have very rich information about the site and event,
so
it is very natural to hang facts from an Event core. When picking the Occurrence core those facts would have to be repeated for each and
every
occurrence record. Moreover, our approach doesn?t stop anyone from
using
the Occurrence core if they so wish. This just provides a different
option
for datasets that better fit an Event core model.
I want to stress that we are not building a ?specific IPT version? to support an Event core but, rather, we adapted the IPT so that it can be configured to support any generic ?core + extension? format to enable
its
use for exploration of more data formats. This is part of the core codebase and there were no custom forks of the IPT for this work. Our
view
at GBIF is that if there are significant numbers of data publishers who
are
keen to adopt, promote and use a (any) format, and the tools can be configured to do so, then we should support it, and, if necessary, use
a
custom namespace.
**New terms around abundance**
Yes, the discussion on TDWG did fade out but it was clear that the term ?abundance? as recommended by the SIGS report (along with abundanceAsPercent) was confusing many when we were looking for term(s) that reported quantitative measures of organisms in a sample. It also became clear we would need to be able to state the type of quantity
being
measured. An alternative suggestion for using the MeasurementsOrFact
class
was immediately shot down.
As some of our main use cases were coming from the EU BON project, discussion shifted to that forum and consensus formed about the
currently
proposed terms. It was within this group that the additional terms (samplingGeometry, samplingUnit, eventSeriesID) were proposed and where
we
began testing with sample data sets.
Best regards,
?amonn
*From:* robgur@gmail.commailto:robgur@gmail.com [mailto:robgur@gmail.commailto:robgur@gmail.com <robgur@gmail.commailto:robgur@gmail.com>]
*On
Behalf Of *Robert Guralnick *Sent:* 19 August 2014 16:56 *To:* ?amonn ? Tuama [GBIF] *Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hi ?amonn --- I am curious about the outcomes presented in the SIGS paper, in particular, this portion of the paper:
"Solutions without introducing an event core in Darwin Core Archives: During the review of the solutions for the uses cases, it became
apparent
that either model could be applied to every use case. The core and extensions bore a complementary relationship and between them could
express
all the required information. The core simply provided the central
anchor
in the star schema from which to join the additional information. Therefore, using the Occurrence core, well established in the GBIF
network
through uptake of the IPT, seemed more appropriate than inventing CollectingEvent as an additional core type."
That SIGS paper has John Wieczorek and you both as authors,
including
many luminaries across the biodiversity standards spectrum. Given the above, its curious to see the EventCore come back again, along with a specific IPT version to support it.
So I see two issues, conflated, in this post you just made. One is
the need for an EventCore at all, and the nature of relating Event and Occurrence/Material Sample. The second is the introduction of new
terms,
which seemingly have arrived after debate on similar terms - but framed around abundance - stalled a year ago. To my mind, these both require
some
further discussion, because I don't (necessarily) see TDWG community coherence around either issue?
Best, Rob
On Tue, Aug 19, 2014 at 6:11 AM, ?amonn ? Tuama [GBIF]
<eotuama@gbif.orgmailto:eotuama@gbif.org>
wrote:
Dear All,
GBIF is committed to exploring ways in which the IPT and Darwin Core Archive format can be extended for publishing sample-based data sets.
In
association with the EU BON project [1], a customised version of the
IPT
[2] has been deployed to test this using a special type of Darwin Core Archive in which the core is an ?Event? with associated taxon
occurrences
in an ?Occurrence? extension.
The Darwin Core vocabulary already provides a rich set of terms with
many
relevant for describing sample-based data. Synthesising several sources
of
input (GBIF organised workshop on sample data, May 2013 [3],
discussions on
the TDWG mailing list in late 2013; internal discussion among EU BON project partners), five new terms relating to sample data were
identified
as essential. The complete model including these new terms are fully described with examples in the online document ?Publishing sample data using the GBIF IPT? [4].
As a first step towards ratification, we would like to register the new terms in the DwC Google Code tracker [5] if there are no major
objections
on this list. The five terms are:
*quantity*: the number or enumeration value of the quantityType
(e.g., individuals, biomass, biovolume, BraunBlanquetScale) per samplingUnit or a percentage measure recorded for the sample.
*quantityType*: : the entity being referred to by quantity,
e.g., individuals, biomass, %species, scale type.
*samplingGeometry*: an indication of what kind of space was
sampled; select from point, line, area or volume.
*samplingUnit*: the unit of measurement used for reporting the
quantity in the sample, e.g., minute, hour, day, metre, metre^2,
metre^3.
It is combined with quantity and quantityType to provide the complete measurement, e.g., 9 individuals per day, 4 biomass-gm per metre^2.
*eventSeriesID*: an identifier for a set of events that are
associated in some way, e.g., a monitoring series; may be a global
unique
identifier or an identifier specific to the series.
Best regards,
?amonn
[1] http://eubon.eu
[3]
http://www.standardsingenomics.org/index.php/sigen/article/view/sigs.4898640
[4] http://links.gbif.org/sample_data_model
[5] https://code.google.com/p/darwincore/issues/list
*?amonn ? Tuama, M.Sc., Ph.D. (eotuama@gbif.orgmailto:eotuama@gbif.org <eotuama@gbif.orgmailto:eotuama@gbif.org>), *
*Senior Programme Officer for Interoperability, *
*Global Biodiversity Information Facility Secretariat, *
*Universitetsparken 15, DK-2100, Copenhagen ?, DENMARK*
*Phone: +45 3532 1494tel:%2B45%203532%201494 <%2B45%203532%201494>; Fax: +45 3532 1480tel:%2B45%203532%201480 <%2B45%203532%201480>*
tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list
tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185tel:443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185tel:443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-------------- next part -------------- An HTML attachment was scrubbed... URL:
http://lists.tdwg.org/pipermail/tdwg-content/attachments/20140821/4b338606/a ttachment-0001.html
Message: 2 Date: Fri, 22 Aug 2014 08:54:07 +0200 From: "Donald Hobern [GBIF]" <dhobern@gbif.orgmailto:dhobern@gbif.org> Subject: Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data To: "'Matt Jones'" <jones@nceas.ucsb.edumailto:jones@nceas.ucsb.edu>, '?amonn ? Tuama [GBIF]' <eotuama@gbif.orgmailto:eotuama@gbif.org> Cc: 'TDWG Content Mailing List' <tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org> Message-ID: <003401cfbdd5$de52b640$9af822c0$@gbif.orghttp://gbif.org> Content-Type: text/plain; charset="utf-8"
Hi Matt,
I?ll take the chance to make a few quick comments here, because I believe
that this work is of massive importance.
Clearly DwC should avoid trying to duplicate well-standardised models and
protocols. However at the same time, there is enormous value for producers and consumers of DwC to benefit from richer data on the events and methods associated with individual species occurrences. I have never seen DwC as purely an Occurrence exchange syntax. I see it (from GBIF?s standpoint) more closely as a mechanism for diverse parties to pool the evidence they have for the occurrence of any species including associated information and/or actionable links to associated information. Users coming from this perspective certainly need (and are demanding) access to all the evidence that can be mobilized to serve as supporting evidence and they also need the ability to understand the significance of these records. Abundance measures, levels of effort, use of consistent methods and redetection of individual organisms are all part of this. DwC should be able to transmit as much data as publishers cho
ose to share on such aspects as part of their publishing of DwC. Users
of DwC carrying out species modeling, threat assessment or community analyses will benefit from rapid ways to filter data for those which derive from standardized sampling events, to understand relative abundance within samples, etc. Many publishers of DwC are currently sharing stripped-down subsets of data and wish to give more information on these points. Users are certainly demanding it.
The challenge is finding the sweet spot, the achievable, non-destructive
overlap between DwC and the proper domain of models better designed to handle the representation of complex systems outside DwC?s current domain. If this is done correctly, there should be paths that enable us to generate O&E (and maybe OBOE) compatible data from data that publishers only serve as augmented DwC.
I?ll also note that this has been a prominent area of discussion now for
several years. Many of us believe strongly that this is one of the most important ways in which we need to close arbitrary gaps between data silos. It?s a prominent part of the GBIF work programme for 2014-2016.
Very best wishes,
Donald
From: tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org
[mailto:tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Matt Jones
Sent: Friday, August 22, 2014 4:52 AM To: ?amonn ? Tuama [GBIF] Cc: TDWG Content Mailing List Subject: Re: [tdwg-content] Darwin Core: proposed news terms for
expressing sample data
This proposal is treading on ground that is quite similar to other
observations and measurements standards for data exchange that are already mature, in particular:
- OGC Observations and Measurements
(http://www.opengeospatial.org/standards/om)
- Extensible Observation Ontology (OBOE;
https://semtools.ecoinformatics.org/oboe)
The former is a standard and broadly deployed, whereas the latter is part
of a research program in the use of ontologies for measurements. Through collaboration between the two projects, they've been modified to be reasonably isomorphic, but O&M uses an XML serialization while OBOE uses an OWL-DL serialization. They largely express the same measurements and sampling model once one gets beyond the terminology differences.
So, I'm wondering if it make much sense to extend Darwin Core, which is
at heart an Occurrence exchange syntax, into this measurements area that is well represented by these other existing specifications? I'm curious to hear why people would even want to do this. And if we do go down this path, won't we just end up with a new syntax that does essentially what O&M and OBOE do now?
Matt
On Thu, Aug 21, 2014 at 12:22 AM, ?amonn ? Tuama [GBIF]
<eotuama@gbif.orgmailto:eotuama@gbif.org> wrote:
Hi Rob, Anne, Rich,
I think Markus has answered your question as to why we opted for an Event
core which is being used in the sense described by Anne and Rich. For any event, you can have a list of species in an Occurrence extension and for each species, you can include quantity and quantityType, e.g., biomass, etc. The proposed term eventSeriesID was intended for linking together related events, although it now looks like parentEventID might be a better, more flexible term. The measurementOrFact extension is a good fit for capturing environmental information relating to an event. See, e.g., the Gialova Lagoon brackish water invertebrate test data set [1] where a set of 18 environmental variables, including temp, pH, Rdx, particulate organic matter, dissolved oxygen, salinity, chlorophyll-a were measured for each sampling station-sampling period combination. An example mapping is:
Id measurementType measurementValue
measurementUnit measurementRemarks
IA Tmp (sed) 21.5
degree C Tmp (sed): temperature at the bottom surface
*Controlled vocabularies*
Ideally, the values for samplingUnit and quantityType would be selected
from controlled vocabularies. This is, effectively, what we do by presenting a small list of values in a drop-down menu. The current values are what we derived for example data sets and discussion but they can undoubtedly be extended and improved.
We capture ?bucket? type measures through a combination of
samplingEffort, samplingGeometry and samplingUnit. For example, a pitfall trap (in a point location) left out for 16 days might have samplingEffort: 16, samplingGeometry: point and samplingUnit: day. Three m^2 quadrats in a shore survey might have samplingEffort: 3, samplingGeometry: area and samplingUnit: m^2.
It would be very useful to see your compilation of scope, effort and
completeness measures to see if we can express them in our model and/or if we need to reconsider our approach.
?amonn
[1] http://eubon-ipt.gbif.org/resource.do?r=ionian-brackish-lagoon
From: tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org
[mailto:tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Markus D?ring
Sent: 20 August 2014 23:47 To: Robert Guralnick
Cc: TDWG Content Mailing List Subject: Re: [tdwg-content] Darwin Core: proposed news terms for
expressing sample data
Rob,
this proposal if for monitoring surveys really, not to be confused with
material samples like environmental or tissue samples which have a distinct new dwc class MaterialSample.
We tend to overload the term sampling a lot and it helps treating
material samples different from pure observational "sampling". That is why the existing Event class was used as the core and classic Occurrence records as extensions. A classic example is a vegetation survey where each plot represents an Event record and each recorded species in that plot will be an Occurrence extension record with a given quantity. Darwin Core already offers individualCount to specify quantity, but it is a very specific way of measuring "abundance" restricted to only some use cases. Abiotic measurements about the plot (e.g. soil type, pH, temperature) can be published using the measurements or facts extension linked to the Event core.
Markus
On 20 Aug 2014, at 20:08, Robert Guralnick
<Robert.Guralnick@colorado.edumailto:Robert.Guralnick@colorado.edu> wrote:
Anne -- I don't know the answers! These are questions for Eamonn. I
would presume that a sample could be a jumble of species or even just water or soil samples, and biomass would refer to that sample - but maybe that isn't a use case being considered? The examples given in the longer document all link an event_id to species name and some measure of quantity for that species (to the species, not an individual specimen), so I assume that is the prevailing (or only) case?
Best, Rob
On Wed, Aug 20, 2014 at 11:56 AM, Anne Thessen <annethessen@gmail.commailto:annethessen@gmail.com>
wrote:
Hi Rob I would like to respond to your item number 2.
From my perspective, I deal with lots of published descriptions of taxa.
The text might say something like "I saw species A in the Chesapeake Bay, the Adriatic Sea and the Indian Ocean and the biomass is 5 - 9 grams". The biomass range obviously corresponds to at least three different occurrences, but how to divide the biomass data? I would love to be able to have an *event* to attach it all to. There is almost two different levels of events - a sampling event and a "study event". The "study event" would correspond to the type of event I would like to use in the above example. It may not be ideal, but for the old literature that might be the best we can do.
I have to admit that I don't know enough about trawl data to understand
why an event core would be a problem. It seems that the trawl would be an event and each biomass measure (of each fish) would be attached to a separate occurrence which is attached to that event. Am I understanding this wrong?
btw - I found a workaround for the example I gave, so it's not impossible
to model with the current structure....
Anne
On 8/20/2014 1:16 PM, Robert Guralnick wrote:
?amonn et al. --- Thanks for the clarifications. I think these help a
ton but it raises a couple more questions for me.
- I am surprised that you plan to use of MeasurementorFact extension
in relation to the Event core, which seems like a novel (or perhaps awkward or unintended?) mechanism for capturing environmental data, but the same extension was not be seen as relevant for describing samples? Can you explain more about the thinking there?
- There may be a subtle issue here extending "Event" to be more what
you call a "Sampling Event Core". My read of this is that Darwin Core serves as a way to deal with point occurrences and Event reflects the context of a single capture event (whether a single observation, or a bulk sample capture). The changes recommended seem to dramatically extend and change that meaning? Its simply a question that I don't have answer to, but is Darwin Core, the right vehicle to start capturing repeated measures of biomass values from trawls? I don't have answer but man, terms like quantityType (as a property of occurrence?) give me pause.
- Is Sampling Unit a controlled vocabulary? For another project, I have
looked through - and captured scope, effort and completeness measures from - a large number of published biotic area inventories. The vast majorities of these are measured in units like bucket hours, or trap nights. Is a "bucket" part of SamplingGeometry or Sampling Unit? I'd be happy to send along all the many examples of how biotic inventories of an area are completed and perhaps it might be good to see how those might be represented using the terms you are proposing?
Best, Rob
On Wed, Aug 20, 2014 at 10:16 AM, Richard Pyle
<deepreef@bishopmuseum.orgmailto:deepreef@bishopmuseum.org> wrote:
Same here ? Events are central to the work that we do.
Aloha,
Rich
From: tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org
[mailto:tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Anne Thessen
Sent: Wednesday, August 20, 2014 2:59 AM To: tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org
Subject: Re: [tdwg-content] Darwin Core: proposed news terms for
expressing sample data
Hello I would just like to comment on *event core*. I've been doing a lot of work translating published data into Darwin
Core. During that process I've wished several times that I could use Event as core. I am happy to hear about that proposed change. It will make it easier to model the data I am working with.
Anne
On 8/20/2014 7:04 AM, ?amonn ? Tuama [GBIF] wrote:
Hi Rob,
Thank you for the feedback. I have tried to address the two main issues
you raise below. At the outset, I would like to emphasise that much of this work is taking place in the context of the EU BON project which includes a task on developing/enhancing tools and standards for data sharing with a particular focus on the IPT for publishing sample-based data. So, we were constrained by the need to publish sample-based data sets in the Darwin Core Archive format and to demonstrate practical application using a working prototype. When the discussion on the TDWG list faded out, we took it to our EU BON partners whose requirements were essential input to further development. We recognise that these discussions took place away from TDWG (although the TDWG/EU BON contributors overlapped) and this is the reason we are presenting the outcomes here for further consideration.
*Event core*
As the SIGS report indicated, sample data can be modelled in Darwin Core
Archives using either Occurrence or Event as core. This was the starting point for our evaluation but as things progressed the data wrangling pushed the model back towards the Event core. We actually went through the exercise of mapping multiple test datasets in an iterative process spanning several months' work. In the end, we found that using an Event core better matched the typical sample data we were dealing with, allowing use of a measurement-or-fact extension to be included for the efficient expression of environmental information associated with the event. The choice comes down to an Occurrence core or an Event core + Occurrence extension. In both cases, the true observation records are Occurrences. The big difference is what type the core has and therefore to which kind of records you can attach further facts and extra information with DwC-A extensions. Many sampling datasets have very rich infor
mation about the site and event, so it is very natural to hang facts
from an Event core. When picking the Occurrence core those facts would have to be repeated for each and every occurrence record. Moreover, our approach doesn?t stop anyone from using the Occurrence core if they so wish. This just provides a different option for datasets that better fit an Event core model.
I want to stress that we are not building a ?specific IPT version? to
support an Event core but, rather, we adapted the IPT so that it can be configured to support any generic ?core + extension? format to enable its use for exploration of more data formats. This is part of the core codebase and there were no custom forks of the IPT for this work. Our view at GBIF is that if there are significant numbers of data publishers who are keen to adopt, promote and use a (any) format, and the tools can be configured to do so, then we should support it, and, if necessary, use a custom namespace.
*New terms around abundance*
Yes, the discussion on TDWG did fade out but it was clear that the term
?abundance? as recommended by the SIGS report (along with abundanceAsPercent) was confusing many when we were looking for term(s) that reported quantitative measures of organisms in a sample. It also became clear we would need to be able to state the type of quantity being measured. An alternative suggestion for using the MeasurementsOrFact class was immediately shot down.
As some of our main use cases were coming from the EU BON project,
discussion shifted to that forum and consensus formed about the currently proposed terms. It was within this group that the additional terms (samplingGeometry, samplingUnit, eventSeriesID) were proposed and where we began testing with sample data sets.
Best regards,
?amonn
From: robgur@gmail.commailto:robgur@gmail.com [mailto:robgur@gmail.commailto:robgur@gmail.com] On Behalf Of Robert
Guralnick
Sent: 19 August 2014 16:56 To: ?amonn ? Tuama [GBIF] Cc: TDWG Content Mailing List Subject: Re: [tdwg-content] Darwin Core: proposed news terms for
expressing sample data
Hi ?amonn --- I am curious about the outcomes presented in the SIGS
paper, in particular, this portion of the paper:
"Solutions without introducing an event core in Darwin Core Archives:
During the review of the solutions for the uses cases, it became apparent that either model could be applied to every use case. The core and extensions bore a complementary relationship and between them could express all the required information. The core simply provided the central anchor in the star schema from which to join the additional information. Therefore, using the Occurrence core, well established in the GBIF network through uptake of the IPT, seemed more appropriate than inventing CollectingEvent as an additional core type."
That SIGS paper has John Wieczorek and you both as authors, including
many luminaries across the biodiversity standards spectrum. Given the above, its curious to see the EventCore come back again, along with a specific IPT version to support it.
So I see two issues, conflated, in this post you just made. One is
the need for an EventCore at all, and the nature of relating Event and Occurrence/Material Sample. The second is the introduction of new terms, which seemingly have arrived after debate on similar terms - but framed around abundance - stalled a year ago. To my mind, these both require some further discussion, because I don't (necessarily) see TDWG community coherence around either issue?
Best, Rob
On Tue, Aug 19, 2014 at 6:11 AM, ?amonn ? Tuama [GBIF] <eotuama@gbif.orgmailto:eotuama@gbif.org>
wrote:
Dear All,
GBIF is committed to exploring ways in which the IPT and Darwin Core
Archive format can be extended for publishing sample-based data sets. In association with the EU BON project [1], a customised version of the IPT [2] has been deployed to test this using a special type of Darwin Core Archive in which the core is an ?Event? with associated taxon occurrences in an ?Occurrence? extension.
The Darwin Core vocabulary already provides a rich set of terms with many
relevant for describing sample-based data. Synthesising several sources of input (GBIF organised workshop on sample data, May 2013 [3], discussions on the TDWG mailing list in late 2013; internal discussion among EU BON project partners), five new terms relating to sample data were identified as essential. The complete model including these new terms are fully described with examples in the online document ?Publishing sample data using the GBIF IPT? [4].
As a first step towards ratification, we would like to register the new
terms in the DwC Google Code tracker [5] if there are no major objections on this list. The five terms are:
quantity: the number or enumeration value of the quantityType
(e.g., individuals, biomass, biovolume, BraunBlanquetScale) per samplingUnit or a percentage measure recorded for the sample.
quantityType: : the entity being referred to by quantity, e.g.,
individuals, biomass, %species, scale type.
samplingGeometry: an indication of what kind of space was
sampled; select from point, line, area or volume.
samplingUnit: the unit of measurement used for reporting the
quantity in the sample, e.g., minute, hour, day, metre, metre^2, metre^3. It is combined with quantity and quantityType to provide the complete measurement, e.g., 9 individuals per day, 4 biomass-gm per metre^2.
eventSeriesID: an identifier for a set of events that are
associated in some way, e.g., a monitoring series; may be a global unique identifier or an identifier specific to the series.
Best regards,
?amonn
[1] http://eubon.eu http://eubon.eu/
[2] http://eubon-ipt.gbif.org http://eubon-ipt.gbif.org/
[3]
http://www.standardsingenomics.org/index.php/sigen/article/view/sigs.4898640
http://links.gbif.org/sample_data_model
[5] https://code.google.com/p/darwincore/issues/list
?amonn ? Tuama, M.Sc., Ph.D. (eotuama@gbif.orgmailto:eotuama@gbif.org),
Senior Programme Officer for Interoperability,
Global Biodiversity Information Facility Secretariat,
Universitetsparken 15, DK-2100, Copenhagen ?, DENMARK
Phone: +45 3532 1494tel:%2B45%203532%201494 tel:%2B45%203532%201494 ; Fax: +45 3532 1480tel:%2B45%203532%201480
tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Anne E. Thessen, Ph.D. The Data Detektiv, Owner and Founder Ronin Institute, Research Scholar 443.225.9185tel:443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Anne E. Thessen, Ph.D. The Data Detektiv, Owner and Founder Ronin Institute, Research Scholar 443.225.9185tel:443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-------------- next part -------------- An HTML attachment was scrubbed... URL:
http://lists.tdwg.org/pipermail/tdwg-content/attachments/20140822/a46f067d/a ttachment-0001.html
tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
End of tdwg-content Digest, Vol 63, Issue 6
tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- John Deck (541) 321-0689tel:%28541%29%20321-0689
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
A couple of points of clarification on what Chuck said:
1. Audubon Core is more than an extension of Darwin Core - it's a vocabulary that is a ratified TDWG technical specification standard (despite what [1] says - nobody has managed to change its status from "Draft" to "Current" on the website). So it stands on the same level as Darwin Core even though it's terms describe different kinds of things than DwC. All ratified TDWG standards have gone through peer review, public comment, and approval by the TDWG Executive. That is not the case with Plinian Core or GBIF extensions.
2. The Darwin Core text guide that Chuck mentions IS actually part of the Darwin Core Standard. TDWG Standards are composed of normative ("Type 1") documents that define the standard and non-normative ("Type 2") documents that "explain and justify" the standard. For Darwin Core, there is a single normative RDF document that defines the standard. It can be viewed at https://code.google.com/p/darwincore/source/browse/trunk/rdf/dwctermshistory... . There are 67 non-normative documents, which include the entire Darwin Core website (including the text guide page). How do I know that these documents comprise the standard? Because they are what I get when I download and unzip the standard from http://www.tdwg.org/standards/450/ . It is one of TDWG's deficiencies that it is not a simple matter to know what actually comprises a standard and to view clearly-identified components of a standard without having to download a zip archive.
I'm not sure that I can articulate the difference between the Darwin Core Text Guide component of the standard and a "Darwin Core Archive". It might be correct to say that Darwin Core Archives are sets of files that are compliant with the Darwin Core Text Guide.
Steve
[1] http://www.tdwg.org/standards/
Chuck Miller wrote:
Rob,
GBIF is hard at work solving a problem for them. It seems natural for them to do what works best for them, particularly dealing with the volume/scale they do. And, Markus, Tim, Eamonn and others have been deeply involved in Darwin Core standards implementation, inventing the DwCA and the IPT in the first place. But, in using DwCA to get actual work done for GBIF they have added a term here and there via their own extensions. See http://tools.gbif.org/dwca-validator/extensions.do. It’s a long list of Registered and Stable extensions. What about Audubon and Plinian Core? Aren’t they actually “extensions” to Darwin Core, particularly in the “universal biodiversity data “ sense Darwin Core seems to be evolving to? Maybe we need a “SampleCore” or “EventCore” that organizes these sampling/event-related terms?
GBIF is creating a new “Event core” for DwCA to do what they need to do. But, I don’t understand the problem with this. Occurrence core, Taxon core, Event core – so what? The DwCA model allows anything to be the core (ignoring IPT) and it does not actually restrict the terms used to DwC only, certainly it already includes Dublin Core and EML. In fact, lots of DwCA exchanges are being used outside GBIF now with extended terms from Audubon Core, Plinian Core, and others. How is this bad? It shows how flexible the DwCA data exchange structure is. And, it is an exchange, not a data storage format. What’s the problem with an Event core?
As an aside, I have said many times that the most confusing thing to our community about DwCA is that it’s called _Darwin Core_ Archive when the structure is in fact not limited to only Darwin Core terms. Any validatable XML term can be used – like http://rs.gbif.org/terms/1.0/verbatimLabel which is used in the How-To Guide Whales example. And I commonly hear people at conferences and meetings now referring to their plans to adopt “Darwin Core” and upon questioning discover they are talking about the DwCA format, not really knowing the details of it. So, the words “Darwin Core” have become ambiguous.
And I want to note that Darwin Core Archive is not a TDWG standard. It is a guideline for implementing the Darwin Core standard in text files. http://rs.tdwg.org/dwc/terms/guides/text/. This is another point of confusion. GBIF describes DwCA as “an internationally recognized biodiversity informatics data standard” http://www.gbif.org/resources/2551, but it’s not a TDWG standard, just a guideline. I think DwCA deserves to be made a TDWG data exchange standard on its own, separate from Darwin Core, and perhaps expanded in scope to embrace the other “Cores” and relabeled to “Biodiversity Data Archive” or something.
Chuck
*From:* tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Robert Guralnick *Sent:* Thursday, August 28, 2014 10:58 AM *To:* John Deck *Cc:* TDWG Content Mailing List; Ramona Walls *Subject:* Re: [tdwg-content] tdwg-content Digest, Vol 63, Issue 6
Hi all --- Ok, I think the scope of the issue is quite clear. Let me summarize: 1) As Éamonn and the rest of GBIF has made quite clear, "GBIF is faced with the immediate task of making sample-based data discoverable and accessible using its current ecosystem of tools" given a funding mandate from EU-BON. 2) The solution for this problem is to develop an Event-core and to promote new terms to the Darwin Core to make this happen. I will note a small inconsistency here: the current ecosystem standards and tools of is Darwin Core (as it stands) and publishing systems such as IPT. That ecosystem of tools includes mechanisms to extend Darwin Core where needed, via extensions. The /current/ ecosystem of tools doesn't include new Cores or new DwC terms, does it?
So this leads in nicely to the contentious issue(s) and places where there seems to be discussions --- these have to do with the nature of the changes suggested and the scope of those changes, both in terms of an Event core and DwC term additions. Leaving aside the Event-core for now, the key questions simply about term additions to the Darwin Core that seem to be at heart here are: 1) Is the intent of the Darwin Core to model surveys, which usually involve multiple kinds and types of sampling over multiple sites using multiple methods? 2) Is the solution to invent new terms for the Darwin Core if there are already terms from other efforts, wouldn't we work with those existing efforts to assure interoperability?
I appreciate the efforts of GBIF here fully, and am personally torn because on the one hand, I fully agree with the goal of extending Darwin Core to better represent richer biodiversity data. On the other hand, I worry about process here and how to make that happen in a way that isn't too hasty or locks us into just the opposite of what I think many of us want with regards to sharing data more broadly than within just one ecosystem of tools.
Best, Rob
On Thu, Aug 28, 2014 at 6:30 AM, John Deck <jdeck@berkeley.edu mailto:jdeck@berkeley.edu> wrote:
I see the rational for enabling this in Darwin Core Archives and adding the new terms. However, back to what Matt Jones brought up: "won't we just end up with a new syntax that does essentially what O&M and OBOE do now?".
We should include explicit references to existing terms/definitions that encapsulate what we're talking about, e.g. in our MaterialSample proposal last year we linked the an existing term in OBI, which has a much richer description and context for MaterialSample than what we considered (https://code.google.com/p/darwincore/issues/detail?id=167)
Have we explored the possibility of doing this with OBOE? I'm not suggesting we adopt OBOE wholesale, but it seems like we have a good opportunity to enable better semantic linking with that efforts.
John
On Thu, Aug 28, 2014 at 4:23 AM, Éamonn Ó Tuama [GBIF] <eotuama@gbif.org mailto:eotuama@gbif.org> wrote:
Thanks, Ramona and Rob.
I'd like to add a few points following on Markus's reply.
I think your pressing of the need for a robust semantic model for biodiversity sample/survey data is incontestable – we do need one and it should enable rich data integration once it is defined and the tools and data standards to support it become available. However, GBIF is faced with the immediate task of making sample-based data discoverable and accessible using its current ecosystem of tools (IPT) and exchange standards (DwC; EML). Waiting for a functional, implementable semantic model and the tools and support services for it is just not an option for us right now.
We have already spend considerable time in analysing the merits of Occurrence core vs Event core and have opted for an Event core for reasons previously given. I don’t believe we are trying to reconfigure Event (“an action that occurs at a place and during a period of time”) and regardless of whether we use Occurrence or Event, the need for some additional terms arise (e.g., quantity, quantityType, samplingGeometry, samplingUnit). Once the BCO model is available for uptake, it should be possible to develop a mapping between it and the simple DwC sample model.
So GBIF’s stance is that we need to take a two-pronged approach by exploring how the IPT and DwC-A can be adapted for publishing sample-based data in the near term while supporting the work of TDWG and groups such as the BCO in advancing biodiversity informatics. GBIF has already engaged in the work of the BCO and will continue to do so.
Éamonn
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org mailto:tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Markus Döring Sent: 28 August 2014 12:44 To: Ramona Walls Cc: TDWG Content Mailing List
Subject: Re: [tdwg-content] tdwg-content Digest, Vol 63, Issue 6
Hi Ramona & Rob,
The Event proposal does not try to change the semantics of an Event, it just uses the existing Darwin Core Event "class" at the core in Darwin Core archives. The actual change proposed is simply adding 3 new terms to the Event "group" to better share information about sampling methods & efforts, extending the existing limited capabilities of Darwin Core which already has the terms dwc:samplingProtocol and dwc:samplingEffort. It also proposes 2 new terms for dealing with quantity of Occurrences, something that has been discussed since 2012 now, when I had proposed a new abundance term [2].
In general application of Darwin Core is not at all limited to specimens and observations. It is used for sharing taxonomic datasets already and it's definition and goal is broad. Let me cite some of the introduction to Darwin Core [1]:
What is the Darwin Core? The Darwin Core is body of standards. It includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing reference definitions, examples, and commentaries. The Darwin Core is primarily based on taxa, their occurrence in nature as documented by observations, specimens, samples, and related information.
Motivation: The Darwin Core standard was originally conceived to facilitate the discovery, retrieval, and integration of information about modern biological specimens, their spatiotemporal occurrence, and their supporting evidence housed in collections (physical or digital). The Darwin Core today is broader in scope and more versatile. It is meant to provide a stable standard reference for sharing information on biological diversity. As a glossary of terms, the Darwin Core is meant to provide stable semantic definitions with the goal of being maximally reusable in a variety of contexts.
Markus
[1] http://rs.tdwg.org/dwc/index.htm [2] https://code.google.com/p/darwincore/issues/detail?id=142
-- Markus Döring Software Developer Global Biodiversity Information Facility (GBIF) mdoering@gbif.org mailto:mdoering@gbif.org http://www.gbif.org
On 27 Aug 2014, at 18:57, Ramona Walls <rlwalls2008@gmail.com mailto:rlwalls2008@gmail.com> wrote:
I think it is important to consider the purpose of both Darwin Core and
DwC archives in deciding whether or not to expand them, but we should use that consideration to address the question at hand, which is whether or not to add an Event core and additional properties to describe events.
Describing the exchange format before the semantics is the wrong way to
go, given that we now have a framework for developing semantics. Expanding Darwin Core before we adequately model survey data is bound to lead to problems later, when we try to retro-fit the semantics to Darwin Core Event archives. This is exactly the problem we are running into now with Occurance archives, and we have the opportunity to avoid it.
I suggest we first use existing ontologies to model survey data,
then deal with if and how to exchange that information in DwC-A. This is what I was hinting at in my first email, but should have said more explicitly.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Wed, Aug 27, 2014 at 9:14 AM, Robert Guralnick
<Robert.Guralnick@colorado.edu mailto:Robert.Guralnick@colorado.edu> wrote:
It may be a sensible view for Darwin Core Archives and their intended
use, but Tim's email suggests we should be putting the method of delivery ahead of the standard that delivers that content. If this was just about DwC-As, why not develop a survey extension that links each occurrence to information about the survey process using the existing star-schema methods we have in place? Why are we discussing adding terms to the Darwin Core or trying to fully reconfigure what we call an Event? That is what is on the table, not DwC-As and how we use them. Or am I missing something?
Best, Rob
On Wed, Aug 27, 2014 at 10:01 AM, Ramona Walls
<rlwalls2008@gmail.com mailto:rlwalls2008@gmail.com> wrote:
Thanks, Tim, and yes, DwC-A as a view (but not necessarily the primary
archive) of data seems like the right point of view.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Wed, Aug 27, 2014 at 1:58 AM, Tim Robertson <trobertson@gbif.org
mailto:trobertson@gbif.org> wrote:
Hi Ramona,
Those are good points, and I’d like to come back to the original
thinking behind the DwC-A.
It was designed and intended to be a simple way of exposing a complete
view of a dataset, primarily for building sophisticated indexes, inventories and allowing basic analytics (e.g. GBIF.org being one sophisticated index). We found that the star schema provided the flexibility to do a lot, and with the bundled metadata (e.g. EML) was enough to trace provenance and allow users to determine if the dataset might be fit for various uses. In many cases this represents the complete (e.g. lossless) view of a dataset.
What we are discussing here are far richer datasets, where shoe-horning
content into the star schema becomes lossy for some, although we’re finding other cases where it is indeed lossless. I believe we should be looking to harmonise ontologies / models etc as you mention but in parallel we should define one or more star schema views that can still be used for discovery / reporting / basic analytical purpose, and not long term archival of the dataset. The dataset would then have the canonical rich form and an additional DwC-A view. What I write here is applicable to all content types of course.
Please also note that many people put supplementary files in the DwC-A
which are ignored by DwC-A readers but could be a way of keeping the richer view in the bundle. If one wished you can describe those supplementary files in the EML document.
Does this gel with the view of others as well?
Cheers, Tim
On 27 Aug 2014, at 02:55, Ramona Walls <rlwalls2008@gmail.com
mailto:rlwalls2008@gmail.com> wrote:
I think Matt hit the nail on the head. Although Darwin Core can be used
to exchange survey data, it lacks the semantics and structure necessary to archive the data without loss of information. I think the biodiversity community would be better served devoting energy to harmonizing existing technologies such as OGC, OBOE, and BCO, not to mention the many database for storing plot or survey data. The goal should be to preserve the data in the most informative manner possible.
There is a strong a case for wanting to search across all evidence for
occurences, including surveys and point occurences, so I can see possible demand for a tool that would extract occurences from survey data to a DwC archive. However, I am very concerned that making a DwC archive the primary exchange format for survey or plot data commits us to a path of losing information from the start, for all but the simplest sampling schemas.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Fri, Aug 22, 2014 at 3:00 AM,
<tdwg-content-request@lists.tdwg.org mailto:tdwg-content-request@lists.tdwg.org> wrote:
Send tdwg-content mailing list submissions to tdwg-content@lists.tdwg.org
mailto:tdwg-content@lists.tdwg.org
To subscribe or unsubscribe via the World Wide Web, visit http://lists.tdwg.org/mailman/listinfo/tdwg-content or, via email, send a message with subject or body 'help' to tdwg-content-request@lists.tdwg.org
mailto:tdwg-content-request@lists.tdwg.org
You can reach the person managing the list at tdwg-content-owner@lists.tdwg.org
mailto:tdwg-content-owner@lists.tdwg.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of tdwg-content digest..."
Today's Topics:
- Re: Darwin Core: proposed news terms for expressing sample data (Matt Jones)
- Re: Darwin Core: proposed news terms for expressing sample data (Donald Hobern [GBIF])
Message: 1 Date: Thu, 21 Aug 2014 18:52:06 -0800 From: Matt Jones <jones@nceas.ucsb.edu mailto:jones@nceas.ucsb.edu> Subject: Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data To: ?amonn ? Tuama [GBIF] <eotuama@gbif.org mailto:eotuama@gbif.org> Cc: TDWG Content Mailing List <tdwg-content@lists.tdwg.org
mailto:tdwg-content@lists.tdwg.org>
Message-ID:
<CAFSW8xkx7uRP9PC2g3=JT_VJanqujH8nPXoz8GXwh+JwKw5Ccw@mail.gmail.com mailto:JT_VJanqujH8nPXoz8GXwh%2BJwKw5Ccw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
This proposal is treading on ground that is quite similar to other observations and measurements standards for data exchange that are
already
mature, in particular:
- OGC Observations and Measurements (
http://www.opengeospatial.org/standards/om)
- Extensible Observation Ontology (OBOE;
https://semtools.ecoinformatics.org/oboe)
The former is a standard and broadly deployed, whereas the latter
is part
of a research program in the use of ontologies for measurements.
Through
collaboration between the two projects, they've been modified to be reasonably isomorphic, but O&M uses an XML serialization while OBOE
uses an
OWL-DL serialization. They largely express the same measurements and sampling model once one gets beyond the terminology differences.
So, I'm wondering if it make much sense to extend Darwin Core, which is
at
heart an Occurrence exchange syntax, into this measurements area
that is
well represented by these other existing specifications? I'm
curious to
hear why people would even want to do this. And if we do go down this path, won't we just end up with a new syntax that does essentially what
O&M
and OBOE do now?
Matt
On Thu, Aug 21, 2014 at 12:22 AM, ?amonn ? Tuama [GBIF]
<eotuama@gbif.org mailto:eotuama@gbif.org>
wrote:
Hi Rob, Anne, Rich,
I think Markus has answered your question as to why we opted for an
Event
core which is being used in the sense described by Anne and Rich. For
any
event, you can have a list of species in an Occurrence extension and
for
each species, you can include quantity and quantityType, e.g.,
biomass,
etc. The proposed term eventSeriesID was intended for linking
together
related events, although it now looks like parentEventID might be a
better,
more flexible term. The measurementOrFact extension is a good fit for capturing environmental information relating to an event. See, e.g.,
the
Gialova Lagoon brackish water invertebrate test data set [1] where a
set
of 18 environmental variables, including temp, pH, Rdx, particulate
organic
matter, dissolved oxygen, salinity, chlorophyll-a were measured for
each
sampling station-sampling period combination. An example mapping is:
Id measurementType measurementValue measurementUnit measurementRemarks
IA Tmp (sed) 21.5 degree C
Tmp
(sed): temperature at the bottom surface
**Controlled vocabularies**
Ideally, the values for samplingUnit and quantityType would be
selected
from controlled vocabularies. This is, effectively, what we do by presenting a small list of values in a drop-down menu. The current
values
are what we derived for example data sets and discussion but they can undoubtedly be extended and improved.
We capture ?bucket? type measures through a combination of
samplingEffort,
samplingGeometry and samplingUnit. For example, a pitfall trap (in a
point
location) left out for 16 days might have samplingEffort: 16, samplingGeometry: point and samplingUnit: day. Three m^2 quadrats
in a
shore survey might have samplingEffort: 3, samplingGeometry: area and samplingUnit: m^2.
It would be very useful to see your compilation of scope, effort and completeness measures to see if we can express them in our model
and/or if
we need to reconsider our approach.
?amonn
[1] http://eubon-ipt.gbif.org/resource.do?r=ionian-brackish-lagoon
*From:* tdwg-content-bounces@lists.tdwg.org
mailto:tdwg-content-bounces@lists.tdwg.org [mailto:
tdwg-content-bounces@lists.tdwg.org
mailto:tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Markus D?ring
*Sent:* 20 August 2014 23:47 *To:* Robert Guralnick
*Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Rob,
this proposal if for monitoring surveys really, not to be
confused with
material samples like environmental or tissue samples which have a
distinct
new dwc class MaterialSample.
We tend to overload the term sampling a lot and it helps treating
material
samples different from pure observational "sampling". That is why the existing Event class was used as the core and classic Occurrence
records as
extensions. A classic example is a vegetation survey where each plot represents an Event record and each recorded species in that plot
will be
an Occurrence extension record with a given quantity. Darwin Core
already
offers individualCount to specify quantity, but it is a very specific
way
of measuring "abundance" restricted to only some use cases. Abiotic measurements about the plot (e.g. soil type, pH, temperature) can be published using the measurements or facts extension linked to the
Event
core.
Markus
On 20 Aug 2014, at 20:08, Robert Guralnick
<Robert.Guralnick@colorado.edu mailto:Robert.Guralnick@colorado.edu>
wrote:
Anne -- I don't know the answers! These are questions for
Eamonn. I
would presume that a sample could be a jumble of species or even just
water
or soil samples, and biomass would refer to that sample - but maybe
that
isn't a use case being considered? The examples given in the longer document all link an event_id to species name and some measure of
quantity
for that species (to the species, not an individual specimen), so I
assume
that is the prevailing (or only) case?
Best, Rob
On Wed, Aug 20, 2014 at 11:56 AM, Anne Thessen
<annethessen@gmail.com mailto:annethessen@gmail.com>
wrote:
Hi Rob I would like to respond to your item number 2. From my perspective, I deal with lots of published descriptions of
taxa.
The text might say something like "I saw species A in the Chesapeake
Bay,
the Adriatic Sea and the Indian Ocean and the biomass is 5 - 9
grams". The
biomass range obviously corresponds to at least three different occurrences, but how to divide the biomass data? I would love to be
able to
have an *event* to attach it all to. There is almost two different
levels
of events - a sampling event and a "study event". The "study event"
would
correspond to the type of event I would like to use in the above
example.
It may not be ideal, but for the old literature that might be the
best we
can do. I have to admit that I don't know enough about trawl data to
understand
why an event core would be a problem. It seems that the trawl
would be an
event and each biomass measure (of each fish) would be attached to a separate occurrence which is attached to that event. Am I
understanding
this wrong? btw - I found a workaround for the example I gave, so it's not
impossible
to model with the current structure.... Anne
On 8/20/2014 1:16 PM, Robert Guralnick wrote:
?amonn et al. --- Thanks for the clarifications. I think these
help a ton
but it raises a couple more questions for me.
- I am surprised that you plan to use of MeasurementorFact
extension in
relation to the Event core, which seems like a novel (or perhaps
awkward or
unintended?) mechanism for capturing environmental data, but the same extension was not be seen as relevant for describing samples? Can you explain more about the thinking there?
- There may be a subtle issue here extending "Event" to be more
what you
call a "Sampling Event Core". My read of this is that Darwin Core
serves
as a way to deal with point occurrences and Event reflects the
context of a
single capture event (whether a single observation, or a bulk sample capture). The changes recommended seem to dramatically extend and
change
that meaning? Its simply a question that I don't have answer to, but
is
Darwin Core, the right vehicle to start capturing repeated
measures of
biomass values from trawls? I don't have answer but man, terms like quantityType (as a property of occurrence?) give me pause.
- Is Sampling Unit a controlled vocabulary? For another project, I
have
looked through - and captured scope, effort and completeness measures
from
- a large number of published biotic area inventories. The vast
majorities
of these are measured in units like bucket hours, or trap
nights. Is a
"bucket" part of SamplingGeometry or Sampling Unit? I'd be happy to
send
along all the many examples of how biotic inventories of an area are completed and perhaps it might be good to see how those might be represented using the terms you are proposing?
Best, Rob
On Wed, Aug 20, 2014 at 10:16 AM, Richard Pyle
<deepreef@bishopmuseum.org mailto:deepreef@bishopmuseum.org>
wrote:
Same here ? Events are central to the work that we do.
Aloha,
Rich
*From:* tdwg-content-bounces@lists.tdwg.org
mailto:tdwg-content-bounces@lists.tdwg.org [mailto:
tdwg-content-bounces@lists.tdwg.org
mailto:tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Anne Thessen
*Sent:* Wednesday, August 20, 2014 2:59 AM *To:* tdwg-content@lists.tdwg.org
mailto:tdwg-content@lists.tdwg.org
*Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hello I would just like to comment on *event core*. I've been doing a lot of work translating published data into Darwin
Core.
During that process I've wished several times that I could use
Event as
core. I am happy to hear about that proposed change. It will make it
easier
to model the data I am working with. Anne
On 8/20/2014 7:04 AM, ?amonn ? Tuama [GBIF] wrote:
Hi Rob,
Thank you for the feedback. I have tried to address the two main
issues
you raise below. At the outset, I would like to emphasise that
much of this
work is taking place in the context of the EU BON project which
includes a
task on developing/enhancing tools and standards for data sharing
with a
particular focus on the IPT for publishing sample-based data. So, we
were
constrained by the need to publish sample-based data sets in the
Darwin
Core Archive format and to demonstrate practical application using a working prototype. When the discussion on the TDWG list faded out, we
took
it to our EU BON partners whose requirements were essential input to further development. We recognise that these discussions took place
away
from TDWG (although the TDWG/EU BON contributors overlapped) and this
is
the reason we are presenting the outcomes here for further
consideration.
**Event core**
As the SIGS report indicated, sample data can be modelled in Darwin
Core
Archives using either Occurrence or Event as core. This was the
starting
point for our evaluation but as things progressed the data wrangling
pushed
the model back towards the Event core. We actually went through the exercise of mapping multiple test datasets in an iterative process
spanning
several months' work. In the end, we found that using an Event core
better
matched the typical sample data we were dealing with, allowing
use of a
measurement-or-fact extension to be included for the efficient
expression
of environmental information associated with the event. The choice
comes
down to an Occurrence core or an Event core + Occurrence
extension. In both
cases, the true observation records are Occurrences. The big
difference is
what type the core has and therefore to which kind of records you can attach further facts and extra information with DwC-A extensions.
Many
sampling datasets have very rich information about the site and
event, so
it is very natural to hang facts from an Event core. When picking the Occurrence core those facts would have to be repeated for each and
every
occurrence record. Moreover, our approach doesn?t stop anyone from
using
the Occurrence core if they so wish. This just provides a different
option
for datasets that better fit an Event core model.
I want to stress that we are not building a ?specific IPT version? to support an Event core but, rather, we adapted the IPT so that it
can be
configured to support any generic ?core + extension? format to enable
its
use for exploration of more data formats. This is part of the core codebase and there were no custom forks of the IPT for this
work. Our view
at GBIF is that if there are significant numbers of data
publishers who are
keen to adopt, promote and use a (any) format, and the tools can be configured to do so, then we should support it, and, if
necessary, use a
custom namespace.
**New terms around abundance**
Yes, the discussion on TDWG did fade out but it was clear that
the term
?abundance? as recommended by the SIGS report (along with abundanceAsPercent) was confusing many when we were looking for
term(s)
that reported quantitative measures of organisms in a sample. It also became clear we would need to be able to state the type of quantity
being
measured. An alternative suggestion for using the MeasurementsOrFact
class
was immediately shot down.
As some of our main use cases were coming from the EU BON project, discussion shifted to that forum and consensus formed about the
currently
proposed terms. It was within this group that the additional terms (samplingGeometry, samplingUnit, eventSeriesID) were proposed and
where we
began testing with sample data sets.
Best regards,
?amonn
*From:* robgur@gmail.com mailto:robgur@gmail.com
[mailto:robgur@gmail.com mailto:robgur@gmail.com <robgur@gmail.com mailto:robgur@gmail.com>] *On
Behalf Of *Robert Guralnick *Sent:* 19 August 2014 16:56 *To:* ?amonn ? Tuama [GBIF] *Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hi ?amonn --- I am curious about the outcomes presented in the SIGS paper, in particular, this portion of the paper:
"Solutions without introducing an event core in Darwin Core Archives: During the review of the solutions for the uses cases, it became
apparent
that either model could be applied to every use case. The core and extensions bore a complementary relationship and between them could
express
all the required information. The core simply provided the central
anchor
in the star schema from which to join the additional information. Therefore, using the Occurrence core, well established in the GBIF
network
through uptake of the IPT, seemed more appropriate than inventing CollectingEvent as an additional core type."
That SIGS paper has John Wieczorek and you both as authors,
including
many luminaries across the biodiversity standards spectrum.
Given the
above, its curious to see the EventCore come back again, along with a specific IPT version to support it.
So I see two issues, conflated, in this post you just made.
One is
the need for an EventCore at all, and the nature of relating
Event and
Occurrence/Material Sample. The second is the introduction of new
terms,
which seemingly have arrived after debate on similar terms - but
framed
around abundance - stalled a year ago. To my mind, these both
require some
further discussion, because I don't (necessarily) see TDWG community coherence around either issue?
Best, Rob
On Tue, Aug 19, 2014 at 6:11 AM, ?amonn ? Tuama [GBIF]
<eotuama@gbif.org mailto:eotuama@gbif.org>
wrote:
Dear All,
GBIF is committed to exploring ways in which the IPT and Darwin Core Archive format can be extended for publishing sample-based data sets.
In
association with the EU BON project [1], a customised version of the
IPT
[2] has been deployed to test this using a special type of Darwin
Core
Archive in which the core is an ?Event? with associated taxon
occurrences
in an ?Occurrence? extension.
The Darwin Core vocabulary already provides a rich set of terms with
many
relevant for describing sample-based data. Synthesising several
sources of
input (GBIF organised workshop on sample data, May 2013 [3],
discussions on
the TDWG mailing list in late 2013; internal discussion among EU BON project partners), five new terms relating to sample data were
identified
as essential. The complete model including these new terms are fully described with examples in the online document ?Publishing sample
data
using the GBIF IPT? [4].
As a first step towards ratification, we would like to register
the new
terms in the DwC Google Code tracker [5] if there are no major
objections
on this list. The five terms are:
*quantity*: the number or enumeration value of the
quantityType
(e.g., individuals, biomass, biovolume, BraunBlanquetScale) per samplingUnit or a percentage measure recorded for the sample.
*quantityType*: : the entity being referred to by quantity,
e.g., individuals, biomass, %species, scale type.
*samplingGeometry*: an indication of what kind of space was
sampled; select from point, line, area or volume.
*samplingUnit*: the unit of measurement used for
reporting the
quantity in the sample, e.g., minute, hour, day, metre, metre^2,
metre^3.
It is combined with quantity and quantityType to provide the complete measurement, e.g., 9 individuals per day, 4 biomass-gm per metre^2.
*eventSeriesID*: an identifier for a set of events that are
associated in some way, e.g., a monitoring series; may be a global
unique
identifier or an identifier specific to the series.
Best regards,
?amonn
[1] http://eubon.eu
[3]
http://www.standardsingenomics.org/index.php/sigen/article/view/sigs.4898640
[4] http://links.gbif.org/sample_data_model
[5] https://code.google.com/p/darwincore/issues/list
*?amonn ? Tuama, M.Sc., Ph.D. (eotuama@gbif.org
mailto:eotuama@gbif.org <eotuama@gbif.org mailto:eotuama@gbif.org>), *
*Senior Programme Officer for Interoperability, *
*Global Biodiversity Information Facility Secretariat, *
*Universitetsparken 15, DK-2100, Copenhagen ?, DENMARK*
*Phone: +45 3532 1494 tel:%2B45%203532%201494
<%2B45%203532%201494>; Fax: +45 3532 1480 tel:%2B45%203532%201480
<%2B45%203532%201480>*
tdwg-content mailing list tdwg-content@lists.tdwg.org mailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list
tdwg-content@lists.tdwg.org mailto:tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185 tel:443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.org mailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185 tel:443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.org mailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org mailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
A problem with this meme, leading e.g. to "Audubon Core is an 'extension' of Darwin Core" is that some of the words in play have both formal and informal meanings, and too often, very necessary quotation marks and capitalization are omitted by hasty composition. This leads to conversations with sentences like <A> is an extension of Darwin Core Consider the <A> Darwin Core Extension Blah,blah, blah Darwin Core Extension
Among the cases of this are "Darwin Core is an extension of Darwin Core", "Consider the Darwin Core Darwin Core Extension", etc.
The problem is not just with the silly cases. I think none of the architects of Audubon Core would agree to Audubon Core is an extension of Darwin Core just because about half (but not all!) of the Darwin Core terms are part of Audubon Core. (A posting of Steve Baskauf around 3:46 EDT makes this same(?) point.)
When a Knowledge Representation term has both a technically defined use and an informal use, discussions are often at cross-purposes, with some participants meaning one thing, but some readers understanding the other.
The meme should not be blamed on Chuck's post. It's widespread in TDWG postings, no doubt including a few or mine here and there. :-) In the current thread, I've tried to read the postings as they fly by to see if I'm supposed to read them carefully. I've never succeeded in that, so I retreat to a position that as a system architect I have to deal with whatever gets produced anyway. This is a little counterproductive, because occasionally I have something to contribute that could at least warn about risks known to some specialists but not to all the parties to the conversation.
This post should have subject: The Sigh extension to Darwin Core
Sigh.
Bob
On Thu, Aug 28, 2014 at 2:16 PM, Chuck Miller Chuck.Miller@mobot.org wrote:
Rob,
GBIF is hard at work solving a problem for them. It seems natural for them to do what works best for them, particularly dealing with the volume/scale they do. And, Markus, Tim, Eamonn and others have been deeply involved in Darwin Core standards implementation, inventing the DwCA and the IPT in the first place. But, in using DwCA to get actual work done for GBIF they have added a term here and there via their own extensions. See http://tools.gbif.org/dwca-validator/extensions.do. It’s a long list of Registered and Stable extensions. What about Audubon and Plinian Core? Aren’t they actually “extensions” to Darwin Core, particularly in the “universal biodiversity data “ sense Darwin Core seems to be evolving to? Maybe we need a “SampleCore” or “EventCore” that organizes these sampling/event-related terms?
GBIF is creating a new “Event core” for DwCA to do what they need to do. But, I don’t understand the problem with this. Occurrence core, Taxon core, Event core – so what? The DwCA model allows anything to be the core (ignoring IPT) and it does not actually restrict the terms used to DwC only, certainly it already includes Dublin Core and EML. In fact, lots of DwCA exchanges are being used outside GBIF now with extended terms from Audubon Core, Plinian Core, and others. How is this bad? It shows how flexible the DwCA data exchange structure is. And, it is an exchange, not a data storage format. What’s the problem with an Event core?
As an aside, I have said many times that the most confusing thing to our community about DwCA is that it’s called *Darwin Core* Archive when the structure is in fact not limited to only Darwin Core terms. Any validatable XML term can be used – like http://rs.gbif.org/terms/1.0/verbatimLabel which is used in the How-To Guide Whales example. And I commonly hear people at conferences and meetings now referring to their plans to adopt “Darwin Core” and upon questioning discover they are talking about the DwCA format, not really knowing the details of it. So, the words “Darwin Core” have become ambiguous.
And I want to note that Darwin Core Archive is not a TDWG standard. It is a guideline for implementing the Darwin Core standard in text files. http://rs.tdwg.org/dwc/terms/guides/text/. This is another point of confusion. GBIF describes DwCA as “an internationally recognized biodiversity informatics data standard” http://www.gbif.org/resources/2551, but it’s not a TDWG standard, just a guideline. I think DwCA deserves to be made a TDWG data exchange standard on its own, separate from Darwin Core, and perhaps expanded in scope to embrace the other “Cores” and relabeled to “Biodiversity Data Archive” or something.
Chuck
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Robert Guralnick *Sent:* Thursday, August 28, 2014 10:58 AM *To:* John Deck *Cc:* TDWG Content Mailing List; Ramona Walls
*Subject:* Re: [tdwg-content] tdwg-content Digest, Vol 63, Issue 6
Hi all --- Ok, I think the scope of the issue is quite clear. Let me summarize: 1) As Éamonn and the rest of GBIF has made quite clear, "GBIF is faced with the immediate task of making sample-based data discoverable and accessible using its current ecosystem of tools" given a funding mandate from EU-BON. 2) The solution for this problem is to develop an Event-core and to promote new terms to the Darwin Core to make this happen. I will note a small inconsistency here: the current ecosystem standards and tools of is Darwin Core (as it stands) and publishing systems such as IPT. That ecosystem of tools includes mechanisms to extend Darwin Core where needed, via extensions. The *current* ecosystem of tools doesn't include new Cores or new DwC terms, does it?
So this leads in nicely to the contentious issue(s) and places where there seems to be discussions --- these have to do with the nature of the changes suggested and the scope of those changes, both in terms of an Event core and DwC term additions. Leaving aside the Event-core for now, the key questions simply about term additions to the Darwin Core that seem to be at heart here are: 1) Is the intent of the Darwin Core to model surveys, which usually involve multiple kinds and types of sampling over multiple sites using multiple methods? 2) Is the solution to invent new terms for the Darwin Core if there are already terms from other efforts, wouldn't we work with those existing efforts to assure interoperability?
I appreciate the efforts of GBIF here fully, and am personally torn because on the one hand, I fully agree with the goal of extending Darwin Core to better represent richer biodiversity data. On the other hand, I worry about process here and how to make that happen in a way that isn't too hasty or locks us into just the opposite of what I think many of us want with regards to sharing data more broadly than within just one ecosystem of tools.
Best, Rob
On Thu, Aug 28, 2014 at 6:30 AM, John Deck jdeck@berkeley.edu wrote:
I see the rational for enabling this in Darwin Core Archives and adding the new terms. However, back to what Matt Jones brought up: "won't we just end up with a new syntax that does essentially what O&M and OBOE do now?".
We should include explicit references to existing terms/definitions that encapsulate what we're talking about, e.g. in our MaterialSample proposal last year we linked the an existing term in OBI, which has a much richer description and context for MaterialSample than what we considered ( https://code.google.com/p/darwincore/issues/detail?id=167)
Have we explored the possibility of doing this with OBOE? I'm not suggesting we adopt OBOE wholesale, but it seems like we have a good opportunity to enable better semantic linking with that efforts.
John
On Thu, Aug 28, 2014 at 4:23 AM, Éamonn Ó Tuama [GBIF] eotuama@gbif.org wrote:
Thanks, Ramona and Rob.
I'd like to add a few points following on Markus's reply.
I think your pressing of the need for a robust semantic model for biodiversity sample/survey data is incontestable – we do need one and it should enable rich data integration once it is defined and the tools and data standards to support it become available. However, GBIF is faced with the immediate task of making sample-based data discoverable and accessible using its current ecosystem of tools (IPT) and exchange standards (DwC; EML). Waiting for a functional, implementable semantic model and the tools and support services for it is just not an option for us right now.
We have already spend considerable time in analysing the merits of Occurrence core vs Event core and have opted for an Event core for reasons previously given. I don’t believe we are trying to reconfigure Event (“an action that occurs at a place and during a period of time”) and regardless of whether we use Occurrence or Event, the need for some additional terms arise (e.g., quantity, quantityType, samplingGeometry, samplingUnit). Once the BCO model is available for uptake, it should be possible to develop a mapping between it and the simple DwC sample model.
So GBIF’s stance is that we need to take a two-pronged approach by exploring how the IPT and DwC-A can be adapted for publishing sample-based data in the near term while supporting the work of TDWG and groups such as the BCO in advancing biodiversity informatics. GBIF has already engaged in the work of the BCO and will continue to do so.
Éamonn
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Markus Döring Sent: 28 August 2014 12:44 To: Ramona Walls Cc: TDWG Content Mailing List
Subject: Re: [tdwg-content] tdwg-content Digest, Vol 63, Issue 6
Hi Ramona & Rob,
The Event proposal does not try to change the semantics of an Event, it just uses the existing Darwin Core Event "class" at the core in Darwin Core archives. The actual change proposed is simply adding 3 new terms to the Event "group" to better share information about sampling methods & efforts, extending the existing limited capabilities of Darwin Core which already has the terms dwc:samplingProtocol and dwc:samplingEffort. It also proposes 2 new terms for dealing with quantity of Occurrences, something that has been discussed since 2012 now, when I had proposed a new abundance term [2].
In general application of Darwin Core is not at all limited to specimens and observations. It is used for sharing taxonomic datasets already and it's definition and goal is broad. Let me cite some of the introduction to Darwin Core [1]:
What is the Darwin Core? The Darwin Core is body of standards. It includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing reference definitions, examples, and commentaries. The Darwin Core is primarily based on taxa, their occurrence in nature as documented by observations, specimens, samples, and related information.
Motivation: The Darwin Core standard was originally conceived to facilitate the discovery, retrieval, and integration of information about modern biological specimens, their spatiotemporal occurrence, and their supporting evidence housed in collections (physical or digital). The Darwin Core today is broader in scope and more versatile. It is meant to provide a stable standard reference for sharing information on biological diversity. As a glossary of terms, the Darwin Core is meant to provide stable semantic definitions with the goal of being maximally reusable in a variety of contexts.
Markus
[1] http://rs.tdwg.org/dwc/index.htm [2] https://code.google.com/p/darwincore/issues/detail?id=142
-- Markus Döring Software Developer Global Biodiversity Information Facility (GBIF) mdoering@gbif.org http://www.gbif.org
On 27 Aug 2014, at 18:57, Ramona Walls rlwalls2008@gmail.com wrote:
I think it is important to consider the purpose of both Darwin Core and
DwC archives in deciding whether or not to expand them, but we should use that consideration to address the question at hand, which is whether or not to add an Event core and additional properties to describe events.
Describing the exchange format before the semantics is the wrong way to
go, given that we now have a framework for developing semantics. Expanding Darwin Core before we adequately model survey data is bound to lead to problems later, when we try to retro-fit the semantics to Darwin Core Event archives. This is exactly the problem we are running into now with Occurance archives, and we have the opportunity to avoid it.
I suggest we first use existing ontologies to model survey data, then
deal with if and how to exchange that information in DwC-A. This is what I was hinting at in my first email, but should have said more explicitly.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Wed, Aug 27, 2014 at 9:14 AM, Robert Guralnick
Robert.Guralnick@colorado.edu wrote:
It may be a sensible view for Darwin Core Archives and their intended
use, but Tim's email suggests we should be putting the method of delivery ahead of the standard that delivers that content. If this was just about DwC-As, why not develop a survey extension that links each occurrence to information about the survey process using the existing star-schema methods we have in place? Why are we discussing adding terms to the Darwin Core or trying to fully reconfigure what we call an Event? That is what is on the table, not DwC-As and how we use them. Or am I missing something?
Best, Rob
On Wed, Aug 27, 2014 at 10:01 AM, Ramona Walls rlwalls2008@gmail.com
wrote:
Thanks, Tim, and yes, DwC-A as a view (but not necessarily the primary
archive) of data seems like the right point of view.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Wed, Aug 27, 2014 at 1:58 AM, Tim Robertson trobertson@gbif.org
wrote:
Hi Ramona,
Those are good points, and I’d like to come back to the original thinking
behind the DwC-A.
It was designed and intended to be a simple way of exposing a complete
view of a dataset, primarily for building sophisticated indexes, inventories and allowing basic analytics (e.g. GBIF.org being one sophisticated index). We found that the star schema provided the flexibility to do a lot, and with the bundled metadata (e.g. EML) was enough to trace provenance and allow users to determine if the dataset might be fit for various uses. In many cases this represents the complete (e.g. lossless) view of a dataset.
What we are discussing here are far richer datasets, where shoe-horning
content into the star schema becomes lossy for some, although we’re finding other cases where it is indeed lossless. I believe we should be looking to harmonise ontologies / models etc as you mention but in parallel we should define one or more star schema views that can still be used for discovery / reporting / basic analytical purpose, and not long term archival of the dataset. The dataset would then have the canonical rich form and an additional DwC-A view. What I write here is applicable to all content types of course.
Please also note that many people put supplementary files in the DwC-A
which are ignored by DwC-A readers but could be a way of keeping the richer view in the bundle. If one wished you can describe those supplementary files in the EML document.
Does this gel with the view of others as well?
Cheers, Tim
On 27 Aug 2014, at 02:55, Ramona Walls rlwalls2008@gmail.com wrote:
I think Matt hit the nail on the head. Although Darwin Core can be used
to exchange survey data, it lacks the semantics and structure necessary to archive the data without loss of information. I think the biodiversity community would be better served devoting energy to harmonizing existing technologies such as OGC, OBOE, and BCO, not to mention the many database for storing plot or survey data. The goal should be to preserve the data in the most informative manner possible.
There is a strong a case for wanting to search across all evidence for
occurences, including surveys and point occurences, so I can see possible demand for a tool that would extract occurences from survey data to a DwC archive. However, I am very concerned that making a DwC archive the primary exchange format for survey or plot data commits us to a path of losing information from the start, for all but the simplest sampling schemas.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Fri, Aug 22, 2014 at 3:00 AM, tdwg-content-request@lists.tdwg.org
wrote:
Send tdwg-content mailing list submissions to tdwg-content@lists.tdwg.org
To subscribe or unsubscribe via the World Wide Web, visit http://lists.tdwg.org/mailman/listinfo/tdwg-content or, via email, send a message with subject or body 'help' to tdwg-content-request@lists.tdwg.org
You can reach the person managing the list at tdwg-content-owner@lists.tdwg.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of tdwg-content digest..."
Today's Topics:
- Re: Darwin Core: proposed news terms for expressing sample data (Matt Jones)
- Re: Darwin Core: proposed news terms for expressing sample data (Donald Hobern [GBIF])
Message: 1 Date: Thu, 21 Aug 2014 18:52:06 -0800 From: Matt Jones jones@nceas.ucsb.edu Subject: Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data To: ?amonn ? Tuama [GBIF] eotuama@gbif.org Cc: TDWG Content Mailing List tdwg-content@lists.tdwg.org Message-ID:
CAFSW8xkx7uRP9PC2g3=JT_VJanqujH8nPXoz8GXwh+JwKw5Ccw@mail.gmail.com
Content-Type: text/plain; charset="utf-8"
This proposal is treading on ground that is quite similar to other observations and measurements standards for data exchange that are
already
mature, in particular:
- OGC Observations and Measurements (
http://www.opengeospatial.org/standards/om)
- Extensible Observation Ontology (OBOE;
https://semtools.ecoinformatics.org/oboe)
The former is a standard and broadly deployed, whereas the latter is
part
of a research program in the use of ontologies for measurements.
Through
collaboration between the two projects, they've been modified to be reasonably isomorphic, but O&M uses an XML serialization while OBOE uses
an
OWL-DL serialization. They largely express the same measurements and sampling model once one gets beyond the terminology differences.
So, I'm wondering if it make much sense to extend Darwin Core, which is
at
heart an Occurrence exchange syntax, into this measurements area that is well represented by these other existing specifications? I'm curious to hear why people would even want to do this. And if we do go down this path, won't we just end up with a new syntax that does essentially what
O&M
and OBOE do now?
Matt
On Thu, Aug 21, 2014 at 12:22 AM, ?amonn ? Tuama [GBIF]
wrote:
Hi Rob, Anne, Rich,
I think Markus has answered your question as to why we opted for an
Event
core which is being used in the sense described by Anne and Rich. For
any
event, you can have a list of species in an Occurrence extension and
for
each species, you can include quantity and quantityType, e.g.,
biomass,
etc. The proposed term eventSeriesID was intended for linking together related events, although it now looks like parentEventID might be a
better,
more flexible term. The measurementOrFact extension is a good fit for capturing environmental information relating to an event. See, e.g.,
the
Gialova Lagoon brackish water invertebrate test data set [1] where a
set
of 18 environmental variables, including temp, pH, Rdx, particulate
organic
matter, dissolved oxygen, salinity, chlorophyll-a were measured for
each
sampling station-sampling period combination. An example mapping is:
Id measurementType measurementValue measurementUnit measurementRemarks
IA Tmp (sed) 21.5 degree C Tmp (sed): temperature at the bottom surface
**Controlled vocabularies**
Ideally, the values for samplingUnit and quantityType would be
selected
from controlled vocabularies. This is, effectively, what we do by presenting a small list of values in a drop-down menu. The current
values
are what we derived for example data sets and discussion but they can undoubtedly be extended and improved.
We capture ?bucket? type measures through a combination of
samplingEffort,
samplingGeometry and samplingUnit. For example, a pitfall trap (in a
point
location) left out for 16 days might have samplingEffort: 16, samplingGeometry: point and samplingUnit: day. Three m^2 quadrats in a shore survey might have samplingEffort: 3, samplingGeometry: area and samplingUnit: m^2.
It would be very useful to see your compilation of scope, effort and completeness measures to see if we can express them in our model
and/or if
we need to reconsider our approach.
?amonn
[1] http://eubon-ipt.gbif.org/resource.do?r=ionian-brackish-lagoon
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Markus D?ring *Sent:* 20 August 2014 23:47 *To:* Robert Guralnick
*Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Rob,
this proposal if for monitoring surveys really, not to be confused
with
material samples like environmental or tissue samples which have a
distinct
new dwc class MaterialSample.
We tend to overload the term sampling a lot and it helps treating
material
samples different from pure observational "sampling". That is why the existing Event class was used as the core and classic Occurrence
records as
extensions. A classic example is a vegetation survey where each plot represents an Event record and each recorded species in that plot will
be
an Occurrence extension record with a given quantity. Darwin Core
already
offers individualCount to specify quantity, but it is a very specific
way
of measuring "abundance" restricted to only some use cases. Abiotic measurements about the plot (e.g. soil type, pH, temperature) can be published using the measurements or facts extension linked to the
Event
core.
Markus
On 20 Aug 2014, at 20:08, Robert Guralnick
wrote:
Anne -- I don't know the answers! These are questions for Eamonn.
I
would presume that a sample could be a jumble of species or even just
water
or soil samples, and biomass would refer to that sample - but maybe
that
isn't a use case being considered? The examples given in the longer document all link an event_id to species name and some measure of
quantity
for that species (to the species, not an individual specimen), so I
assume
that is the prevailing (or only) case?
Best, Rob
On Wed, Aug 20, 2014 at 11:56 AM, Anne Thessen <annethessen@gmail.com
wrote:
Hi Rob I would like to respond to your item number 2. From my perspective, I deal with lots of published descriptions of
taxa.
The text might say something like "I saw species A in the Chesapeake
Bay,
the Adriatic Sea and the Indian Ocean and the biomass is 5 - 9 grams".
The
biomass range obviously corresponds to at least three different occurrences, but how to divide the biomass data? I would love to be
able to
have an *event* to attach it all to. There is almost two different
levels
of events - a sampling event and a "study event". The "study event"
would
correspond to the type of event I would like to use in the above
example.
It may not be ideal, but for the old literature that might be the best
we
can do. I have to admit that I don't know enough about trawl data to
understand
why an event core would be a problem. It seems that the trawl would be
an
event and each biomass measure (of each fish) would be attached to a separate occurrence which is attached to that event. Am I
understanding
this wrong? btw - I found a workaround for the example I gave, so it's not
impossible
to model with the current structure.... Anne
On 8/20/2014 1:16 PM, Robert Guralnick wrote:
?amonn et al. --- Thanks for the clarifications. I think these help a
ton
but it raises a couple more questions for me.
- I am surprised that you plan to use of MeasurementorFact
extension in
relation to the Event core, which seems like a novel (or perhaps
awkward or
unintended?) mechanism for capturing environmental data, but the same extension was not be seen as relevant for describing samples? Can you explain more about the thinking there?
- There may be a subtle issue here extending "Event" to be more what
you
call a "Sampling Event Core". My read of this is that Darwin Core
serves
as a way to deal with point occurrences and Event reflects the context
of a
single capture event (whether a single observation, or a bulk sample capture). The changes recommended seem to dramatically extend and
change
that meaning? Its simply a question that I don't have answer to, but
is
Darwin Core, the right vehicle to start capturing repeated measures of biomass values from trawls? I don't have answer but man, terms like quantityType (as a property of occurrence?) give me pause.
- Is Sampling Unit a controlled vocabulary? For another project, I
have
looked through - and captured scope, effort and completeness measures
from
- a large number of published biotic area inventories. The vast
majorities
of these are measured in units like bucket hours, or trap nights. Is
a
"bucket" part of SamplingGeometry or Sampling Unit? I'd be happy to
send
along all the many examples of how biotic inventories of an area are completed and perhaps it might be good to see how those might be represented using the terms you are proposing?
Best, Rob
On Wed, Aug 20, 2014 at 10:16 AM, Richard Pyle
wrote:
Same here ? Events are central to the work that we do.
Aloha,
Rich
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Anne Thessen *Sent:* Wednesday, August 20, 2014 2:59 AM *To:* tdwg-content@lists.tdwg.org
*Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hello I would just like to comment on *event core*. I've been doing a lot of work translating published data into Darwin
Core.
During that process I've wished several times that I could use Event
as
core. I am happy to hear about that proposed change. It will make it
easier
to model the data I am working with. Anne
On 8/20/2014 7:04 AM, ?amonn ? Tuama [GBIF] wrote:
Hi Rob,
Thank you for the feedback. I have tried to address the two main
issues
you raise below. At the outset, I would like to emphasise that much of
this
work is taking place in the context of the EU BON project which
includes a
task on developing/enhancing tools and standards for data sharing with
a
particular focus on the IPT for publishing sample-based data. So, we
were
constrained by the need to publish sample-based data sets in the
Darwin
Core Archive format and to demonstrate practical application using a working prototype. When the discussion on the TDWG list faded out, we
took
it to our EU BON partners whose requirements were essential input to further development. We recognise that these discussions took place
away
from TDWG (although the TDWG/EU BON contributors overlapped) and this
is
the reason we are presenting the outcomes here for further
consideration.
**Event core**
As the SIGS report indicated, sample data can be modelled in Darwin
Core
Archives using either Occurrence or Event as core. This was the
starting
point for our evaluation but as things progressed the data wrangling
pushed
the model back towards the Event core. We actually went through the exercise of mapping multiple test datasets in an iterative process
spanning
several months' work. In the end, we found that using an Event core
better
matched the typical sample data we were dealing with, allowing use of
a
measurement-or-fact extension to be included for the efficient
expression
of environmental information associated with the event. The choice
comes
down to an Occurrence core or an Event core + Occurrence extension. In
both
cases, the true observation records are Occurrences. The big
difference is
what type the core has and therefore to which kind of records you can attach further facts and extra information with DwC-A extensions. Many sampling datasets have very rich information about the site and event,
so
it is very natural to hang facts from an Event core. When picking the Occurrence core those facts would have to be repeated for each and
every
occurrence record. Moreover, our approach doesn?t stop anyone from
using
the Occurrence core if they so wish. This just provides a different
option
for datasets that better fit an Event core model.
I want to stress that we are not building a ?specific IPT version? to support an Event core but, rather, we adapted the IPT so that it can
be
configured to support any generic ?core + extension? format to enable
its
use for exploration of more data formats. This is part of the core codebase and there were no custom forks of the IPT for this work. Our
view
at GBIF is that if there are significant numbers of data publishers
who are
keen to adopt, promote and use a (any) format, and the tools can be configured to do so, then we should support it, and, if necessary, use
a
custom namespace.
**New terms around abundance**
Yes, the discussion on TDWG did fade out but it was clear that the
term
?abundance? as recommended by the SIGS report (along with abundanceAsPercent) was confusing many when we were looking for
term(s)
that reported quantitative measures of organisms in a sample. It also became clear we would need to be able to state the type of quantity
being
measured. An alternative suggestion for using the MeasurementsOrFact
class
was immediately shot down.
As some of our main use cases were coming from the EU BON project, discussion shifted to that forum and consensus formed about the
currently
proposed terms. It was within this group that the additional terms (samplingGeometry, samplingUnit, eventSeriesID) were proposed and
where we
began testing with sample data sets.
Best regards,
?amonn
*From:* robgur@gmail.com [mailto:robgur@gmail.com robgur@gmail.com]
*On
Behalf Of *Robert Guralnick *Sent:* 19 August 2014 16:56 *To:* ?amonn ? Tuama [GBIF] *Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hi ?amonn --- I am curious about the outcomes presented in the SIGS paper, in particular, this portion of the paper:
"Solutions without introducing an event core in Darwin Core Archives: During the review of the solutions for the uses cases, it became
apparent
that either model could be applied to every use case. The core and extensions bore a complementary relationship and between them could
express
all the required information. The core simply provided the central
anchor
in the star schema from which to join the additional information. Therefore, using the Occurrence core, well established in the GBIF
network
through uptake of the IPT, seemed more appropriate than inventing CollectingEvent as an additional core type."
That SIGS paper has John Wieczorek and you both as authors,
including
many luminaries across the biodiversity standards spectrum. Given the above, its curious to see the EventCore come back again, along with a specific IPT version to support it.
So I see two issues, conflated, in this post you just made. One
is
the need for an EventCore at all, and the nature of relating Event and Occurrence/Material Sample. The second is the introduction of new
terms,
which seemingly have arrived after debate on similar terms - but
framed
around abundance - stalled a year ago. To my mind, these both require
some
further discussion, because I don't (necessarily) see TDWG community coherence around either issue?
Best, Rob
On Tue, Aug 19, 2014 at 6:11 AM, ?amonn ? Tuama [GBIF]
wrote:
Dear All,
GBIF is committed to exploring ways in which the IPT and Darwin Core Archive format can be extended for publishing sample-based data sets.
In
association with the EU BON project [1], a customised version of the
IPT
[2] has been deployed to test this using a special type of Darwin Core Archive in which the core is an ?Event? with associated taxon
occurrences
in an ?Occurrence? extension.
The Darwin Core vocabulary already provides a rich set of terms with
many
relevant for describing sample-based data. Synthesising several
sources of
input (GBIF organised workshop on sample data, May 2013 [3],
discussions on
the TDWG mailing list in late 2013; internal discussion among EU BON project partners), five new terms relating to sample data were
identified
as essential. The complete model including these new terms are fully described with examples in the online document ?Publishing sample data using the GBIF IPT? [4].
As a first step towards ratification, we would like to register the
new
terms in the DwC Google Code tracker [5] if there are no major
objections
on this list. The five terms are:
*quantity*: the number or enumeration value of the
quantityType
(e.g., individuals, biomass, biovolume, BraunBlanquetScale) per samplingUnit or a percentage measure recorded for the sample.
*quantityType*: : the entity being referred to by quantity,
e.g., individuals, biomass, %species, scale type.
*samplingGeometry*: an indication of what kind of space was
sampled; select from point, line, area or volume.
*samplingUnit*: the unit of measurement used for reporting the
quantity in the sample, e.g., minute, hour, day, metre, metre^2,
metre^3.
It is combined with quantity and quantityType to provide the complete measurement, e.g., 9 individuals per day, 4 biomass-gm per metre^2.
*eventSeriesID*: an identifier for a set of events that are
associated in some way, e.g., a monitoring series; may be a global
unique
identifier or an identifier specific to the series.
Best regards,
?amonn
[1] http://eubon.eu
[3]
http://www.standardsingenomics.org/index.php/sigen/article/view/sigs.4898640
[4] http://links.gbif.org/sample_data_model
[5] https://code.google.com/p/darwincore/issues/list
*?amonn ? Tuama, M.Sc., Ph.D. (eotuama@gbif.org eotuama@gbif.org),
*Senior Programme Officer for Interoperability, *
*Global Biodiversity Information Facility Secretariat, *
*Universitetsparken 15, DK-2100, Copenhagen ?, DENMARK*
*Phone: +45 3532 1494 <%2B45%203532%201494>; Fax: +45 3532 1480 <%2B45%203532%201480>*
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Bob, I think the words "Darwin Core" have escaped the orbit of the TDWG standard and all its careful definitions. I've been in conferences where I was the only one with knowledge of TDWG but the term Darwin Core was being thrown around as if it were the single standard for all biodiversity data. I guess we've done a great job of branding Darwin Core albeit with some loss of clarity. So I agree with your observation Bob.
Chuck
On Aug 28, 2014, at 4:00 PM, "Bob Morris" <morris.bob@gmail.commailto:morris.bob@gmail.com> wrote:
A problem with this meme, leading e.g. to "Audubon Core is an 'extension' of Darwin Core" is that some of the words in play have both formal and informal meanings, and too often, very necessary quotation marks and capitalization are omitted by hasty composition. This leads to conversations with sentences like <A> is an extension of Darwin Core Consider the <A> Darwin Core Extension Blah,blah, blah Darwin Core Extension
Among the cases of this are "Darwin Core is an extension of Darwin Core", "Consider the Darwin Core Darwin Core Extension", etc.
The problem is not just with the silly cases. I think none of the architects of Audubon Core would agree to Audubon Core is an extension of Darwin Core just because about half (but not all!) of the Darwin Core terms are part of Audubon Core. (A posting of Steve Baskauf around 3:46 EDT makes this same(?) point.)
When a Knowledge Representation term has both a technically defined use and an informal use, discussions are often at cross-purposes, with some participants meaning one thing, but some readers understanding the other.
The meme should not be blamed on Chuck's post. It's widespread in TDWG postings, no doubt including a few or mine here and there. :-) In the current thread, I've tried to read the postings as they fly by to see if I'm supposed to read them carefully. I've never succeeded in that, so I retreat to a position that as a system architect I have to deal with whatever gets produced anyway. This is a little counterproductive, because occasionally I have something to contribute that could at least warn about risks known to some specialists but not to all the parties to the conversation.
This post should have subject: The Sigh extension to Darwin Core
Sigh.
Bob
On Thu, Aug 28, 2014 at 2:16 PM, Chuck Miller <Chuck.Miller@mobot.orgmailto:Chuck.Miller@mobot.org> wrote: Rob, GBIF is hard at work solving a problem for them. It seems natural for them to do what works best for them, particularly dealing with the volume/scale they do. And, Markus, Tim, Eamonn and others have been deeply involved in Darwin Core standards implementation, inventing the DwCA and the IPT in the first place. But, in using DwCA to get actual work done for GBIF they have added a term here and there via their own extensions. See http://tools.gbif.org/dwca-validator/extensions.do. It’s a long list of Registered and Stable extensions. What about Audubon and Plinian Core? Aren’t they actually “extensions” to Darwin Core, particularly in the “universal biodiversity data “ sense Darwin Core seems to be evolving to? Maybe we need a “SampleCore” or “EventCore” that organizes these sampling/event-related terms?
GBIF is creating a new “Event core” for DwCA to do what they need to do. But, I don’t understand the problem with this. Occurrence core, Taxon core, Event core – so what? The DwCA model allows anything to be the core (ignoring IPT) and it does not actually restrict the terms used to DwC only, certainly it already includes Dublin Core and EML. In fact, lots of DwCA exchanges are being used outside GBIF now with extended terms from Audubon Core, Plinian Core, and others. How is this bad? It shows how flexible the DwCA data exchange structure is. And, it is an exchange, not a data storage format. What’s the problem with an Event core?
As an aside, I have said many times that the most confusing thing to our community about DwCA is that it’s called Darwin Core Archive when the structure is in fact not limited to only Darwin Core terms. Any validatable XML term can be used – like http://rs.gbif.org/terms/1.0/verbatimLabel which is used in the How-To Guide Whales example. And I commonly hear people at conferences and meetings now referring to their plans to adopt “Darwin Core” and upon questioning discover they are talking about the DwCA format, not really knowing the details of it. So, the words “Darwin Core” have become ambiguous.
And I want to note that Darwin Core Archive is not a TDWG standard. It is a guideline for implementing the Darwin Core standard in text files. http://rs.tdwg.org/dwc/terms/guides/text/. This is another point of confusion. GBIF describes DwCA as “an internationally recognized biodiversity informatics data standard” http://www.gbif.org/resources/2551, but it’s not a TDWG standard, just a guideline. I think DwCA deserves to be made a TDWG data exchange standard on its own, separate from Darwin Core, and perhaps expanded in scope to embrace the other “Cores” and relabeled to “Biodiversity Data Archive” or something.
Chuck
From: tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Robert Guralnick Sent: Thursday, August 28, 2014 10:58 AM To: John Deck Cc: TDWG Content Mailing List; Ramona Walls
Subject: Re: [tdwg-content] tdwg-content Digest, Vol 63, Issue 6
Hi all --- Ok, I think the scope of the issue is quite clear. Let me summarize: 1) As Éamonn and the rest of GBIF has made quite clear, "GBIF is faced with the immediate task of making sample-based data discoverable and accessible using its current ecosystem of tools" given a funding mandate from EU-BON. 2) The solution for this problem is to develop an Event-core and to promote new terms to the Darwin Core to make this happen. I will note a small inconsistency here: the current ecosystem standards and tools of is Darwin Core (as it stands) and publishing systems such as IPT. That ecosystem of tools includes mechanisms to extend Darwin Core where needed, via extensions. The current ecosystem of tools doesn't include new Cores or new DwC terms, does it?
So this leads in nicely to the contentious issue(s) and places where there seems to be discussions --- these have to do with the nature of the changes suggested and the scope of those changes, both in terms of an Event core and DwC term additions. Leaving aside the Event-core for now, the key questions simply about term additions to the Darwin Core that seem to be at heart here are: 1) Is the intent of the Darwin Core to model surveys, which usually involve multiple kinds and types of sampling over multiple sites using multiple methods? 2) Is the solution to invent new terms for the Darwin Core if there are already terms from other efforts, wouldn't we work with those existing efforts to assure interoperability?
I appreciate the efforts of GBIF here fully, and am personally torn because on the one hand, I fully agree with the goal of extending Darwin Core to better represent richer biodiversity data. On the other hand, I worry about process here and how to make that happen in a way that isn't too hasty or locks us into just the opposite of what I think many of us want with regards to sharing data more broadly than within just one ecosystem of tools.
Best, Rob
On Thu, Aug 28, 2014 at 6:30 AM, John Deck <jdeck@berkeley.edumailto:jdeck@berkeley.edu> wrote: I see the rational for enabling this in Darwin Core Archives and adding the new terms. However, back to what Matt Jones brought up: "won't we just end up with a new syntax that does essentially what O&M and OBOE do now?".
We should include explicit references to existing terms/definitions that encapsulate what we're talking about, e.g. in our MaterialSample proposal last year we linked the an existing term in OBI, which has a much richer description and context for MaterialSample than what we considered (https://code.google.com/p/darwincore/issues/detail?id=167) Have we explored the possibility of doing this with OBOE? I'm not suggesting we adopt OBOE wholesale, but it seems like we have a good opportunity to enable better semantic linking with that efforts.
John
On Thu, Aug 28, 2014 at 4:23 AM, Éamonn Ó Tuama [GBIF] <eotuama@gbif.orgmailto:eotuama@gbif.org> wrote: Thanks, Ramona and Rob.
I'd like to add a few points following on Markus's reply.
I think your pressing of the need for a robust semantic model for biodiversity sample/survey data is incontestable – we do need one and it should enable rich data integration once it is defined and the tools and data standards to support it become available. However, GBIF is faced with the immediate task of making sample-based data discoverable and accessible using its current ecosystem of tools (IPT) and exchange standards (DwC; EML). Waiting for a functional, implementable semantic model and the tools and support services for it is just not an option for us right now.
We have already spend considerable time in analysing the merits of Occurrence core vs Event core and have opted for an Event core for reasons previously given. I don’t believe we are trying to reconfigure Event (“an action that occurs at a place and during a period of time”) and regardless of whether we use Occurrence or Event, the need for some additional terms arise (e.g., quantity, quantityType, samplingGeometry, samplingUnit). Once the BCO model is available for uptake, it should be possible to develop a mapping between it and the simple DwC sample model.
So GBIF’s stance is that we need to take a two-pronged approach by exploring how the IPT and DwC-A can be adapted for publishing sample-based data in the near term while supporting the work of TDWG and groups such as the BCO in advancing biodiversity informatics. GBIF has already engaged in the work of the BCO and will continue to do so.
Éamonn
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Markus Döring Sent: 28 August 2014 12:44 To: Ramona Walls Cc: TDWG Content Mailing List Subject: Re: [tdwg-content] tdwg-content Digest, Vol 63, Issue 6
Hi Ramona & Rob,
The Event proposal does not try to change the semantics of an Event, it just uses the existing Darwin Core Event "class" at the core in Darwin Core archives. The actual change proposed is simply adding 3 new terms to the Event "group" to better share information about sampling methods & efforts, extending the existing limited capabilities of Darwin Core which already has the terms dwc:samplingProtocol and dwc:samplingEffort. It also proposes 2 new terms for dealing with quantity of Occurrences, something that has been discussed since 2012 now, when I had proposed a new abundance term [2].
In general application of Darwin Core is not at all limited to specimens and observations. It is used for sharing taxonomic datasets already and it's definition and goal is broad. Let me cite some of the introduction to Darwin Core [1]:
What is the Darwin Core? The Darwin Core is body of standards. It includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing reference definitions, examples, and commentaries. The Darwin Core is primarily based on taxa, their occurrence in nature as documented by observations, specimens, samples, and related information.
Motivation: The Darwin Core standard was originally conceived to facilitate the discovery, retrieval, and integration of information about modern biological specimens, their spatiotemporal occurrence, and their supporting evidence housed in collections (physical or digital). The Darwin Core today is broader in scope and more versatile. It is meant to provide a stable standard reference for sharing information on biological diversity. As a glossary of terms, the Darwin Core is meant to provide stable semantic definitions with the goal of being maximally reusable in a variety of contexts.
Markus
[1] http://rs.tdwg.org/dwc/index.htm [2] https://code.google.com/p/darwincore/issues/detail?id=142
-- Markus Döring Software Developer Global Biodiversity Information Facility (GBIF) mdoering@gbif.orgmailto:mdoering@gbif.org http://www.gbif.org
On 27 Aug 2014, at 18:57, Ramona Walls <rlwalls2008@gmail.commailto:rlwalls2008@gmail.com> wrote:
I think it is important to consider the purpose of both Darwin Core and
DwC archives in deciding whether or not to expand them, but we should use that consideration to address the question at hand, which is whether or not to add an Event core and additional properties to describe events.
Describing the exchange format before the semantics is the wrong way to
go, given that we now have a framework for developing semantics. Expanding Darwin Core before we adequately model survey data is bound to lead to problems later, when we try to retro-fit the semantics to Darwin Core Event archives. This is exactly the problem we are running into now with Occurance archives, and we have the opportunity to avoid it.
I suggest we first use existing ontologies to model survey data, then deal
with if and how to exchange that information in DwC-A. This is what I was hinting at in my first email, but should have said more explicitly.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Wed, Aug 27, 2014 at 9:14 AM, Robert Guralnick
<Robert.Guralnick@colorado.edumailto:Robert.Guralnick@colorado.edu> wrote:
It may be a sensible view for Darwin Core Archives and their intended
use, but Tim's email suggests we should be putting the method of delivery ahead of the standard that delivers that content. If this was just about DwC-As, why not develop a survey extension that links each occurrence to information about the survey process using the existing star-schema methods we have in place? Why are we discussing adding terms to the Darwin Core or trying to fully reconfigure what we call an Event? That is what is on the table, not DwC-As and how we use them. Or am I missing something?
Best, Rob
On Wed, Aug 27, 2014 at 10:01 AM, Ramona Walls <rlwalls2008@gmail.commailto:rlwalls2008@gmail.com>
wrote:
Thanks, Tim, and yes, DwC-A as a view (but not necessarily the primary
archive) of data seems like the right point of view.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Wed, Aug 27, 2014 at 1:58 AM, Tim Robertson <trobertson@gbif.orgmailto:trobertson@gbif.org>
wrote:
Hi Ramona,
Those are good points, and I’d like to come back to the original thinking
behind the DwC-A.
It was designed and intended to be a simple way of exposing a complete
view of a dataset, primarily for building sophisticated indexes, inventories and allowing basic analytics (e.g. GBIF.orghttp://GBIF.org being one sophisticated index). We found that the star schema provided the flexibility to do a lot, and with the bundled metadata (e.g. EML) was enough to trace provenance and allow users to determine if the dataset might be fit for various uses. In many cases this represents the complete (e.g. lossless) view of a dataset.
What we are discussing here are far richer datasets, where shoe-horning
content into the star schema becomes lossy for some, although we’re finding other cases where it is indeed lossless. I believe we should be looking to harmonise ontologies / models etc as you mention but in parallel we should define one or more star schema views that can still be used for discovery / reporting / basic analytical purpose, and not long term archival of the dataset. The dataset would then have the canonical rich form and an additional DwC-A view. What I write here is applicable to all content types of course.
Please also note that many people put supplementary files in the DwC-A
which are ignored by DwC-A readers but could be a way of keeping the richer view in the bundle. If one wished you can describe those supplementary files in the EML document.
Does this gel with the view of others as well?
Cheers, Tim
On 27 Aug 2014, at 02:55, Ramona Walls <rlwalls2008@gmail.commailto:rlwalls2008@gmail.com> wrote:
I think Matt hit the nail on the head. Although Darwin Core can be used
to exchange survey data, it lacks the semantics and structure necessary to archive the data without loss of information. I think the biodiversity community would be better served devoting energy to harmonizing existing technologies such as OGC, OBOE, and BCO, not to mention the many database for storing plot or survey data. The goal should be to preserve the data in the most informative manner possible.
There is a strong a case for wanting to search across all evidence for
occurences, including surveys and point occurences, so I can see possible demand for a tool that would extract occurences from survey data to a DwC archive. However, I am very concerned that making a DwC archive the primary exchange format for survey or plot data commits us to a path of losing information from the start, for all but the simplest sampling schemas.
Ramona
Ramona L. Walls, Ph.D. Scientific Analyst, The iPlant Collaborative, University of Arizona Research Associate, Bio5 Institute, University of Arizona Laboratory Research Associate, New York Botanical Garden
On Fri, Aug 22, 2014 at 3:00 AM, <tdwg-content-request@lists.tdwg.orgmailto:tdwg-content-request@lists.tdwg.org>
wrote:
Send tdwg-content mailing list submissions to tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org
To subscribe or unsubscribe via the World Wide Web, visit http://lists.tdwg.org/mailman/listinfo/tdwg-content or, via email, send a message with subject or body 'help' to tdwg-content-request@lists.tdwg.orgmailto:tdwg-content-request@lists.tdwg.org
You can reach the person managing the list at tdwg-content-owner@lists.tdwg.orgmailto:tdwg-content-owner@lists.tdwg.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of tdwg-content digest..."
Today's Topics:
- Re: Darwin Core: proposed news terms for expressing sample data (Matt Jones)
- Re: Darwin Core: proposed news terms for expressing sample data (Donald Hobern [GBIF])
Message: 1 Date: Thu, 21 Aug 2014 18:52:06 -0800 From: Matt Jones <jones@nceas.ucsb.edumailto:jones@nceas.ucsb.edu> Subject: Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data To: ?amonn ? Tuama [GBIF] <eotuama@gbif.orgmailto:eotuama@gbif.org> Cc: TDWG Content Mailing List <tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org> Message-ID:
<CAFSW8xkx7uRP9PC2g3=JT_VJanqujH8nPXoz8GXwh+JwKw5Ccw@mail.gmail.commailto:JT_VJanqujH8nPXoz8GXwh%2BJwKw5Ccw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
This proposal is treading on ground that is quite similar to other observations and measurements standards for data exchange that are
already
mature, in particular:
- OGC Observations and Measurements (
http://www.opengeospatial.org/standards/om)
- Extensible Observation Ontology (OBOE;
https://semtools.ecoinformatics.org/oboe)
The former is a standard and broadly deployed, whereas the latter is part of a research program in the use of ontologies for measurements. Through collaboration between the two projects, they've been modified to be reasonably isomorphic, but O&M uses an XML serialization while OBOE uses
an
OWL-DL serialization. They largely express the same measurements and sampling model once one gets beyond the terminology differences.
So, I'm wondering if it make much sense to extend Darwin Core, which is
at
heart an Occurrence exchange syntax, into this measurements area that is well represented by these other existing specifications? I'm curious to hear why people would even want to do this. And if we do go down this path, won't we just end up with a new syntax that does essentially what
O&M
and OBOE do now?
Matt
On Thu, Aug 21, 2014 at 12:22 AM, ?amonn ? Tuama [GBIF]
<eotuama@gbif.orgmailto:eotuama@gbif.org>
wrote:
Hi Rob, Anne, Rich,
I think Markus has answered your question as to why we opted for an
Event
core which is being used in the sense described by Anne and Rich. For
any
event, you can have a list of species in an Occurrence extension and
for
each species, you can include quantity and quantityType, e.g., biomass, etc. The proposed term eventSeriesID was intended for linking together related events, although it now looks like parentEventID might be a
better,
more flexible term. The measurementOrFact extension is a good fit for capturing environmental information relating to an event. See, e.g.,
the
Gialova Lagoon brackish water invertebrate test data set [1] where a
set
of 18 environmental variables, including temp, pH, Rdx, particulate
organic
matter, dissolved oxygen, salinity, chlorophyll-a were measured for
each
sampling station-sampling period combination. An example mapping is:
Id measurementType measurementValue measurementUnit measurementRemarks
IA Tmp (sed) 21.5 degree C Tmp (sed): temperature at the bottom surface
**Controlled vocabularies**
Ideally, the values for samplingUnit and quantityType would be selected from controlled vocabularies. This is, effectively, what we do by presenting a small list of values in a drop-down menu. The current
values
are what we derived for example data sets and discussion but they can undoubtedly be extended and improved.
We capture ?bucket? type measures through a combination of
samplingEffort,
samplingGeometry and samplingUnit. For example, a pitfall trap (in a
point
location) left out for 16 days might have samplingEffort: 16, samplingGeometry: point and samplingUnit: day. Three m^2 quadrats in a shore survey might have samplingEffort: 3, samplingGeometry: area and samplingUnit: m^2.
It would be very useful to see your compilation of scope, effort and completeness measures to see if we can express them in our model and/or
if
we need to reconsider our approach.
?amonn
[1] http://eubon-ipt.gbif.org/resource.do?r=ionian-brackish-lagoon
*From:* tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Markus D?ring *Sent:* 20 August 2014 23:47 *To:* Robert Guralnick
*Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Rob,
this proposal if for monitoring surveys really, not to be confused with material samples like environmental or tissue samples which have a
distinct
new dwc class MaterialSample.
We tend to overload the term sampling a lot and it helps treating
material
samples different from pure observational "sampling". That is why the existing Event class was used as the core and classic Occurrence
records as
extensions. A classic example is a vegetation survey where each plot represents an Event record and each recorded species in that plot will
be
an Occurrence extension record with a given quantity. Darwin Core
already
offers individualCount to specify quantity, but it is a very specific
way
of measuring "abundance" restricted to only some use cases. Abiotic measurements about the plot (e.g. soil type, pH, temperature) can be published using the measurements or facts extension linked to the Event core.
Markus
On 20 Aug 2014, at 20:08, Robert Guralnick
<Robert.Guralnick@colorado.edumailto:Robert.Guralnick@colorado.edu>
wrote:
Anne -- I don't know the answers! These are questions for Eamonn. I would presume that a sample could be a jumble of species or even just
water
or soil samples, and biomass would refer to that sample - but maybe
that
isn't a use case being considered? The examples given in the longer document all link an event_id to species name and some measure of
quantity
for that species (to the species, not an individual specimen), so I
assume
that is the prevailing (or only) case?
Best, Rob
On Wed, Aug 20, 2014 at 11:56 AM, Anne Thessen <annethessen@gmail.commailto:annethessen@gmail.com> wrote:
Hi Rob I would like to respond to your item number 2. From my perspective, I deal with lots of published descriptions of
taxa.
The text might say something like "I saw species A in the Chesapeake
Bay,
the Adriatic Sea and the Indian Ocean and the biomass is 5 - 9 grams".
The
biomass range obviously corresponds to at least three different occurrences, but how to divide the biomass data? I would love to be
able to
have an *event* to attach it all to. There is almost two different
levels
of events - a sampling event and a "study event". The "study event"
would
correspond to the type of event I would like to use in the above
example.
It may not be ideal, but for the old literature that might be the best
we
can do. I have to admit that I don't know enough about trawl data to understand why an event core would be a problem. It seems that the trawl would be
an
event and each biomass measure (of each fish) would be attached to a separate occurrence which is attached to that event. Am I understanding this wrong? btw - I found a workaround for the example I gave, so it's not
impossible
to model with the current structure.... Anne
On 8/20/2014 1:16 PM, Robert Guralnick wrote:
?amonn et al. --- Thanks for the clarifications. I think these help a
ton
but it raises a couple more questions for me.
- I am surprised that you plan to use of MeasurementorFact extension
in
relation to the Event core, which seems like a novel (or perhaps
awkward or
unintended?) mechanism for capturing environmental data, but the same extension was not be seen as relevant for describing samples? Can you explain more about the thinking there?
- There may be a subtle issue here extending "Event" to be more what
you
call a "Sampling Event Core". My read of this is that Darwin Core
serves
as a way to deal with point occurrences and Event reflects the context
of a
single capture event (whether a single observation, or a bulk sample capture). The changes recommended seem to dramatically extend and
change
that meaning? Its simply a question that I don't have answer to, but
is
Darwin Core, the right vehicle to start capturing repeated measures of biomass values from trawls? I don't have answer but man, terms like quantityType (as a property of occurrence?) give me pause.
- Is Sampling Unit a controlled vocabulary? For another project, I
have
looked through - and captured scope, effort and completeness measures
from
- a large number of published biotic area inventories. The vast
majorities
of these are measured in units like bucket hours, or trap nights. Is a "bucket" part of SamplingGeometry or Sampling Unit? I'd be happy to
send
along all the many examples of how biotic inventories of an area are completed and perhaps it might be good to see how those might be represented using the terms you are proposing?
Best, Rob
On Wed, Aug 20, 2014 at 10:16 AM, Richard Pyle
<deepreef@bishopmuseum.orgmailto:deepreef@bishopmuseum.org>
wrote:
Same here ? Events are central to the work that we do.
Aloha,
Rich
*From:* tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Anne Thessen *Sent:* Wednesday, August 20, 2014 2:59 AM *To:* tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org
*Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hello I would just like to comment on *event core*. I've been doing a lot of work translating published data into Darwin
Core.
During that process I've wished several times that I could use Event as core. I am happy to hear about that proposed change. It will make it
easier
to model the data I am working with. Anne
On 8/20/2014 7:04 AM, ?amonn ? Tuama [GBIF] wrote:
Hi Rob,
Thank you for the feedback. I have tried to address the two main issues you raise below. At the outset, I would like to emphasise that much of
this
work is taking place in the context of the EU BON project which
includes a
task on developing/enhancing tools and standards for data sharing with
a
particular focus on the IPT for publishing sample-based data. So, we
were
constrained by the need to publish sample-based data sets in the Darwin Core Archive format and to demonstrate practical application using a working prototype. When the discussion on the TDWG list faded out, we
took
it to our EU BON partners whose requirements were essential input to further development. We recognise that these discussions took place
away
from TDWG (although the TDWG/EU BON contributors overlapped) and this
is
the reason we are presenting the outcomes here for further
consideration.
**Event core**
As the SIGS report indicated, sample data can be modelled in Darwin
Core
Archives using either Occurrence or Event as core. This was the
starting
point for our evaluation but as things progressed the data wrangling
pushed
the model back towards the Event core. We actually went through the exercise of mapping multiple test datasets in an iterative process
spanning
several months' work. In the end, we found that using an Event core
better
matched the typical sample data we were dealing with, allowing use of a measurement-or-fact extension to be included for the efficient
expression
of environmental information associated with the event. The choice
comes
down to an Occurrence core or an Event core + Occurrence extension. In
both
cases, the true observation records are Occurrences. The big difference
is
what type the core has and therefore to which kind of records you can attach further facts and extra information with DwC-A extensions. Many sampling datasets have very rich information about the site and event,
so
it is very natural to hang facts from an Event core. When picking the Occurrence core those facts would have to be repeated for each and
every
occurrence record. Moreover, our approach doesn?t stop anyone from
using
the Occurrence core if they so wish. This just provides a different
option
for datasets that better fit an Event core model.
I want to stress that we are not building a ?specific IPT version? to support an Event core but, rather, we adapted the IPT so that it can be configured to support any generic ?core + extension? format to enable
its
use for exploration of more data formats. This is part of the core codebase and there were no custom forks of the IPT for this work. Our
view
at GBIF is that if there are significant numbers of data publishers who
are
keen to adopt, promote and use a (any) format, and the tools can be configured to do so, then we should support it, and, if necessary, use
a
custom namespace.
**New terms around abundance**
Yes, the discussion on TDWG did fade out but it was clear that the term ?abundance? as recommended by the SIGS report (along with abundanceAsPercent) was confusing many when we were looking for term(s) that reported quantitative measures of organisms in a sample. It also became clear we would need to be able to state the type of quantity
being
measured. An alternative suggestion for using the MeasurementsOrFact
class
was immediately shot down.
As some of our main use cases were coming from the EU BON project, discussion shifted to that forum and consensus formed about the
currently
proposed terms. It was within this group that the additional terms (samplingGeometry, samplingUnit, eventSeriesID) were proposed and where
we
began testing with sample data sets.
Best regards,
?amonn
*From:* robgur@gmail.commailto:robgur@gmail.com [mailto:robgur@gmail.commailto:robgur@gmail.com <robgur@gmail.commailto:robgur@gmail.com>]
*On
Behalf Of *Robert Guralnick *Sent:* 19 August 2014 16:56 *To:* ?amonn ? Tuama [GBIF] *Cc:* TDWG Content Mailing List *Subject:* Re: [tdwg-content] Darwin Core: proposed news terms for expressing sample data
Hi ?amonn --- I am curious about the outcomes presented in the SIGS paper, in particular, this portion of the paper:
"Solutions without introducing an event core in Darwin Core Archives: During the review of the solutions for the uses cases, it became
apparent
that either model could be applied to every use case. The core and extensions bore a complementary relationship and between them could
express
all the required information. The core simply provided the central
anchor
in the star schema from which to join the additional information. Therefore, using the Occurrence core, well established in the GBIF
network
through uptake of the IPT, seemed more appropriate than inventing CollectingEvent as an additional core type."
That SIGS paper has John Wieczorek and you both as authors,
including
many luminaries across the biodiversity standards spectrum. Given the above, its curious to see the EventCore come back again, along with a specific IPT version to support it.
So I see two issues, conflated, in this post you just made. One is
the need for an EventCore at all, and the nature of relating Event and Occurrence/Material Sample. The second is the introduction of new
terms,
which seemingly have arrived after debate on similar terms - but framed around abundance - stalled a year ago. To my mind, these both require
some
further discussion, because I don't (necessarily) see TDWG community coherence around either issue?
Best, Rob
On Tue, Aug 19, 2014 at 6:11 AM, ?amonn ? Tuama [GBIF]
<eotuama@gbif.orgmailto:eotuama@gbif.org>
wrote:
Dear All,
GBIF is committed to exploring ways in which the IPT and Darwin Core Archive format can be extended for publishing sample-based data sets.
In
association with the EU BON project [1], a customised version of the
IPT
[2] has been deployed to test this using a special type of Darwin Core Archive in which the core is an ?Event? with associated taxon
occurrences
in an ?Occurrence? extension.
The Darwin Core vocabulary already provides a rich set of terms with
many
relevant for describing sample-based data. Synthesising several sources
of
input (GBIF organised workshop on sample data, May 2013 [3],
discussions on
the TDWG mailing list in late 2013; internal discussion among EU BON project partners), five new terms relating to sample data were
identified
as essential. The complete model including these new terms are fully described with examples in the online document ?Publishing sample data using the GBIF IPT? [4].
As a first step towards ratification, we would like to register the new terms in the DwC Google Code tracker [5] if there are no major
objections
on this list. The five terms are:
*quantity*: the number or enumeration value of the quantityType
(e.g., individuals, biomass, biovolume, BraunBlanquetScale) per samplingUnit or a percentage measure recorded for the sample.
*quantityType*: : the entity being referred to by quantity,
e.g., individuals, biomass, %species, scale type.
*samplingGeometry*: an indication of what kind of space was
sampled; select from point, line, area or volume.
*samplingUnit*: the unit of measurement used for reporting the
quantity in the sample, e.g., minute, hour, day, metre, metre^2,
metre^3.
It is combined with quantity and quantityType to provide the complete measurement, e.g., 9 individuals per day, 4 biomass-gm per metre^2.
*eventSeriesID*: an identifier for a set of events that are
associated in some way, e.g., a monitoring series; may be a global
unique
identifier or an identifier specific to the series.
Best regards,
?amonn
[1] http://eubon.eu
[3]
http://www.standardsingenomics.org/index.php/sigen/article/view/sigs.4898640
[4] http://links.gbif.org/sample_data_model
[5] https://code.google.com/p/darwincore/issues/list
*?amonn ? Tuama, M.Sc., Ph.D. (eotuama@gbif.orgmailto:eotuama@gbif.org <eotuama@gbif.orgmailto:eotuama@gbif.org>), *
*Senior Programme Officer for Interoperability, *
*Global Biodiversity Information Facility Secretariat, *
*Universitetsparken 15, DK-2100, Copenhagen ?, DENMARK*
*Phone: +45 3532 1494tel:%2B45%203532%201494 <%2B45%203532%201494>; Fax: +45 3532 1480tel:%2B45%203532%201480 <%2B45%203532%201480>*
tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list
tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185tel:443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Anne E. Thessen, Ph.D.
The Data Detektiv, Owner and Founder
Ronin Institute, Research Scholar
443.225.9185tel:443.225.9185
tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
participants (9)
-
Bob Morris
-
Chuck Miller
-
John Deck
-
Markus Döring
-
Ramona Walls
-
Robert Guralnick
-
Steve Baskauf
-
Tim Robertson
-
Éamonn Ó Tuama [GBIF]