[tdwg-tapir] DwC extensions

Thu May 22 22:15:13 CEST 2008

I got sidetracked on this days ago, but feel in the light of recent star
schema discussions on the original caching thread that the time is again
right to submit this new discussion.

Tradition has DwC discussions on this Tapir mailing list. I'm starting this
new thread based on Markus' recent posting (below) about an Identification
extension to DwC. I'm motivated to pull together the time and energy to
finally push the pending DwC through the standards process, with a goal of
having that whole process finished by the TDWG Meeting this year.
I've been thinking about how to conduct the Request for Comment required to
move the standard forward. I propose to put together a survey with Survey
Monkey or something akin to actually test for reasonble concensus. Any
comments or suggestions about this idea are welcome.
However, I see benefits to having further discussion about some key issues
before doing that, as I believe we now have enough accumulated experience to
make some good decisions that will affect the design and guidelines for
further development of the Darwin core and extensions.

In the past, most Darwin Core discussions have revolved about whether to
include a particular concept, and where. I think it will be much more useful
to concentrate on a few key issues at a higher level, resolve them at that
level, then make any necessary changes to the schemas based on the consensus
guiding principles. It should be easy and fast to accomplish this if the
principles are clear and simple. It should be possible to complete this work
soon if we can easily achieve a concensus. Here are some seed questions and
recommendations to facilitate the resolution if this next step in the
process.

1) Is species occurrence in nature and in collections the right scope for
the Core?
2) Should the general philosophy of the Core be inclusive or minimalist?
What are the characteristics of a concept that allow it to be in the Core?
What are the characteristics of a concept that allow it to be added to an
existing extension?
3) What are the defining characteristics of a group of related concepts that
justify the creation of a new extension? Should extensions be based on
abstract conceptual groupings/objects (events,
identifications/determinations, places)? Or on special interests (paleo,
curation, interaction)? Or on the stability of the concepts (core contains
the proven stable concepts, extensions are more volatile)?
4) Should there be elements in the Core and extensions to hold GUIDs linking
them to instances of related classes of objects, such as an occurrence to a
TaxonConceptGUID, or an occurrence to a CoreGatheringGUID? Should every
extension have a non-mandatory GUID allowing for the external resolution of
the object?
5) What should the Darwin Tapir application schema look like?
6) Is it the right approach to have restrictions on content at the concept
definition level? Where should the line be drawn? Arguments have been raised
in the past about the DwC and extensions' content with respect to
being restrictive versus open to incorrect content. For example, DayOfYear
in the current DwC 1.4 (http://rs.tdwg.org/dwc/tdwg_dw_core.xsd) is typed as
a dwc:dayOfYearDataType, which is defined in
http://rs.tdwg.org/dwc/tdwg_basetypes.xsd as:

<xs:simpleType name="dayOfYearDataType">
 <xs:restriction base="xs:integer">
 <xs:minInclusive value="1" />
 <xs:maxInclusive value="366" />
 </xs:restriction>
</xs:simpleType>

John's two-cent opinions:
1) Yes
2) Tough question. Inclusive, but with a well-defined path from testing to
inclusion. I see no merit in minimalism for its own sake. Candidate concepts
should be in scope (the discovery or retrieval of occurrence information),
have a demonstrated audience, and be stable following testing. New and
untested concepts can go into test extensions and application schemas that
import the core, other extensions and the test extension.
3) Alignment with the Core Ontology (
http://wiki.tdwg.org/twiki/bin/view/TAG/CoreOntology) is a good guiding
principle for the design of extensions. To me this suggests, for example,
that an Identification Extension is appropriate as a model for the
CoreIdentification. The tough bit is to decide which objects constitute
extensions. All of them? Higher-level ones? Ones that are likely to have
services built around them? For example, should there be an extension for a
CorePlace (Geospatial extension) or a CoreGathering (Geospatial extension
with event information)? Another tough bit is to decide if objects that can
have a one-to-many relationship with the Core should have Status concepts.
For example, given an Identification Extension to the Core, should that
extension have an IdentificationStatus concept in which to label a uniquely
"accepted" identification? Sorry, more questions than answers here.
The same recommendations for inclusion of concepts in the core (above) apply
to extensions - that they should be tested and stable.
4) Tough question. We don't seem to be completely prepared from the
implementation perspective to apply GUIDs to occurrences, let alone other
objects whose nature may change over time. Is there a convincing argument
one way or another? In the absence of an argument in favor, I guess the
default response is "No new GUID concepts".
5) The existing Darwin Record Application Schema for Tapir (
http://rs.tdwg.org/dwc/tdwg_dw_record_tapir.xsd) is a good model. It is
working well in practice so far. The concepts will have to change to
accommodate any changes in the Core or extensions, but the structure and
method of composition of the application schema seem sound.
6) No, it isn't the right approach to overly restrict content at the concept
definition level for the simple reason that if we do that, we will remove
the need and value of applications or services built on top of the
distributed networks (or caches built from them) to help collections
validate or do error detection on their data. That would be a great loss as
an incentive to participate. Besides, application schemas can be built from
the existing concept definitions and may further restrict them for
specialized purposes.

Anything short of a flamethrower in response is welcome.

---------- Forwarded message ----------
From: Markus Döring <mdoering at gbif.org>
Date: Fri, May 16, 2008 at 1:29 AM
Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?
To: Renato De Giovanni <renato at cria.org.br>
Cc: tdwg-tapir at lists.tdwg.org

Renato,

<snip>

I have created an identification extension for darwin core that
holds the historical list of identification events and their outcome.
This is a YAML section of the metafile describing the columns for this
extension through fully qualified concepts ala TAPIR:

identification:
  - http://rs.tdwg.org/dwc/dwcore/ScientificName
  - http://rs.tdwg.org/dwc/dwcore/AuthorYearOfScientificName
  - http://rs.tdwg.org/dwc/dwcore/Family
  - http://rs.tdwg.org/dwc/dwcore/IdentificationQualifier
  - http://rs.tdwg.org/dwc/curatorial/DateIdentified
  - http://rs.tdwg.org/dwc/curatorial/IdentifiedBy

When creating this I realised that pretty much all concepts I was
interested in already existed in darwin core or the curatorial
extension. Wouldnt it be wise to reuse those concepts? Or are they
strictly tight to the idea of a current identification and therefore
cant be used for historical ones? This is probably more of a darwin
core question than TAPIR, but we are all on this list anyway ...

The xml in that case would look sth like this:

<record uri="http://mygarden.com/specimen/plants/54321-423-43-54-6-3-24-44
">
  <dwc:ScientificName>Aster alpinus subsp.
parviceps<dwc:ScientificName>
  ...
  <ident:record>
    <dwc:ScientificName>Aster alpinus<dwc:ScientificName>
    <dwc:AuthorYearOfScientificName>L.</dwc:AuthorYearOfScientificName>
    <dwc:Family>Asteraceae<dwc:Family>
    <cur:DateIdentified>1913-03-12</cur:DateIdentified>
    <cur:IdentifiedBy>Karl Marx</cur:IdentifiedBy>
  </ident:record>
  <ident:record>
    <dwc:ScientificName>Aster alpinus subsp.
parviceps<dwc:ScientificName>
    <dwc:AuthorYearOfScientificName>Novopokr.</
dwc:AuthorYearOfScientificName>
    <dwc:Family>Asteraceae<dwc:Family>
    <cur:DateIdentified>2003-09-07</cur:DateIdentified>
    <cur:IdentifiedBy>Keith Richards</cur:IdentifiedBy>
  </ident:record>
<record>

Markus

</snip>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-tag/attachments/20080522/09835ec9/attachment.html