[tdwg-content] Treatise on Occurrence, tokens, and basisOfRecord

Mon Oct 25 21:26:37 CEST 2010

Hi Cam,

> What I then want to ask is, 1. do the terms for 
> clearly defining the bounds of the Occurrence already exist?  
> There exist terms for spatial uncertainty: 
> dwc:coordinateUncertaintyInMeters, and coarse ones for 
> temporal bounds: 
> startDayOfYear + endDayOfYear, but not for temporal 
> uncertainty, or spatial bounds (but see Pete's 
> http://lod.taxonconcept.org/ontology/dwc_area.owl).  

The question of temporal uncertainty is an excellent one, and after years of
struggling, I have no "elegant" solution ("elegant" in this case being a
high degree of capturing what what we want to capture, with a minimal set of
attributes in a simple structure).  The problem is that given a range
represented by startDate and endDate (how I handle it in my database), the
interpretation may be any of the following:

- Event occurred at a singular (imprecisely known) point between startDate
and endDate
- Event occurred at multiple points between startDate and endDate
- Event occurred continuously beginning startDate and ending endDate

To overcome this, I added two additional fields:

verbatimDate: Used for historical datasets to record the verbatim date
information (useful for things like "Summer 1984")
dateRemarks: Some sort of text description of what is meant by the startDate
and endDate values

Of course, nether of these date qualifiers is of much use in the semantic
context.  So what we may need some sort of controlled vocabulary for
"dateRangeQualifier" or something, that indicates how to interpret a date
range given only dateStart and dateEnd.

Another solution would be something analagous to my approach for handling
the second part of your question: spatial bounds.

In my universe of datasets, I have lat/lon coordinates that fall into one of
several types:

- single point with uncertainty
- two points representing a transect line (commonly used for plankton tows)
- two points representing two corners of a bounding box
- series of multiple points representing a non-straight line (e.g., a river,
road, or a non-straight survey path)
- series of multiple points representing a polygon

In addition to the fact that there are 1...n points to represent the
bounding of a place, there may be multiple re-interpretations of a point or
set of points, when retroactively georeferencing locality data.

So...the model I came up with looks something like this (using my same
ASCII-art notation as before)

Location--<CoordinateSet--<Coordinate
                |
        CoordinateSetType

A Coordinate minimally consists of a decimalLatitute, decimalLongitude, and
Sequence.

CoordinateSetType is a controlled vocabulary that defines the five types
listed above (point, transect, boundingBox, line, polygon).

Each CoordinateSet consists of 1, 2, or >2 Coordinates, depending on the
CoordinateSetType.

Attached to "CoordinateSet" is all the MaNIS-style metadata for who/when/how
the coordinate was derived.

The reason for the 1:M Location:CoordinateSet is to allow for multiple
interpretations of retroactively established coordinates (e.g., following
the MaNIS protocol).

Whether or not things like Datum and Uncertainty are attached to
CoordinateSet or Coordinate depends on how much flexibility you need for
capturing heterogenous Datum or Uncertainty values within a particular set
(e.g., if certain nodes on a polygon are more precise than other nodes).  I
would defintiely put Datum on CoordinateSet, and probably also put
uncertainty there as well (which assumes that the same datum applies to each
point in a set, and also that uncertainty is consistent for each point in a
set).

The pretty-much covers all spacial bounding protocols, and it's not too
difficult to derive a "point" coordinate from any of the other four, for
purposes of "dumbing the data down" to fit into DwC.  There are some
problems, but they are not important enough to go into now.

But the point is, you could conceivably handle dates using a similar
structure -- but the cases where parsing the precise date information is
imprtant are so few, that we probably don't need a semantic structure for
it, and can simly capture it in a human-readible dateRemarks.

Now....obviosuly this is all in data modelling space, not necessarily
DwC-space.  But I think it is useful to discuss how databases capture this
kind of information at the source when trying to figure out how best to
simplify it for aq content exchange protocol.

> Also, 2. 
> if there was a consensus for moving to the `explicit token' 
> model, should the space-time bounds of the Occurrence still 
> be contained in an associated (often blank) Event, or 
> accepted as properties of the Occurrence itself (e.g., 
> occurrenceDate, occurrenceDuration, occurrenceLocation, 
> occurrenceRadius)?  I would support the latter.

I would support the former.  I'm not sure I understand why Event is "often
blank".  If there is any space-time information, then Event is not blank.
In the context of the data I manage, it makes much more sense (in a DwC
context) to capture Events and Locations as distinct classes, than
representing multiple tokens for the same Occurrence.  Even if we don't
establish an individual class, dwc:individualID within the Occurrence class
allows us to deal with both the "same-organism-at-multiple-events"
situation, and the "multiple-tokens-for-same-organism-at-same-event"
situation.

> Finally, 3. if there was a consensus for moving to the 
> `explicit token' 
> model, and a human observation was a token-less Occurrence, 
> would we best specify who made the observation with 
> dwc:recordedBy and what the observation was with 
> dwc:occurrenceRemarks, or would it be better to create a 
> second new token (along with `Physical specimens') that was 
> an explicit Observation class, that would link explicitly to, 
> say, an external observational ontology (i.e., OBOE)?  The 
> issue of GUIDs for non-physical observations comes up, but 
> this could still be solved in various ways.

I would favor at the very least a "place-holder" or "implied" token for a
human observation.  It's functionally analagous to the situation where a
photo was taken, but then accidentally destroyed or lost.  The only
difference between an image and a memory is that the image is generally more
durable, and is more easily and precisely conveyed from person to person.

> Stepping back from the details for a moment, and reading some 
> of the replies to Steve's post that have come in, I am 
> wondering how many readers are thinking, ``the need for a 
> semantic web standard for biodiversity information might be 
> better achieved by a deep fork of Darwin Core, adopting new 
> Classes and explicit domains and ranges for each term, to 
> create a `Darwin SW,' rather than by an effort to evolve 
> Darwin Core itself.''  I'm sure the question of forking 
> Darwin Core has come up before, and I'm sure the discussion 
> was passionate!

To the extent that I understand both DwC and the semantic web, this seems to
me to be the most parsimonious approach.

Aloha,
Rich