Hi Cam,
What I then want to ask is, 1. do the terms for clearly defining the bounds of the Occurrence already exist? There exist terms for spatial uncertainty: dwc:coordinateUncertaintyInMeters, and coarse ones for temporal bounds: startDayOfYear + endDayOfYear, but not for temporal uncertainty, or spatial bounds (but see Pete's http://lod.taxonconcept.org/ontology/dwc_area.owl).
The question of temporal uncertainty is an excellent one, and after years of struggling, I have no "elegant" solution ("elegant" in this case being a high degree of capturing what what we want to capture, with a minimal set of attributes in a simple structure). The problem is that given a range represented by startDate and endDate (how I handle it in my database), the interpretation may be any of the following:
- Event occurred at a singular (imprecisely known) point between startDate and endDate - Event occurred at multiple points between startDate and endDate - Event occurred continuously beginning startDate and ending endDate
To overcome this, I added two additional fields:
verbatimDate: Used for historical datasets to record the verbatim date information (useful for things like "Summer 1984") dateRemarks: Some sort of text description of what is meant by the startDate and endDate values
Of course, nether of these date qualifiers is of much use in the semantic context. So what we may need some sort of controlled vocabulary for "dateRangeQualifier" or something, that indicates how to interpret a date range given only dateStart and dateEnd.
Another solution would be something analagous to my approach for handling the second part of your question: spatial bounds.
In my universe of datasets, I have lat/lon coordinates that fall into one of several types:
- single point with uncertainty - two points representing a transect line (commonly used for plankton tows) - two points representing two corners of a bounding box - series of multiple points representing a non-straight line (e.g., a river, road, or a non-straight survey path) - series of multiple points representing a polygon
In addition to the fact that there are 1...n points to represent the bounding of a place, there may be multiple re-interpretations of a point or set of points, when retroactively georeferencing locality data.
So...the model I came up with looks something like this (using my same ASCII-art notation as before)
Location--<CoordinateSet--<Coordinate | CoordinateSetType
A Coordinate minimally consists of a decimalLatitute, decimalLongitude, and Sequence.
CoordinateSetType is a controlled vocabulary that defines the five types listed above (point, transect, boundingBox, line, polygon).
Each CoordinateSet consists of 1, 2, or >2 Coordinates, depending on the CoordinateSetType.
Attached to "CoordinateSet" is all the MaNIS-style metadata for who/when/how the coordinate was derived.
The reason for the 1:M Location:CoordinateSet is to allow for multiple interpretations of retroactively established coordinates (e.g., following the MaNIS protocol).
Whether or not things like Datum and Uncertainty are attached to CoordinateSet or Coordinate depends on how much flexibility you need for capturing heterogenous Datum or Uncertainty values within a particular set (e.g., if certain nodes on a polygon are more precise than other nodes). I would defintiely put Datum on CoordinateSet, and probably also put uncertainty there as well (which assumes that the same datum applies to each point in a set, and also that uncertainty is consistent for each point in a set).
The pretty-much covers all spacial bounding protocols, and it's not too difficult to derive a "point" coordinate from any of the other four, for purposes of "dumbing the data down" to fit into DwC. There are some problems, but they are not important enough to go into now.
But the point is, you could conceivably handle dates using a similar structure -- but the cases where parsing the precise date information is imprtant are so few, that we probably don't need a semantic structure for it, and can simly capture it in a human-readible dateRemarks.
Now....obviosuly this is all in data modelling space, not necessarily DwC-space. But I think it is useful to discuss how databases capture this kind of information at the source when trying to figure out how best to simplify it for aq content exchange protocol.
Also, 2. if there was a consensus for moving to the `explicit token' model, should the space-time bounds of the Occurrence still be contained in an associated (often blank) Event, or accepted as properties of the Occurrence itself (e.g., occurrenceDate, occurrenceDuration, occurrenceLocation, occurrenceRadius)? I would support the latter.
I would support the former. I'm not sure I understand why Event is "often blank". If there is any space-time information, then Event is not blank. In the context of the data I manage, it makes much more sense (in a DwC context) to capture Events and Locations as distinct classes, than representing multiple tokens for the same Occurrence. Even if we don't establish an individual class, dwc:individualID within the Occurrence class allows us to deal with both the "same-organism-at-multiple-events" situation, and the "multiple-tokens-for-same-organism-at-same-event" situation.
Finally, 3. if there was a consensus for moving to the `explicit token' model, and a human observation was a token-less Occurrence, would we best specify who made the observation with dwc:recordedBy and what the observation was with dwc:occurrenceRemarks, or would it be better to create a second new token (along with `Physical specimens') that was an explicit Observation class, that would link explicitly to, say, an external observational ontology (i.e., OBOE)? The issue of GUIDs for non-physical observations comes up, but this could still be solved in various ways.
I would favor at the very least a "place-holder" or "implied" token for a human observation. It's functionally analagous to the situation where a photo was taken, but then accidentally destroyed or lost. The only difference between an image and a memory is that the image is generally more durable, and is more easily and precisely conveyed from person to person.
Stepping back from the details for a moment, and reading some of the replies to Steve's post that have come in, I am wondering how many readers are thinking, ``the need for a semantic web standard for biodiversity information might be better achieved by a deep fork of Darwin Core, adopting new Classes and explicit domains and ranges for each term, to create a `Darwin SW,' rather than by an effort to evolve Darwin Core itself.'' I'm sure the question of forking Darwin Core has come up before, and I'm sure the discussion was passionate!
To the extent that I understand both DwC and the semantic web, this seems to me to be the most parsimonious approach.
Aloha, Rich