Re: [tdwg-content] practical details of recording a determination What is an Occurrence?

19 Oct 2010

      Thanks, Steve.

The diagram looks about right, except for the arrow heads as you noted. If
there's a way you can replace the arrows with some sort of 1:Many line
notation, that would be better.  As you have it now, the arrowhead is on the
"one" side; but I think it's more intuitive to have a "crows-foot" sort of
symbol on the "many" side.  I can send an example of what I mean.  Not a big
deal.

Yeah, I originally had it as eventDate, but then switched to eventTime.  If
Date can include time (and Time is assumed not to include date), then using
eventDate is fine.
...
In principle, I agree with this diagram to the left of taxonNameUsage
completely.  
(I still need clarification about a few things on the right end.)
Yes, that's a failing on my part to get better documentation out there for
GNUB (from which TaxonNameUsage, and all of the other "Usage" terms in DWC
come).  I hope to correct this by the end of the year.
...
My main reason 
for using determination as a term rather than identification 
is because it is not ambiguous to refer to the person doing 
the identifying as the determiner, whereas referring to that 
person as the "identifier" creates confusion between that 
person and the identifying string for resources (as in 
"persistent identifier").
Ah!  Got it.  Makes sense.
...
So if we agree that determination, annotation, and identification 
all mean the same thing (namely an instance of the dwc:Identification 
class), I'm happy to just use the term "identification".  
For the person doing it, I guess dwc:identifiedBy would be the 
best term although it's a bit awkward in regular speech so I may
slip and still say "determiner".
Either way.  Now that you put it in that context, I'm also happy to go with
"Determination" and "Determiner".

But I would avoid "Annotation".  That word has a much more general meaning,
and we'll likely be hearing more and more about it (in the more general
sense) as several big-ish projects are working on Annotations (in general)
right now.
...
Although I agree in principle that there can be many occurrences 
at an Event and many events at a Location, I think there are two 
practical reasons why it may be better to assign separate 
eventDate and Location metadata to each Occurrence.
Hmmm...not sure I follow.  Are you saying that a new Event record (ID)
should be created for every Occurrence record, and that a new Location
record (ID) should be created for every Event record?  If so, then it's
going to be very difficult to convicne me of this.  I don't think that our
database is unusual in having many (sometimes hundreds) of Occurrences at
the same Event (e.g., a large fish poison station), and many (again,
sometimes hundreds) of Events at the same Location.
...
The first is that it makes the database structure simpler. 
As Markus has already noted, we really would prefer for 
the database to be as "flat" as possible.
Which database?  Are we talking about DwCA?  If so, I understand the
rationale for flattening out content to make it easier to batch-package
records amnd ship them around.  But if we're talking about actual database
implementations at the content provider end, I think I'm not alone in
wanting to stick with a more normalized approach.  Besides, what's the point
of even defining the different DwC classes, each with their own ID, if we're
just going flatten them all out anyway (as per old Dwc)?
...
When I look at the terms listed in the DwC 
term page (http://rs.tdwg.org/dwc/terms/index.htm) 
under Event, the most important one that I see which 
everyone should be providing is eventDate.  The 
rest I would pretty much consider optional and as 
a shortcut Rich's diagram could be collapsed to 
make them direct properties of the Occurrence.
Yes, they could be collapsed to Occurrence -- in the same way that
properties of "Individual" are currently collapsed to Occurrence.  But after
pleading your case to normalize "Individual" as its own separate class, I'm
kinda surprised to see you arguing in favor of collapsing the Event class
into Occurrence.
...
The second reason involves the practical matter of defining a 
Location.  I will note that my thinking about this has been 
deeply influenced by a previous discussion on the topic from 
2008-2009 summarized at 
http://www.sernec.org/files/summary-of-discussion.pdf on p.78-84.  
I don't think most people will want to wade through all of that 
text, so I'll just sum it up here.  Somebody (I think it might 
have been Debbie Paul at Morphbank) suggested to me that we 
really have an intrinsically globally unique identifier for 
Location.  It's the combination of dwc:decimalLatitude and 
dwc:decimalLongitude along with dwc:coordinateUncertaintyInMeters 
to establish precision and dwc:geodeticDatum to establish 
the reference system.
Yes....sort of.  Doesn't helpf for localities defined as bounded boxes,
polygons or lines (e.g., transects, as we often have for data from plankton
tows) -- but it certainly does serve as a hand "natural key" of sorts for
point localities.  The problem is that so much of our exiting content is not
reliably georeferenced yet.  Thus, we need all those other terms to
accommodate various location descriptors, which will eventually allow us to
after-the-fact georeference the localities.  Also, many after-the-fact
georeferenced points are interpretations.  Keeping the descriptors around
can allow someone else to come up with a better/more precise
lat/long/uncertainty interpretation.  Also, errors are abundant
(particularly in failing to represent decimal degrees with negatives).
Having the descriptors allows us to catch such errors much more quickly.
...
(If we like geo:lat and geo:long, then the reference system 
is implied and we are down to three terms to unambiguously 
define a Location and its uncertainty.  For the benefits of 
humans, a Locality description is probably also beneficial.  
Also, elevation and depth might be provided, although at 
least in theory elevation could be calculated with a 
sufficiently good digital elevation model).
Well, that depends on the extent of coordinateUncertaintyInMeters.  Original
data often have fairly precise elevations, but imprecise lat/long.  Thus,
one can often narrow down the likely location more precisely than a circle
described by a point/radius.  In other words,
Lat+Long+coordinateUncertaintyInMeters may describe a circle that includes a
range of elevations, and thus an elevation cannot be reliably calculated.
Moreover, a lot of modelling use-cases will want as precise of an elevation
as possible.
...
I will grant that we don't have this information for a lot 
of old records, but based on the massive efforts to 
geolocate specimens, I would say it's pretty clear that 
this is what we would like to have if we could get it.
Sure -- but we can't really ignore the massive numbers of non-georeferenced
datapointn that already exist. And even when they are georeferenced, we'll
still want to keep the original location descriptors.
...
I certainly hope that there aren't any serious collectors, 
observers, and live organism photographers who aren't 
by this point trying to record this information as they 
establish new Occurrence records.  If you look at all of 
the Location terms on the dwc list, most of the other 
terms are either concessions to the fact that we don't 
have what we want (e.g. the "verbatum" terms), things 
we could generate using a computer program if we were 
clever (like stateProvince, county, etc. - I know at 
least Mike Giddens has succeeded in doing this), ways of 
indicating how we got lat and long from old records 
(e.g. georefererenceSources), or methods to define larger 
scale Locations that aren't points (e.g. footprintWKT). 
I think it is safe to say that in the future (if not 
now already), many or most Events associated with 
Occurrences will have an associated button click (on 
a GPS receiver, camera phone, or GPS enabled camera) that 
will automatically generate dwc:eventDate, dwc:decimalLatitude, 
dwc:decimalLongitude (with geodeticDatum=WGS84) and maybe 
coordinateUncertaintyInMeters.  Thus designing a system 
that requires that these time/space snapshots be grouped 
together into artificial "Locations" is really 
counterproductive when those data are now generated and 
can be associated with Occurrences automatically.
OK, I see where you're coming from.  But I guess my rsponse is that we're a
LONG way from:

- A world where most existing Occurrence content is well georeferenced;
- A world where reliable services allow me to easily/automatically query
records on, say "Northwestern Hawaiian Islands", based on GIS polygon
querying
- A world where content holders are not going to want to share all the
textual locality descriptiors with their reocrds

Besides, we're going to always want to maintain the ability to define
locations as bounded boxes, polygons, and lines; not just point-radius.

Moreover, there are many cases where a single location is re-used for many
different events.  The same tree monitored continuously over years.  The
same field station re-visted year after year. The same transect repeated
every month/season/year to monitor populations & presence/absence. LTER
data.  Christmas Count data.  In other words, it's common to have multiple
Events stacked at the same locality; and we won't want to always limit
ourselves to defining that locality using point/radius.

So I still see the advantage of keeping Location as a separate class,
maintaining those extra "human-friendly" descriptor terms, and
conceptualizing 1:M Location:Events.
...
I don't know if Greg Riccardi of Morphbank is following 
this thread or not.  If so he may want to comment on this 
issue based on practical experience at Morphbank.  When 
the Morphbank system was set up, it required the creation 
of a separate Location record which was assigned a unique 
Morphbank identifier.  Specimens were then linked to this 
Location.  What ended up happening was that each Specimen 
having GPS metadata ended up being assigned to its own 
separate Location even if it was 20 meters from another 
specimen.  In effect, each Occurrence record ended up 
having its own decimalLatitude/decimalLongitude record 
anyway.  So the system ended up being more complicated 
than necessary.
Yes -- that's definitely a trend of modern collecting data with GPS....a
tendency towards fewer and fewer instances of Events per instance of
Location -- to the point where many records are now 1:1.  However, that
doesn't change the fact that an enormous volue of content currently is, and
always will be best structured as many Events per location, and many
Occurrences per Event. I think DwC needs to accommodate that content.  It's
easy to store 1:1 records in a structure designed to accommodate 1:M.  But
it's a lot messier to generate 1:M content in a structure designed for 1:1.
...
As I said, I agree in principle with the left side of 
Rich's diagram.  Taking the practical considerations I 
just mentioned into account, I would simplify the diagram as
http://bioimages.vanderbilt.edu/pages/rich-diagram2.gif
I think that's perfectly fine as a simplified structure for data exchange,
such as DwCA and other applications that aggregate content and/or provide
value-added or indexing services on aggregated content.  But for original
source data, I think it would be unwise to advocate such a simplified
structure to anything but small ad-hoc project-based systems.  It's
incredibly easy to flatten out the more normalized form into the more
flattened form; not so easy to go the other way around.

I also still find it interesting that you are quite content to flatten
Locality and Event class terms into Occurrence, while simultaneously wanting
to normalize Individual as a new class, sparate from Occurrence.  I'm not
saying that establishing an Individual class is a bad idea -- in principle I
support it.  But I am curious as to why you think it important ti push for
normalization on the Individual side of an Occurrence, but push for
de-normalization on the Event side of an Occurrence.
...
Superficially, it looks more complicated, but I've gotten 
rid of several "one to many" relationships and enthroned 
Occurrence at its accustomed place in the center of the 
universe (or at least the center of the left side of the 
diagram).  I don't have any philosophical objections to 
people structuring their data according to Rich's original
diagram and the existing Darwin Core terms certainly make
it possible to do so (well except for the Individual thing). 
However, I submit that many people will find it simpler
(and easier to use tools like Darwin Core Archives) if 
they use the flatter structure that I have in the revised 
diagram.
That may be true.  But I've spent more than two decades taking
over-simplified, flattened database structures and transforming them into
more normalized structures, because the flattened structures consistely
limited my ability to ask novel questions of the database, and also
encouraged inconsistency of data entry practices.  The small price I pay for
increased normalization has yielded ample return in more "powerful" datasets
(i.e., more flexibility in how I can frame and/or analyze the data).
...
I will save my questions about the right side of Rich's 
diagram for later.
That would be best answered through documentation of GNUB, which I will be
working on intensively over the next two months.

Aloha,
Rich

Re: [tdwg-content] practical details of recording a determination What is an Occurrence?

Richard Pyle