[tdwg-content] practical details of recording a determination What is an Occurrence?

Tue Oct 19 19:12:18 CEST 2010

Thanks, Steve.

The diagram looks about right, except for the arrow heads as you noted. If
there's a way you can replace the arrows with some sort of 1:Many line
notation, that would be better.  As you have it now, the arrowhead is on the
"one" side; but I think it's more intuitive to have a "crows-foot" sort of
symbol on the "many" side.  I can send an example of what I mean.  Not a big
deal.

Yeah, I originally had it as eventDate, but then switched to eventTime.  If
Date can include time (and Time is assumed not to include date), then using
eventDate is fine.

> In principle, I agree with this diagram to the left of taxonNameUsage
completely.  
> (I still need clarification about a few things on the right end.)  

Yes, that's a failing on my part to get better documentation out there for
GNUB (from which TaxonNameUsage, and all of the other "Usage" terms in DWC
come).  I hope to correct this by the end of the year.

> My main reason 
> for using determination as a term rather than identification 
> is because it is not ambiguous to refer to the person doing 
> the identifying as the determiner, whereas referring to that 
> person as the "identifier" creates confusion between that 
> person and the identifying string for resources (as in 
> "persistent identifier").  

Ah!  Got it.  Makes sense. 

> So if we agree that determination, annotation, and identification 
> all mean the same thing (namely an instance of the dwc:Identification 
> class), I'm happy to just use the term "identification".  
> For the person doing it, I guess dwc:identifiedBy would be the 
> best term although it's a bit awkward in regular speech so I may
> slip and still say "determiner".  

Either way.  Now that you put it in that context, I'm also happy to go with
"Determination" and "Determiner".

But I would avoid "Annotation".  That word has a much more general meaning,
and we'll likely be hearing more and more about it (in the more general
sense) as several big-ish projects are working on Annotations (in general)
right now.  

> Although I agree in principle that there can be many occurrences 
> at an Event and many events at a Location, I think there are two 
> practical reasons why it may be better to assign separate 
> eventDate and Location metadata to each Occurrence.  

Hmmm...not sure I follow.  Are you saying that a new Event record (ID)
should be created for every Occurrence record, and that a new Location
record (ID) should be created for every Event record?  If so, then it's
going to be very difficult to convicne me of this.  I don't think that our
database is unusual in having many (sometimes hundreds) of Occurrences at
the same Event (e.g., a large fish poison station), and many (again,
sometimes hundreds) of Events at the same Location.

> The first is that it makes the database structure simpler. 
> As Markus has already noted, we really would prefer for 
> the database to be as "flat" as possible.  

Which database?  Are we talking about DwCA?  If so, I understand the
rationale for flattening out content to make it easier to batch-package
records amnd ship them around.  But if we're talking about actual database
implementations at the content provider end, I think I'm not alone in
wanting to stick with a more normalized approach.  Besides, what's the point
of even defining the different DwC classes, each with their own ID, if we're
just going flatten them all out anyway (as per old Dwc)?

> When I look at the terms listed in the DwC 
> term page (http://rs.tdwg.org/dwc/terms/index.htm) 
> under Event, the most important one that I see which 
> everyone should be providing is eventDate.  The 
> rest I would pretty much consider optional and as 
> a shortcut Rich's diagram could be collapsed to 
> make them direct properties of the Occurrence.  

Yes, they could be collapsed to Occurrence -- in the same way that
properties of "Individual" are currently collapsed to Occurrence.  But after
pleading your case to normalize "Individual" as its own separate class, I'm
kinda surprised to see you arguing in favor of collapsing the Event class
into Occurrence.

> The second reason involves the practical matter of defining a 
> Location.  I will note that my thinking about this has been 
> deeply influenced by a previous discussion on the topic from 
> 2008-2009 summarized at 
> http://www.sernec.org/files/summary-of-discussion.pdf on p.78-84.  
> I don't think most people will want to wade through all of that 
> text, so I'll just sum it up here.  Somebody (I think it might 
> have been Debbie Paul at Morphbank) suggested to me that we 
> really have an intrinsically globally unique identifier for 
> Location.  It's the combination of dwc:decimalLatitude and 
> dwc:decimalLongitude along with dwc:coordinateUncertaintyInMeters 
> to establish precision and dwc:geodeticDatum to establish 
> the reference system.  

Yes....sort of.  Doesn't helpf for localities defined as bounded boxes,
polygons or lines (e.g., transects, as we often have for data from plankton
tows) -- but it certainly does serve as a hand "natural key" of sorts for
point localities.  The problem is that so much of our exiting content is not
reliably georeferenced yet.  Thus, we need all those other terms to
accommodate various location descriptors, which will eventually allow us to
after-the-fact georeference the localities.  Also, many after-the-fact
georeferenced points are interpretations.  Keeping the descriptors around
can allow someone else to come up with a better/more precise
lat/long/uncertainty interpretation.  Also, errors are abundant
(particularly in failing to represent decimal degrees with negatives).
Having the descriptors allows us to catch such errors much more quickly.

> (If we like geo:lat and geo:long, then the reference system 
> is implied and we are down to three terms to unambiguously 
> define a Location and its uncertainty.  For the benefits of 
> humans, a Locality description is probably also beneficial.  
> Also, elevation and depth might be provided, although at 
> least in theory elevation could be calculated with a 
> sufficiently good digital elevation model).  

Well, that depends on the extent of coordinateUncertaintyInMeters.  Original
data often have fairly precise elevations, but imprecise lat/long.  Thus,
one can often narrow down the likely location more precisely than a circle
described by a point/radius.  In other words,
Lat+Long+coordinateUncertaintyInMeters may describe a circle that includes a
range of elevations, and thus an elevation cannot be reliably calculated.
Moreover, a lot of modelling use-cases will want as precise of an elevation
as possible.

> I will grant that we don't have this information for a lot 
> of old records, but based on the massive efforts to 
> geolocate specimens, I would say it's pretty clear that 
> this is what we would like to have if we could get it.  

Sure -- but we can't really ignore the massive numbers of non-georeferenced
datapointn that already exist. And even when they are georeferenced, we'll
still want to keep the original location descriptors.

> I certainly hope that there aren't any serious collectors, 
> observers, and live organism photographers who aren't 
> by this point trying to record this information as they 
> establish new Occurrence records.  If you look at all of 
> the Location terms on the dwc list, most of the other 
> terms are either concessions to the fact that we don't 
> have what we want (e.g. the "verbatum" terms), things 
> we could generate using a computer program if we were 
> clever (like stateProvince, county, etc. - I know at 
> least Mike Giddens has succeeded in doing this), ways of 
> indicating how we got lat and long from old records 
> (e.g. georefererenceSources), or methods to define larger 
> scale Locations that aren't points (e.g. footprintWKT). 
> I think it is safe to say that in the future (if not 
> now already), many or most Events associated with 
> Occurrences will have an associated button click (on 
> a GPS receiver, camera phone, or GPS enabled camera) that 
> will automatically generate dwc:eventDate, dwc:decimalLatitude, 
> dwc:decimalLongitude (with geodeticDatum=WGS84) and maybe 
> coordinateUncertaintyInMeters.  Thus designing a system 
> that requires that these time/space snapshots be grouped 
> together into artificial "Locations" is really 
> counterproductive when those data are now generated and 
> can be associated with Occurrences automatically.  

OK, I see where you're coming from.  But I guess my rsponse is that we're a
LONG way from:

- A world where most existing Occurrence content is well georeferenced;
- A world where reliable services allow me to easily/automatically query
records on, say "Northwestern Hawaiian Islands", based on GIS polygon
querying
- A world where content holders are not going to want to share all the
textual locality descriptiors with their reocrds

Besides, we're going to always want to maintain the ability to define
locations as bounded boxes, polygons, and lines; not just point-radius.

Moreover, there are many cases where a single location is re-used for many
different events.  The same tree monitored continuously over years.  The
same field station re-visted year after year. The same transect repeated
every month/season/year to monitor populations & presence/absence. LTER
data.  Christmas Count data.  In other words, it's common to have multiple
Events stacked at the same locality; and we won't want to always limit
ourselves to defining that locality using point/radius.

So I still see the advantage of keeping Location as a separate class,
maintaining those extra "human-friendly" descriptor terms, and
conceptualizing 1:M Location:Events.

> I don't know if Greg Riccardi of Morphbank is following 
> this thread or not.  If so he may want to comment on this 
> issue based on practical experience at Morphbank.  When 
> the Morphbank system was set up, it required the creation 
> of a separate Location record which was assigned a unique 
> Morphbank identifier.  Specimens were then linked to this 
> Location.  What ended up happening was that each Specimen 
> having GPS metadata ended up being assigned to its own 
> separate Location even if it was 20 meters from another 
> specimen.  In effect, each Occurrence record ended up 
> having its own decimalLatitude/decimalLongitude record 
> anyway.  So the system ended up being more complicated 
> than necessary.

Yes -- that's definitely a trend of modern collecting data with GPS....a
tendency towards fewer and fewer instances of Events per instance of
Location -- to the point where many records are now 1:1.  However, that
doesn't change the fact that an enormous volue of content currently is, and
always will be best structured as many Events per location, and many
Occurrences per Event. I think DwC needs to accommodate that content.  It's
easy to store 1:1 records in a structure designed to accommodate 1:M.  But
it's a lot messier to generate 1:M content in a structure designed for 1:1.

> As I said, I agree in principle with the left side of 
> Rich's diagram.  Taking the practical considerations I 
> just mentioned into account, I would simplify the diagram as
> http://bioimages.vanderbilt.edu/pages/rich-diagram2.gif

I think that's perfectly fine as a simplified structure for data exchange,
such as DwCA and other applications that aggregate content and/or provide
value-added or indexing services on aggregated content.  But for original
source data, I think it would be unwise to advocate such a simplified
structure to anything but small ad-hoc project-based systems.  It's
incredibly easy to flatten out the more normalized form into the more
flattened form; not so easy to go the other way around.

I also still find it interesting that you are quite content to flatten
Locality and Event class terms into Occurrence, while simultaneously wanting
to normalize Individual as a new class, sparate from Occurrence.  I'm not
saying that establishing an Individual class is a bad idea -- in principle I
support it.  But I am curious as to why you think it important ti push for
normalization on the Individual side of an Occurrence, but push for
de-normalization on the Event side of an Occurrence.

> Superficially, it looks more complicated, but I've gotten 
> rid of several "one to many" relationships and enthroned 
> Occurrence at its accustomed place in the center of the 
> universe (or at least the center of the left side of the 
> diagram).  I don't have any philosophical objections to 
> people structuring their data according to Rich's original
> diagram and the existing Darwin Core terms certainly make
> it possible to do so (well except for the Individual thing). 
> However, I submit that many people will find it simpler
> (and easier to use tools like Darwin Core Archives) if 
> they use the flatter structure that I have in the revised 
> diagram.

That may be true.  But I've spent more than two decades taking
over-simplified, flattened database structures and transforming them into
more normalized structures, because the flattened structures consistely
limited my ability to ask novel questions of the database, and also
encouraged inconsistency of data entry practices.  The small price I pay for
increased normalization has yielded ample return in more "powerful" datasets
(i.e., more flexibility in how I can frame and/or analyze the data).

> I will save my questions about the right side of Rich's 
> diagram for later.

That would be best answered through documentation of GNUB, which I will be
working on intensively over the next two months.

Aloha,
Rich