Re: [tdwg-content] practical details of recording a determination What is an Occurrence?

20 Oct 2010

      Specific responses inline
...
Hmmm...not sure I follow.  Are you saying that a new Event record (ID)
should be created for every Occurrence record, and that a new Location
record (ID) should be created for every Event record?  If so, then it's
going to be very difficult to convicne me of this.  I don't think that our
database is unusual in having many (sometimes hundreds) of Occurrences at
the same Event (e.g., a large fish poison station), and many (again,
sometimes hundreds) of Events at the same Location.
I think I answered this (in a sense) in the other email that I sent a 
little while ago.  In principle, if every record of an Occurrence has a 
time that is discernibly different from the times of other Occurrences 
(i.e. because the time was recorded to the nearest second by a machine), 
then yes, I consider it to be a different event.  To force people to 
lump such Occurrences together into one Event (say comprising a day or 
some other time interval larger than a second) is essentially throwing 
away data that we already have about the Occurrences.  I have already 
found it useful to know exactly what time of day a flower was opened 
rather than closed or the order in which I took several images. 

As for creating a new Event record ID for all of those one-second 
events, I would submit the same solution that I gave in the previous 
post.  In my database, I don't have separate Event records for the times 
when I took live plant images (which I am rightly or wrongly calling 
Occurrences).  I just have a "flat" table where each image has a 
eventTime (the time recorded automatically in the EXIF data for the 
image).  If in the context of having an RDF structure that was 
compatible with the kind of structure used by people who have many 
Occurrences associated with a single event (e.g. your fish kill), for 
the image http://bioimages.vanderbilt.edu/baskauf/57755 I would 
automatically create an event identifier for its creation as 
http://bioimages.vanderbilt.edu/baskauf/57755#event .  There!  I have a 
perfectly valid GUID that represents the event without any additional 
burden on my record-keeping system since I don't have to keep any 
additional records about that GUID beyond the ones I'm already keeping 
for the image.  I'm not doing this now, but I suppose I should if there 
is a consensus that things are related to each other in the way shown in 
your diagram.  The same approach could be taken for Location.  People 
who care about having unrelated identifiers for their Events and 
Locations (because of one-to-many relationships) would be welcome to do 
so, but I wouldn't need to in my internal database..
...
Which database?  Are we talking about DwCA?  If so, I understand the
rationale for flattening out content to make it easier to batch-package
records amnd ship them around.  But if we're talking about actual database
implementations at the content provider end, I think I'm not alone in
wanting to stick with a more normalized approach.  Besides, what's the point
of even defining the different DwC classes, each with their own ID, if we're
just going flatten them all out anyway (as per old Dwc)?
Well, yes I had DwCA (Darwin Core Archives) specifically in mind.  But 
it could be any other "shipping format" or local database format.  I 
think we need to draw the distinction between having a way that people 
can understand the meaning of metadata records and fields that we are 
shipping to them (e.g. DwCA), and describing the properties and 
connections between resources (e.g. in RDF).  I think this may be what 
Pete means when he says that we need two "kinds" of DwC.  The first use 
is pretty much "ready to go".  I don't think we know how close we are to 
being able to use the existing DwC terms for describing relationships 
until we have some more conversation of the sort we are having now as 
well as conversation about which existing terms can be used (perhaps in 
ways that weren't originally intended) to express the relationships 
needed in RDF.  I feel like I hear Pete saying that we are still lacking 
a lot of the predicates we need, while I (and maybe Cam) feel that we 
are most of the way there.

For clarification, when I argue that the class Individual should exist 
in Darwin Core, it's not because I'm insisting that all users must have 
an Individual table in their database.  What I want is for people to be 
ABLE to have an Individual table in their database (if they need it) and 
have others understand what it means and how the entities described in 
that table are related conceptually to other things like Identifications 
and Occurrences.  If all of the records in their database have only one 
occurrence per individual, they don't "need" to keep track of Individuals.
...
Yes, they could be collapsed to Occurrence -- in the same way that
properties of "Individual" are currently collapsed to Occurrence.  But after
pleading your case to normalize "Individual" as its own separate class, I'm
kinda surprised to see you arguing in favor of collapsing the Event class
into Occurrence.
I confess my crime.  In penance, I freely confess that the Event class 
exists and that people should use it in their databases if it helps them 
cluster Occurrences.  In addition, I confess that I should probably 
acknowledge the existence of Events and their relationship to other 
Darwin Core classes when I write RDF.  Guilty as charged!
...
...
Yes....sort of.  Doesn't helpf for localities defined as bounded boxes,
polygons or lines (e.g., transects, as we often have for data from plankton
tows) -- but it certainly does serve as a hand "natural key" of sorts for
point localities.  The problem is that so much of our exiting content is not
reliably georeferenced yet.  Thus, we need all those other terms to
accommodate various location descriptors, which will eventually allow us to
after-the-fact georeference the localities.  Also, many after-the-fact
georeferenced points are interpretations.  Keeping the descriptors around
can allow someone else to come up with a better/more precise
lat/long/uncertainty interpretation.  Also, errors are abundant
(particularly in failing to represent decimal degrees with negatives).
Having the descriptors allows us to catch such errors much more quickly.
Here I will confess the crime of ignorance.  I'm still trying to 
understand the need for and uses of a number of the Darwin Core 
dcterms:Location class terms.  I guess I need to spend some more time 
reading the Guide to Best Practices for Georeferencing (I was going to 
include the link here, but the link at 
http://www.biogeomancer.org/library.html is broken). 
Sure -- but we can't really ignore the massive numbers of non-georeferenced
datapointn that already exist. And even when they are georeferenced, we'll
still want to keep the original location descriptors.
Agree on this and all other points I deleted here.
...
OK, I see where you're coming from.  But I guess my rsponse is that we're a
LONG way from:
- A world where most existing Occurrence content is well georeferenced;
[text omitted for brevity]
...
So I still see the advantage of keeping Location as a separate class,
maintaining those extra "human-friendly" descriptor terms, and
conceptualizing 1:M Location:Events.
I'm totally convinced.  Location and Event belong where you had them in 
the diagram. 

[more text listed below]
...
I also still find it interesting that you are quite content to flatten
Locality and Event class terms into Occurrence, while simultaneously wanting
to normalize Individual as a new class, sparate from Occurrence.  I'm not
saying that establishing an Individual class is a bad idea -- in principle I
support it.  But I am curious as to why you think it important ti push for
normalization on the Individual side of an Occurrence, but push for
de-normalization on the Event side of an Occurrence.
Again, I plead guilty as charged.  The left side of the chart should be 
allowed to not be flat.  However, I maintain my stance that many (most?) 
new Occurrence records will in the future have their own individual 
latitude/longitude/elevation or depth/time.  Those "atomized events 
points" could easily be aggregated by software into larger scale events 
and locations by some simple rules about the timespan for events and 
geographic bounds for locations.  Those larger scale events and 
locations could be used to ask the kinds of questions you describe 
below.  As long as you aren't requiring me to do this kind of 
aggregation BEFORE I create my records (and hence requiring me to lose 
the data that my GPS has collected for me automatically), I'm happy to 
allow others to define events and locations on larger scales and with 
1:M relationships.
...
That may be true.  But I've spent more than two decades taking
over-simplified, flattened database structures and transforming them into
more normalized structures, because the flattened structures consistely
limited my ability to ask novel questions of the database, and also
encouraged inconsistency of data entry practices.  The small price I pay for
increased normalization has yielded ample return in more "powerful" datasets
(i.e., more flexibility in how I can frame and/or analyze the data).
As I said in my earlier email, I'm encouraged by the consistency in the 
way I hear people talking about the relationships among the DwC 
classes.  I was a bit afraid when we started this thread that I would 
turn out to have some kind of fringe ideas.  Now what I'm seeing is a 
lot of variation on how people choose to "collapse" the basic model to 
meet their individual needs, but not a lot of disagreement about what 
the basic model is.

Thanks for your great feedback and for challenging my statements.  I 
need that!

Steve

-- 
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 343-6707
http://bioimages.vanderbilt.edu

Re: [tdwg-content] practical details of recording a determination What is an Occurrence?

Steve Baskauf