[tdwg-content] practical details of recording a determination What is an Occurrence?
Steve Baskauf
steve.baskauf at vanderbilt.edu
Wed Oct 20 22:06:52 CEST 2010
Specific responses inline
> Hmmm...not sure I follow. Are you saying that a new Event record (ID)
> should be created for every Occurrence record, and that a new Location
> record (ID) should be created for every Event record? If so, then it's
> going to be very difficult to convicne me of this. I don't think that our
> database is unusual in having many (sometimes hundreds) of Occurrences at
> the same Event (e.g., a large fish poison station), and many (again,
> sometimes hundreds) of Events at the same Location.
>
I think I answered this (in a sense) in the other email that I sent a
little while ago. In principle, if every record of an Occurrence has a
time that is discernibly different from the times of other Occurrences
(i.e. because the time was recorded to the nearest second by a machine),
then yes, I consider it to be a different event. To force people to
lump such Occurrences together into one Event (say comprising a day or
some other time interval larger than a second) is essentially throwing
away data that we already have about the Occurrences. I have already
found it useful to know exactly what time of day a flower was opened
rather than closed or the order in which I took several images.
As for creating a new Event record ID for all of those one-second
events, I would submit the same solution that I gave in the previous
post. In my database, I don't have separate Event records for the times
when I took live plant images (which I am rightly or wrongly calling
Occurrences). I just have a "flat" table where each image has a
eventTime (the time recorded automatically in the EXIF data for the
image). If in the context of having an RDF structure that was
compatible with the kind of structure used by people who have many
Occurrences associated with a single event (e.g. your fish kill), for
the image http://bioimages.vanderbilt.edu/baskauf/57755 I would
automatically create an event identifier for its creation as
http://bioimages.vanderbilt.edu/baskauf/57755#event . There! I have a
perfectly valid GUID that represents the event without any additional
burden on my record-keeping system since I don't have to keep any
additional records about that GUID beyond the ones I'm already keeping
for the image. I'm not doing this now, but I suppose I should if there
is a consensus that things are related to each other in the way shown in
your diagram. The same approach could be taken for Location. People
who care about having unrelated identifiers for their Events and
Locations (because of one-to-many relationships) would be welcome to do
so, but I wouldn't need to in my internal database..
> Which database? Are we talking about DwCA? If so, I understand the
> rationale for flattening out content to make it easier to batch-package
> records amnd ship them around. But if we're talking about actual database
> implementations at the content provider end, I think I'm not alone in
> wanting to stick with a more normalized approach. Besides, what's the point
> of even defining the different DwC classes, each with their own ID, if we're
> just going flatten them all out anyway (as per old Dwc)?
>
Well, yes I had DwCA (Darwin Core Archives) specifically in mind. But
it could be any other "shipping format" or local database format. I
think we need to draw the distinction between having a way that people
can understand the meaning of metadata records and fields that we are
shipping to them (e.g. DwCA), and describing the properties and
connections between resources (e.g. in RDF). I think this may be what
Pete means when he says that we need two "kinds" of DwC. The first use
is pretty much "ready to go". I don't think we know how close we are to
being able to use the existing DwC terms for describing relationships
until we have some more conversation of the sort we are having now as
well as conversation about which existing terms can be used (perhaps in
ways that weren't originally intended) to express the relationships
needed in RDF. I feel like I hear Pete saying that we are still lacking
a lot of the predicates we need, while I (and maybe Cam) feel that we
are most of the way there.
For clarification, when I argue that the class Individual should exist
in Darwin Core, it's not because I'm insisting that all users must have
an Individual table in their database. What I want is for people to be
ABLE to have an Individual table in their database (if they need it) and
have others understand what it means and how the entities described in
that table are related conceptually to other things like Identifications
and Occurrences. If all of the records in their database have only one
occurrence per individual, they don't "need" to keep track of Individuals.
> Yes, they could be collapsed to Occurrence -- in the same way that
> properties of "Individual" are currently collapsed to Occurrence. But after
> pleading your case to normalize "Individual" as its own separate class, I'm
> kinda surprised to see you arguing in favor of collapsing the Event class
> into Occurrence.
>
I confess my crime. In penance, I freely confess that the Event class
exists and that people should use it in their databases if it helps them
cluster Occurrences. In addition, I confess that I should probably
acknowledge the existence of Events and their relationship to other
Darwin Core classes when I write RDF. Guilty as charged!
>> Yes....sort of. Doesn't helpf for localities defined as bounded boxes,
>> polygons or lines (e.g., transects, as we often have for data from plankton
>> tows) -- but it certainly does serve as a hand "natural key" of sorts for
>> point localities. The problem is that so much of our exiting content is not
>> reliably georeferenced yet. Thus, we need all those other terms to
>> accommodate various location descriptors, which will eventually allow us to
>> after-the-fact georeference the localities. Also, many after-the-fact
>> georeferenced points are interpretations. Keeping the descriptors around
>> can allow someone else to come up with a better/more precise
>> lat/long/uncertainty interpretation. Also, errors are abundant
>> (particularly in failing to represent decimal degrees with negatives).
>> Having the descriptors allows us to catch such errors much more quickly.
>>
Here I will confess the crime of ignorance. I'm still trying to
understand the need for and uses of a number of the Darwin Core
dcterms:Location class terms. I guess I need to spend some more time
reading the Guide to Best Practices for Georeferencing (I was going to
include the link here, but the link at
http://www.biogeomancer.org/library.html is broken).
> Sure -- but we can't really ignore the massive numbers of non-georeferenced
> datapointn that already exist. And even when they are georeferenced, we'll
> still want to keep the original location descriptors.
>
>
Agree on this and all other points I deleted here.
> OK, I see where you're coming from. But I guess my rsponse is that we're a
> LONG way from:
>
> - A world where most existing Occurrence content is well georeferenced;
>
[text omitted for brevity]
> So I still see the advantage of keeping Location as a separate class,
> maintaining those extra "human-friendly" descriptor terms, and
> conceptualizing 1:M Location:Events.
>
I'm totally convinced. Location and Event belong where you had them in
the diagram.
[more text listed below]
>
> I also still find it interesting that you are quite content to flatten
> Locality and Event class terms into Occurrence, while simultaneously wanting
> to normalize Individual as a new class, sparate from Occurrence. I'm not
> saying that establishing an Individual class is a bad idea -- in principle I
> support it. But I am curious as to why you think it important ti push for
> normalization on the Individual side of an Occurrence, but push for
> de-normalization on the Event side of an Occurrence.
>
Again, I plead guilty as charged. The left side of the chart should be
allowed to not be flat. However, I maintain my stance that many (most?)
new Occurrence records will in the future have their own individual
latitude/longitude/elevation or depth/time. Those "atomized events
points" could easily be aggregated by software into larger scale events
and locations by some simple rules about the timespan for events and
geographic bounds for locations. Those larger scale events and
locations could be used to ask the kinds of questions you describe
below. As long as you aren't requiring me to do this kind of
aggregation BEFORE I create my records (and hence requiring me to lose
the data that my GPS has collected for me automatically), I'm happy to
allow others to define events and locations on larger scales and with
1:M relationships.
>
> That may be true. But I've spent more than two decades taking
> over-simplified, flattened database structures and transforming them into
> more normalized structures, because the flattened structures consistely
> limited my ability to ask novel questions of the database, and also
> encouraged inconsistency of data entry practices. The small price I pay for
> increased normalization has yielded ample return in more "powerful" datasets
> (i.e., more flexibility in how I can frame and/or analyze the data).
>
As I said in my earlier email, I'm encouraged by the consistency in the
way I hear people talking about the relationships among the DwC
classes. I was a bit afraid when we started this thread that I would
turn out to have some kind of fringe ideas. Now what I'm seeing is a
lot of variation on how people choose to "collapse" the basic model to
meet their individual needs, but not a lot of disagreement about what
the basic model is.
Thanks for your great feedback and for challenging my statements. I
need that!
Steve
--
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences
postal mail address:
VU Station B 351634
Nashville, TN 37235-1634, U.S.A.
delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235
office: 2128 Stevenson Center
phone: (615) 343-4582, fax: (615) 343-6707
http://bioimages.vanderbilt.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20101020/b18eb011/attachment-0001.html
More information about the tdwg-content
mailing list