Specific responses inline
Hmmm...not sure I follow. Are you saying that a new Event record (ID)
should be created for every Occurrence record, and that a new Location
record (ID) should be created for every Event record? If so, then it's
going to be very difficult to convicne me of this. I don't think that our
database is unusual in having many (sometimes hundreds) of Occurrences at
the same Event (e.g., a large fish poison station), and many (again,
sometimes hundreds) of Events at the same Location.
I think I answered this (in a sense) in the other email that I sent a
little while ago. In principle, if every record of an Occurrence has a
time that is discernibly different from the times of other Occurrences
(i.e. because the time was recorded to the nearest second by a
machine), then yes, I consider it to be a different event. To force
people to lump such Occurrences together into one Event (say comprising
a day or some other time interval larger than a second) is essentially
throwing away data that we already have about the Occurrences. I have
already found it useful to know exactly what time of day a flower was
opened rather than closed or the order in which I took several images.
As for creating a new Event record ID for all of those one-second
events, I would submit the same solution that I gave in the previous
post. In my database, I don't have separate Event records for the
times when I took live plant images (which I am rightly or wrongly
calling Occurrences). I just have a "flat" table where each image has
a eventTime (the time recorded automatically in the EXIF data for the
image). If in the context of having an RDF structure that was
compatible with the kind of structure used by people who have many
Occurrences associated with a single event (e.g. your fish kill), for
the image http://bioimages.vanderbilt.edu/baskauf/57755 I would
automatically create an event identifier for its creation as
http://bioimages.vanderbilt.edu/baskauf/57755#event . There! I have a
perfectly valid GUID that represents the event without any additional
burden on my record-keeping system since I don't have to keep any
additional records about that GUID beyond the ones I'm already keeping
for the image. I'm not doing this now, but I suppose I should if there
is a consensus that things are related to each other in the way shown
in your diagram. The same approach could be taken for Location.
People who care about having unrelated identifiers for their Events and
Locations (because of one-to-many relationships) would be welcome to do
so, but I wouldn't need to in my internal database..
Which database? Are we talking about DwCA? If so, I understand the
rationale for flattening out content to make it easier to batch-package
records amnd ship them around. But if we're talking about actual database
implementations at the content provider end, I think I'm not alone in
wanting to stick with a more normalized approach. Besides, what's the point
of even defining the different DwC classes, each with their own ID, if we're
just going flatten them all out anyway (as per old Dwc)?
Well, yes I had DwCA (Darwin Core Archives) specifically in mind. But
it could be any other "shipping format" or local database format. I
think we need to draw the distinction between having a way that people
can understand the meaning of metadata records and fields that we are
shipping to them (e.g. DwCA), and describing the properties and
connections between resources (e.g. in RDF). I think this may be what
Pete means when he says that we need two "kinds" of DwC. The first use
is pretty much "ready to go". I don't think we know how close we are
to being able to use the existing DwC terms for describing
relationships until we have some more conversation of the sort we are
having now as well as conversation about which existing terms can be
used (perhaps in ways that weren't originally intended) to express the
relationships needed in RDF. I feel like I hear Pete saying that we
are still lacking a lot of the predicates we need, while I (and maybe
Cam) feel that we are most of the way there.
For clarification, when I argue that the class Individual should exist
in Darwin Core, it's not because I'm insisting that all users must have
an Individual table in their database. What I want is for people to be
ABLE to have an Individual table in their database (if they need it)
and have others understand what it means and how the entities described
in that table are related conceptually to other things like
Identifications and Occurrences. If all of the records in their
database have only one occurrence per individual, they don't "need" to
keep track of Individuals.
Yes, they could be collapsed to Occurrence -- in the same way that
properties of "Individual" are currently collapsed to Occurrence. But after
pleading your case to normalize "Individual" as its own separate class, I'm
kinda surprised to see you arguing in favor of collapsing the Event class
into Occurrence.
I confess my crime. In penance, I freely confess that the Event class
exists and that people should use it in their databases if it helps
them cluster Occurrences. In addition, I confess that I should
probably acknowledge the existence of Events and their relationship to
other Darwin Core classes when I write RDF. Guilty as charged!
Yes....sort of. Doesn't helpf for localities defined as bounded boxes,
polygons or lines (e.g., transects, as we often have for data from plankton
tows) -- but it certainly does serve as a hand "natural key" of sorts for
point localities. The problem is that so much of our exiting content is not
reliably georeferenced yet. Thus, we need all those other terms to
accommodate various location descriptors, which will eventually allow us to
after-the-fact georeference the localities. Also, many after-the-fact
georeferenced points are interpretations. Keeping the descriptors around
can allow someone else to come up with a better/more precise
lat/long/uncertainty interpretation. Also, errors are abundant
(particularly in failing to represent decimal degrees with negatives).
Having the descriptors allows us to catch such errors much more quickly.
Here I will confess the crime of ignorance. I'm still trying to
understand the need for and uses of a number of the Darwin Core
dcterms:Location class terms. I guess I need to spend some more time
reading the Guide to Best Practices for Georeferencing (I was going to
include the link here, but the link at
http://www.biogeomancer.org/library.html is broken).
Sure -- but we can't really ignore the massive numbers of non-georeferenced
datapointn that already exist. And even when they are georeferenced, we'll
still want to keep the original location descriptors.
Agree on this and all other points I deleted here.
OK, I see where you're coming from. But I guess my rsponse is that we're a
LONG way from:
- A world where most existing Occurrence content is well georeferenced;
[text omitted for brevity]
So I still see the advantage of keeping Location as a separate class,
maintaining those extra "human-friendly" descriptor terms, and
conceptualizing 1:M Location:Events.
I'm totally convinced. Location and Event belong where you had them in
the diagram.
[more text listed below]
I also still find it interesting that you are quite content to flatten
Locality and Event class terms into Occurrence, while simultaneously wanting
to normalize Individual as a new class, sparate from Occurrence. I'm not
saying that establishing an Individual class is a bad idea -- in principle I
support it. But I am curious as to why you think it important ti push for
normalization on the Individual side of an Occurrence, but push for
de-normalization on the Event side of an Occurrence.
Again, I plead guilty as charged. The left side of the chart should be
allowed to not be flat. However, I maintain my stance that many
(most?) new Occurrence records will in the future have their own
individual latitude/longitude/elevation or depth/time. Those "atomized
events points" could easily be aggregated by software into larger scale
events and locations by some simple rules about the timespan for events
and geographic bounds for locations. Those larger scale events and
locations could be used to ask the kinds of questions you describe
below. As long as you aren't requiring me to do this kind of
aggregation BEFORE I create my records (and hence requiring me to lose
the data that my GPS has collected for me automatically), I'm happy to
allow others to define events and locations on larger scales and with
1:M relationships.
That may be true. But I've spent more than two decades taking
over-simplified, flattened database structures and transforming them into
more normalized structures, because the flattened structures consistely
limited my ability to ask novel questions of the database, and also
encouraged inconsistency of data entry practices. The small price I pay for
increased normalization has yielded ample return in more "powerful" datasets
(i.e., more flexibility in how I can frame and/or analyze the data).
As I said in my earlier email, I'm encouraged by the consistency in the
way I hear people talking about the relationships among the DwC
classes. I was a bit afraid when we started this thread that I would
turn out to have some kind of fringe ideas. Now what I'm seeing is a
lot of variation on how people choose to "collapse" the basic model to
meet their individual needs, but not a lot of disagreement about what
the basic model is.
Thanks for your great feedback and for challenging my statements. I
need that!
Steve
--
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences
postal mail address:
VU Station B 351634
Nashville, TN 37235-1634, U.S.A.
delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235
office: 2128 Stevenson Center
phone: (615) 343-4582, fax: (615) 343-6707
http://bioimages.vanderbilt.edu