Specific responses inline
Hmmm...not sure I follow. Are you saying that a new Event record (ID) should be created for every Occurrence record, and that a new Location record (ID) should be created for every Event record? If so, then it's going to be very difficult to convicne me of this. I don't think that our database is unusual in having many (sometimes hundreds) of Occurrences at the same Event (e.g., a large fish poison station), and many (again, sometimes hundreds) of Events at the same Location.
I think I answered this (in a sense) in the other email that I sent a little while ago. In principle, if every record of an Occurrence has a time that is discernibly different from the times of other Occurrences (i.e. because the time was recorded to the nearest second by a machine), then yes, I consider it to be a different event. To force people to lump such Occurrences together into one Event (say comprising a day or some other time interval larger than a second) is essentially throwing away data that we already have about the Occurrences. I have already found it useful to know exactly what time of day a flower was opened rather than closed or the order in which I took several images.
As for creating a new Event record ID for all of those one-second events, I would submit the same solution that I gave in the previous post. In my database, I don't have separate Event records for the times when I took live plant images (which I am rightly or wrongly calling Occurrences). I just have a "flat" table where each image has a eventTime (the time recorded automatically in the EXIF data for the image). If in the context of having an RDF structure that was compatible with the kind of structure used by people who have many Occurrences associated with a single event (e.g. your fish kill), for the image http://bioimages.vanderbilt.edu/baskauf/57755 I would automatically create an event identifier for its creation as http://bioimages.vanderbilt.edu/baskauf/57755#event . There! I have a perfectly valid GUID that represents the event without any additional burden on my record-keeping system since I don't have to keep any additional records about that GUID beyond the ones I'm already keeping for the image. I'm not doing this now, but I suppose I should if there is a consensus that things are related to each other in the way shown in your diagram. The same approach could be taken for Location. People who care about having unrelated identifiers for their Events and Locations (because of one-to-many relationships) would be welcome to do so, but I wouldn't need to in my internal database..
Which database? Are we talking about DwCA? If so, I understand the rationale for flattening out content to make it easier to batch-package records amnd ship them around. But if we're talking about actual database implementations at the content provider end, I think I'm not alone in wanting to stick with a more normalized approach. Besides, what's the point of even defining the different DwC classes, each with their own ID, if we're just going flatten them all out anyway (as per old Dwc)?
Well, yes I had DwCA (Darwin Core Archives) specifically in mind. But it could be any other "shipping format" or local database format. I think we need to draw the distinction between having a way that people can understand the meaning of metadata records and fields that we are shipping to them (e.g. DwCA), and describing the properties and connections between resources (e.g. in RDF). I think this may be what Pete means when he says that we need two "kinds" of DwC. The first use is pretty much "ready to go". I don't think we know how close we are to being able to use the existing DwC terms for describing relationships until we have some more conversation of the sort we are having now as well as conversation about which existing terms can be used (perhaps in ways that weren't originally intended) to express the relationships needed in RDF. I feel like I hear Pete saying that we are still lacking a lot of the predicates we need, while I (and maybe Cam) feel that we are most of the way there.
For clarification, when I argue that the class Individual should exist in Darwin Core, it's not because I'm insisting that all users must have an Individual table in their database. What I want is for people to be ABLE to have an Individual table in their database (if they need it) and have others understand what it means and how the entities described in that table are related conceptually to other things like Identifications and Occurrences. If all of the records in their database have only one occurrence per individual, they don't "need" to keep track of Individuals.
Yes, they could be collapsed to Occurrence -- in the same way that properties of "Individual" are currently collapsed to Occurrence. But after pleading your case to normalize "Individual" as its own separate class, I'm kinda surprised to see you arguing in favor of collapsing the Event class into Occurrence.
I confess my crime. In penance, I freely confess that the Event class exists and that people should use it in their databases if it helps them cluster Occurrences. In addition, I confess that I should probably acknowledge the existence of Events and their relationship to other Darwin Core classes when I write RDF. Guilty as charged!
Yes....sort of. Doesn't helpf for localities defined as bounded boxes, polygons or lines (e.g., transects, as we often have for data from plankton tows) -- but it certainly does serve as a hand "natural key" of sorts for point localities. The problem is that so much of our exiting content is not reliably georeferenced yet. Thus, we need all those other terms to accommodate various location descriptors, which will eventually allow us to after-the-fact georeference the localities. Also, many after-the-fact georeferenced points are interpretations. Keeping the descriptors around can allow someone else to come up with a better/more precise lat/long/uncertainty interpretation. Also, errors are abundant (particularly in failing to represent decimal degrees with negatives). Having the descriptors allows us to catch such errors much more quickly.
Here I will confess the crime of ignorance. I'm still trying to understand the need for and uses of a number of the Darwin Core dcterms:Location class terms. I guess I need to spend some more time reading the Guide to Best Practices for Georeferencing (I was going to include the link here, but the link at http://www.biogeomancer.org/library.html is broken).
Sure -- but we can't really ignore the massive numbers of non-georeferenced datapointn that already exist. And even when they are georeferenced, we'll still want to keep the original location descriptors.
Agree on this and all other points I deleted here.
OK, I see where you're coming from. But I guess my rsponse is that we're a LONG way from:
- A world where most existing Occurrence content is well georeferenced;
[text omitted for brevity]
So I still see the advantage of keeping Location as a separate class, maintaining those extra "human-friendly" descriptor terms, and conceptualizing 1:M Location:Events.
I'm totally convinced. Location and Event belong where you had them in the diagram.
[more text listed below]
I also still find it interesting that you are quite content to flatten Locality and Event class terms into Occurrence, while simultaneously wanting to normalize Individual as a new class, sparate from Occurrence. I'm not saying that establishing an Individual class is a bad idea -- in principle I support it. But I am curious as to why you think it important ti push for normalization on the Individual side of an Occurrence, but push for de-normalization on the Event side of an Occurrence.
Again, I plead guilty as charged. The left side of the chart should be allowed to not be flat. However, I maintain my stance that many (most?) new Occurrence records will in the future have their own individual latitude/longitude/elevation or depth/time. Those "atomized events points" could easily be aggregated by software into larger scale events and locations by some simple rules about the timespan for events and geographic bounds for locations. Those larger scale events and locations could be used to ask the kinds of questions you describe below. As long as you aren't requiring me to do this kind of aggregation BEFORE I create my records (and hence requiring me to lose the data that my GPS has collected for me automatically), I'm happy to allow others to define events and locations on larger scales and with 1:M relationships.
That may be true. But I've spent more than two decades taking over-simplified, flattened database structures and transforming them into more normalized structures, because the flattened structures consistely limited my ability to ask novel questions of the database, and also encouraged inconsistency of data entry practices. The small price I pay for increased normalization has yielded ample return in more "powerful" datasets (i.e., more flexibility in how I can frame and/or analyze the data).
As I said in my earlier email, I'm encouraged by the consistency in the way I hear people talking about the relationships among the DwC classes. I was a bit afraid when we started this thread that I would turn out to have some kind of fringe ideas. Now what I'm seeing is a lot of variation on how people choose to "collapse" the basic model to meet their individual needs, but not a lot of disagreement about what the basic model is.
Thanks for your great feedback and for challenging my statements. I need that!
Steve