Thanks, Steve.
The diagram looks about right, except for the arrow heads as you noted. If there's a way you can replace the arrows with some sort of 1:Many line notation, that would be better. As you have it now, the arrowhead is on the "one" side; but I think it's more intuitive to have a "crows-foot" sort of symbol on the "many" side. I can send an example of what I mean. Not a big deal.
Yeah, I originally had it as eventDate, but then switched to eventTime. If Date can include time (and Time is assumed not to include date), then using eventDate is fine.
In principle, I agree with this diagram to the left of taxonNameUsage
completely.
(I still need clarification about a few things on the right end.)
Yes, that's a failing on my part to get better documentation out there for GNUB (from which TaxonNameUsage, and all of the other "Usage" terms in DWC come). I hope to correct this by the end of the year.
My main reason for using determination as a term rather than identification is because it is not ambiguous to refer to the person doing the identifying as the determiner, whereas referring to that person as the "identifier" creates confusion between that person and the identifying string for resources (as in "persistent identifier").
Ah! Got it. Makes sense.
So if we agree that determination, annotation, and identification all mean the same thing (namely an instance of the dwc:Identification class), I'm happy to just use the term "identification". For the person doing it, I guess dwc:identifiedBy would be the best term although it's a bit awkward in regular speech so I may slip and still say "determiner".
Either way. Now that you put it in that context, I'm also happy to go with "Determination" and "Determiner".
But I would avoid "Annotation". That word has a much more general meaning, and we'll likely be hearing more and more about it (in the more general sense) as several big-ish projects are working on Annotations (in general) right now.
Although I agree in principle that there can be many occurrences at an Event and many events at a Location, I think there are two practical reasons why it may be better to assign separate eventDate and Location metadata to each Occurrence.
Hmmm...not sure I follow. Are you saying that a new Event record (ID) should be created for every Occurrence record, and that a new Location record (ID) should be created for every Event record? If so, then it's going to be very difficult to convicne me of this. I don't think that our database is unusual in having many (sometimes hundreds) of Occurrences at the same Event (e.g., a large fish poison station), and many (again, sometimes hundreds) of Events at the same Location.
The first is that it makes the database structure simpler. As Markus has already noted, we really would prefer for the database to be as "flat" as possible.
Which database? Are we talking about DwCA? If so, I understand the rationale for flattening out content to make it easier to batch-package records amnd ship them around. But if we're talking about actual database implementations at the content provider end, I think I'm not alone in wanting to stick with a more normalized approach. Besides, what's the point of even defining the different DwC classes, each with their own ID, if we're just going flatten them all out anyway (as per old Dwc)?
When I look at the terms listed in the DwC term page (http://rs.tdwg.org/dwc/terms/index.htm) under Event, the most important one that I see which everyone should be providing is eventDate. The rest I would pretty much consider optional and as a shortcut Rich's diagram could be collapsed to make them direct properties of the Occurrence.
Yes, they could be collapsed to Occurrence -- in the same way that properties of "Individual" are currently collapsed to Occurrence. But after pleading your case to normalize "Individual" as its own separate class, I'm kinda surprised to see you arguing in favor of collapsing the Event class into Occurrence.
The second reason involves the practical matter of defining a Location. I will note that my thinking about this has been deeply influenced by a previous discussion on the topic from 2008-2009 summarized at http://www.sernec.org/files/summary-of-discussion.pdf on p.78-84. I don't think most people will want to wade through all of that text, so I'll just sum it up here. Somebody (I think it might have been Debbie Paul at Morphbank) suggested to me that we really have an intrinsically globally unique identifier for Location. It's the combination of dwc:decimalLatitude and dwc:decimalLongitude along with dwc:coordinateUncertaintyInMeters to establish precision and dwc:geodeticDatum to establish the reference system.
Yes....sort of. Doesn't helpf for localities defined as bounded boxes, polygons or lines (e.g., transects, as we often have for data from plankton tows) -- but it certainly does serve as a hand "natural key" of sorts for point localities. The problem is that so much of our exiting content is not reliably georeferenced yet. Thus, we need all those other terms to accommodate various location descriptors, which will eventually allow us to after-the-fact georeference the localities. Also, many after-the-fact georeferenced points are interpretations. Keeping the descriptors around can allow someone else to come up with a better/more precise lat/long/uncertainty interpretation. Also, errors are abundant (particularly in failing to represent decimal degrees with negatives). Having the descriptors allows us to catch such errors much more quickly.
(If we like geo:lat and geo:long, then the reference system is implied and we are down to three terms to unambiguously define a Location and its uncertainty. For the benefits of humans, a Locality description is probably also beneficial. Also, elevation and depth might be provided, although at least in theory elevation could be calculated with a sufficiently good digital elevation model).
Well, that depends on the extent of coordinateUncertaintyInMeters. Original data often have fairly precise elevations, but imprecise lat/long. Thus, one can often narrow down the likely location more precisely than a circle described by a point/radius. In other words, Lat+Long+coordinateUncertaintyInMeters may describe a circle that includes a range of elevations, and thus an elevation cannot be reliably calculated. Moreover, a lot of modelling use-cases will want as precise of an elevation as possible.
I will grant that we don't have this information for a lot of old records, but based on the massive efforts to geolocate specimens, I would say it's pretty clear that this is what we would like to have if we could get it.
Sure -- but we can't really ignore the massive numbers of non-georeferenced datapointn that already exist. And even when they are georeferenced, we'll still want to keep the original location descriptors.
I certainly hope that there aren't any serious collectors, observers, and live organism photographers who aren't by this point trying to record this information as they establish new Occurrence records. If you look at all of the Location terms on the dwc list, most of the other terms are either concessions to the fact that we don't have what we want (e.g. the "verbatum" terms), things we could generate using a computer program if we were clever (like stateProvince, county, etc. - I know at least Mike Giddens has succeeded in doing this), ways of indicating how we got lat and long from old records (e.g. georefererenceSources), or methods to define larger scale Locations that aren't points (e.g. footprintWKT). I think it is safe to say that in the future (if not now already), many or most Events associated with Occurrences will have an associated button click (on a GPS receiver, camera phone, or GPS enabled camera) that will automatically generate dwc:eventDate, dwc:decimalLatitude, dwc:decimalLongitude (with geodeticDatum=WGS84) and maybe coordinateUncertaintyInMeters. Thus designing a system that requires that these time/space snapshots be grouped together into artificial "Locations" is really counterproductive when those data are now generated and can be associated with Occurrences automatically.
OK, I see where you're coming from. But I guess my rsponse is that we're a LONG way from:
- A world where most existing Occurrence content is well georeferenced; - A world where reliable services allow me to easily/automatically query records on, say "Northwestern Hawaiian Islands", based on GIS polygon querying - A world where content holders are not going to want to share all the textual locality descriptiors with their reocrds
Besides, we're going to always want to maintain the ability to define locations as bounded boxes, polygons, and lines; not just point-radius.
Moreover, there are many cases where a single location is re-used for many different events. The same tree monitored continuously over years. The same field station re-visted year after year. The same transect repeated every month/season/year to monitor populations & presence/absence. LTER data. Christmas Count data. In other words, it's common to have multiple Events stacked at the same locality; and we won't want to always limit ourselves to defining that locality using point/radius.
So I still see the advantage of keeping Location as a separate class, maintaining those extra "human-friendly" descriptor terms, and conceptualizing 1:M Location:Events.
I don't know if Greg Riccardi of Morphbank is following this thread or not. If so he may want to comment on this issue based on practical experience at Morphbank. When the Morphbank system was set up, it required the creation of a separate Location record which was assigned a unique Morphbank identifier. Specimens were then linked to this Location. What ended up happening was that each Specimen having GPS metadata ended up being assigned to its own separate Location even if it was 20 meters from another specimen. In effect, each Occurrence record ended up having its own decimalLatitude/decimalLongitude record anyway. So the system ended up being more complicated than necessary.
Yes -- that's definitely a trend of modern collecting data with GPS....a tendency towards fewer and fewer instances of Events per instance of Location -- to the point where many records are now 1:1. However, that doesn't change the fact that an enormous volue of content currently is, and always will be best structured as many Events per location, and many Occurrences per Event. I think DwC needs to accommodate that content. It's easy to store 1:1 records in a structure designed to accommodate 1:M. But it's a lot messier to generate 1:M content in a structure designed for 1:1.
As I said, I agree in principle with the left side of Rich's diagram. Taking the practical considerations I just mentioned into account, I would simplify the diagram as http://bioimages.vanderbilt.edu/pages/rich-diagram2.gif
I think that's perfectly fine as a simplified structure for data exchange, such as DwCA and other applications that aggregate content and/or provide value-added or indexing services on aggregated content. But for original source data, I think it would be unwise to advocate such a simplified structure to anything but small ad-hoc project-based systems. It's incredibly easy to flatten out the more normalized form into the more flattened form; not so easy to go the other way around.
I also still find it interesting that you are quite content to flatten Locality and Event class terms into Occurrence, while simultaneously wanting to normalize Individual as a new class, sparate from Occurrence. I'm not saying that establishing an Individual class is a bad idea -- in principle I support it. But I am curious as to why you think it important ti push for normalization on the Individual side of an Occurrence, but push for de-normalization on the Event side of an Occurrence.
Superficially, it looks more complicated, but I've gotten rid of several "one to many" relationships and enthroned Occurrence at its accustomed place in the center of the universe (or at least the center of the left side of the diagram). I don't have any philosophical objections to people structuring their data according to Rich's original diagram and the existing Darwin Core terms certainly make it possible to do so (well except for the Individual thing). However, I submit that many people will find it simpler (and easier to use tools like Darwin Core Archives) if they use the flatter structure that I have in the revised diagram.
That may be true. But I've spent more than two decades taking over-simplified, flattened database structures and transforming them into more normalized structures, because the flattened structures consistely limited my ability to ask novel questions of the database, and also encouraged inconsistency of data entry practices. The small price I pay for increased normalization has yielded ample return in more "powerful" datasets (i.e., more flexibility in how I can frame and/or analyze the data).
I will save my questions about the right side of Rich's diagram for later.
That would be best answered through documentation of GNUB, which I will be working on intensively over the next two months.
Aloha, Rich