Background for the Individual class proposal. 1. Denormalization of models and correspondence to the ASC model
This is part 1 of three messages that attempt to summarize the issues that we have been discussing over the last month and to suggest a solution and a way forward. If you zone out when you get emails longer than three lines, please erase the messages and go on with your life. Unfortunately this is a complicated topic and I'm trying to lay out the issues in the simplest and most straightforward way that I can. The first email (this one) describes how a fully normalized model of Darwin Core can arise from modifying the ASC model to meet articulated needs of the Darwin Core constituency. The second email will describe why we need to come to a consensus on this and the criteria that I think should be considered before a decision is reached. The third email discusses the issue that Rich has raised as to whether the proposed Individual class should have a rather narrow scope (as I have advocated) or if it should be broadened to include other functions. I have separated this material into three emails because they are really separate but related issues and may each spawn threads relating to the particular issue. ---------------------------------------------------------------------- To try to get a better understanding of the issues we have been discussing, I went back to the Association of Systematics Collections (ASC) information that Stan posted at http://wiki.tdwg.org/twiki/bin/view/TAG/HistoricalDocuments - in particular, the chart http://wiki.tdwg.org/twiki/bin/viewfile/TAG/HistoricalDocuments?rev=1;filena...
I have cut out a section of that chart that will fit on one screen and have created several images that have various models involving Darwin Core classes pasted at the top. Each subsequent Darwin Core class model is more normalized than the previous one. Below each model I show how that denormalization maps to the ASC model.
The first diagram is the ASC model itself http://bioimages.vanderbilt.edu/pages/asc-model.jpg There are several differences in names between ASC and DwC. dwc:Location corresponds to Locality in ASC, dwc:Event corresponds to Collecting Event in ASC, dwc:Identification corresponds to Determination in ASC, and Collecting Unit in ASC corresponds to a subset of what I have been calling the "token" (evidence), that is limited to organisms, their pieces, and their conglomerations. One may quibble about exact correspondence, but I think that fundamentally those things are congruent. In the ASC model, the lines with crow's feet correspond to one-to-many relationships, with the foot at the "many" end. In my diagram a triangle does the same thing with the point of the triangle representing the "one" end. As you can see, the subset of the ASC model shown here can summarized in simplified form using DwC classes (excluding for the time being the parts of the model that fall into the DwC Taxon class). The ASC model reflects the "museum" perspective: in many or most cases the whole organism is collected, or if only part of the organism is collected (e.g. tree branch) the organism is rarely re-visited for additional collections. So this model is denormalized (flattened) to the extent that it doesn't allow for multiple types of tokens per organism and resampling of the organism over time.
The second diagram represents Darwin Core at the time it became a standard in 2009. http://bioimages.vanderbilt.edu/pages/darwin-core-model.jpg The difference from the previous diagram is the creation of the Occurrence class. This class recognizes the needs of the observation community because it allows one to connect Events to Determinations directly without forcing them to be associated with a physical object (token). This modification was beneficial because terms describing the act of documenting the presence of a taxon during an Event are shared between observations and specimen collection. This model presupposes that there is no more than one token per Occurrence. dwc:basisOfRecord is used to describe the nature of that one token. Terms for handling tokens other than specimens are not well developed.
The third diagram is a slight modification of the second and is what I've been calling the "explicit token" model: http://bioimages.vanderbilt.edu/pages/dwc-explicit-token-model.jpg The only difference between it and the previous model is that there is now recognition that the token is a separate thing from the Occurrence. Types of tokens other than specimens (such as images and sounds) are recognized explicitly as means of documenting Occurrences. The lines connecting Occurrence to tokens have "crow's feet" on the token side, allowing that there may be one to many tokens that act as evidence for a single Occurrence. When I complain that basisOfRecord "doesn't work", it is with this model in mind. In this model, there is not one single "basis" (token) for a record - under this model there would need to be the possibility to have multiple basisOfRecord values for an Occurrence, which I don't really think is supported currently in DwC.
The fourth diagram, which I call the "full model" adds one more component to the explicit token model: http://bioimages.vanderbilt.edu/pages/full-model.jpg This model is what I consider to be the fully normalized version of Darwin Core (excluding the Taxon parts). This model introduces the Individual class exactly as I have defined it in my proposed term addition: as a node that connects Occurrences to Identifications (a.k.a. Determinations). This is not really an addition to the existing Darwin Core standard because the term individualID already exists in the Occurrence class. My proposal simply gives a name to the thing that is the object of individualID - in fact my original justification for the term addition says exactly that. The fundamental purpose that Individual serves is to accommodate the "crow's foot" on the Occurrence side of the line that connects Individual to Occurrence, i.e. to allow re-sampling over time and space. That is all. The line going to Identification/Determination has to be connected somewhere and it makes sense to connect it to Individual rather than Occurrence since the resampled entity is not going to change its identity from one sampling to another.
I have done one more thing in this model to make it more denormalized. It's a spin-off from Paul Murray's post http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001771.html which got me to thinking that if we were to treat time in the same way we are treating Locations and other entities a fully normalized model would have a class for Time since time can have varying degrees of specificity (just like Location and Taxon) and there is a one-to-many relationship between Time and Event (i.e. there can be many Events going on at different Locations at a given Time, just like there can be many Events at different Times at a given Location). We almost always denormalize the Time class out of our models because in most cases it can be represented as a single ISO 8601 string. But as Paul points out, Time can be a complicated thing that one might want to model in a more sophisticated way than a single string. I'm not suggesting that we should do this in Darwin Core if nobody needs it, but the point is that it COULD be done. There probably already is a class for Time defined by somebody else (does anyone know about this?).
In summary, the fully normalized model that I have presented seems to be consistent with almost all of the discussion that has taken place on the list recently. Although the ASC model is "more normalized" than this in some parts, I haven't heard many of the participants in the discussion advocating for a general Darwin Core model that is more complex than what I've presented in the last link. Obviously, individuals (humans) could add many more classes of things in their own personal models, but I think the classes in this last model can acommodate nearly all of the resources people have said that they want to describe using Darwin Core.
End of part 1
participants (1)
-
Steve Baskauf