[tdwg-content] Background for the Individual class proposal. 1. Denormalization of models and correspondence to the ASC model

Sat Nov 13 17:22:02 CET 2010

This is part 1 of three messages that attempt to summarize the issues 
that we have been discussing over the last month and to suggest a 
solution and a way forward.  If you zone out when you get emails longer 
than three lines, please erase the messages and go on with your life.  
Unfortunately this is a complicated topic and I'm trying to lay out the 
issues in the simplest and most straightforward way that I can.  The 
first email (this one) describes how a fully normalized model of Darwin 
Core can arise from modifying the ASC model to meet articulated needs of 
the Darwin Core constituency.  The second email will describe why we 
need to come to a consensus on this and the criteria that I think should 
be considered before a decision is reached.  The third email discusses 
the issue that Rich has raised as to whether the proposed Individual 
class should have a rather narrow scope (as I have advocated) or if it 
should be broadened to include other functions.  I have separated this 
material into three emails because they are really separate but related 
issues and may each spawn threads relating to the particular issue. 
----------------------------------------------------------------------
To try to get a better understanding of the issues we have been 
discussing, I went back to the Association of Systematics Collections 
(ASC) information that Stan posted at 
http://wiki.tdwg.org/twiki/bin/view/TAG/HistoricalDocuments - in 
particular, the chart
http://wiki.tdwg.org/twiki/bin/viewfile/TAG/HistoricalDocuments?rev=1;filename=Ascfig2.pdf

I have cut out a section of that chart that will fit on one screen and 
have created several images that have various models involving Darwin 
Core classes pasted at the top.  Each subsequent Darwin Core class model 
is more normalized than the previous one.  Below each model I show how 
that denormalization maps to the ASC model. 

The first diagram is the ASC model itself
http://bioimages.vanderbilt.edu/pages/asc-model.jpg
There are several differences in names between ASC and DwC.  
dwc:Location corresponds to Locality in ASC, dwc:Event corresponds to 
Collecting Event in ASC, dwc:Identification corresponds to Determination 
in ASC, and Collecting Unit in ASC corresponds to a subset of what I 
have been calling the "token" (evidence), that is limited to organisms, 
their pieces, and their conglomerations.  One may quibble about exact 
correspondence, but I think that fundamentally those things are 
congruent.  In the ASC model, the lines with crow's feet correspond to 
one-to-many relationships, with the foot at the "many" end.  In my 
diagram a triangle does the same thing with the point of the triangle 
representing the "one" end.  As you can see, the subset of the ASC model 
shown here can summarized in simplified form using DwC classes 
(excluding for the time being the parts of the model that fall into the 
DwC Taxon class).  The ASC model reflects the "museum" perspective: in 
many or most cases the whole organism is collected, or if only part of 
the organism is collected (e.g. tree branch) the organism is rarely 
re-visited for additional collections.  So this model is denormalized 
(flattened) to the extent that it doesn't allow for multiple types of 
tokens per organism and resampling of the organism over time.

The second diagram represents Darwin Core at the time it became a 
standard in 2009.
http://bioimages.vanderbilt.edu/pages/darwin-core-model.jpg
The difference from the previous diagram is the creation of the 
Occurrence class.  This class recognizes the needs of the observation 
community because it allows one to connect Events to Determinations 
directly without forcing them to be associated with a physical object 
(token).  This modification was beneficial because terms describing the 
act of documenting the presence of a taxon during an Event are shared 
between observations and specimen collection.  This model presupposes 
that there is no more than one token per Occurrence. dwc:basisOfRecord 
is used to describe the nature of that one token.  Terms for handling 
tokens other than specimens are not well developed.

The third diagram is a slight modification of the second and is what 
I've been calling the "explicit token" model:
http://bioimages.vanderbilt.edu/pages/dwc-explicit-token-model.jpg
The only difference between it and the previous model is that there is 
now recognition that the token is a separate thing from the Occurrence.  
Types of tokens other than specimens (such as images and sounds) are 
recognized explicitly as means of documenting Occurrences.  The lines 
connecting Occurrence to tokens have "crow's feet" on the token side, 
allowing that there may be one to many tokens that act as evidence for a 
single Occurrence.  When I complain that basisOfRecord "doesn't work", 
it is with this model in mind.  In this model, there is not one single 
"basis" (token) for a record - under this model there would need to be 
the possibility to have multiple basisOfRecord values for an Occurrence, 
which I don't really think is supported currently in DwC. 

The fourth diagram, which I call the "full model" adds one more 
component to the explicit token model:
http://bioimages.vanderbilt.edu/pages/full-model.jpg
This model is what I consider to be the fully normalized version of 
Darwin Core (excluding the Taxon parts).  This model introduces the 
Individual class exactly as I have defined it in my proposed term 
addition: as a node that connects Occurrences to Identifications (a.k.a. 
Determinations).  This is not really an addition to the existing Darwin 
Core standard because the term individualID already exists in the 
Occurrence class.  My proposal simply gives a name to the thing that is 
the object of individualID - in fact my original justification for the 
term addition says exactly that.  The fundamental purpose that 
Individual serves is to accommodate the "crow's foot" on the Occurrence 
side of the line that connects Individual to Occurrence, i.e. to allow 
re-sampling over time and space.  That is all. The line going to 
Identification/Determination has to be connected somewhere and it makes 
sense to connect it to Individual rather than Occurrence since the 
resampled entity is not going to change its identity from one sampling 
to another. 

I have done one more thing in this model to make it more denormalized.  
It's a spin-off from Paul Murray's post 
http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001771.html
which got me to thinking that if we were to treat time in the same way 
we are treating Locations and other entities a fully normalized model 
would have a class for Time since time can have varying degrees of 
specificity (just like Location and Taxon) and there is a one-to-many 
relationship between Time and Event (i.e. there can be many Events going 
on at different Locations at a given Time, just like there can be many 
Events at different Times at a given Location).  We almost always 
denormalize the Time class out of our models because in most cases it 
can be represented as a single ISO 8601 string.  But as Paul points out, 
Time can be a complicated thing that one might want to model in a more 
sophisticated way than a single string.  I'm not suggesting that we 
should do this in Darwin Core if nobody needs it, but the point is that 
it COULD be done.  There probably already is a class for Time defined by 
somebody else (does anyone know about this?).

In summary, the fully normalized model that I have presented seems to be 
consistent with almost all of the discussion that has taken place on the 
list recently.  Although the ASC model is "more normalized" than this in 
some parts, I haven't heard many of the participants in the discussion 
advocating for a general Darwin Core model that is more complex than 
what I've presented in the last link.  Obviously, individuals (humans) 
could add many more classes of things in their own personal models, but 
I think the classes in this last model can acommodate nearly all of the 
resources people have said that they want to describe using Darwin Core.

End of part 1

-- 
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 343-6707
http://bioimages.vanderbilt.edu