[tdwg-content] Background for the Individual class proposal. 1. Denormalization of models and correspondence to the ASC model
Steve Baskauf
steve.baskauf at vanderbilt.edu
Sat Nov 13 17:22:02 CET 2010
This is part 1 of three messages that attempt to summarize the issues
that we have been discussing over the last month and to suggest a
solution and a way forward. If you zone out when you get emails longer
than three lines, please erase the messages and go on with your life.
Unfortunately this is a complicated topic and I'm trying to lay out the
issues in the simplest and most straightforward way that I can. The
first email (this one) describes how a fully normalized model of Darwin
Core can arise from modifying the ASC model to meet articulated needs of
the Darwin Core constituency. The second email will describe why we
need to come to a consensus on this and the criteria that I think should
be considered before a decision is reached. The third email discusses
the issue that Rich has raised as to whether the proposed Individual
class should have a rather narrow scope (as I have advocated) or if it
should be broadened to include other functions. I have separated this
material into three emails because they are really separate but related
issues and may each spawn threads relating to the particular issue.
----------------------------------------------------------------------
To try to get a better understanding of the issues we have been
discussing, I went back to the Association of Systematics Collections
(ASC) information that Stan posted at
http://wiki.tdwg.org/twiki/bin/view/TAG/HistoricalDocuments - in
particular, the chart
http://wiki.tdwg.org/twiki/bin/viewfile/TAG/HistoricalDocuments?rev=1;filename=Ascfig2.pdf
I have cut out a section of that chart that will fit on one screen and
have created several images that have various models involving Darwin
Core classes pasted at the top. Each subsequent Darwin Core class model
is more normalized than the previous one. Below each model I show how
that denormalization maps to the ASC model.
The first diagram is the ASC model itself
http://bioimages.vanderbilt.edu/pages/asc-model.jpg
There are several differences in names between ASC and DwC.
dwc:Location corresponds to Locality in ASC, dwc:Event corresponds to
Collecting Event in ASC, dwc:Identification corresponds to Determination
in ASC, and Collecting Unit in ASC corresponds to a subset of what I
have been calling the "token" (evidence), that is limited to organisms,
their pieces, and their conglomerations. One may quibble about exact
correspondence, but I think that fundamentally those things are
congruent. In the ASC model, the lines with crow's feet correspond to
one-to-many relationships, with the foot at the "many" end. In my
diagram a triangle does the same thing with the point of the triangle
representing the "one" end. As you can see, the subset of the ASC model
shown here can summarized in simplified form using DwC classes
(excluding for the time being the parts of the model that fall into the
DwC Taxon class). The ASC model reflects the "museum" perspective: in
many or most cases the whole organism is collected, or if only part of
the organism is collected (e.g. tree branch) the organism is rarely
re-visited for additional collections. So this model is denormalized
(flattened) to the extent that it doesn't allow for multiple types of
tokens per organism and resampling of the organism over time.
The second diagram represents Darwin Core at the time it became a
standard in 2009.
http://bioimages.vanderbilt.edu/pages/darwin-core-model.jpg
The difference from the previous diagram is the creation of the
Occurrence class. This class recognizes the needs of the observation
community because it allows one to connect Events to Determinations
directly without forcing them to be associated with a physical object
(token). This modification was beneficial because terms describing the
act of documenting the presence of a taxon during an Event are shared
between observations and specimen collection. This model presupposes
that there is no more than one token per Occurrence. dwc:basisOfRecord
is used to describe the nature of that one token. Terms for handling
tokens other than specimens are not well developed.
The third diagram is a slight modification of the second and is what
I've been calling the "explicit token" model:
http://bioimages.vanderbilt.edu/pages/dwc-explicit-token-model.jpg
The only difference between it and the previous model is that there is
now recognition that the token is a separate thing from the Occurrence.
Types of tokens other than specimens (such as images and sounds) are
recognized explicitly as means of documenting Occurrences. The lines
connecting Occurrence to tokens have "crow's feet" on the token side,
allowing that there may be one to many tokens that act as evidence for a
single Occurrence. When I complain that basisOfRecord "doesn't work",
it is with this model in mind. In this model, there is not one single
"basis" (token) for a record - under this model there would need to be
the possibility to have multiple basisOfRecord values for an Occurrence,
which I don't really think is supported currently in DwC.
The fourth diagram, which I call the "full model" adds one more
component to the explicit token model:
http://bioimages.vanderbilt.edu/pages/full-model.jpg
This model is what I consider to be the fully normalized version of
Darwin Core (excluding the Taxon parts). This model introduces the
Individual class exactly as I have defined it in my proposed term
addition: as a node that connects Occurrences to Identifications (a.k.a.
Determinations). This is not really an addition to the existing Darwin
Core standard because the term individualID already exists in the
Occurrence class. My proposal simply gives a name to the thing that is
the object of individualID - in fact my original justification for the
term addition says exactly that. The fundamental purpose that
Individual serves is to accommodate the "crow's foot" on the Occurrence
side of the line that connects Individual to Occurrence, i.e. to allow
re-sampling over time and space. That is all. The line going to
Identification/Determination has to be connected somewhere and it makes
sense to connect it to Individual rather than Occurrence since the
resampled entity is not going to change its identity from one sampling
to another.
I have done one more thing in this model to make it more denormalized.
It's a spin-off from Paul Murray's post
http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001771.html
which got me to thinking that if we were to treat time in the same way
we are treating Locations and other entities a fully normalized model
would have a class for Time since time can have varying degrees of
specificity (just like Location and Taxon) and there is a one-to-many
relationship between Time and Event (i.e. there can be many Events going
on at different Locations at a given Time, just like there can be many
Events at different Times at a given Location). We almost always
denormalize the Time class out of our models because in most cases it
can be represented as a single ISO 8601 string. But as Paul points out,
Time can be a complicated thing that one might want to model in a more
sophisticated way than a single string. I'm not suggesting that we
should do this in Darwin Core if nobody needs it, but the point is that
it COULD be done. There probably already is a class for Time defined by
somebody else (does anyone know about this?).
In summary, the fully normalized model that I have presented seems to be
consistent with almost all of the discussion that has taken place on the
list recently. Although the ASC model is "more normalized" than this in
some parts, I haven't heard many of the participants in the discussion
advocating for a general Darwin Core model that is more complex than
what I've presented in the last link. Obviously, individuals (humans)
could add many more classes of things in their own personal models, but
I think the classes in this last model can acommodate nearly all of the
resources people have said that they want to describe using Darwin Core.
End of part 1
--
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences
postal mail address:
VU Station B 351634
Nashville, TN 37235-1634, U.S.A.
delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235
office: 2128 Stevenson Center
phone: (615) 343-4582, fax: (615) 343-6707
http://bioimages.vanderbilt.edu
More information about the tdwg-content
mailing list