[tdwg-content] Treatise on Occurrence, tokens, and basisOfRecord

Sun Oct 24 03:39:25 CEST 2010

Hi Steve,

Many thanks for taking the time to carefully articulate all of this.  Though
your post is long, I think it is clear and well-written, and perhaps a good
genesis for a web or wiki page somewhere, to add to the collective
documentation of DwC-space.

A couple of comments, as I read through what you wrote:

> However, this decision does not excuse us from thinking 
> carefully about whether a term can be appropriately applied 
> to a resource that is a member of some class (e.g. should 
> we say that a digital photograph has a scientific name?).  
> Placing a term within a class is a suggestion that the 
> term would appropriately be applied as a property of an 
> instance of a class.

I'm a bit unsure where this notion of "digital photograph has a scientific
name" comes into play.  My best guess is that if the basisOfRecord for an
Occurrence record is "Digital Image" (not actually listed among the examples
at http://rs.tdwg.org/dwc/terms/#basisOfRecord), perhaps a consumer of such
a record will misinterpret it as though the Image *is* the item that Occurs
(and hence has a scientific name).  But I think of basisOfRecord as the
"basis" of our *belief* (aka "evidence") that the Occurrence was real.  That
is, the Occurrence is always understood to refer to an organism (though the
documentation doesn't say this explicitly), and the "basis" of the
Occurrence is the reason we have for believing that the organism occurred at
a place and time.  Again, my interpretation of this may be wrong, and it's
confounded by our tendency to shortcut information. For example, often in
our community the statement "an individual organism that was documented to
occur at a place and time was identified by someone as belonging to a taxon
concept that is best represented by this scientific name" is truncated to
"Occurrence has scientific name". Because we tend to do this, I can easily
understand taking it one step furter and reduce the statement "an individual
organism that was documented *by a digital image* to occur at a place and
time was identified by someone as belonging to a taxon concept that is best
represented by this scientific name" to "Digital Image has scientific name".

So, maybe we should try to avoid these short-cut representations of our data
as much as possible?

> 3. When users want to "flatten" and simplify their 
> databases, they tend to eliminate one-to-many (1:M) 
> relationships in favor of one-to-one (1:1) 
> relationships.  The result of that is differences 
> like we saw in 
> 
> http://bioimages.vanderbilt.edu/pages/rich-diagram1.gif 
> (which allows 1:M relationships between Occurrences and
>  Events and between Events and Locations) and
> 
> http://bioimages.vanderbilt.edu/pages/rich-diagram2.gif 
> (which "atomizes" every Occurrence by considering it 
> to have its own separate eventTime and Location information).  

Another way to look at the difference between diagram1 and diagram2 is that
the latter is simply a flattened (aka "denormalized") version of the first.
I don't think the latter is really a more "atomized" version, because if
eventTime and Location are recorded with such precision, then they could
easily be represented in the structure of diagram1 -- except that the 1:M
links would tend to shift in abundance towards more 1:1 (there is no problem
represnting a 1:M relationship where most instances actually are 1:1).  But
what really makes diagram2 different is that it is likely to include
replication of identical Event propery values over multiple records in cases
where there really is a 1:M relationship between Event and Occurrence.  No
harm there -- it's just a bit denormalized.  Denormalization is fine for
mechanisms to transmit bulk content around -- especially if they were
generated from more normalized original data structures at the source.

> A. There is nothing intrinsically "right" or "wrong" 
> about either of these approaches, because they each have 
> their own advantages.  The 1:M approach is more efficient, 
> but results in a more complicated database, while the 1:1 
> approach results in a simpler database but may require 
> repeating some or many term values in the records.  

Exactly.  Perhaps I misunderstood your point about "atomized".

> C. This collapsing of the diagram is also the reason 
> for some disagreement about whether a term belongs in 
> a certain class or not.  In the example above, 1:1 
> people would say that eventDate is a property of an 
> Occurrence, while 1:M people would say that eventDate 
> is a property of an Event.   

That's not quite how I would characterize it.  I would say, if you establish
an Event class at all, then eventDate is clearly a member of that Class.  I
think the real question is "Do we need an Event class?"  If yes, then
eventDate belongs to it.  If no, then we "collapse" eventDate to Occurrence.

By the way, when I say "Do we need an Event class?", I mean it at two
levels.  At one level, the question is: "Is it useful to establish it within
DwC?"  At another level, it's "How shall I structure my pacakge of data
using DwC terms?"  My understanding (which is crude), is that even with
Event class defined in DwC, I still have the choice of representing my
Occurrence data as:

1) "Normalized":
===================
occurenceID: 1234
eventID: 9876
identificationID: 7654
individualCount: 4
recordedBy: "J. Smith"
===================
eventID: 9876
LocationID: 4567
eventDate: 24-October-2010
eventTime: 02:13:00
===================
LocationID: 4567
decimalLatitude:  52.453016
decimalLongitude: 13.309418
geodeticDatum: "WGS84"
country: "Germany"
locality: "Botanischer Garten Und Botanisches Museum Berlin-Dahlem"
===================
identificationID: 7654
taxonID: 2345
identifiedBy: "Richard Pyle"
dateIdentified: 24-October-2010
===================
taxonID: 2345
scientificName: "Homo sapiens Linnaeus 1758"
namePublishedIn: "Linnaeus, C. 1758. Systema Naturae...."
nameAccordingTo: "Linnaeus, C. 1758. Systema Naturae...."
===================

2) "Flattened":
===================
occurenceID: 1234
individualCount: 4
recordedBy: "J. Smith"
eventDate: 24-October-2010
eventTime: 02:13:00
decimalLatitude:  52.453016
decimalLongitude: 13.309418
geodeticDatum: "WGS84"
country: "Germany"
locality: "Botanischer Garten Und Botanisches Museum Berlin-Dahlem"
identifiedBy: "Richard Pyle"
dateIdentified: 24-October-2010
scientificName: "Homo sapiens Linnaeus 1758"
namePublishedIn: "Linnaeus, C. 1758. Systema Naturae...."
nameAccordingTo: "Linnaeus, C. 1758. Systema Naturae...."
===================

In my understanding, both of these would be legitimate implementations of
the DwC terms.  The difference is that in the first case, the content is
normalized such that the value of the different properties (sorry, Bob --
not sure of the correct word here) are inherited through the various
"[class]ID" links; whereas in the the "flattened" version, the properties
are represented directly on the Occurrence instance.

The advantage of the first is that the atomized and ID'd class instances can
be reused for multiple occurrences, whereas the advantage of the second is
that it greatly simplifies the content structure.

> 4.  I would propose that the "right" relationship diagram is not
necessarily 
> one that caters to a certain "right" philosophical point of view.  Rather,

> the "right" diagram is the one that allows users to define the
relationships 
> that they need for the organization of their metadata in the simplest
manner, 
> and which provides the most clarity about what resources of various kinds 
> are, and how they are connected.  

Agreed.  But another component to "rightness" is the extent to which users
want to re-use content from the various class instances.  For example, it's
incredibly easy to conver the Normalized version to the flattened version.
But is's not always so easy to parse the flattened version back to the
normalized one.  You can always do it by creating unique values of the
provided terms for each class, but this can be potentially misleading and
artificial -- especially if there were more properties for each of the
individual class entities that were not included with the packaged data.

> There also seemed to be a consensus that an observation was simply an 
> Occurrence that did not have an associated token.  

Well...technically the "token" in this case is a pattern of neurons in the
observer's brain that constitute a memory....but that may be a bit abstract.

> http://bioimages.vanderbilt.edu/pages/token-assumed.gif which 
> I will refer to as the "assumed token" model and 
>
> http://bioimages.vanderbilt.edu/pages/token-explicit.gif which 
> I will refer to as the "explicit token" model.  

Nice -- and without reading another word of your message, I'm going to take
a chance and say that I conceptually agree with your "token-explicit"
diagram. The hard part is (as with the case of class:individual) deciding
whether this level of "normalization" is valuable for DwC purposes.  

> I believe that historically the assumed token model has been 
> the one which most people have had in mind.  

Actually, I've always envisioned it as you have in your token-explicit
version (and have said as much at various meetings to discuss DwC, going
back to 1.0).  In fact, I remember discussing this exact issue with Stan
Blum long before DwC existed (he was the first to suggest to me the term
"evidence" in this context -- which I think is functionally equivalent to
your "token"). However, I've conceeded that this level of normalization
would probably be too much for the intended purpose of the DwC terms.  But
I'll keep an open mind on that.

> Before the new DwC standard, we had specimens and we had 
> observations.  In order to avoid redundancies in terms for 
> those two types of "things", a combined "thing" called 
> "Occurrence" was created.  An Occurrence that was an 
> observation didn't have a token and an Occurrence that 
> was a specimen had a physical or living specimen as its 
> token.  

My rationalization of it in the early days (pre-DwC) was that *everything*
was effectively an observation, and beyond that, the only question was a
matter of evidence.  In my earliest models, I categorized "evidence" into
"Specimen", "Image", "Literature Report", and "Unvouchered Observation" (I
was using the word "voucher" in the general sense, as in the verb "to vouch"
-- not in the more specific sense for our community, which implies "Specimen
preserved in Museum").  My read on the history of DwC is that it was
initially established as a means to aggregate and/or share Specimen data
amongst Museums (hence its Specimen-centric nature).  Later, the
Specimen/Observation dichotomy was introduced to allow DwC content to allow
more sophisticated and complete representations of the occurrence of
organisms in place and time, because there was muchmore information than
what existed as specimens in Museums.  In my mind, the "Observation" side
was effectively a collapsing of my "Image", "Literature Report" and
"Unvouchered Observation" -- which I was OK with in the context of the time.
Because at the time, the vast majority of content available in computer
databases came from museum specimen databases, and from observational
databases (largely in the bird realm).

So...I see the current iteration of DwC as another step in the evolution of
moving from "sharing and aggregating specimen data among museums" to
"documenting biodiversity in nature".  It's not all the way into the fully
normalized representation of biodiversity data, but it's far enough that it
is a nice compromise between practical and effective for the majority of the
user constituency. In my mind, the next logical step in this evolutionary
trajectory would be to recognize "Individual" as a class (which DwC is
apready primed for, via individualID).

> It is not clear how one is supposed to handle the actually metadata 
> for the image that serves as the token.  

That seems to me to be in the domain of the TDWG MRTG group
(http://www.tdwg.org/standards/638/).

> Unlike specimens where the token's metadata terms are placed in the 
> Occurrence class, I guess in the case of an image one is supposed 
> to use associatedMedia to link the so-called MachineObservation to 
> the image record.  If DNA were extracted, one would link the 
> sequence to the Occurrence using associatedSequences (although 
> it's not clear to me what the basisOfRecord for that would be - 
> "TookATissueSample"?).  But what does one do for other kinds of 
> tokens, like seeds or tissue samples - create terms like 
> associatedSeed and associatedTissueSample?  

In my mind, things like seeds, tissue samples, and DNA sequences are simply
different kinds of specimens (just like dried skeletons vs. botanical
pressed sheets vs. whole organisms in jars of alcohol vs. prepared skins,
etc.)  They may have certain properties specific to each subclass of
specimen, but fundamentally I think it's fair to treat them as specimens.
DNA sequences are a bit different, of course, because they are not the
"stuff" of an organism, but rather an indirect representation of the
"stuff".  In my mind, that difference justifies associatedSequences, where
we don't have associatedSeeds, associatedTeeth, associatedSkins,
associatedSkeletons, etc.

> However, if I'm going to make this admission, I demand that 
> the other guilty parties also confess, namely people who want 
> to assert that Occurrences have properties that actually are 
> properties of specimens.  

I'll plead innocent of those charges, as I always understood the
representation of Specimen-properties as applied to Occurrence instances as
just another compromise of "flat" vs. "normalized", in much the same way
that applying properties of Locations to Events, and Locations+Events to
Occurrences, would likewise be compromises in the interest of
simplification.

> If we accept the explicit token model, then as a biodiversity 
> informatics resource type "observation" will have to disappear 
> into a puff of nothingness 

Not necessarily.  See my comment earlier about patterns on neurons in a
human brain that constitute a memory.  Just as a digital image rendered on a
hard disk requires certain machinery to convert into photons that strike our
retinas (i.e., a computer and monitor), so too does a memory require such
machinery (e.g., the brain itself, transmission of sound waves via vocal
chords, soud ways striking ear drums, etc.)  This may sound weird, but I'm
being serious: a human memory is, fundamentally, every bit as much of a
"token" as a specimen or a digital image.  It's just considerably less
accessible and well-resolved.

> Realistically, I can't see this kind of separation ever happening, 
> given the amount of trouble it's been just to get a few people 
> to admit that Individuals exist.  

I don't think the issue was ever in convincing people that Individuals exist
-- that much, I think, was clear to everyone (as proof: see
dwc:individualID).  The issue was always more about where the current DwC
should lie on the scale of highly flattened (e.g., DwC 1.0) to highly
normalized (e.g., ABCD and CDM).  It's necessarily a compromise between
modelling the information "as it really is", vs. modelling the information
in a way that's both accessible to the majority to content providers, and
useful to the majority of contnent consumers.  I think we both understand
what the trade-offs are in either direction. The question is, what is the
"sweet spot" for the majority of our community at this time in history?

I would venture that at the time DwC 1.0 was developed, that hit the sweet
spot reasonably well.  As more content holders develop inclreasingly
sophisticated DBMS for their content, and as the user community delves into
increasingly sophisticated analyses of the data, the "sweet spot" will shift
from the flattened end of the scale to the normalized end of the scale. And,
I would hope, DwC wll evolve accordingly.

> It is just too hard to get motion to happen in the TDWG community.  

People make the same complaint about another organization that I'm involved
with (ICZN).  But here's the thing: as in the case of nomenclature,
stability in itself can be a very important thing.  If DwC changed every six
months, then by the time people developed software apps to work with it,
those apps would already be obsolete.  If someone writes code that consumes
DwC content as expressed in the current version of DwC, then that code may
break if people start providing content with class:individual and
class:token content.  If our community is going to move forward
successfully, I think standards like DwC need to evolve in a punctuated way,
rather than a gradualist way (same goes for the Codes of nomenclature). That
is, a bit of inertia in the system is probably a good thing.

> OK, I've now gone on for eight pages of text explaining the 
> rationale behind the question.  So I'll return to the basic 
> question: is the consensus for modeling the relationship 
> between an Occurrence and associated token(s) the assumed 
> token model:
>
>	http://bioimages.vanderbilt.edu/pages/token-assumed.gif
>
>	or the explicit token model:
>
>	http://bioimages.vanderbilt.edu/pages/token-explicit.gif
>
>	?  

Here's how I would answer:  When modelling my own databases, tracking my own
content, I would *definitely* (and indeed already have, for a long time now)
go with the token-expicit.

But when deciding on a community data exchange standard (i.e., DwC),
compromise between flat and normalized is still a necesssity, and as such,
the answer in terms of modifying DwC needs to take into account the form of
the bulk of the existing content, the needs of the bulk of the existing
users/consumers, and the virtues of stability of Standards in a world where
software app development time stretches for months or years.

Maybe the answer to this is to treat different versions of DwC as
concurrent, rather than serial.  That is, as long as the next most
sophisticated version can easily be "collapsed" to all previous versions
(aka, backward compatibility), then maybe we just need a clear mechanism for
consuming applications to indicate desired DwC version. That way, apps
developed to work with v2.1 can indicate to a provider that is capable of
produding v3.6 content, that they want it in v2.1 format.  Assuming we
maintain backward compatibility (i.e., the more-normalized version can be
easily collapsed to the more flattened version), then is should be a very
simple matter for the content provider to stream the same content in v2.1
format.

But now I'm dabbling in areas that are WAY outside my scope of expertise...

Anyway...I would reiterate that I, for one, appreciate that you took the
time to write all this down (took me over 3 hours to read & respond -- so
obviously I care! -- of course, I'm waiting for a taxi to go to the airport,
so really not much else for me to do right now).  If I didn't reply to parts
of your message, it was either because I agreed with you and had nothing to
elaborate or expound upon, or I didn't really understand (e.g., all the rdf
stuff).

Aloha,
Rich