[tdwg-content] Treatise on Occurrence, tokens, and basisOfRecord

Sun Oct 24 06:31:46 CEST 2010

Rich,
Thanks for taking the time to read the whole thing.  Based on the first 
series of comments you made, it seems as though we are in agreement on 
most points.  I think that what I wrote was (as I had anticipated) 
somewhat less clear due to my use (or failure to use) some appropriate 
terms to describe what I was talking about.  For example, when I said 
"atomized" I probably should have said something like "fine-grained" and 
correct use of the term "normalized" would have helped.  Some other 
comments inline:

Richard Pyle wrote:
>> I believe that historically the assumed token model has been
>> the one which most people have had in mind.
>>     
>
> Actually, I've always envisioned it as you have in your token-explicit
> version (and have said as much at various meetings to discuss DwC, going
> back to 1.0).  In fact, I remember discussing this exact issue with Stan
> Blum long before DwC existed (he was the first to suggest to me the term
> "evidence" in this context -- which I think is functionally equivalent to
> your "token"). However, I've conceeded that this level of normalization
> would probably be too much for the intended purpose of the DwC terms.  But
> I'll keep an open mind on that.
>
>   
>> Before the new DwC standard, we had specimens and we had
>> observations.  In order to avoid redundancies in terms for
>> those two types of "things", a combined "thing" called
>> "Occurrence" was created.  An Occurrence that was an
>> observation didn't have a token and an Occurrence that
>> was a specimen had a physical or living specimen as its
>> token.
>>     
>
> My rationalization of it in the early days (pre-DwC) was that *everything*
> was effectively an observation, and beyond that, the only question was a
> matter of evidence.  In my earliest models, I categorized "evidence" into
> "Specimen", "Image", "Literature Report", and "Unvouchered Observation" (I
> was using the word "voucher" in the general sense, as in the verb "to vouch"
> -- not in the more specific sense for our community, which implies "Specimen
> preserved in Museum").  My read on the history of DwC is that it was
> initially established as a means to aggregate and/or share Specimen data
> amongst Museums (hence its Specimen-centric nature).  Later, the
> Specimen/Observation dichotomy was introduced to allow DwC content to allow
> more sophisticated and complete representations of the occurrence of
> organisms in place and time, because there was muchmore information than
> what existed as specimens in Museums.  In my mind, the "Observation" side
> was effectively a collapsing of my "Image", "Literature Report" and
> "Unvouchered Observation" -- which I was OK with in the context of the time.
> Because at the time, the vast majority of content available in computer
> databases came from museum specimen databases, and from observational
> databases (largely in the bird realm).
>   
Well, I'm not surprised that the ideas that I'm trying to put down in 
words and diagrams predate my entry into this arena a year and a half 
ago.  What is a bit frustrating to me is that ideas like these aren't 
laid out in an easy-to-understand fashion and placed in easy-to-find 
places.  I have spent much of that last year and a half trying to 
understand how the whole TDWG/DwC universe is supposed to fit together.  
I think that the idea of having the Google Code site where there are 
explanations and examples for the various DwC terms is the kind of thing 
we need.  Unfortunately, most of the terms do not yet have entries 
there.  Perhaps I'm just impatient.  If it turns out that any of the 
summaries that I've written here accurately reflect any kind of 
consensus, then maybe someone could "clean them up" (i.e. use correct 
technical terms after giving definitions of what they mean) and paste 
them somewhere where people can find them.  That would prevent another 
person 10 years from now re-articulating the same ideas a third time.  
I'm particularly thinking of the summary diagram 
http://bioimages.vanderbilt.edu/pages/token-explicit.gif along with an 
explanation of how people use the more normalized and more flattened 
versions of it.  We already do have quite lucid examples in the Simple 
Darwin Core (flattened) and Darwin Core XML guide (normalized), but some 
sort of overview of the big picture might be helpful.  If an RDF guide 
ever gets off the ground, that would be another example of how the 
relationships assumed in DwC are expressed in a very explicit way. 
> So...I see the current iteration of DwC as another step in the evolution of
> moving from "sharing and aggregating specimen data among museums" to
> "documenting biodiversity in nature".  It's not all the way into the fully
> normalized representation of biodiversity data, but it's far enough that it
> is a nice compromise between practical and effective for the majority of the
> user constituency. In my mind, the next logical step in this evolutionary
> trajectory would be to recognize "Individual" as a class (which DwC is
> apready primed for, via individualID).
>   
I think I understand the message that you are trying to convey above and 
in your later comments about creating new versions of DwC (or new 
evolutionary states of DwC) that don't break the previous ones.  I think 
that is one reason why the process of examining and clearly articulating 
the community consensus on what Darwin Core terms and classes "mean" and 
how they are connected to each other is so important before we embark on 
implementing GUIDs and RDF.  Pete has suggested that we may need a 
second version of DwC in order to make it work in the Linked Open Data 
world and he's probably right.  I'm not sure that the existing 
vocabulary has all of the terms we need to do that.  However, if we are 
going to "evolve" Darwin Core so that it will work in the LOD world, I 
hope that we do it in such a way that we maintain the same "meaning" of 
things as Darwin Core 1.0 .  I think that is the way to maintain the 
kind of "stability" that you described below. 
>> Unlike specimens where the token's metadata terms are placed in the
>> Occurrence class, I guess in the case of an image one is supposed
>> to use associatedMedia to link the so-called MachineObservation to
>> the image record.  If DNA were extracted, one would link the
>> sequence to the Occurrence using associatedSequences (although
>> it's not clear to me what the basisOfRecord for that would be -
>> "TookATissueSample"?).  But what does one do for other kinds of
>> tokens, like seeds or tissue samples - create terms like
>> associatedSeed and associatedTissueSample?
>>     
>
> In my mind, things like seeds, tissue samples, and DNA sequences are simply
> different kinds of specimens (just like dried skeletons vs. botanical
> pressed sheets vs. whole organisms in jars of alcohol vs. prepared skins,
> etc.)  They may have certain properties specific to each subclass of
> specimen, but fundamentally I think it's fair to treat them as specimens.
> DNA sequences are a bit different, of course, because they are not the
> "stuff" of an organism, but rather an indirect representation of the
> "stuff".  In my mind, that difference justifies associatedSequences, where
> we don't have associatedSeeds, associatedTeeth, associatedSkins,
> associatedSkeletons, etc.
>   
Your point is well taken in that we don't need a proliferation of types 
of associated tokens.  We need as many different token "types" as we 
have coherent sets of metadata terms.  One of the points of typing 
resources is to let potential users know what kinds of metadata 
properties (terms) they can reasonably expect to receive about that 
resource.  If one will receive the same set of properties about two 
kinds of resources (e.g. skins and skeletons), there is no reason to 
type them differently.  The point that I was trying to get at 
(eventually) was that it was inconsistent to say that images need to be 
referenced as associatedMedia and sequences needed to be referenced as 
associatedSequences, and yet not say that specimens needed to be 
referenced as "associatedSpecimens".  I actually think that based on 
Roger's explanation of "to subclass or not" 
(http://wiki.tdwg.org/twiki/bin/view/TAG/SubclassOrNot), it makes more 
sense to talk about using a generic "hasToken" or "tokenID" along with 
"tagging" the token using rdfs:type (as I suggested toward the end of my 
"treatise") rather than a bunch of associatedXXXX terms. 

>   
>> If we accept the explicit token model, then as a biodiversity
>> informatics resource type "observation" will have to disappear
>> into a puff of nothingness
>>     
>
> Not necessarily.  See my comment earlier about patterns on neurons in a
> human brain that constitute a memory.  Just as a digital image rendered on a
> hard disk requires certain machinery to convert into photons that strike our
> retinas (i.e., a computer and monitor), so too does a memory require such
> machinery (e.g., the brain itself, transmission of sound waves via vocal
> chords, soud ways striking ear drums, etc.)  This may sound weird, but I'm
> being serious: a human memory is, fundamentally, every bit as much of a
> "token" as a specimen or a digital image.  It's just considerably less
> accessible and well-resolved.
>   
I guess I'm thinking about this in terms of a token being something to 
which we can assign an identifier and retrieve a representation (a la 
representational state transfer).  Although I don't deny the existence 
of memory patterns in neurons that are associated with a 
HumanObservation, there isn't any way that we can receive a 
representation of that memory directly.  If the person draws a sketch of 
what he/she remembers, then we have a media item that we can convert 
into a digital form and transmit through the Internet (a token).  If the 
person types up notes, then we have a text document  (a token that can 
also be delivered as a digital file or scan of typewritten page).  On 
the other hand, if the person simply records the values of recordedBy, 
eventDate, and Location terms, then we have only Occurrence metadata (no 
token).  If someone claims "basisOfRecord=HumanObservation" and has no 
token of any kind, then what is there that is deliverable other than the 
basic Occurrence metadata?  That's why I'm claiming that 
basisOfRecord=HumanObservation simply corresponds to an Occurrence 
record with no token.
>   
>> Realistically, I can't see this kind of separation ever happening,
>> given the amount of trouble it's been just to get a few people
>> to admit that Individuals exist.
>>     
>
> I don't think the issue was ever in convincing people that Individuals exist
> -- that much, I think, was clear to everyone (as proof: see
> dwc:individualID).  The issue was always more about where the current DwC
> should lie on the scale of highly flattened (e.g., DwC 1.0) to highly
> normalized (e.g., ABCD and CDM).  It's necessarily a compromise between
> modelling the information "as it really is", vs. modelling the information
> in a way that's both accessible to the majority to content providers, and
> useful to the majority of contnent consumers.  I think we both understand
> what the trade-offs are in either direction. The question is, what is the
> "sweet spot" for the majority of our community at this time in history?
>
> I would venture that at the time DwC 1.0 was developed, that hit the sweet
> spot reasonably well.  As more content holders develop inclreasingly
> sophisticated DBMS for their content, and as the user community delves into
> increasingly sophisticated analyses of the data, the "sweet spot" will shift
> from the flattened end of the scale to the normalized end of the scale. And,
> I would hope, DwC wll evolve accordingly.
>
>   
>> It is just too hard to get motion to happen in the TDWG community.
>>     
>
> People make the same complaint about another organization that I'm involved
> with (ICZN).  But here's the thing: as in the case of nomenclature,
> stability in itself can be a very important thing.  If DwC changed every six
> months, then by the time people developed software apps to work with it,
> those apps would already be obsolete.  If someone writes code that consumes
> DwC content as expressed in the current version of DwC, then that code may
> break if people start providing content with class:individual and
> class:token content.  If our community is going to move forward
> successfully, I think standards like DwC need to evolve in a punctuated way,
> rather than a gradualist way (same goes for the Codes of nomenclature). That
> is, a bit of inertia in the system is probably a good thing.
>
>   
>> OK, I've now gone on for eight pages of text explaining the
>> rationale behind the question.  So I'll return to the basic
>> question: is the consensus for modeling the relationship
>> between an Occurrence and associated token(s) the assumed
>> token model:
>>
>>       http://bioimages.vanderbilt.edu/pages/token-assumed.gif
>>
>>       or the explicit token model:
>>
>>       http://bioimages.vanderbilt.edu/pages/token-explicit.gif
>>
>>       ?
>>     
>
> Here's how I would answer:  When modelling my own databases, tracking my own
> content, I would *definitely* (and indeed already have, for a long time now)
> go with the token-expicit.
>
> But when deciding on a community data exchange standard (i.e., DwC),
> compromise between flat and normalized is still a necesssity, and as such,
> the answer in terms of modifying DwC needs to take into account the form of
> the bulk of the existing content, the needs of the bulk of the existing
> users/consumers, and the virtues of stability of Standards in a world where
> software app development time stretches for months or years.
>
> Maybe the answer to this is to treat different versions of DwC as
> concurrent, rather than serial.  That is, as long as the next most
> sophisticated version can easily be "collapsed" to all previous versions
> (aka, backward compatibility), then maybe we just need a clear mechanism for
> consuming applications to indicate desired DwC version. That way, apps
> developed to work with v2.1 can indicate to a provider that is capable of
> produding v3.6 content, that they want it in v2.1 format.  Assuming we
> maintain backward compatibility (i.e., the more-normalized version can be
> easily collapsed to the more flattened version), then is should be a very
> simple matter for the content provider to stream the same content in v2.1
> format.
>   
Yes, I agree about this concept.  I think that what I'm really 
advocating for is that we agree on what the most normalized model is 
that will connect all of the existing Darwin Core classes and terms.  In 
that sense, when I'm asking for Individual to be accepted as a class, 
I'm not arguing for a "new" thing, I'm arguing for a clarification of 
what we mean when we use the existing term dwc:individualID.  When I'm 
asking for terms to facilitate a logically consistent way to connect 
Occurrences with their tokens, I'm also not really asking for an 
expansion of Darwin Core, I'm asking for a more consistent model than 
"subclassing" by using associatedMedia and associatedSequences but not 
using "associatedSpecimens".  I think that this is important because if 
we don't agree on these things, we are going to have a royal mess on our 
hands if we try to start trying to develop an RDF guide for Darwin 
Core.  As an eternal optimist, I think that describing a fully 
normalized model that can be translated into RDF can be achieved with 
only a few minor additions to the existing terms as opposed to requiring 
a complete new version.  If we really need to completely rewrite Darwin 
Core for RDF I don't have any delusions that it will be accomplished 
before I retire. 
> But now I'm dabbling in areas that are WAY outside my scope of expertise...
>
> Anyway...I would reiterate that I, for one, appreciate that you took the
> time to write all this down (took me over 3 hours to read & respond -- so
> obviously I care! -- of course, I'm waiting for a taxi to go to the airport,
> so really not much else for me to do right now).  If I didn't reply to parts
> of your message, it was either because I agreed with you and had nothing to
> elaborate or expound upon, or I didn't really understand (e.g., all the rdf
> stuff).
>   
Again, thanks for taking the time to read and comment.

Steve

-- 
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 343-6707
http://bioimages.vanderbilt.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20101023/76422c85/attachment-0001.html