[tdwg-content] Treatise on Occurrence, tokens, and basisOfRecord

Steve Baskauf steve.baskauf at vanderbilt.edu
Wed Oct 27 12:42:17 CEST 2010

Please note that in various examples, I have incorrectly placed rdf:type 
in the namespace rdfs: (http://www.w3.org/2000/01/rdf-schema#) rather 
than rdf: (http://www.w3.org/1999/02/22-rdf-syntax-ns#).  Thanks to Bob 
for pointing out this serious error. 

Also, the ACS model information is very cool.  I wish I'd seen it a long 
time ago.  I especially like the giant relationship chart.  Thanks Stan 
and Rich.

Steve Baskauf wrote:
> Rich,
> Thanks for taking the time to read the whole thing.  Based on the 
> first series of comments you made, it seems as though we are in 
> agreement on most points.  I think that what I wrote was (as I had 
> anticipated) somewhat less clear due to my use (or failure to use) 
> some appropriate terms to describe what I was talking about.  For 
> example, when I said "atomized" I probably should have said something 
> like "fine-grained" and correct use of the term "normalized" would 
> have helped.  Some other comments inline:
> Richard Pyle wrote:
>>> I believe that historically the assumed token model has been
>>> the one which most people have had in mind.
>> Actually, I've always envisioned it as you have in your token-explicit
>> version (and have said as much at various meetings to discuss DwC, going
>> back to 1.0).  In fact, I remember discussing this exact issue with Stan
>> Blum long before DwC existed (he was the first to suggest to me the term
>> "evidence" in this context -- which I think is functionally equivalent to
>> your "token"). However, I've conceeded that this level of normalization
>> would probably be too much for the intended purpose of the DwC terms.  But
>> I'll keep an open mind on that.
>>> Before the new DwC standard, we had specimens and we had
>>> observations.  In order to avoid redundancies in terms for
>>> those two types of "things", a combined "thing" called
>>> "Occurrence" was created.  An Occurrence that was an
>>> observation didn't have a token and an Occurrence that
>>> was a specimen had a physical or living specimen as its
>>> token.
>> My rationalization of it in the early days (pre-DwC) was that *everything*
>> was effectively an observation, and beyond that, the only question was a
>> matter of evidence.  In my earliest models, I categorized "evidence" into
>> "Specimen", "Image", "Literature Report", and "Unvouchered Observation" (I
>> was using the word "voucher" in the general sense, as in the verb "to vouch"
>> -- not in the more specific sense for our community, which implies "Specimen
>> preserved in Museum").  My read on the history of DwC is that it was
>> initially established as a means to aggregate and/or share Specimen data
>> amongst Museums (hence its Specimen-centric nature).  Later, the
>> Specimen/Observation dichotomy was introduced to allow DwC content to allow
>> more sophisticated and complete representations of the occurrence of
>> organisms in place and time, because there was muchmore information than
>> what existed as specimens in Museums.  In my mind, the "Observation" side
>> was effectively a collapsing of my "Image", "Literature Report" and
>> "Unvouchered Observation" -- which I was OK with in the context of the time.
>> Because at the time, the vast majority of content available in computer
>> databases came from museum specimen databases, and from observational
>> databases (largely in the bird realm).
> Well, I'm not surprised that the ideas that I'm trying to put down in 
> words and diagrams predate my entry into this arena a year and a half 
> ago.  What is a bit frustrating to me is that ideas like these aren't 
> laid out in an easy-to-understand fashion and placed in easy-to-find 
> places.  I have spent much of that last year and a half trying to 
> understand how the whole TDWG/DwC universe is supposed to fit 
> together.  I think that the idea of having the Google Code site where 
> there are explanations and examples for the various DwC terms is the 
> kind of thing we need.  Unfortunately, most of the terms do not yet 
> have entries there.  Perhaps I'm just impatient.  If it turns out that 
> any of the summaries that I've written here accurately reflect any 
> kind of consensus, then maybe someone could "clean them up" (i.e. use 
> correct technical terms after giving definitions of what they mean) 
> and paste them somewhere where people can find them.  That would 
> prevent another person 10 years from now re-articulating the same 
> ideas a third time.  I'm particularly thinking of the summary diagram 
> http://bioimages.vanderbilt.edu/pages/token-explicit.gif along with an 
> explanation of how people use the more normalized and more flattened 
> versions of it.  We already do have quite lucid examples in the Simple 
> Darwin Core (flattened) and Darwin Core XML guide (normalized), but 
> some sort of overview of the big picture might be helpful.  If an RDF 
> guide ever gets off the ground, that would be another example of how 
> the relationships assumed in DwC are expressed in a very explicit way. 
>> So...I see the current iteration of DwC as another step in the evolution of
>> moving from "sharing and aggregating specimen data among museums" to
>> "documenting biodiversity in nature".  It's not all the way into the fully
>> normalized representation of biodiversity data, but it's far enough that it
>> is a nice compromise between practical and effective for the majority of the
>> user constituency. In my mind, the next logical step in this evolutionary
>> trajectory would be to recognize "Individual" as a class (which DwC is
>> apready primed for, via individualID).
> I think I understand the message that you are trying to convey above 
> and in your later comments about creating new versions of DwC (or new 
> evolutionary states of DwC) that don't break the previous ones.  I 
> think that is one reason why the process of examining and clearly 
> articulating the community consensus on what Darwin Core terms and 
> classes "mean" and how they are connected to each other is so 
> important before we embark on implementing GUIDs and RDF.  Pete has 
> suggested that we may need a second version of DwC in order to make it 
> work in the Linked Open Data world and he's probably right.  I'm not 
> sure that the existing vocabulary has all of the terms we need to do 
> that.  However, if we are going to "evolve" Darwin Core so that it 
> will work in the LOD world, I hope that we do it in such a way that we 
> maintain the same "meaning" of things as Darwin Core 1.0 .  I think 
> that is the way to maintain the kind of "stability" that you described 
> below. 
>>> Unlike specimens where the token's metadata terms are placed in the
>>> Occurrence class, I guess in the case of an image one is supposed
>>> to use associatedMedia to link the so-called MachineObservation to
>>> the image record.  If DNA were extracted, one would link the
>>> sequence to the Occurrence using associatedSequences (although
>>> it's not clear to me what the basisOfRecord for that would be -
>>> "TookATissueSample"?).  But what does one do for other kinds of
>>> tokens, like seeds or tissue samples - create terms like
>>> associatedSeed and associatedTissueSample?
>> In my mind, things like seeds, tissue samples, and DNA sequences are simply
>> different kinds of specimens (just like dried skeletons vs. botanical
>> pressed sheets vs. whole organisms in jars of alcohol vs. prepared skins,
>> etc.)  They may have certain properties specific to each subclass of
>> specimen, but fundamentally I think it's fair to treat them as specimens.
>> DNA sequences are a bit different, of course, because they are not the
>> "stuff" of an organism, but rather an indirect representation of the
>> "stuff".  In my mind, that difference justifies associatedSequences, where
>> we don't have associatedSeeds, associatedTeeth, associatedSkins,
>> associatedSkeletons, etc.
> Your point is well taken in that we don't need a proliferation of 
> types of associated tokens.  We need as many different token "types" 
> as we have coherent sets of metadata terms.  One of the points of 
> typing resources is to let potential users know what kinds of metadata 
> properties (terms) they can reasonably expect to receive about that 
> resource.  If one will receive the same set of properties about two 
> kinds of resources (e.g. skins and skeletons), there is no reason to 
> type them differently.  The point that I was trying to get at 
> (eventually) was that it was inconsistent to say that images need to 
> be referenced as associatedMedia and sequences needed to be referenced 
> as associatedSequences, and yet not say that specimens needed to be 
> referenced as "associatedSpecimens".  I actually think that based on 
> Roger's explanation of "to subclass or not" 
> (http://wiki.tdwg.org/twiki/bin/view/TAG/SubclassOrNot), it makes more 
> sense to talk about using a generic "hasToken" or "tokenID" along with 
> "tagging" the token using rdfs:type (as I suggested toward the end of 
> my "treatise") rather than a bunch of associatedXXXX terms. 
>>> If we accept the explicit token model, then as a biodiversity
>>> informatics resource type "observation" will have to disappear
>>> into a puff of nothingness
>> Not necessarily.  See my comment earlier about patterns on neurons in a
>> human brain that constitute a memory.  Just as a digital image rendered on a
>> hard disk requires certain machinery to convert into photons that strike our
>> retinas (i.e., a computer and monitor), so too does a memory require such
>> machinery (e.g., the brain itself, transmission of sound waves via vocal
>> chords, soud ways striking ear drums, etc.)  This may sound weird, but I'm
>> being serious: a human memory is, fundamentally, every bit as much of a
>> "token" as a specimen or a digital image.  It's just considerably less
>> accessible and well-resolved.
> I guess I'm thinking about this in terms of a token being something to 
> which we can assign an identifier and retrieve a representation (a la 
> representational state transfer).  Although I don't deny the existence 
> of memory patterns in neurons that are associated with a 
> HumanObservation, there isn't any way that we can receive a 
> representation of that memory directly.  If the person draws a sketch 
> of what he/she remembers, then we have a media item that we can 
> convert into a digital form and transmit through the Internet (a 
> token).  If the person types up notes, then we have a text document  
> (a token that can also be delivered as a digital file or scan of 
> typewritten page).  On the other hand, if the person simply records 
> the values of recordedBy, eventDate, and Location terms, then we have 
> only Occurrence metadata (no token).  If someone claims 
> "basisOfRecord=HumanObservation" and has no token of any kind, then 
> what is there that is deliverable other than the basic Occurrence 
> metadata?  That's why I'm claiming that basisOfRecord=HumanObservation 
> simply corresponds to an Occurrence record with no token.
>>> Realistically, I can't see this kind of separation ever happening,
>>> given the amount of trouble it's been just to get a few people
>>> to admit that Individuals exist.
>> I don't think the issue was ever in convincing people that Individuals exist
>> -- that much, I think, was clear to everyone (as proof: see
>> dwc:individualID).  The issue was always more about where the current DwC
>> should lie on the scale of highly flattened (e.g., DwC 1.0) to highly
>> normalized (e.g., ABCD and CDM).  It's necessarily a compromise between
>> modelling the information "as it really is", vs. modelling the information
>> in a way that's both accessible to the majority to content providers, and
>> useful to the majority of contnent consumers.  I think we both understand
>> what the trade-offs are in either direction. The question is, what is the
>> "sweet spot" for the majority of our community at this time in history?
>> I would venture that at the time DwC 1.0 was developed, that hit the sweet
>> spot reasonably well.  As more content holders develop inclreasingly
>> sophisticated DBMS for their content, and as the user community delves into
>> increasingly sophisticated analyses of the data, the "sweet spot" will shift
>> from the flattened end of the scale to the normalized end of the scale. And,
>> I would hope, DwC wll evolve accordingly.
>>> It is just too hard to get motion to happen in the TDWG community.
>> People make the same complaint about another organization that I'm involved
>> with (ICZN).  But here's the thing: as in the case of nomenclature,
>> stability in itself can be a very important thing.  If DwC changed every six
>> months, then by the time people developed software apps to work with it,
>> those apps would already be obsolete.  If someone writes code that consumes
>> DwC content as expressed in the current version of DwC, then that code may
>> break if people start providing content with class:individual and
>> class:token content.  If our community is going to move forward
>> successfully, I think standards like DwC need to evolve in a punctuated way,
>> rather than a gradualist way (same goes for the Codes of nomenclature). That
>> is, a bit of inertia in the system is probably a good thing.
>>> OK, I've now gone on for eight pages of text explaining the
>>> rationale behind the question.  So I'll return to the basic
>>> question: is the consensus for modeling the relationship
>>> between an Occurrence and associated token(s) the assumed
>>> token model:
>>>       http://bioimages.vanderbilt.edu/pages/token-assumed.gif
>>>       or the explicit token model:
>>>       http://bioimages.vanderbilt.edu/pages/token-explicit.gif
>>>       ?
>> Here's how I would answer:  When modelling my own databases, tracking my own
>> content, I would *definitely* (and indeed already have, for a long time now)
>> go with the token-expicit.
>> But when deciding on a community data exchange standard (i.e., DwC),
>> compromise between flat and normalized is still a necesssity, and as such,
>> the answer in terms of modifying DwC needs to take into account the form of
>> the bulk of the existing content, the needs of the bulk of the existing
>> users/consumers, and the virtues of stability of Standards in a world where
>> software app development time stretches for months or years.
>> Maybe the answer to this is to treat different versions of DwC as
>> concurrent, rather than serial.  That is, as long as the next most
>> sophisticated version can easily be "collapsed" to all previous versions
>> (aka, backward compatibility), then maybe we just need a clear mechanism for
>> consuming applications to indicate desired DwC version. That way, apps
>> developed to work with v2.1 can indicate to a provider that is capable of
>> produding v3.6 content, that they want it in v2.1 format.  Assuming we
>> maintain backward compatibility (i.e., the more-normalized version can be
>> easily collapsed to the more flattened version), then is should be a very
>> simple matter for the content provider to stream the same content in v2.1
>> format.
> Yes, I agree about this concept.  I think that what I'm really 
> advocating for is that we agree on what the most normalized model is 
> that will connect all of the existing Darwin Core classes and terms.  
> In that sense, when I'm asking for Individual to be accepted as a 
> class, I'm not arguing for a "new" thing, I'm arguing for a 
> clarification of what we mean when we use the existing term 
> dwc:individualID.  When I'm asking for terms to facilitate a logically 
> consistent way to connect Occurrences with their tokens, I'm also not 
> really asking for an expansion of Darwin Core, I'm asking for a more 
> consistent model than "subclassing" by using associatedMedia and 
> associatedSequences but not using "associatedSpecimens".  I think that 
> this is important because if we don't agree on these things, we are 
> going to have a royal mess on our hands if we try to start trying to 
> develop an RDF guide for Darwin Core.  As an eternal optimist, I think 
> that describing a fully normalized model that can be translated into 
> RDF can be achieved with only a few minor additions to the existing 
> terms as opposed to requiring a complete new version.  If we really 
> need to completely rewrite Darwin Core for RDF I don't have any 
> delusions that it will be accomplished before I retire. 
>> But now I'm dabbling in areas that are WAY outside my scope of expertise...
>> Anyway...I would reiterate that I, for one, appreciate that you took the
>> time to write all this down (took me over 3 hours to read & respond -- so
>> obviously I care! -- of course, I'm waiting for a taxi to go to the airport,
>> so really not much else for me to do right now).  If I didn't reply to parts
>> of your message, it was either because I agreed with you and had nothing to
>> elaborate or expound upon, or I didn't really understand (e.g., all the rdf
>> stuff).
> Again, thanks for taking the time to read and comment.
> Steve
> -- 
> Steven J. Baskauf, Ph.D., Senior Lecturer
> Vanderbilt University Dept. of Biological Sciences
> postal mail address:
> VU Station B 351634
> Nashville, TN  37235-1634,  U.S.A.
> delivery address:
> 2125 Stevenson Center
> 1161 21st Ave., S.
> Nashville, TN 37235
> office: 2128 Stevenson Center
> phone: (615) 343-4582,  fax: (615) 343-6707
> http://bioimages.vanderbilt.edu

Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 343-6707

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20101027/0197b726/attachment-0001.html 

More information about the tdwg-content mailing list