[tdwg-content] Occurrences, Organisms, and CollectionObjects: a review

Dag Endresen (GBIF) dendresen at gbif.org
Tue Sep 13 12:17:52 CEST 2011

 Hi Markus,

 I believe that the discussion here originates from the view that the 
 "CollectionObject"/"Sample" is a different thing from the "Organism" - 
 and that there can be a relationship between CollectionObjects/Samples 
 and Organisms that could be difficult to describe if these things are 
 identified as the same think (occurrenceID). Do you think that the 
 "Occurrence" would be seen as a thing different from the proposed 
 CollectionObject/Sample and Organism - or as a super-class that would 
 include CollectionObjects/Samples and Organisms? Would the semantics of 
 Occurrence change?

 I fully share your view that the Darwin Core Archive (DwC-A) would not 
 be suited to share the full complex relationship between entities - even 
 if persistent identifiers would be used. However if we start to describe 
 and include other things (core types) than only the taxon and 
 occurrences then perhaps the DwC-A could be a useful way to provide a 
 simple list of these entities? This could perhaps provide easier 
 indexing and discovery of these new entities?


 On Tue, 13 Sep 2011 10:03:00 +0200, Markus Döring (GBIF) wrote:
> I have to say that the change in semantics to the Occurrence class
> makes me a bit nervous.
> Can someone try help fighting my fears?
> DarwinCore has no versioning of namespaces, so there is no way for a
> consumer to detect if its an old style Occurrence or a new one. I am
> currently parsing various RSS feeds and even though its a mess having
> to parse 10 different styles I am glad that at least the designers
> made sure they all have their own namespace! Also removing or 
> renaming
> terms might cause serious problems. Would discrete versions of dwc
> with their own namespace hurt?
> Another observation relates to dwc archives and its star schema. As
> an index to data that has been flattened there is no problem with 
> more
> classes and core row types, but if you want it as a way to transfer
> complete normalized data it will not work. But that never really was
> the intention and I simply wanted to stress that fact.
> Markus
> On Sep 9, 2011, at 4:52 PM, Steve Baskauf wrote:
>> Richard Pyle wrote:
>>> I'm also wondering if we necessarily need to "break" the 
>>> traditional view of
>>> the "Occurrence" class in order to implement Organism and 
>>> CollectionObject.
>>> As long as we keep in mind that DwC is a vocabulary of terms 
>>> focused on
>>> representing an exchange standard (rather than a full-blown 
>>> Ontology),
>>> perhaps Occurrence records can continue to be represented in the 
>>> traditional
>>> way as "flat" content, but the Organism and CollectionObject 
>>> classes allow
>>> us to present data in a somewhat more "normalized" way in those
>>> circumstances that call for it (e.g. tracking individuals or groups 
>>> over
>>> time [Organism], or managing fossil rocks with multiple taxa
>>> [CollectionObject] -- to name just two).
>> I've been thinking about this issue of "backward compatibility" with
>> respect to Occurrences if the CollectionObject/Sample/Token/whatever
>> class is adopted.  I really don't think it is going to be as big of 
>> a
>> deal as people are making it out to be.
>> It seems to me that the main problems arise in two areas: when one 
>> wants
>> to be clear about typing and when one wants to express relationships 
>> in
>> a system where it is possible to do through semantics (like RDF).  
>> In
>> that kind of circumstance, it's bad (oh yeh, I forgot - the term is
>> "naughty") to say  something like
>> resourceA hasOccurrence resourceB
>> when resourceB isn't actually an Occurrence.   "Wrong" typing also
>> happens all the time because the classes don't exist (yet) to do the
>> typing correctly.  As a case in point, in the Morphbank system, I 
>> have
>> multiple images of the same tree.  In that system the tree is typed 
>> as a
>> "specimen".  That is totally wrong because the tree isn't a 
>> specimen,
>> but what else is it going to be typed as?  There isn't (yet) an
>> appropriate class to put it in.
>> Although these two problems (wrong typing and using a term with the
>> wrong kind of object which are actually different manifestations of 
>> the
>> same class-based problem) are naughty, realistically very few people 
>> are
>> actually using a system that is "semantic-aware" (e.g. serving and
>> consuming RDF) so right now making those mistakes doesn't really 
>> "break"
>> anything.  Most data providers are using traditional databases or 
>> even
>> Excel spreadsheets where the DwC terms are just column headings with 
>> no
>> real "meaning" other than what the data managers intend for them to
>> mean.  So if a manager has a table where each line contains a record 
>> for
>> a specimen and has a column heading for a column entitled
>> "dwc:catalogNumber", there isn't really anything other than an idea 
>> in
>> the manager's head that the catalogNumber is a property of a 
>> specimen or
>> Occurence or CollectionObject.  If each line in the database table 
>> is
>> "flat" such that one specimen=one CollectionObject=one Occurrence, 
>> all
>> that is required to make catalogNumber be a property of a
>> CollectionObject instead of an Occurrence is a different way of 
>> thinking
>> in the managers mind because there are really no semantics embedded 
>> in
>> the table.  We are already doing this kind of mental gymnastics with
>> existing classes like dwc:Identification .  If our hypothetical 
>> database
>> manager has a column heading that says "dwc:identifiedBy" in the
>> specimen table, that is really a property of dwc:Identification, not
>> dwc:Occurrence but again that is a distinction that is only going to 
>> be
>> made in the manager's mind.  Making the distinction really only 
>> becomes
>> an issue when the database stops being "flat" for a particular
>> relationship, e.g. if the database wants to allow multiple
>> Identifications per specimen record.  Then the database structure 
>> must
>> be changed accordingly to accommodate that "normalization".
>> What we have here at the present moment is a situation where data
>> providers don't have any way to have anything but "flat" records 
>> where 1
>> specimen=1 Occurrence=1 Organism.  By adding the Organism and
>> CollectionObject classes, we allow people who need or want to have 
>> less
>> "flat" (=more "normalized") databases to have something to call the
>> entities that are represented by the new tables they create to 
>> handle
>> 1:many relationships instead of 1:1 relationships.  Anybody who only
>> cares about 1:1 relationships really doesn't need to worry about the
>> fact that the new class exists, just as people currently don't have 
>> to
>> worry about the Identification class if they only allow one
>> Identification per specimen in their database.
>> So I guess what I'm saying is that if a database manager has a table
>> labeled Occurrence, they really don't have to freak out if we now 
>> tell
>> them that their table actually should be labeled CollectionObject as
>> long as there is only one CollectionObject per Occurrence.  They 
>> didn't
>> freak out before when we told them that they should call their table
>> "Occurrence" instead of "Observation" or "Specimen" in 2009, did 
>> they?
>> I think what I'm saying here is what Rich was trying to say in the
>> paragraph I quoted, but I'm not sure.
>> Steve
>> --
>> Steven J. Baskauf, Ph.D., Senior Lecturer
>> Vanderbilt University Dept. of Biological Sciences
>> postal mail address:
>> VU Station B 351634
>> Nashville, TN  37235-1634,  U.S.A.
>> delivery address:
>> 2125 Stevenson Center
>> 1161 21st Ave., S.
>> Nashville, TN 37235
>> office: 2128 Stevenson Center
>> phone: (615) 343-4582,  fax: (615) 343-6707
>> http://bioimages.vanderbilt.edu
>> _______________________________________________
>> tdwg-content mailing list
>> tdwg-content at lists.tdwg.org
>> http://lists.tdwg.org/mailman/listinfo/tdwg-content

More information about the tdwg-content mailing list