[tdwg-content] Occurrences, Organisms, and CollectionObjects: a review

"Markus Döring (GBIF)" mdoering at gbif.org
Tue Sep 13 10:03:00 CEST 2011

I have to say that the change in semantics to the Occurrence class makes me a bit nervous.
Can someone try help fighting my fears?

DarwinCore has no versioning of namespaces, so there is no way for a consumer to detect if its an old style Occurrence or a new one. I am currently parsing various RSS feeds and even though its a mess having to parse 10 different styles I am glad that at least the designers made sure they all have their own namespace! Also removing or renaming terms might cause serious problems. Would discrete versions of dwc with their own namespace hurt?

Another observation relates to dwc archives and its star schema. As an index to data that has been flattened there is no problem with more classes and core row types, but if you want it as a way to transfer complete normalized data it will not work. But that never really was the intention and I simply wanted to stress that fact.


On Sep 9, 2011, at 4:52 PM, Steve Baskauf wrote:

> Richard Pyle wrote:
>> I'm also wondering if we necessarily need to "break" the traditional view of
>> the "Occurrence" class in order to implement Organism and CollectionObject.
>> As long as we keep in mind that DwC is a vocabulary of terms focused on
>> representing an exchange standard (rather than a full-blown Ontology),
>> perhaps Occurrence records can continue to be represented in the traditional
>> way as "flat" content, but the Organism and CollectionObject classes allow
>> us to present data in a somewhat more "normalized" way in those
>> circumstances that call for it (e.g. tracking individuals or groups over
>> time [Organism], or managing fossil rocks with multiple taxa
>> [CollectionObject] -- to name just two).
> I've been thinking about this issue of "backward compatibility" with 
> respect to Occurrences if the CollectionObject/Sample/Token/whatever 
> class is adopted.  I really don't think it is going to be as big of a 
> deal as people are making it out to be. 
> It seems to me that the main problems arise in two areas: when one wants 
> to be clear about typing and when one wants to express relationships in 
> a system where it is possible to do through semantics (like RDF).  In 
> that kind of circumstance, it's bad (oh yeh, I forgot - the term is 
> "naughty") to say  something like
> resourceA hasOccurrence resourceB
> when resourceB isn't actually an Occurrence.   "Wrong" typing also 
> happens all the time because the classes don't exist (yet) to do the 
> typing correctly.  As a case in point, in the Morphbank system, I have 
> multiple images of the same tree.  In that system the tree is typed as a 
> "specimen".  That is totally wrong because the tree isn't a specimen, 
> but what else is it going to be typed as?  There isn't (yet) an 
> appropriate class to put it in. 
> Although these two problems (wrong typing and using a term with the 
> wrong kind of object which are actually different manifestations of the 
> same class-based problem) are naughty, realistically very few people are 
> actually using a system that is "semantic-aware" (e.g. serving and 
> consuming RDF) so right now making those mistakes doesn't really "break" 
> anything.  Most data providers are using traditional databases or even 
> Excel spreadsheets where the DwC terms are just column headings with no 
> real "meaning" other than what the data managers intend for them to 
> mean.  So if a manager has a table where each line contains a record for 
> a specimen and has a column heading for a column entitled 
> "dwc:catalogNumber", there isn't really anything other than an idea in 
> the manager's head that the catalogNumber is a property of a specimen or 
> Occurence or CollectionObject.  If each line in the database table is 
> "flat" such that one specimen=one CollectionObject=one Occurrence, all 
> that is required to make catalogNumber be a property of a 
> CollectionObject instead of an Occurrence is a different way of thinking 
> in the managers mind because there are really no semantics embedded in 
> the table.  We are already doing this kind of mental gymnastics with 
> existing classes like dwc:Identification .  If our hypothetical database 
> manager has a column heading that says "dwc:identifiedBy" in the 
> specimen table, that is really a property of dwc:Identification, not 
> dwc:Occurrence but again that is a distinction that is only going to be 
> made in the manager's mind.  Making the distinction really only becomes 
> an issue when the database stops being "flat" for a particular 
> relationship, e.g. if the database wants to allow multiple 
> Identifications per specimen record.  Then the database structure must 
> be changed accordingly to accommodate that "normalization".
> What we have here at the present moment is a situation where data 
> providers don't have any way to have anything but "flat" records where 1 
> specimen=1 Occurrence=1 Organism.  By adding the Organism and 
> CollectionObject classes, we allow people who need or want to have less 
> "flat" (=more "normalized") databases to have something to call the 
> entities that are represented by the new tables they create to handle 
> 1:many relationships instead of 1:1 relationships.  Anybody who only 
> cares about 1:1 relationships really doesn't need to worry about the 
> fact that the new class exists, just as people currently don't have to 
> worry about the Identification class if they only allow one 
> Identification per specimen in their database.
> So I guess what I'm saying is that if a database manager has a table 
> labeled Occurrence, they really don't have to freak out if we now tell 
> them that their table actually should be labeled CollectionObject as 
> long as there is only one CollectionObject per Occurrence.  They didn't 
> freak out before when we told them that they should call their table 
> "Occurrence" instead of "Observation" or "Specimen" in 2009, did they?
> I think what I'm saying here is what Rich was trying to say in the 
> paragraph I quoted, but I'm not sure.
> Steve
> -- 
> Steven J. Baskauf, Ph.D., Senior Lecturer
> Vanderbilt University Dept. of Biological Sciences
> postal mail address:
> VU Station B 351634
> Nashville, TN  37235-1634,  U.S.A.
> delivery address:
> 2125 Stevenson Center
> 1161 21st Ave., S.
> Nashville, TN 37235
> office: 2128 Stevenson Center
> phone: (615) 343-4582,  fax: (615) 343-6707
> http://bioimages.vanderbilt.edu
> _______________________________________________
> tdwg-content mailing list
> tdwg-content at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-content

More information about the tdwg-content mailing list