[tdwg-content] Occurrences, Organisms, and CollectionObjects: a review
Éamonn Ó Tuama (GBIF)
eotuama at gbif.org
Tue Sep 13 15:19:32 CEST 2011
It would be good to hear from someone who is familiar with the work going on in the Observations Task Group and could explain how a generic model for observations/measurements (e.g. OBOE) might help sort out these issues. It seems to me that we are trying to build in an ad-hoc manner an increasingly complex model on top of DwC which is really just a glossary of terms. That does not seem like a good approach - but I'm no modeller :-)
_Éamonn
-----Original Message-----
From: Dag Endresen (GBIF) [mailto:dendresen at gbif.org]
Sent: 13 September 2011 12:18
To: "Markus Döring (GBIF)"
Cc: tdwg-content at lists.tdwg.org; Éamonn Ó Tuama
Subject: Re: [tdwg-content] Occurrences, Organisms, and CollectionObjects: a review
Hi Markus,
I believe that the discussion here originates from the view that the
"CollectionObject"/"Sample" is a different thing from the "Organism" -
and that there can be a relationship between CollectionObjects/Samples
and Organisms that could be difficult to describe if these things are
identified as the same think (occurrenceID). Do you think that the
"Occurrence" would be seen as a thing different from the proposed
CollectionObject/Sample and Organism - or as a super-class that would
include CollectionObjects/Samples and Organisms? Would the semantics of
Occurrence change?
I fully share your view that the Darwin Core Archive (DwC-A) would not
be suited to share the full complex relationship between entities - even
if persistent identifiers would be used. However if we start to describe
and include other things (core types) than only the taxon and
occurrences then perhaps the DwC-A could be a useful way to provide a
simple list of these entities? This could perhaps provide easier
indexing and discovery of these new entities?
Dag
On Tue, 13 Sep 2011 10:03:00 +0200, Markus Döring (GBIF) wrote:
> I have to say that the change in semantics to the Occurrence class
> makes me a bit nervous.
> Can someone try help fighting my fears?
>
> DarwinCore has no versioning of namespaces, so there is no way for a
> consumer to detect if its an old style Occurrence or a new one. I am
> currently parsing various RSS feeds and even though its a mess having
> to parse 10 different styles I am glad that at least the designers
> made sure they all have their own namespace! Also removing or
> renaming
> terms might cause serious problems. Would discrete versions of dwc
> with their own namespace hurt?
>
> Another observation relates to dwc archives and its star schema. As
> an index to data that has been flattened there is no problem with
> more
> classes and core row types, but if you want it as a way to transfer
> complete normalized data it will not work. But that never really was
> the intention and I simply wanted to stress that fact.
>
> Markus
>
>
>
> On Sep 9, 2011, at 4:52 PM, Steve Baskauf wrote:
>
>> Richard Pyle wrote:
>>> I'm also wondering if we necessarily need to "break" the
>>> traditional view of
>>> the "Occurrence" class in order to implement Organism and
>>> CollectionObject.
>>> As long as we keep in mind that DwC is a vocabulary of terms
>>> focused on
>>> representing an exchange standard (rather than a full-blown
>>> Ontology),
>>> perhaps Occurrence records can continue to be represented in the
>>> traditional
>>> way as "flat" content, but the Organism and CollectionObject
>>> classes allow
>>> us to present data in a somewhat more "normalized" way in those
>>> circumstances that call for it (e.g. tracking individuals or groups
>>> over
>>> time [Organism], or managing fossil rocks with multiple taxa
>>> [CollectionObject] -- to name just two).
>>>
>> I've been thinking about this issue of "backward compatibility" with
>> respect to Occurrences if the CollectionObject/Sample/Token/whatever
>> class is adopted. I really don't think it is going to be as big of
>> a
>> deal as people are making it out to be.
>>
>> It seems to me that the main problems arise in two areas: when one
>> wants
>> to be clear about typing and when one wants to express relationships
>> in
>> a system where it is possible to do through semantics (like RDF).
>> In
>> that kind of circumstance, it's bad (oh yeh, I forgot - the term is
>> "naughty") to say something like
>> resourceA hasOccurrence resourceB
>> when resourceB isn't actually an Occurrence. "Wrong" typing also
>> happens all the time because the classes don't exist (yet) to do the
>> typing correctly. As a case in point, in the Morphbank system, I
>> have
>> multiple images of the same tree. In that system the tree is typed
>> as a
>> "specimen". That is totally wrong because the tree isn't a
>> specimen,
>> but what else is it going to be typed as? There isn't (yet) an
>> appropriate class to put it in.
>>
>> Although these two problems (wrong typing and using a term with the
>> wrong kind of object which are actually different manifestations of
>> the
>> same class-based problem) are naughty, realistically very few people
>> are
>> actually using a system that is "semantic-aware" (e.g. serving and
>> consuming RDF) so right now making those mistakes doesn't really
>> "break"
>> anything. Most data providers are using traditional databases or
>> even
>> Excel spreadsheets where the DwC terms are just column headings with
>> no
>> real "meaning" other than what the data managers intend for them to
>> mean. So if a manager has a table where each line contains a record
>> for
>> a specimen and has a column heading for a column entitled
>> "dwc:catalogNumber", there isn't really anything other than an idea
>> in
>> the manager's head that the catalogNumber is a property of a
>> specimen or
>> Occurence or CollectionObject. If each line in the database table
>> is
>> "flat" such that one specimen=one CollectionObject=one Occurrence,
>> all
>> that is required to make catalogNumber be a property of a
>> CollectionObject instead of an Occurrence is a different way of
>> thinking
>> in the managers mind because there are really no semantics embedded
>> in
>> the table. We are already doing this kind of mental gymnastics with
>> existing classes like dwc:Identification . If our hypothetical
>> database
>> manager has a column heading that says "dwc:identifiedBy" in the
>> specimen table, that is really a property of dwc:Identification, not
>> dwc:Occurrence but again that is a distinction that is only going to
>> be
>> made in the manager's mind. Making the distinction really only
>> becomes
>> an issue when the database stops being "flat" for a particular
>> relationship, e.g. if the database wants to allow multiple
>> Identifications per specimen record. Then the database structure
>> must
>> be changed accordingly to accommodate that "normalization".
>>
>> What we have here at the present moment is a situation where data
>> providers don't have any way to have anything but "flat" records
>> where 1
>> specimen=1 Occurrence=1 Organism. By adding the Organism and
>> CollectionObject classes, we allow people who need or want to have
>> less
>> "flat" (=more "normalized") databases to have something to call the
>> entities that are represented by the new tables they create to
>> handle
>> 1:many relationships instead of 1:1 relationships. Anybody who only
>> cares about 1:1 relationships really doesn't need to worry about the
>> fact that the new class exists, just as people currently don't have
>> to
>> worry about the Identification class if they only allow one
>> Identification per specimen in their database.
>>
>> So I guess what I'm saying is that if a database manager has a table
>> labeled Occurrence, they really don't have to freak out if we now
>> tell
>> them that their table actually should be labeled CollectionObject as
>> long as there is only one CollectionObject per Occurrence. They
>> didn't
>> freak out before when we told them that they should call their table
>> "Occurrence" instead of "Observation" or "Specimen" in 2009, did
>> they?
>>
>> I think what I'm saying here is what Rich was trying to say in the
>> paragraph I quoted, but I'm not sure.
>>
>> Steve
>>
>> --
>> Steven J. Baskauf, Ph.D., Senior Lecturer
>> Vanderbilt University Dept. of Biological Sciences
>>
>> postal mail address:
>> VU Station B 351634
>> Nashville, TN 37235-1634, U.S.A.
>>
>> delivery address:
>> 2125 Stevenson Center
>> 1161 21st Ave., S.
>> Nashville, TN 37235
>>
>> office: 2128 Stevenson Center
>> phone: (615) 343-4582, fax: (615) 343-6707
>> http://bioimages.vanderbilt.edu
>>
>>
>> _______________________________________________
>> tdwg-content mailing list
>> tdwg-content at lists.tdwg.org
>> http://lists.tdwg.org/mailman/listinfo/tdwg-content
More information about the tdwg-content
mailing list