[tdwg-content] Occurrences, Organisms, and CollectionObjects: a review

John Wieczorek tuco at berkeley.edu
Wed Sep 14 03:43:59 CEST 2011

On Tue, Sep 13, 2011 at 11:15 AM, Richard Pyle
<deepreef at bishopmuseum.org> wrote:
> Ever since DwC transitioned from a "Federated Schema" to a "Vocabulary",
> I've never been entirely clear on what sorts of alterations would break
> backward-compatibility, and which are easily handled.  I've heard various
> statements from people with much more understanding than I on the
> implications of a "Vocabulary" that the classes are really intended as rough
> clusters of terms, and it's the definition of terms that matter.  Have I
> misunderstood this?

Well, that depends partially on what technology you have invested in.
If you only need to share records of Simple Darwin Core, whether in
text files or in XML documents that are valid to the Simple Darwin
Core XML schema, then the classes really don't have much of a function
at all, except maybe as a convenience. For example, people might find
it convenient to talk about the Location terms in Simple Darwin Core
as a subset of terms that are grouped together and pertain to
Locations. But in the record, the Location class does not appear. IPT
uses the label for the Class to create a list of terms in its mapping
user interface to help user quickly find relevant terms among the
many, many terms that Darwin Core supports.

Once you get into more relational structures, Classes may take on a
more active role. For example, Using the IPT's capacity to represent a
star schema in text files, one might have all of the Identification
information in a single file and relate that to a core record of the
Occurrence type and thereby support many Identifications for a single
Occurrence. The classes aren't explicit in this example, but they are
"understood" by the humans who attach the identifiers in the
Identifications file that relate to the core record. The same sort of
structural understanding might be made more explicit in a database
structure that tries to mimic the recommended
dcattributes:organizedInClass, or in an XML Schema that explicitly
uses the Classes as containers for properties.

In the semantic world, the Classes might be used to make world views
(plural on purpose) that reflect an understanding of how the higher
level concepts ought to relate to each other when applied to a
particular problem.

All of these are valid uses of Darwin Core.

> The point being: The only way we are threatening to
> "break" DwC is by moving terms from the Occurrence class to two other new
> classes.  Does that mean we are no longer allowed to represent those terms
> as properties of a record with an OccurrenceID?

You weren't ever really allowed to do so in Darwin Core in the strict
sense of assigning a domain to a property. This is because, for
example, just because something has a dwc:scientificName doesn't mean
it's a Taxon.

> The tiny part of my brain
> that "gets" ontology wants to believe that backward compatibility of what
> would be the new DwC:Occurrence would be maintained with what is the
> existing DwC:Occurrence *only* if the new classes ("Organism" and
> "CollectionObject") are regarded as subclasses of Occurrence.

They really are distinct, and subclassing them from Occurrence would
not give them any properties. Backward compatibility is much more
affected by what you do did with Occurrence before and what you will
do with Occurrence, Organism, and CollectionObject instead. To give an
example, the IPT could be modified to allow an Organism and
CollectionObject to be a core types instead of or in addition to
Occurrence. That would require re-engineering. IPT could just as
easily ignore these distinctions and still pump out perfectly good
Darwin Core Archives about Taxa and Occurrences. It really about what
you want to do with it.

> But the
> slightly less tiny (but still tiny) part of my brain that "gets" information
> modeling doesn't think that's the right way to represent the new classes.
> Which tiny part of my brain is right? (I'm guessing neither...) Does it even
> matter?

I think it matters to those who want to make Darwin more capable of
being used in ways that require distinctions that it cannot currently
represent. It might be useful to think of a counter example to help
people feel more comfortable with the proposed changes. I you have a
monitoring network of camera traps, does it really bother you that
Darwin Core has this GeologicalContext class? You don't use it. Should
it matter that there are classes for Organisms and the persistent
evidence of them? It should matter if you can do something interesting
with that, otherwise you can ignore the distinction between them and
think of the properties as Occurrence-related.

> Obviously, we want a stable DwC.  But we also want a DwC that meets our
> needs.  Clearly, there are needs that are not being met by the existing DwC.
> The first question is, are those needs important enough to consider
> destabilizing DwC (by introducing two new classes, and shuffling some terms
> from one existing class to the new classes)?  The second question is: what
> are the real costs/consequences of the "destabilization".  In my mind, the
> answer to the first question is increasingly obvious ("yes").  But I don't
> have a good feel for the answer to the second question.

The answer to the second one is going to depend on how Darwin Core is
being used. It might be good to get some anecdotes about how the
proposed changes are going to "break" anything currently in existence.

> Aloha,
> Rich
> P.S. Greg: I live on the other side of the world from *everyone*, yet that
> hasn't prevented me from getting my words in... :-)
>> -----Original Message-----
>> From: tdwg-content-bounces at lists.tdwg.org [mailto:tdwg-content-
>> bounces at lists.tdwg.org] On Behalf Of "Markus Döring (GBIF)"
>> Sent: Tuesday, September 13, 2011 6:59 AM
>> To: Steve Baskauf
>> Cc: tdwg-content at lists.tdwg.org; "Éamonn Ó Tuama (GBIF)"
>> Subject: Re: [tdwg-content] Occurrences, Organisms, and CollectionObjects:
>> a review
>> Hi Steve,
>> I agree this is a good thing to me more clear about what an occurrence
>> actualize is and I would't disagree with the proposed 3 classes. Still
> there is a
>> drastic change in semantics of an existing term Occurrence and I would
> feel
>> more comfortable if we can tell those different usages apart. If thats via
> a
>> namespace based versioning of (all?) darwin core terms, through the use of
> a
>> different term name or sth else I don't know.
>> Don't you think this an issue? Would you also change an owl ontology class
>> definition in the same way and would't that be harmful to existing
> instances?
>> Markus
>> > With regards to Markus' concern about whether people will be able to
>> know whether somebody is talking about a "new-style" Occurrence or an
>> "old" Occurrence, I would assert that the "old" Occurrence didn't really
> have
>> a clear meaning.  If you review the summary of the discussion on
> Occurrence,
>> you can see that it was used to mean at least three different kinds of
> "things"
>> by different people.  What John is actually doing with his proposal is to
> add
>> clarity about what an Occurrence is where it didn't exist before.  I think
> that is
>> a good thing.  If, by the "old" kind of Occurrence people are meaning that
>> Occurrence is a fancier name for PreservedSpecimen (which I believe is how
>> some people in the museum community are thinking of it), then I would say
>> that such a characterization is incorrect (based on the apparent
> consensus)
>> and that clarifying the incorrectness of that view is a really good thing.
>> >
>> > Steve
>> >
>> > Éamonn Ó Tuama (GBIF) wrote:
>> >> It would be good to hear from someone who is familiar with the work
>> >> going on in the Observations Task Group and could explain how a
>> >> generic model for observations/measurements (e.g. OBOE) might help
>> >> sort out these issues. It seems to me that we are trying to build in
>> >> an ad-hoc manner an increasingly complex model on top of DwC which is
>> >> really just a glossary of terms. That does not seem like a good
>> >> approach - but I'm no modeller :-) _Éamonn
>> >>
>> >> -----Original Message-----
>> >> From: Dag Endresen (GBIF) [
>> >> mailto:dendresen at gbif.org
>> >> ]
>> >> Sent: 13 September 2011 12:18
>> >> To: "Markus Döring (GBIF)"
>> >> Cc:
>> >> tdwg-content at lists.tdwg.org
>> >> ; Éamonn Ó Tuama
>> >> Subject: Re: [tdwg-content] Occurrences, Organisms, and
>> >> CollectionObjects: a review
>> >>
>> >>  Hi Markus,
>> >>
>> >>  I believe that the discussion here originates from the view that the
>> >> "CollectionObject"/"Sample" is a different thing from the "Organism"
>> >> -  and that there can be a relationship between
>> >> CollectionObjects/Samples  and Organisms that could be difficult to
>> >> describe if these things are  identified as the same think
>> >> (occurrenceID). Do you think that the  "Occurrence" would be seen as
>> >> a thing different from the proposed  CollectionObject/Sample and
>> >> Organism - or as a super-class that would  include
>> >> CollectionObjects/Samples and Organisms? Would the semantics of
>> Occurrence change?
>> >>
>> >>  I fully share your view that the Darwin Core Archive (DwC-A) would
>> >> not  be suited to share the full complex relationship between
>> >> entities - even  if persistent identifiers would be used. However if
>> >> we start to describe  and include other things (core types) than only
>> >> the taxon and  occurrences then perhaps the DwC-A could be a useful
>> >> way to provide a  simple list of these entities? This could perhaps
>> >> provide easier  indexing and discovery of these new entities?
>> >>
>> >>  Dag
>> >>
>> >>
>> >>
>> >>  On Tue, 13 Sep 2011 10:03:00 +0200, Markus Döring (GBIF) wrote:
>> >>
>> >>
>> >>> I have to say that the change in semantics to the Occurrence class
>> >>> makes me a bit nervous.
>> >>> Can someone try help fighting my fears?
>> >>>
>> >>> DarwinCore has no versioning of namespaces, so there is no way for a
>> >>> consumer to detect if its an old style Occurrence or a new one. I am
>> >>> currently parsing various RSS feeds and even though its a mess
>> >>> having to parse 10 different styles I am glad that at least the
>> >>> designers made sure they all have their own namespace! Also removing
>> >>> or renaming terms might cause serious problems. Would discrete
>> >>> versions of dwc with their own namespace hurt?
>> >>>
>> >>> Another observation relates to dwc archives and its star schema. As
>> >>> an index to data that has been flattened there is no problem with
>> >>> more classes and core row types, but if you want it as a way to
>> >>> transfer complete normalized data it will not work. But that never
>> >>> really was the intention and I simply wanted to stress that fact.
>> >>>
>> >>> Markus
>> >>>
>> >>>
>> >>>
>> >>> On Sep 9, 2011, at 4:52 PM, Steve Baskauf wrote:
>> >>>
>> >>>
>> >>>
>> >>>> Richard Pyle wrote:
>> >>>>
>> >>>>
>> >>>>> I'm also wondering if we necessarily need to "break" the
>> >>>>> traditional view of the "Occurrence" class in order to implement
>> >>>>> Organism and CollectionObject.
>> >>>>> As long as we keep in mind that DwC is a vocabulary of terms
>> >>>>> focused on representing an exchange standard (rather than a
>> >>>>> full-blown Ontology), perhaps Occurrence records can continue to
>> >>>>> be represented in the traditional way as "flat" content, but the
>> >>>>> Organism and CollectionObject classes allow us to present data in
>> >>>>> a somewhat more "normalized" way in those circumstances that call
>> >>>>> for it (e.g. tracking individuals or groups over time [Organism],
>> >>>>> or managing fossil rocks with multiple taxa [CollectionObject] --
>> >>>>> to name just two).
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>> I've been thinking about this issue of "backward compatibility"
>> >>>> with respect to Occurrences if the
>> >>>> CollectionObject/Sample/Token/whatever
>> >>>> class is adopted.  I really don't think it is going to be as big of
>> >>>> a deal as people are making it out to be.
>> >>>>
>> >>>> It seems to me that the main problems arise in two areas: when one
>> >>>> wants to be clear about typing and when one wants to express
>> >>>> relationships in a system where it is possible to do through
>> >>>> semantics (like RDF).
>> >>>> In
>> >>>> that kind of circumstance, it's bad (oh yeh, I forgot - the term is
>> >>>> "naughty") to say  something like
>> >>>> resourceA hasOccurrence resourceB
>> >>>> when resourceB isn't actually an Occurrence.   "Wrong" typing also
>> >>>> happens all the time because the classes don't exist (yet) to do
>> >>>> the typing correctly.  As a case in point, in the Morphbank system,
>> >>>> I have multiple images of the same tree.  In that system the tree
>> >>>> is typed as a "specimen".  That is totally wrong because the tree
>> >>>> isn't a specimen, but what else is it going to be typed as?  There
>> >>>> isn't (yet) an appropriate class to put it in.
>> >>>>
>> >>>> Although these two problems (wrong typing and using a term with the
>> >>>> wrong kind of object which are actually different manifestations of
>> >>>> the same class-based problem) are naughty, realistically very few
>> >>>> people are actually using a system that is "semantic-aware" (e.g.
>> >>>> serving and consuming RDF) so right now making those mistakes
>> >>>> doesn't really "break"
>> >>>> anything.  Most data providers are using traditional databases or
>> >>>> even Excel spreadsheets where the DwC terms are just column
>> >>>> headings with no real "meaning" other than what the data managers
>> >>>> intend for them to mean.  So if a manager has a table where each
>> >>>> line contains a record for a specimen and has a column heading for
>> >>>> a column entitled "dwc:catalogNumber", there isn't really anything
>> >>>> other than an idea in the manager's head that the catalogNumber is
>> >>>> a property of a specimen or Occurence or CollectionObject.  If each
>> >>>> line in the database table is "flat" such that one specimen=one
>> >>>> CollectionObject=one Occurrence, all that is required to make
>> >>>> catalogNumber be a property of a CollectionObject instead of an
>> >>>> Occurrence is a different way of thinking in the managers mind
>> >>>> because there are really no semantics embedded in the table.  We
>> >>>> are already doing this kind of mental gymnastics with existing
>> >>>> classes like dwc:Identification .  If our hypothetical database
>> >>>> manager has a column heading that says "dwc:identifiedBy" in the
>> >>>> specimen table, that is really a property of dwc:Identification,
>> >>>> not dwc:Occurrence but again that is a distinction that is only
>> >>>> going to be made in the manager's mind.  Making the distinction
>> >>>> really only becomes an issue when the database stops being "flat"
>> >>>> for a particular relationship, e.g. if the database wants to allow
>> >>>> multiple Identifications per specimen record.  Then the database
>> >>>> structure must be changed accordingly to accommodate that
>> >>>> "normalization".
>> >>>>
>> >>>> What we have here at the present moment is a situation where data
>> >>>> providers don't have any way to have anything but "flat" records
>> >>>> where 1
>> >>>> specimen=1 Occurrence=1 Organism.  By adding the Organism and
>> >>>> CollectionObject classes, we allow people who need or want to have
>> >>>> less "flat" (=more "normalized") databases to have something to
>> >>>> call the entities that are represented by the new tables they
>> >>>> create to handle 1:many relationships instead of 1:1 relationships.
>> >>>> Anybody who only cares about 1:1 relationships really doesn't need
>> >>>> to worry about the fact that the new class exists, just as people
>> >>>> currently don't have to worry about the Identification class if
>> >>>> they only allow one Identification per specimen in their database.
>> >>>>
>> >>>> So I guess what I'm saying is that if a database manager has a
>> >>>> table labeled Occurrence, they really don't have to freak out if we
>> >>>> now tell them that their table actually should be labeled
>> >>>> CollectionObject as long as there is only one CollectionObject per
>> >>>> Occurrence.  They didn't freak out before when we told them that
>> >>>> they should call their table "Occurrence" instead of "Observation"
>> >>>> or "Specimen" in 2009, did they?
>> >>>>
>> >>>> I think what I'm saying here is what Rich was trying to say in the
>> >>>> paragraph I quoted, but I'm not sure.
>> >>>>
>> >>>> Steve
>> >>>>
>> >>>> --
>> >>>> Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University
>> >>>> Dept. of Biological Sciences
>> >>>>
>> >>>> postal mail address:
>> >>>> VU Station B 351634
>> >>>> Nashville, TN  37235-1634,  U.S.A.
>> >>>>
>> >>>> delivery address:
>> >>>> 2125 Stevenson Center
>> >>>> 1161 21st Ave., S.
>> >>>> Nashville, TN 37235
>> >>>>
>> >>>> office: 2128 Stevenson Center
>> >>>> phone: (615) 343-4582,  fax: (615) 343-6707
>> >>>>
>> >>>> http://bioimages.vanderbilt.edu
>> >>>>
>> >>>>
>> >>>>
>> >>>> _______________________________________________
>> >>>> tdwg-content mailing list
>> >>>>
>> >>>> tdwg-content at lists.tdwg.org
>> >>>> http://lists.tdwg.org/mailman/listinfo/tdwg-content
>> >>>>
>> >>>>
>> >>>>
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> tdwg-content mailing list
>> >>
>> >> tdwg-content at lists.tdwg.org
>> >> http://lists.tdwg.org/mailman/listinfo/tdwg-content
>> >>
>> >>
>> >>
>> >
>> > --
>> > Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept.
>> > of Biological Sciences
>> >
>> > postal mail address:
>> > VU Station B 351634
>> > Nashville, TN  37235-1634,  U.S.A.
>> >
>> > delivery address:
>> > 2125 Stevenson Center
>> > 1161 21st Ave., S.
>> > Nashville, TN 37235
>> >
>> > office: 2128 Stevenson Center
>> > phone: (615) 343-4582,  fax: (615) 343-6707
>> >
>> > http://bioimages.vanderbilt.edu
>> _______________________________________________
>> tdwg-content mailing list
>> tdwg-content at lists.tdwg.org
>> http://lists.tdwg.org/mailman/listinfo/tdwg-content
> _______________________________________________
> tdwg-content mailing list
> tdwg-content at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-content

More information about the tdwg-content mailing list