Re: [tdwg-content] "Wrong" RDF, was Re: What I learned at the TechnoBioBlitz

14 Oct 2010

      Speaking strictly from ignorance rather than wisdom here, I don't believe
there is one right way to use the standard, though I agree that they are
innumerable wrong ways to do so. It's this basic unease that makes me
intuitively shy of expressing "A [single] TDWG Ontology".

What if we try a slightly different world view from the one you propose
centered on the Individual? Namely, let the Occurrence stand as "evidence
that a taxon occurred at a place and time." That is to say, we may or may
not care about the concept of an individual in our thinking and our data
capture. In this view, the Occurrence remains the central concept, and the
rest of the data highlights the evidence. Hence, a skull in a collection
(and the information gathered about the collection event) is the evidence
that a taxon occurred at a place and time.  Similarly, a digital image of an
identifiable individual from a camera trap is the evidence that a taxon
occurred at a place and time. A fossil having myriad individuals is evidence
that taxa occurred at a place and time based on a GeologicalContext.
In plain English, which we could express as RDF with an appropriate set of
predicates, we would always have the same pattern to describe Occurrences
from the Occurrence-centric world view, namely

the Occurrence O gives evidence that Taxon T determined based on
Identification criteria I occurred at Location L within GeologicalContext G
during the Event E based on evidence captured in properties of the
Occurrence and distinguishable in the type of evidence as recorded in the
dcterms:type and or the dwc:basisOfRecord.

I don't see anything "wrong" with this formulation, as all of the predicates
appropriately associate subjects and objects.

In other words, what is special about the Individual-centric view (or any
other view) except the way one wants to think about and express the
relationships (predicates) or formulates the questions?

On Wed, Oct 13, 2010 at 7:07 PM, Steve Baskauf <steve.baskauf@vanderbilt.edu
...
wrote:
...
I was just ready to leave work when I wrote this and since then I'm
feeling like I should clarify just what I mean by "wrong" ways of using
RDF.  I recognize that TDWG encourages flexibility in the ways that
standards such as DwC are used.  As such, it doesn't usually define "right"
and "wrong" ways of using the standards.  What I mean by calling some uses
"wrong" is not intended to discourage the creative use of DwC terms in RDF.
What I mean is that one must be careful to make sure that RDF statements
mean what is intended.  Here is an example.  The Dublin Core term
dcterms:language means "the language of the resource".  On multiple
occasions, I've seen this term used in RDF as a property of a resource whose
metadata is written in a certain language.  This is "wrong" because the
subject of the statement is the resource itself, not the resource's
metadata.  The need for this kind of clarity is apparent in the case of
media.  For example, if we are providing metadata in English that describes
a nature film which has audio in German, the correct statement is that
[film] dcterms:language "de", NOT [film] dcterms:language "en".  This
problem is handled appropriately in the MRTG schema by creating the
(required) term mrtg:metadataLanguage.   The correct statement would be
[film] mrtg:metadataLanguage "en" .  (I'm using "[film]" in lieu of a URI
identifier for the film.)  If, however, we were writing RDF to describe the
metadata itself rather than the film, then it would be appropriate to say
[film's metadata] dcterms:language "en" .  In straight XML, we might get
away with semantic sloppiness if the senders and receivers of the XML
"understand" what the intended subject is of the term dcterms:language.  But
in RDF, we have to assume that the receiver of the RDF is a "stupid"
computer which only infers exactly what is said and not what we MEANT to
say.
I believe that this is a very important point that all parties need to keep
in mind before we happily march off creating RDF templates for the general
public to use.  In particular, I have some serious problems with the way
that people are associating properties with instances of the dwc:Occurrence
class.  I believe that these "wrong" ways originate with the historical
roots of Darwin Core as a means to describe specimens.  I will illustrate
what I mean.  In many cases, a specimen is created by killing an organism
and gluing it to a piece of paper (if it's a plant) or putting it in a jar
(if it's an animal).  It is natural to ask the question "what kind of
species is the specimen?".  We can look at the specimen and make a statement
like [specimen] dwc:scientificName "Drosophila melanogaster" and it pretty
much makes sense.  However, in the new Darwin Core standard, we have a
broader category of "things" (a.k.a. resources) that we call Occurrences
which include specimens but which also includes observations and probably
all kinds of things like images, DNA samples, and a whole lot of other
things.  If we try to apply the same kind of statement to other kinds of
Occurrences besides specimens we immediately run into problems.  If we say
that [digital image] dwc:scientificName "Drosophila melanogaster" we are
making a nonsensical statement.  The digital image can have properties like
its photographer, its format, its pixel dimensions, etc. but the image
itself does not have a scientific name.  The scientific name is a property
of the thing that was photographed.  It makes even less sense if we are
talking about observations.  An observation is a situation where somebody
observes an organism.  The observation can have properties like the
observer, the location, etc.  However, if we say [observation]
dwc:scientificName "Drosophila melanogaster" we are saying that that act of
observing has a scientific name.  That is an incorrect statement.  So the
general statement [Occurrence] dwc:scientificName "Drosophila melanogaster"
does not make sense when applied to all possible types of Occurrences.
Rather, the organism that we are observing is the thing that has a
scientific name.
In all of the examples above, the correct statement is [individual
organism] dwc:scientificName "Drosophila melanogaster".  The specimen is an
occurrence of the individual organism.  The image is an occurrence of the
individual organism.  The observation is an occurrence of the individual
organism.  These statements may seem odd because we are used to thinking of
an Occurrence being an occurrence of the "species" but it's not really.  The
image is not an image of the Drosophila species concept nor is it an image
of the string "Drosophila melanogaster".  The image is an image of an
individual fruit fly.  The individual fruit fly is a representative of the
taxon, the image and the observation are not.
This point becomes more clear if we look at a situation where several types
of occurrence records are collected from the same individual.  Let's say
that we capture a bird, photograph it, collect a feather from it, collect a
DNA sample and band it and let it go.  Later somebody sees the band and
reports that as an observation.  How do we connect all of these things?  Do
we create an identifier for the specimen (the feather) and then say that the
image and the DNA sample came from it?  That would be wrong.  We could take
an image of the feather, but that would be a different thing from an image
of the bird.  We didn't get the DNA sample from the feather, we got it via a
blood sample from the bird.  The band observation is not an observation of
the feather, or the image or the DNA sample.  It's an observation of the
bird which was never any kind of specimen living or dead.  The bird is an
individual organism and that's what we need to call it.  Right now we don't
have anything in Darwin Core that can be used to rdfs:type the bird, which
is why I proposed Individual as a Darwin Core class.
I could say these things more clearly in RDF, but since because many
members of the audience of this message aren't familiar with RDF/XML they
would probably zone out and the point would be lost.  The point is that we
need to have identifiable classes of "resources" (the technical name for
"things" like physical artifacts, concepts, and electronic representations)
for all of the things that that we need to describe and inter-relate in the
Darwin Core world.  Right now, we are missing one of the important pieces
that we need, which is a class for the Individual.  If we are satisfied with
creating an RDF model that only works for specimens and one-time
observations, then we probably don't need Individual as a Darwin Core
class.  On the other hand, if TDWG and GBIF are really serious about
creating a system (Darwin Core and RDF based on it) that can handle other
types of Occurrences like multiple images of live organisms, observations of
the same organism over time, and multiple types of Occurrences collected
from the same organism, then this capability should be built into the system
from the start.  When I got back from the TDWG meeting, I was all excited
about trying to use Darwin Core Archives with my live plant image
collection.  However, it quickly became evident that it could not work
because Occurrences were at the center of the diagram rather than
Individuals.  So unless something changes, we are already embarking on the
process of locking out these other Occurrence types.
I hate to sound like a broken record (do we have those any more?), but read
my paper on this subject.  It explains the rationale better than this email,
has nice diagrams, and gives RDF examples to illustrate everything (
https://journals.ku.edu/index.php/jbi/article/view/3664).  If somebody has
a better idea of how to develop an internally consistent system that can
handle the problems I've raised here that DOESN'T involve Individuals (i.e.
other "right"[=semantically accurate] ways to express properties and
relationships among Identifications, Taxa, diverse types of Occurrences,
etc.) I'd like to hear what it is.  Or perhaps as Stan has suggested, there
needs to be a task group that can hash out alternative views.  But let's
have the discussion before we post models and suggest people use them.
Steve

Re: [tdwg-content] "Wrong" RDF, was Re: What I learned at the TechnoBioBlitz

John Wieczorek