I was just ready to leave work when I wrote this and since then I'm
feeling like I should clarify just what I mean by "wrong" ways of using
RDF. I recognize that TDWG encourages flexibility in the ways that
standards such as DwC are used. As such, it doesn't usually define
"right" and "wrong" ways of using the standards. What I mean by
calling some uses "wrong" is not intended to discourage the creative
use of DwC terms in RDF. What I mean is that one must be careful to
make sure that RDF statements mean what is intended. Here is an
example. The Dublin Core term dcterms:language means "the language of
the resource". On multiple occasions, I've seen this term used in RDF
as a property of a resource whose metadata is written in a certain
language. This is "wrong" because the subject of the statement is the
resource itself, not the resource's metadata. The need for this kind
of clarity is apparent in the case of media. For example, if we are
providing metadata in English that describes a nature film which has
audio in German, the correct statement is that [film] dcterms:language
"de", NOT [film] dcterms:language "en". This problem is handled
appropriately in the MRTG schema by creating the (required) term
mrtg:metadataLanguage. The correct statement would be [film]
mrtg:metadataLanguage "en" . (I'm using "[film]" in lieu of a URI
identifier for the film.) If, however, we were writing RDF to describe
the metadata itself rather than the film, then it would be appropriate
to say [film's metadata] dcterms:language "en" . In straight XML, we
might get away with semantic sloppiness if the senders and receivers of
the XML "understand" what the intended subject is of the term
dcterms:language. But in RDF, we have to assume that the receiver of
the RDF is a "stupid" computer which only infers exactly what is said
and not what we MEANT to say.
I believe that this is a very important point that all parties need to
keep in mind before we happily march off creating RDF templates for the
general public to use. In particular, I have some serious problems
with the way that people are associating properties with instances of
the dwc:Occurrence class. I believe that these "wrong" ways originate
with the historical roots of Darwin Core as a means to describe
specimens. I will illustrate what I mean. In many cases, a specimen
is created by killing an organism and gluing it to a piece of paper (if
it's a plant) or putting it in a jar (if it's an animal). It is
natural to ask the question "what kind of species is the specimen?".
We can look at the specimen and make a statement like [specimen]
dwc:scientificName "Drosophila melanogaster" and it pretty much makes
sense. However, in the new Darwin Core standard, we have a broader
category of "things" (a.k.a. resources) that we call Occurrences which
include specimens but which also includes observations and probably all
kinds of things like images, DNA samples, and a whole lot of other
things. If we try to apply the same kind of statement to other kinds
of Occurrences besides specimens we immediately run into problems. If
we say that [digital image] dwc:scientificName "Drosophila
melanogaster" we are making a nonsensical statement. The digital image
can have properties like its photographer, its format, its pixel
dimensions, etc. but the image itself does not have a scientific name.
The scientific name is a property of the thing that was photographed.
It makes even less sense if we are talking about observations. An
observation is a situation where somebody observes an organism. The
observation can have properties like the observer, the location, etc.
However, if we say [observation] dwc:scientificName "Drosophila
melanogaster" we are saying that that act of observing has a scientific
name. That is an incorrect statement. So the general statement
[Occurrence] dwc:scientificName "Drosophila melanogaster" does not make
sense when applied to all possible types of Occurrences. Rather, the
organism that we are observing is the thing that has a scientific
name.
In all of the examples above, the correct statement is [individual
organism] dwc:scientificName "Drosophila melanogaster". The specimen
is an occurrence of the individual organism. The image is an
occurrence of the individual organism. The observation is an
occurrence of the individual organism. These statements may seem odd
because we are used to thinking of an Occurrence being an occurrence of
the "species" but it's not really. The image is not an image of the
Drosophila species concept nor is it an image of the string "Drosophila
melanogaster". The image is an image of an individual fruit fly. The
individual fruit fly is a representative of the taxon, the image and
the observation are not.
This point becomes more clear if we look at a situation where several
types of occurrence records are collected from the same individual.
Let's say that we capture a bird, photograph it, collect a feather from
it, collect a DNA sample and band it and let it go. Later somebody
sees the band and reports that as an observation. How do we connect
all of these things? Do we create an identifier for the specimen (the
feather) and then say that the image and the DNA sample came from it?
That would be wrong. We could take an image of the feather, but that
would be a different thing from an image of the bird. We didn't get
the DNA sample from the feather, we got it via a blood sample from the
bird. The band observation is not an observation of the feather, or
the image or the DNA sample. It's an observation of the bird which was
never any kind of specimen living or dead. The bird is an individual
organism and that's what we need to call it. Right now we don't have
anything in Darwin Core that can be used to rdfs:type the bird, which
is why I proposed Individual as a Darwin Core class.
I could say these things more clearly in RDF, but since because many
members of the audience of this message aren't familiar with RDF/XML
they would probably zone out and the point would be lost. The point is
that we need to have identifiable classes of "resources" (the technical
name for "things" like physical artifacts, concepts, and electronic
representations) for all of the things that that we need to describe
and inter-relate in the Darwin Core world. Right now, we are missing
one of the important pieces that we need, which is a class for the
Individual. If we are satisfied with creating an RDF model that only
works for specimens and one-time observations, then we probably don't
need Individual as a Darwin Core class. On the other hand, if TDWG and
GBIF are really serious about creating a system (Darwin Core and RDF
based on it) that can handle other types of Occurrences like multiple
images of live organisms, observations of the same organism over time,
and multiple types of Occurrences collected from the same organism,
then this capability should be built into the system from the start.
When I got back from the TDWG meeting, I was all excited about trying
to use Darwin Core Archives with my live plant image collection.
However, it quickly became evident that it could not work because
Occurrences were at the center of the diagram rather than Individuals.
So unless something changes, we are already embarking on the process of
locking out these other Occurrence types.
I hate to sound like a broken record (do we have those any more?), but
read my paper on this subject. It explains the rationale better than
this email, has nice diagrams, and gives RDF examples to illustrate
everything (
https://journals.ku.edu/index.php/jbi/article/view/3664).
If somebody has a better idea of how to develop an internally
consistent system that can handle the problems I've raised here that
DOESN'T involve Individuals (i.e. other "right"[=semantically accurate]
ways to express properties and relationships among Identifications,
Taxa, diverse types of Occurrences, etc.) I'd like to hear what it is.
Or perhaps as Stan has suggested, there needs to be a task group that
can hash out alternative views. But let's have the discussion before
we post models and suggest people use them.
Steve