Re: [tdwg-content] "Wrong" RDF, was Re: What I learned at the TechnoBioBlitz

17 Oct 2010

      Hi Steve,

First of all, thank you for taking the time to carefully articulate your
perspective on this.  As someone who has written more than a few "epic"
posts to these lists, I am personally grateful for the careful explanation.
I know many people don't have time to read long messages, but I think that
the lack of explicit descriptions of things has led to confusion, which, in
the long run, has cost us all even more time.

For much of your emails, I found myself nodding in agreement.  I'm only
commenting below on a few passages that caught my attention. I don't know
how much your later posts supercede this one, but here are my comments to
your first.
...
From your first post of Oct 15:
...
In the case of putting "dots on a map" to show the distribution of a 
species, the case is simple if the occurrences are specimens where the 
whole dead organism is collected.  It is not so simple with other 
types of occurrences.  Let me illustrate with an example.  There is 
currently precisely one known individual of Crataegus harbisonii in 
nature.  I have given this individual the URI
http://bioimages.vanderbilt.edu/ind-baskauf/70905 .  I have 
approximately 62 images of that individual at 
http://bioimages.vanderbilt.edu/ind-baskauf/70905.htm and 
http://www.cas.vanderbilt.edu/bioimages/species/crha2.htm .
Each one of these images represents an occurrence in that I pressed 
the shutter on my camera at different times for each one.
Yes, technically, you could represent these 62 images as 62 separate
Occurrence records (rather, as 62 separate events). But assuming the 62
shutter releases were all within a reasonable period of time (e.g., the same
day), it would be perfectly appropriate to collapse these 62 events into a
single event, which spanned in time from the first shutter-release to the
last.  The reason you would do that is that very, very few end-users would
gain much wisdom about nature knowing the precise points in time that the
organism occurred at that place.  Most would be quite happy to infer that it
also existed at the same place in-between the individual shutter-releases.
Thus, representing a single event (anchored to the Occurrence) as a range of
time from first shutter-release to last, would be adequate for essentially
all use-cases.

I relaize that each image contains with it the metadata for its moment of
capture, but of course that is metadata that applies to the image (evidence
of the occurrence); not the occurrence itself (which can be safely flattened
to a single occurrence).

For analagous reasons, many natural history collections aggregate multiple
specimens of the same taxon collected at the same event into a single "lot",
which gets a single catalog number, and is represented as a single
occurrence.  If I collect 100 individuals of [what I identify as] the same
fish species at the same poison station, I will establish one specimen
record, with individualCount set to 100, and represent it as a single
Occurrence (even though technically, the 100 fish were captured at slightly
different different times, and thus I could technically generate a different
Event instance for each, and thus have 100 different occurrence records).
...
Ron Lance has collected tissue from this tree for grafting purposes 
and now has an occurrence with basisOfRecord="LivingSpecimen" in his 
arboretum in North Carolina.  Andrea Bishop of the Tennessee Dept of 
Environment and Conservation has seeds collected from the tree - I'd 
call the collection of those seeds an occurrence record.
I would probably do the same, assuming they weren't collected at the same
time (plus or minus a few hours to a few days) as the tissue sample.  But in
either case, the tree at the place and time is the occurrence; the tissue
sample and the seeds are just evidence of the occurrence.  If either the
time is substantially different (certainly possible!), or the location is
different (not likely, it being a tree), then I would see justification for
treating it as two separate occurrence records, so we know that the tree was
at that place at those two (reasonably separate) times.  And, of course,
we'd want to join the two Occurrences via individualID.
...
So my
question to Marcus and others at GBIF is: how many dots will you put 
on your map for this tree?
My answer: one for each distinctly different moment in time that the tree
was confirmed to be in that place (by whatever evidence). How does one
define "distinctly different"? Well, I suspect that would be best judged by
a biologist who could determine whether there is meaningful knowledge to be
gained by knowning this tree was at this place at multiple points in time
within the span of an hour, vs. multiple points in time within the spann of
a day, or a month, or a year. Is diel variation important to document? Is
lunar-cycle variation important to document? Is seasonal variation important
to document?  The answers to these questions would inform the decision of
how many of these "potentially unique Occurrences" should be aggregated into
a single occurrence, which would be represented by less-precision of time
and/or place.
...
I anticipate that one response
to this question will be to call each imaging bout one "observation" 
having a number of dwc:associatedMedia
references.   That
collapses the number of occurrence records considerably, but not down 
to one.
It could be one.  What is the time span of a "bout", compared to the
time-span of the collection of all bouts combined?  How important is it to
resolve the time component of each bout individually, vs. aggregate them
into a single bout, represented by a broader (less precise) window of time?
...
I took images of that tree on at least three separate instances over 
the course of a year and Ron collected his graft tissue years before 
that.
Sounds to me like four occurrence records; presuming there is value in
distingushing the occurrence of the tree at the place at different times of
the year.
...
There is simply no way to reduce the number of occurrences for this 
tree to one, nor should we want to.
There is a way -- if you don't care about seasonal variation, you simply
define the Event as a span of time covering all four visits to the tree.
You probably wouldn't want to do that; but you certainly could.
...
A possible use
of multiple occurrence records (i.e. my first point above) of this 
sort might be to establish how long individuals of Crataegus 
harbisonii live and each occurrence record (whether separated by years 
or by the seconds between shutter clicks) is a part of the record that 
we should be able to (and want
to) preserve.  Another use would be to track a non-sessile organism 
(e.g. a whale) in both time and space.  In that case, the record on a 
map for an individual would be some kind of curve rather than a dot.  
But in any case, recognizing the existence of an entity that I'm 
calling an Individual facilitates these broader uses of occurrence 
data and it's really hard for me to see how that is going to happen if 
we ONLY have occurrences as separate entities.
Response Markus?
How does GBIF deal with whale tracks or multiple banded bird 
observations for a single bird?
If I understand your overall point correctly, it is something like this:

Individuals potentially span multiple Occurrences.  DwC uses individualID to
link these multiple Occurrences together. However, there is no class for
individualID, and hence no way to apply additional dwc metadata to the
object represented by the individualID, other than through one or more
occurrence records.

Is that about right?  I mean, we certainly can provide an individualID as a
component of dwc:occurrence; but we have no way with dwc of assigning
metadata specific to that individual, except through a series of occrence
instances.
...
(In the oversimplified
examples I gave earlier, I applied a scientific name directly to an 
individual.  In actual practice, I relate individuals to 
identifications and then relate the identifications to
taxa.)
Good to know!  This is, in my opinion, the right way to do it.
...
Again, to illustrate with a real-life example, when Bruce Kirchoff was 
developing his Woody Plants of the Southeastern US learning software, 
he asked a taxonomist to go through the images of mine that he was 
using for the project to verify that they were identified correctly.  
My old website just threw together all images of a particular species 
onto one page without regard to the individuals from which they 
originated (e.g.
http://www.cas.vanderbilt.edu/bioimages/species/sarar3.htm
and
http://www.cas.vanderbilt.edu/bioimages/species/soam3.htm).  
It turns out that I had carelessly misidentified a vegetative Sambucus 
racemosa ssp.
racemosa individual as Sorbus americana.  The taxonomist asked me 
which of the various bark, twig, leaf, etc. images were from the same 
plant and the only way I could find out was through the laborious 
process of looking for images with similar time/date values and my 
hand written field notes.  It was a nightmare finding all of the 
particular image records that needed to have their identifications 
fixed and then correcting them.  On my new website (e.g.
http://bioimages.vanderbilt.edu/metadata.htm, then click on Quercus 
chrysolepis), the images are connected to the individual from which 
they originated.  If I discover by looking at a particularly 
informative image that I have misidentified the individual, I only 
need to add an updated determination (i.e. identification) to that 
individual's record and automatically all images from that individual 
are displayed with the correct name and are placed on the correct 
species page.
Well...I would counter that you could achieve the same thing by representing
things via Occurrence, and then cross-linking those Occurences that
represent the same individual by using the shared individualID. The only
thing you don't have is metdata specific to the individual anchored directly
to the individualID.  Instead, dwc denormalizes this a bit and aggregates
those individual-specific metadata to other classes (Occurrence,
Idnetification, etc.)

It's not that I disagree that an "individual" is a useful class in the realm
of biodiversity informatics.  I also think there are a couple of important
entities in taxon name/concept space that warrant their own classes.
However, as John W. and Markus have both emphasized, DwC (necessarily)
represents a compromise between a proper ontological mapping of the
information classes, and a practical vehicle for information exchange
amongst holders of biodiversity datasets. I find this constraining
sometimes, but it helps when I remind myself of the following: DwC is a
mechanism for exchanging data & metadata, not a database model or schema.
As such, I think it covers most (but not all) of the need, with a reasonable
(as opposed to normalized) set of classes and terms.  If you have metadata
for an individual, you can resolve (err...deference) that metdata by
providing an appropriate individualID via dwc occurrence records.

So, to re-state -- I don't disagree with your premise that there logically
ought to be a class for individual; I'm just not sure it is necessary for
DwC at this time.

Having said that, and being a database-nerd with a tendency to
hyper-normalize data models, I am mostly playing devil's advocate in my
message here.  In truth, I share your view that there should be (should have
been) a class for Individual, and Occurrence should have simply been the
union of an Individual and an Event. So....I reserve the right to stop
playing Devil's Advocate, and join you in your efforts to make the case for
an "individual" class. My only concern is that we may have different
perspecives on how to scope "individual".  In my mind, a better term would
be "organism", rather than "individual"; because in my mind, once you allow
a single coral head (as oppose to the individual polyps) to be represented
as an "individual", you've just allowed for multiple "individuals" -- which
opens the door to ever broader circumscriptions of multiple individuals
(colony-->small group-->herd/school/flock-->population-->taxon concept).
...
I recognize that many "specimen-based" organizations aren't really 
going to care one whit about this.  That's fine.  In their databases 
and personal XML schemas they can ignore Individuals as it is their 
prerogative.
Actually, no we can't.  Often is the case that a "lot" of 10 specimens is
later determined to contain more than one taxon.  In such cases, we *do*
need to identify individuals, so we can separate the lot accordingly.  I
know it's not exactly the same as the examples you give, but fundamentally
it's the same basic information flow: aggregated/abstract occurrence needs
more precise recognition of individual organism.
...
But when we build
RDF templates, I believe strongly that for the benefit of those of us 
who care about the broader applications of occurrences those templates 
should use individuals to connect (one or more) occurrences and (one 
or
more) identifications.  For those with a technical bent, you can see 
how I have done this for an herbarium specimen by looking at the page 
source RDF of the example 
http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0429.rdf
.  For those of a non-technical bent, just look at the webpage that 
shows up when you click on the link.  It looks just like any other web 
page for a specimen and you don't even have to know that the 
underlying RDF supports using Individuals as a grouping mechanism.
I guess my real question is: is the case for individual as a class because
you cannot represent certain information within DwC (via individualID) at
all? Or that you could do so more elegantly if individual was broken out as
a class?  I certainly agree that it would be more elegant to treat
individual as a class; I'm just not convinced that the increase in elegance
justifies the increase in complexity/normalization of DwC.
...
In summary, I think we need Individual as a DwC class to enable 
understandable rdfs:typing of records of individuals and to create a 
context in which instances of individuals can be placed (i.e. people 
would assign and use identifiers for individuals when they document 
occurrences).  These instances (and their assigned URI GUIDSs) would 
allow for "connecting"
identifications and occurrences in a more meaningful way.  I am not 
suggesting that the occurrence be dethroned as the center of 
biodiversity records.  Assuming that the xxxxID terms end up being 
moved out of the various classes and into the record-level terms area 
as was suggested recently, I think that there are really only about 
two terms that should be put into a new Individual class: the other 
new term I have proposed
(individualRemarks) and establishmentMeans (but that is the topic of 
another email).  It may seem odd to suggest a adding a class that has 
very few terms in it, but if you follow my reasoning above you will 
hopefully understand why I have done so.
OK, I guess I should have read this paragraph first -- it would have saved
me a lot of typing above.  But the words are already typed above, and I
don't have time to go figure out which ones are no longer necessary, so I'm
leaving it as written (Sorry!).  Anyway, the way you frame it here certainly
clarifies things in my mind, and nudges me closer to joining your crusade
for establishment of an individual class.  But I'm not yet sure we fully
agree on the scope of what an "individual" is.  The most intruiging passage
in your entire email (for me) is this one:
...
I think that there are really only about two terms that should be put 
into a new Individual class: the other new term I have proposed
(individualRemarks) and establishmentMeans (but that is the topic of 
another email)
By including establishmentMeans in your attributes of individual, you've
piqued my interest in reading your "another email".... :-)
...
I hope that the discussion (and criticism!) will continue.  
Again, I'm interested in hearing alternatives.
I'll find time later to read and contemplate your other emails.  For now, Id
be interested in whether my comments in this email are useful, or am I just
misunderstanding your basic point.

Aloha,
Rich

Re: [tdwg-content] "Wrong" RDF, was Re: What I learned at the TechnoBioBlitz

Richard Pyle