[tdwg-content] "Wrong" RDF, was Re: What I learned at the TechnoBioBlitz

Sun Oct 17 11:37:27 CEST 2010

Hi Steve,

First of all, thank you for taking the time to carefully articulate your
perspective on this.  As someone who has written more than a few "epic"
posts to these lists, I am personally grateful for the careful explanation.
I know many people don't have time to read long messages, but I think that
the lack of explicit descriptions of things has led to confusion, which, in
the long run, has cost us all even more time.

For much of your emails, I found myself nodding in agreement.  I'm only
commenting below on a few passages that caught my attention. I don't know
how much your later posts supercede this one, but here are my comments to
your first.

>From your first post of Oct 15:

> In the case of putting "dots on a map" to show the distribution of a 
> species, the case is simple if the occurrences are specimens where the 
> whole dead organism is collected.  It is not so simple with other 
> types of occurrences.  Let me illustrate with an example.  There is 
> currently precisely one known individual of Crataegus harbisonii in 
> nature.  I have given this individual the URI
> http://bioimages.vanderbilt.edu/ind-baskauf/70905 .  I have 
> approximately 62 images of that individual at 
> http://bioimages.vanderbilt.edu/ind-baskauf/70905.htm and 
> http://www.cas.vanderbilt.edu/bioimages/species/crha2.htm .
> Each one of these images represents an occurrence in that I pressed 
> the shutter on my camera at different times for each one.

Yes, technically, you could represent these 62 images as 62 separate
Occurrence records (rather, as 62 separate events). But assuming the 62
shutter releases were all within a reasonable period of time (e.g., the same
day), it would be perfectly appropriate to collapse these 62 events into a
single event, which spanned in time from the first shutter-release to the
last.  The reason you would do that is that very, very few end-users would
gain much wisdom about nature knowing the precise points in time that the
organism occurred at that place.  Most would be quite happy to infer that it
also existed at the same place in-between the individual shutter-releases.
Thus, representing a single event (anchored to the Occurrence) as a range of
time from first shutter-release to last, would be adequate for essentially
all use-cases.

I relaize that each image contains with it the metadata for its moment of
capture, but of course that is metadata that applies to the image (evidence
of the occurrence); not the occurrence itself (which can be safely flattened
to a single occurrence).

For analagous reasons, many natural history collections aggregate multiple
specimens of the same taxon collected at the same event into a single "lot",
which gets a single catalog number, and is represented as a single
occurrence.  If I collect 100 individuals of [what I identify as] the same
fish species at the same poison station, I will establish one specimen
record, with individualCount set to 100, and represent it as a single
Occurrence (even though technically, the 100 fish were captured at slightly
different different times, and thus I could technically generate a different
Event instance for each, and thus have 100 different occurrence records).

> Ron Lance has collected tissue from this tree for grafting purposes 
> and now has an occurrence with basisOfRecord="LivingSpecimen" in his 
> arboretum in North Carolina.  Andrea Bishop of the Tennessee Dept of 
> Environment and Conservation has seeds collected from the tree - I'd 
> call the collection of those seeds an occurrence record.

I would probably do the same, assuming they weren't collected at the same
time (plus or minus a few hours to a few days) as the tissue sample.  But in
either case, the tree at the place and time is the occurrence; the tissue
sample and the seeds are just evidence of the occurrence.  If either the
time is substantially different (certainly possible!), or the location is
different (not likely, it being a tree), then I would see justification for
treating it as two separate occurrence records, so we know that the tree was
at that place at those two (reasonably separate) times.  And, of course,
we'd want to join the two Occurrences via individualID.

> So my
> question to Marcus and others at GBIF is: how many dots will you put 
> on your map for this tree?

My answer: one for each distinctly different moment in time that the tree
was confirmed to be in that place (by whatever evidence). How does one
define "distinctly different"? Well, I suspect that would be best judged by
a biologist who could determine whether there is meaningful knowledge to be
gained by knowning this tree was at this place at multiple points in time
within the span of an hour, vs. multiple points in time within the spann of
a day, or a month, or a year. Is diel variation important to document? Is
lunar-cycle variation important to document? Is seasonal variation important
to document?  The answers to these questions would inform the decision of
how many of these "potentially unique Occurrences" should be aggregated into
a single occurrence, which would be represented by less-precision of time
and/or place.

> I anticipate that one response
> to this question will be to call each imaging bout one "observation" 
> having a number of dwc:associatedMedia
> references.   That
> collapses the number of occurrence records considerably, but not down 
> to one.

It could be one.  What is the time span of a "bout", compared to the
time-span of the collection of all bouts combined?  How important is it to
resolve the time component of each bout individually, vs. aggregate them
into a single bout, represented by a broader (less precise) window of time?

> I took images of that tree on at least three separate instances over 
> the course of a year and Ron collected his graft tissue years before 
> that.

Sounds to me like four occurrence records; presuming there is value in
distingushing the occurrence of the tree at the place at different times of
the year.

> There is simply no way to reduce the number of occurrences for this 
> tree to one, nor should we want to.

There is a way -- if you don't care about seasonal variation, you simply
define the Event as a span of time covering all four visits to the tree.
You probably wouldn't want to do that; but you certainly could.

> A possible use
> of multiple occurrence records (i.e. my first point above) of this 
> sort might be to establish how long individuals of Crataegus 
> harbisonii live and each occurrence record (whether separated by years 
> or by the seconds between shutter clicks) is a part of the record that 
> we should be able to (and want
> to) preserve.  Another use would be to track a non-sessile organism 
> (e.g. a whale) in both time and space.  In that case, the record on a 
> map for an individual would be some kind of curve rather than a dot.  
> But in any case, recognizing the existence of an entity that I'm 
> calling an Individual facilitates these broader uses of occurrence 
> data and it's really hard for me to see how that is going to happen if 
> we ONLY have occurrences as separate entities.
> Response Markus?
> How does GBIF deal with whale tracks or multiple banded bird 
> observations for a single bird?

If I understand your overall point correctly, it is something like this:

Individuals potentially span multiple Occurrences.  DwC uses individualID to
link these multiple Occurrences together. However, there is no class for
individualID, and hence no way to apply additional dwc metadata to the
object represented by the individualID, other than through one or more
occurrence records.

Is that about right?  I mean, we certainly can provide an individualID as a
component of dwc:occurrence; but we have no way with dwc of assigning
metadata specific to that individual, except through a series of occrence
instances.

> (In the oversimplified
> examples I gave earlier, I applied a scientific name directly to an 
> individual.  In actual practice, I relate individuals to 
> identifications and then relate the identifications to
> taxa.)

Good to know!  This is, in my opinion, the right way to do it.

> Again, to illustrate with a real-life example, when Bruce Kirchoff was 
> developing his Woody Plants of the Southeastern US learning software, 
> he asked a taxonomist to go through the images of mine that he was 
> using for the project to verify that they were identified correctly.  
> My old website just threw together all images of a particular species 
> onto one page without regard to the individuals from which they 
> originated (e.g.
> http://www.cas.vanderbilt.edu/bioimages/species/sarar3.htm
> and
> http://www.cas.vanderbilt.edu/bioimages/species/soam3.htm).  
> It turns out that I had carelessly misidentified a vegetative Sambucus 
> racemosa ssp.
> racemosa individual as Sorbus americana.  The taxonomist asked me 
> which of the various bark, twig, leaf, etc. images were from the same 
> plant and the only way I could find out was through the laborious 
> process of looking for images with similar time/date values and my 
> hand written field notes.  It was a nightmare finding all of the 
> particular image records that needed to have their identifications 
> fixed and then correcting them.  On my new website (e.g.
> http://bioimages.vanderbilt.edu/metadata.htm, then click on Quercus 
> chrysolepis), the images are connected to the individual from which 
> they originated.  If I discover by looking at a particularly 
> informative image that I have misidentified the individual, I only 
> need to add an updated determination (i.e. identification) to that 
> individual's record and automatically all images from that individual 
> are displayed with the correct name and are placed on the correct 
> species page.

Well...I would counter that you could achieve the same thing by representing
things via Occurrence, and then cross-linking those Occurences that
represent the same individual by using the shared individualID. The only
thing you don't have is metdata specific to the individual anchored directly
to the individualID.  Instead, dwc denormalizes this a bit and aggregates
those individual-specific metadata to other classes (Occurrence,
Idnetification, etc.)

It's not that I disagree that an "individual" is a useful class in the realm
of biodiversity informatics.  I also think there are a couple of important
entities in taxon name/concept space that warrant their own classes.
However, as John W. and Markus have both emphasized, DwC (necessarily)
represents a compromise between a proper ontological mapping of the
information classes, and a practical vehicle for information exchange
amongst holders of biodiversity datasets. I find this constraining
sometimes, but it helps when I remind myself of the following: DwC is a
mechanism for exchanging data & metadata, not a database model or schema.
As such, I think it covers most (but not all) of the need, with a reasonable
(as opposed to normalized) set of classes and terms.  If you have metadata
for an individual, you can resolve (err...deference) that metdata by
providing an appropriate individualID via dwc occurrence records.

So, to re-state -- I don't disagree with your premise that there logically
ought to be a class for individual; I'm just not sure it is necessary for
DwC at this time.

Having said that, and being a database-nerd with a tendency to
hyper-normalize data models, I am mostly playing devil's advocate in my
message here.  In truth, I share your view that there should be (should have
been) a class for Individual, and Occurrence should have simply been the
union of an Individual and an Event. So....I reserve the right to stop
playing Devil's Advocate, and join you in your efforts to make the case for
an "individual" class. My only concern is that we may have different
perspecives on how to scope "individual".  In my mind, a better term would
be "organism", rather than "individual"; because in my mind, once you allow
a single coral head (as oppose to the individual polyps) to be represented
as an "individual", you've just allowed for multiple "individuals" -- which
opens the door to ever broader circumscriptions of multiple individuals
(colony-->small group-->herd/school/flock-->population-->taxon concept).

> I recognize that many "specimen-based" organizations aren't really 
> going to care one whit about this.  That's fine.  In their databases 
> and personal XML schemas they can ignore Individuals as it is their 
> prerogative.

Actually, no we can't.  Often is the case that a "lot" of 10 specimens is
later determined to contain more than one taxon.  In such cases, we *do*
need to identify individuals, so we can separate the lot accordingly.  I
know it's not exactly the same as the examples you give, but fundamentally
it's the same basic information flow: aggregated/abstract occurrence needs
more precise recognition of individual organism.

> But when we build
> RDF templates, I believe strongly that for the benefit of those of us 
> who care about the broader applications of occurrences those templates 
> should use individuals to connect (one or more) occurrences and (one 
> or
> more) identifications.  For those with a technical bent, you can see 
> how I have done this for an herbarium specimen by looking at the page 
> source RDF of the example 
> http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0429.rdf
> .  For those of a non-technical bent, just look at the webpage that 
> shows up when you click on the link.  It looks just like any other web 
> page for a specimen and you don't even have to know that the 
> underlying RDF supports using Individuals as a grouping mechanism.

I guess my real question is: is the case for individual as a class because
you cannot represent certain information within DwC (via individualID) at
all? Or that you could do so more elegantly if individual was broken out as
a class?  I certainly agree that it would be more elegant to treat
individual as a class; I'm just not convinced that the increase in elegance
justifies the increase in complexity/normalization of DwC.

> In summary, I think we need Individual as a DwC class to enable 
> understandable rdfs:typing of records of individuals and to create a 
> context in which instances of individuals can be placed (i.e. people 
> would assign and use identifiers for individuals when they document 
> occurrences).  These instances (and their assigned URI GUIDSs) would 
> allow for "connecting"
> identifications and occurrences in a more meaningful way.  I am not 
> suggesting that the occurrence be dethroned as the center of 
> biodiversity records.  Assuming that the xxxxID terms end up being 
> moved out of the various classes and into the record-level terms area 
> as was suggested recently, I think that there are really only about 
> two terms that should be put into a new Individual class: the other 
> new term I have proposed
> (individualRemarks) and establishmentMeans (but that is the topic of 
> another email).  It may seem odd to suggest a adding a class that has 
> very few terms in it, but if you follow my reasoning above you will 
> hopefully understand why I have done so.

OK, I guess I should have read this paragraph first -- it would have saved
me a lot of typing above.  But the words are already typed above, and I
don't have time to go figure out which ones are no longer necessary, so I'm
leaving it as written (Sorry!).  Anyway, the way you frame it here certainly
clarifies things in my mind, and nudges me closer to joining your crusade
for establishment of an individual class.  But I'm not yet sure we fully
agree on the scope of what an "individual" is.  The most intruiging passage
in your entire email (for me) is this one:

> I think that there are really only about two terms that should be put 
> into a new Individual class: the other new term I have proposed
> (individualRemarks) and establishmentMeans (but that is the topic of 
> another email)

By including establishmentMeans in your attributes of individual, you've
piqued my interest in reading your "another email".... :-)

> I hope that the discussion (and criticism!) will continue.  
> Again, I'm interested in hearing alternatives.

I'll find time later to read and contemplate your other emails.  For now, Id
be interested in whether my comments in this email are useful, or am I just
misunderstanding your basic point.

Aloha,
Rich
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BiSciCol-Summary.pdf
Type: application/pdf
Size: 77964 bytes
Desc: not available
Url : http://lists.tdwg.org/pipermail/tdwg-content/attachments/20101016/0381b42f/attachment-0001.pdf