[tdwg-content] "Wrong" RDF, was Re: What I learned at the TechnoBioBlitz

Thu Oct 14 17:43:23 CEST 2010

Thanks for the various replies.  I'm going to try to respond to several 
of them in this one.  I realize that these lengthy replies may overwhelm 
some readers.  However, I will beg your collective indulgence because 
I've got a proposal on the table for adding Individual as a Darwin Core 
class.  It appears that the submission process is moving forward and you 
can consider this as the "pleading of my case" for why that addition is 
desirable (and in my opinion) necessary.

One point which I think has permeated the Darwin Core discussions since 
I've started following them is that DwC is designed to facilitate many 
uses.  Although somebody might use Occurrence records to make dots on a 
distribution map, somebody else might be using the same records to track 
the movement of the individual organism as it swims around the sea.  
Somebody else may just be using the location and time metadata to 
demonstrate that the photo that they took places the organism in a 
reasonable location for the species they assert they have photographed.  
Another person may be using the location and time metadata to indicate 
that two species co-occurred at the same location at the same time.  
Darwin Core will be functioning well when it allows occurrence records 
to do any of these things or possibly all of these things at the same 
time.  The case that I'll try to make here is that Darwin Core mostly 
allows these things, but lack of an Individual class is making it 
difficult to do some of them.  I will illustrate with a couple examples.

The first one is the problem of tracking an individual over time.  As 
Rich correctly points out, the "new" Darwin Core standard has the term 
dwc:individualID which is designed to facilitate exactly this kind of 
thing.  In a previous thread when we discussed the appropriate use of 
the xxxxID terms, I believe that there was a consensus that using them 
as "idrefs" (I can't remember the technical database term for this, I 
mean when an item in a record points to the identifier of another 
record) was appropriate.  In a flat "table-based" database system, you 
would just have a table of records (i.e. rows) for some kind of "thing" 
with a column heading of  "xxxxID".  You would place the identifier for 
the related other thing in that column.  In the case of 
dwc:individualID, the rows would be occurrence records and the entry in 
the individualID column would be the identifier for the individual.  In 
RDF, you would make statements asserting the relationship between the 
thing and the other thing.  For example, if you wanted to say that a 
dwc:Identification asserted that something was a particular dwc:Taxon, 
you could make the statement in RDF that [identification] dwc:taxonID 
[taxon], where [identification] and [taxon] are instances of those two 
classes that have been assigned some kind of (hopefully gobally unique) 
identifiers.  In the case of asserting that a number of occurrence 
records track the same individual over time, in RDF, I would for each 
occurrence make the statement [occurrence] dwc:individualID 
[individual].  That's great and I can (and do) do that with Darwin Core 
as it exists.  The problem that I face is that in RDF any time that one 
makes a statement about a resource (I'm switching to that term because 
"thing" is to vague) using an identifier for it (in the form of a URI), 
the identifier must dereference (resolve? sorry Bob!) to produce 
metadata about the resource.  So when I assign a URI to an individual 
organism a semantic client should be able to retrieve information about 
the individual.  One of the fundamental pieces of information that a 
client should (according to the TDWG GUID applicability statement) be 
given about a resource is what type of thing the resource is.  This is 
called the "rdfs:type" of the resource.  The TDWG Applicability 
statement (recommendation 11) says that resources identified by a GUID 
"should be typed using the TDWG ontology or other well-known 
vocabularies".  I hate to be cynical about this, but I don't have 
confidence that the TDWG ontology will be ready to use in my lifetime.  
The only "well-known" vocabulary that I know of that will work for this 
purpose at the moment is Darwin Core and the Darwin Core classes are 
just right for typing all of the kinds of resources I want to talk about 
(occurrences, taxon, identifications, etc.) EXCEPT for Individuals.  I 
think that dwc:individualID is the only one of the xxxxID terms that 
refers to a type of thing that doesn't have a class defined for it, 
hence my request to add Individual as a class.  At the TDWG meeting, 
somebody (Roger maybe?) commented that there isn't anything that would 
stop me from creating my own URI for an Individual class.  That is 
absolutely true and I already did that 
(http://bioimages.vanderbilt.edu/rdf/terms#Individual), but that doesn't 
make my term "well known".  I want Individual to be a class in Darwin 
Core so that people other than me know what it means.  There is no way 
that I can currently follow the "rules" for GUIDs and RDF on this, and 
anybody in the future who uses dwc:individualID in RDF is going to face 
this same problem (i.e. anyone who wants to track individuals over time).

In the case of putting "dots on a map" to show the distribution of a 
species, the case is simple if the occurrences are specimens where the 
whole dead organism is collected.  It is not so simple with other types 
of occurrences.  Let me illustrate with an example.  There is currently 
precisely one known individual of Crataegus harbisonii in nature.  I 
have given this individual the URI 
http://bioimages.vanderbilt.edu/ind-baskauf/70905 .  I have 
approximately 62 images of that individual at 
http://bioimages.vanderbilt.edu/ind-baskauf/70905.htm and 
http://www.cas.vanderbilt.edu/bioimages/species/crha2.htm .  Each one of 
these images represents an occurrence in that I pressed the shutter on 
my camera at different times for each one.  Ron Lance has collected 
tissue from this tree for grafting purposes and now has an occurrence 
with basisOfRecord="LivingSpecimen" in his arboretum in North Carolina.  
Andrea Bishop of the Tennessee Dept of Environment and Conservation has 
seeds collected from the tree - I'd call the collection of those seeds 
an occurrence record.  I'm pretty sure that there are one or more 
specimens from this tree in herbaria (although I'm not sure where).  So 
my question to Marcus and others at GBIF is: how many dots will you put 
on your map for this tree?  65 (one for each occurrence) or 1 (one for 
each individual)?  I think the answer should be one, but it isn't clear 
to me how a data aggregator is going to achieve the goal of having one 
dot per individual if the basic unit "dot creation" is an occurrence 
rather than an individual.  At the present moment, this question seems 
like a moot point because most records in big databases like GBIF are 
based on one specimen (or observation) per record of an individual, but 
that won't necessarily be the case in the future if people take multiple 
live organism images, perhaps also at the same time they collect a 
physical specimen.  I anticipate that one response to this question will 
be to call each imaging bout one "observation" having a number of 
dwc:associatedMedia references.   That collapses the number of 
occurrence records considerably, but not down to one.  I took images of 
that tree on at least three separate instances over the course of a year 
and Ron collected his graft tissue years before that.  There is simply 
no way to reduce the number of occurrences for this tree to one, nor 
should we want to.  A possible use of multiple occurrence records (i.e. 
my first point above) of this sort might be to establish how long 
individuals of Crataegus harbisonii live and each occurrence record 
(whether separated by years or by the seconds between shutter clicks) is 
a part of the record that we should be able to (and want to) preserve.  
Another use would be to track a non-sessile organism (e.g. a whale) in 
both time and space.  In that case, the record on a map for an 
individual would be some kind of curve rather than a dot.  But in any 
case, recognizing the existence of an entity that I'm calling an 
Individual facilitates these broader uses of occurrence data and it's 
really hard for me to see how that is going to happen if we ONLY have 
occurrences as separate entities.  Response Markus?  How does GBIF deal 
with whale tracks or multiple banded bird observations for a single bird?

The third compelling reason for recognizing the existence of Individuals 
as a resource type is that it is the best way to maintain the linkage 
between multiple occurrences of the same individual and 
identifications.  (In the oversimplified examples I gave earlier, I 
applied a scientific name directly to an individual.  In actual 
practice, I relate individuals to identifications and then relate the 
identifications to taxa.)  Again, to illustrate with a real-life 
example, when Bruce Kirchoff was developing his Woody Plants of the 
Southeastern US learning software, he asked a taxonomist to go through 
the images of mine that he was using for the project to verify that they 
were identified correctly.  My old website just threw together all 
images of a particular species onto one page without regard to the 
individuals from which they originated (e.g. 
http://www.cas.vanderbilt.edu/bioimages/species/sarar3.htm and 
http://www.cas.vanderbilt.edu/bioimages/species/soam3.htm).  It turns 
out that I had carelessly misidentified a vegetative Sambucus racemosa 
ssp. racemosa individual as Sorbus americana.  The taxonomist asked me 
which of the various bark, twig, leaf, etc. images were from the same 
plant and the only way I could find out was through the laborious 
process of looking for images with similar time/date values and my hand 
written field notes.  It was a nightmare finding all of the particular 
image records that needed to have their identifications fixed and then 
correcting them.  On my new website (e.g. 
http://bioimages.vanderbilt.edu/metadata.htm, then click on Quercus 
chrysolepis), the images are connected to the individual from which they 
originated.  If I discover by looking at a particularly informative 
image that I have misidentified the individual, I only need to add an 
updated determination (i.e. identification) to that individual's record 
and automatically all images from that individual are displayed with the 
correct name and are placed on the correct species page.  Now imagine a 
situation that is larger and even more complicated than this (think a 
Bioblitz).  Herbarium curators and live plant photographers are working 
together to document the flora of an area.  Multiple images and multiple 
specimens may be collected from the same individual.  The images may go 
one place and the specimens may go to several herbaria (if "duplicates" 
are distributed).  It's possible that people might come back to the same 
individual later to photograph or collect fruit having initially seen 
flowers.  Suppose on down the line a taxonomist looks at one of the 
specimen duplicates and realizes that the initial identification was 
wrong (or maybe just wants to assert an alternative opinion about the 
identity).  If the record is based on that individual, then all that is 
required is for the annotating taxonomist to add a determination (i.e. 
dwc:Identification) to the Individual's record and poof! all images and 
duplicate specimens have that opinion associated with them.  In 
contrast, if all of these separate occurrence records are not tied 
together via the Individual, and if each individual occurrence record 
has its own determination, nobody is possibly going to ever track down 
and correct every one.  Granted, the scenario that I've suggested is 
contingent on the existence of a large scale database that can connect 
metadata across institutions, but exactly that kind of thing is what 
projects like the US Virtual Herbarium and our Live Plants Imaging group 
are trying to create.  Let's enable this by making it possible within 
Darwin Core to have a record structure that is Individual-based.

I recognize that many "specimen-based" organizations aren't really going 
to care one whit about this.  That's fine.  In their databases and 
personal XML schemas they can ignore Individuals as it is their 
prerogative.  But when we build RDF templates, I believe strongly that 
for the benefit of those of us who care about the broader applications 
of occurrences those templates should use individuals to connect (one or 
more) occurrences and (one or more) identifications.  For those with a 
technical bent, you can see how I have done this for an herbarium 
specimen by looking at the page source RDF of the example 
http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0429.rdf .  For 
those of a non-technical bent, just look at the webpage that shows up 
when you click on the link.  It looks just like any other web page for a 
specimen and you don't even have to know that the underlying RDF 
supports using Individuals as a grouping mechanism.

In summary, I think we need Individual as a DwC class to enable 
understandable rdfs:typing of records of individuals and to create a 
context in which instances of individuals can be placed (i.e. people 
would assign and use identifiers for individuals when they document 
occurrences).  These instances (and their assigned URI GUIDSs) would 
allow for "connecting" identifications and occurrences in a more 
meaningful way.  I am not suggesting that the occurrence be dethroned as 
the center of biodiversity records.  Assuming that the xxxxID terms end 
up being moved out of the various classes and into the record-level 
terms area as was suggested recently, I think that there are really only 
about two terms that should be put into a new Individual class: the 
other new term I have proposed (individualRemarks) and 
establishmentMeans (but that is the topic of another email).  It may 
seem odd to suggest a adding a class that has very few terms in it, but 
if you follow my reasoning above you will hopefully understand why I 
have done so. 

I hope that the discussion (and criticism!) will continue.  Again, I'm 
interested in hearing alternatives.
Steve

Richard Pyle wrote:
>> In many cases, a specimen is created by killing an organism and gluing it
>>     
> to a
>   
>> piece of paper (if it's a plant) or putting it in a jar (if it's an
>>     
> animal).
>   
>> It is natural to ask the question "what kind of species is the specimen?".
>>     
>
>   
>> We can look at the specimen and make a statement like [specimen]
>> dwc:scientificName "Drosophila melanogaster" and it pretty much makes
>>     
> sense.
>   
>> However, in the new Darwin Core standard, we have a broader category of
>> "things" (a.k.a. resources) that we call Occurrences which include
>>     
> specimens
>   
>> but which also includes observations and probably all kinds of things like
>>     
>
>   
>> images, DNA samples, and a whole lot of other things.  If we try to apply
>> the same kind of statement to other kinds of Occurrences besides specimens
>>     
>
>   
>> we immediately run into problems.  If we say that [digital image]
>> dwc:scientificName "Drosophila melanogaster" we are making a nonsensical
>> statement.  The digital image can have properties like its photographer,
>> its format, its pixel dimensions, etc. but the image itself does not have
>>     
> a
>   
>> scientific name.  The scientific name is a property of the thing that was
>> photographed.  It makes even less sense if we are talking about
>>     
> observations.
>   
>> An observation is a situation where somebody observes an organism.
>> The observation can have properties like the observer, the location, etc.
>> However, if we say [observation] dwc:scientificName "Drosophila
>>     
> melanogaster"
>   
>> we are saying that that act of observing has a scientific name.
>> That is an incorrect statement.  So the general statement [Occurrence]
>> dwc:scientificName "Drosophila melanogaster" does not make sense when
>> applied to all possible types of Occurrences.  Rather, the organism
>> that we are observing is the thing that has a scientific name.
>>     
>
> OK, I admit that I have not been following this list as closely as I should
> have -- especially during the latter half of 2009.  But I have to
> ask....seriously....is this the level of misunderstanding that still exists
> in our community?
>
> Perhaps I'm the idiot here, but it has *always* been my understanding that
> the "thing" (I hesitate to use the word "basis") of an Occurrence instance
> is *always* the organism (or set of organisms, or impression of an organism
> in the case of fossils).  If the organisms were captured and preserved in a
> Museum, then we call it a specimen.  If the organisms were only witnessed
> and not captured, we call it an observation.  Everything else (including the
> physical specimen) is just layers of evidence to support the existence and
> taxonomic identification of the organism within the Occurrence.  When
> photons reflected off the outer surface of an organism find their way
> through a lense and onto some mechanism for recording said photos (either a
> human retina and neurons in the brain, or sheet of celluloid, or digital
> image sensor and memory stick), it's still the organism that the photons
> reflected off of, which represents the "thing" of the Occurrence to which
> metadata apply. Same goes for vocalizations transmitted through pressure
> waves in the air onto some recording device (ear/brain, or microphone/tape).
>
> So while it's certainly true that a media object such as a 35mm slide or
> digital image file does not itself have a scientificName (then again, some
> of my old Kodachromes have enough mold on them that they might....), said
> media objects are *not* the Occurrence itself -- they merely represent
> evidence of the occurrence.  Even a specimen in a jar is not the Occurrence
> itself.  The Occurrence occurred when the specimen was captured (e.g., 400
> feet deep on a coral reef).  A specimen in a jar on a shelf in a Museum is
> no longer the "Occurrence"; it is the evidence of the Occurrence.
>
> When I assign a GUID to an Occurrence record that lacks a voucher (i.e., an
> "Observation"), I'm certainly not trying to identify the act of observation;
> I'm identifying the organism that was observed, at the time and place that
> it was observed.
>
> For what it's worth, if I only have a still or video image of an organism
> (e.g., http://www.youtube.com/watch?v=GVTd11q3Ppc; taken by Rob Whitton, who
> some of you met at TDWG this year), and didn't collect the specimen, I
> create an Observation record, and link the image to it as associatedMedia.
> I would never assign a taxon name to the video clip -- only to the "content
> item" of the video that represents an organism, serving as the basis of an
> Occurrence record.
>
>   
>> The specimen is an occurrence of the individual organism.
>> The image is an occurrence of the individual organism.
>> The observation is an occurrence of the individual organism.
>>     
>
> I would say in all three cases that the presence of an organism at a place
> and time was the Occurrence.  Specimens, images, and reported observations
> are merely the evidence that the occurrence existed (and to varying degrees,
> can also allow for subsequent interpretations of taxonomic identification).
>
>   
>> These statements may seem odd because we are used to
>> thinking of an Occurrence being an occurrence of the
>> "species" but it's not really.
>>     
>
> I completely agree.  The occurrence was the organism at a place and time.
> The "species" is merely the taxon concept that someone identified the
> organism as belonging to.  The scientificName is merely the label that
> someone applied to the taxon concept.  In other words, the scientificName is
> really a property of the Taxon Concept, and the Taxon Concept is the subject
> of an identification event, and the identification event was applied to the
> organism, which itself represents the basis of an Occurrence.  But very few
> people go to the trouble of creating that full chain of relationships, so as
> a short-hand, the scientificName is often treated as a direct property of
> the occurrence (collected or observed organism).  I think this short-hand is
> perfectly fine in the context of DwC, but only as long as people understand
> the implied chain of linked entities.  If we start to forget what's really
> going on, then we run into trouble.
>
> Which, I guess, was the whole point of Steve's post.
>
> What concerns me, though, is that we're not (yet?) already beyond this.
>
>   
>> This point becomes more clear if we look at a situation where several
>> types of occurrence records are collected from the same individual.
>> Let's say that we capture a bird, photograph it, collect a feather from
>>     
> it,
>   
>> collect a DNA sample and band it and let it go.  Later somebody sees the
>> band and reports that as an observation.
>> How do we connect all of these things?
>>     
>
> Two Occurences:  The first one when it was captured, photographed, and
> relieved of a feather. The second when it was observed at a later date.
>
>   
>> Do we create an identifier for the specimen (the feather)
>> and then say that the image and the DNA sample came from it?
>>     
>
> We create an identifier for the first Occurrence, capture the
> specimen-relevant metadata of the preserved feather, and track the DNA
> sample via associatedSequences.
>
>   
>> That would be wrong.  We could take an image of the feather,
>> but that would be a different thing from an image of the bird.
>>     
>
> It's certainly different from an image of the whole Bird, but that doesn't
> preclude us from including both bird and feather images among
> associatedMedia for the first Occurrence.
>
>   
>> We didn't get the DNA sample from the feather, we got it
>> via a blood sample from the bird.
>>     
>
> I don't see that as a problem, because the feather is only the evidence of
> the bird at the place and time (i.e., the first Occurrence). Thus, the
> sequence can still be included as part of the associatedSequences for the
> first Occurrence.
>
>   
>> The band observation is not an observation of the feather,
>> or the image or the DNA sample.  It's an observation of
>> the bird which was never any kind of specimen living or dead.
>> The bird is an individual organism and that's what we need to call it.
>>     
>
> Agreed -- it forms the basis for the second Occurrence record (later date).
> The two Occurrence records can be cross referenced, either via a shared
> individualID, or via associatedOccurrences.
>
>   
>> Right now we don't have anything in Darwin Core that can
>> be used to rdfs:type the bird, which is why I proposed Individual
>> as a Darwin Core class.
>>     
>
> As someone else alluded to earlier in this thread, there are near-infinite
> ways that we can slice & cluster biodiversity data. I think there are some
> cases where "individual" makes a lot of sense as a class (banded birds,
> managed organisms in zoos and curated gardens, whale and shark observation
> datasets, plant monitoring projects, etc.). But I think the notion of
> "Occurrence" makes more sense at this point in biodiversity informatics
> history, because the vast majority of datasets can be organized in this way
> realtively painlessly, and because the majority of questions being asked of
> these data revolve around presence of organisms identified to taxon concepts
> occurring at place and time.
>
>   
>> I could say these things more clearly in RDF, but since
>> because many members of the audience of this message
>> aren't familiar with RDF/XML they would probably zone
>> out and the point would be lost.
>>     
>
> Myself among them.  Thank you for presenting it in the less-efficient
> English Prose form.
>
>   
>> The point is that we need to have identifiable classes of "resources"
>> (the technical name for "things" like physical artifacts, concepts,
>> and electronic representations) for all of the things that that we
>> need to describe and inter-relate in the Darwin Core world.
>> Right now, we are missing one of the important pieces that we need,
>> which is a class for the Individual.  If we are satisfied with creating
>> an RDF model that only works for specimens and one-time observations,
>> then we probably don't need Individual as a Darwin Core class.  On the
>> other hand, if TDWG and GBIF are really serious about creating a
>> system (Darwin Core and RDF based on it) that can handle other types
>> of Occurrences like multiple images of live organisms, observations
>> of the same organism over time, and multiple types of Occurrences
>> collected from the same organism, then this capability should be built
>> into the system from the start.  When I got back from the TDWG meeting,
>> I was all excited about trying to use Darwin Core Archives with my
>> live plant image collection.  However, it quickly became evident
>> that it could not work because Occurrences were at the center of the
>> diagram rather than Individuals.  So unless something changes, we
>> are already embarking on the process of locking out these other
>> Occurrence types.
>>     
>
> Well...I certainly agree with you that we need *clear* documentation on what
> these classes are intended to represent.  I had *thought* it was clear that
> an Occurrence was as I have outlined above.  But like I said, I'm perfectly
> willing to accept that I'm the idiot in this case, and am completely out of
> phase with the rest of the community.
>
> As to whether or not we need to define a class for Individual, I'm not so
> sure that's entirely necessary.  I guess DwC is already primed for it
> (http://rs.tdwg.org/dwc/terms/index.htm#individualID) -- but I'm not sure
> what properties would apply to such a class that are not already covered in
> DwC.  Pronbably the next intieration of DwC would move some of the
> properties of the Occurrence class (catalogNumber, individualCount,
> preparations, disposition, associatedSequences, previousIdentifications)
> over to the Individual Class, at which point the Occurrence becomes the
> intersection of an Individual and an Event.
>
> But let me ask: how would you scope "Individual"? (see my previous rants on
> this list in recent days)  Would it be restricted to a particular individual
> organism? Or, would it be extended to include specified groups of organisms
> (as dwc:individualID already does)? What about populations?  Taxon Concepts?
>
>   
>> I hate to sound like a broken record (do we have those any more?),
>> but read my paper on this subject.
>>     
>
> I've had gotten through the first few pages, and intend to finish soon.  But
> it's much more fun to write emails about this stuff..... :-)
>
> Aloha,
> Rich
>
> Richard L. Pyle, PhD
> Database Coordinator for Natural Sciences
> Associate Zoologist in Ichthyology
> Dive Safety Officer
> Department of Natural Sciences, Bishop Museum
> 1525 Bernice St., Honolulu, HI 96817
> Ph: (808)848-4115, Fax: (808)847-8252
> email: deepreef at bishopmuseum.org
> http://hbs.bishopmuseum.org/staff/pylerichard.html
>
>
>
> .
>
>   

-- 
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 343-6707
http://bioimages.vanderbilt.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20101014/bc222ce8/attachment-0001.html