[tdwg-content] dwc:associatedOccurrences

Wed Aug 25 03:14:32 CEST 2010

As I was stuck in traffic this morning I was thinking about my response 
to Bob's comments.  In retrospect, I should have simply said that 
indicating that specimens are duplicates by assigning their 
dwc:individualID property to the same URI is really not just one option, 
but rather that it is the semantically correct thing to do. 

Assume that we are assembling a database of RDF triples about taxonomic 
names and their authors.  We discover URI#1 whose metadata asserts that 
a foaf:Person has the rdfs:label "L.". If we know of URI#2 whose 
metadata asserts that a foaf:Person has the rdfs:label "Carl Linnaeus", 
it would be correct to assert that URI#2 is owl:sameAs URI#1.  Anyone 
who was aware of this assertion through knowledge of our database and 
who trusted the veracity of the assertion would then know that both URIs 
referred to the same person because the labels "L." and "Carl Linnaeus" 
actually refer to the same person. 

In contrast, assume we are assembling a database about specimens at 
various institutions.  We discover URI#3 for a dwc:Occurrence of 
dwc:basisOfRecord="PreservedSpecimen".  We realize by some means that 
this specimen is a "duplicate" of a dwc:Occurrence of 
dwc:basisOfRecord="PreservedSpecimen" having URI#4 and located in 
another institution.  Despite the colloquial use of the word 
"duplicate", it would not be correct to assert that URI#4 is owl:sameAs 
URI#3 because the resources represented by those two URIs are NOT the 
same thing.  They are different pieces of dead tissue in different jars 
or pasted to different pieces of paper.  If we think of what curators 
mean by "duplicate" it has exactly the meaning that both duplicates were 
collected from either the same individual organism (in the case of a 
large organism like a tree) or from the same small population of 
organisms (all members of the same species) such as a clump of grass, 
ants in the same colony, etc.  Given the definition of dwc:individualID 
as referring to a "an individual or named group of individual 
organisms",  assigning the two Occurrences the same value for their 
dwc:individualID property semantically describes "duplicates" exactly.  
Curators might not like this way of describing "duplicates" because it's 
not the way they are used to talking about them, but in the Linked Data 
world it is our job to correctly describe relationships using existing 
predicates.  We don't make up a new term if there is already one 
available that will do the job.

This relationship may seem more apparent if one considers an example 
that involves dwc:Occurrences of a different type.  I enjoyed looking at 
http://www.whaleshark.org/ yesterday.  In this library, users report 
dwc:Occurrences of the proposed type 
dwc:basisOfRecord="DigitalStillImage".  The database assigns identifiers 
(not URIs but they could be someday) to individual whale sharks and 
associates the dwc:Occurrences with the Individuals (i.e. the equivalent 
of providing a value for dwc:individualID for the dwc:Occurrence).  By 
using pattern recognition software, the project matches spot patterns 
and with luck can reach the conclusion that an Individual represented in 
a particular dwc:Occurrence is the same as the Individual documented by 
another dwc:Occurrence.  It is abundantly clear in this circumstance 
that the correct thing to do is to assert that the Individual 
represented in the first dwc:Occurrence is owl:sameAs the the Individual 
represented in the second dwc:Occurrence if the Individuals had 
previously been assigned different identifiers, or just to assign the 
second dwc:Occurrence the same value for dwc:individualID as the first 
dwc:Occurrence if a second identifier hadn't already been assigned to 
the Individual. 

The point here is that from a semantic point of view, there is no 
difference in what is being done in the case of linking duplicate 
specimens in different herbaria and in linking images that were taken of 
the same whale shark.  In both cases, two dwc:Occurrences are related in 
a certain way because they have the same value for dwc:individualID.  
Multiple observations/mark recapture might be used to establish where 
individuals move or how they behave, recognition of duplicate specimens 
might be used to update identifications, track relationships among 
herbaria, or anything we want.  We do not have to imply some particular 
fitness of use when we assert that relationship and we should not invent 
terms for a particular fitness of use when we have generic terms that 
already describe the relationship. 

Thus I say that it is wrong to invent some other term to represent a 
relationship that can be clearly and unambiguously expressed using 
existing terms.  One of the beauties of the Darwin Core standard is that 
it simplifies the vocabulary needed to express equivalent relationships 
by having a generic class (dwc:Occurrence) that can represent kinds of 
things that are distinguished by typing them with values for 
dwc:basisOfRecord such as PreservedSpecimen, HumanObservation, or 
DigitalStillImage.  So lets not move backwards by proposing to invent 
some new terms that will only apply to herbarium specimens.

Steve

Bob Morris wrote:
> Good idea, but it suffers from the same fate as might
> associatedOccurrences  (not previously mentioned because I was after
> some clarification in principle, with the herbarium duplicate sheets
> only one current case of interest): I need to follow whatever the
> community practice is of regarding a sheet as part of a duplicate set
> distributed by the original collector.  I'm told by the people at the
> Harvard University Herbaria that "duplicate" usually, but not always,
> means from the same organism and same collection event---occasionally
> people used to put several organisms on the same sheet, raising the
> possibility that they are not even the same taxon. Worse,  the
> different parts of the same organism might be catalogued as separate
> specimens. In this case, an assertion that they are from the same
> individual might be true and understandable, but the utility of that
> assertion depends on your purpose. Consider a use case in which one
> set of traditional duplicates all have a determination that is out of
> date, but another specimen---say your acorn collected later from the
> same tree---has a current determination.  For purposes of notifying
> duplicate holders that a new determination has been made to the
> original, the later acorn may not be interesting. This means that for
> this use, a distributed query of the form "find all records with the
> same dwc:individualID" is not as useful as "find all records with the
> same dwc:eventID".
>
> Also, as Mark writes, it doesn't address any other associatedOccurrences.
>
> More generally, we are working on annotations of data records.
> Probably what the real issue here is that associatedOccurrences is an
> assertion about organisms, and we are making assertions about
> occurrence data.
>
> On Mon, Aug 23, 2010 at 3:07 PM, Steve Baskauf
> <steve.baskauf at vanderbilt.edu> wrote:
>   
>> Bob,
>> It seems to me that the most semantically clear way to indicate in a
>> machine-readable way that two herbarium sheets are duplicates would be to
>> assert that they have the same dwc:individualID.  individualID is defined as
>> "An identifier for an individual or named group of individual organisms
>> represented in the Occurrence" so asserting that two occurrences represent
>> the same individual or named group of individual organisms pretty much
>> exactly describes what duplicate specimens are.  I use this same approach to
>> indicate that
>> http://bioimages.vanderbilt.edu/baskauf/67307
>> is an image of an acorn from the same tree:
>> http://bioimages.vanderbilt.edu/ind-baskauf/67304
>> as the bark image
>> http://bioimages.vanderbilt.edu/baskauf/67312
>> I won't say more here as I have written more extensively on this approach in
>> Biodiversity Informatics 7:17-44
>> (https://journals.ku.edu/index.php/jbi/article/view/3664).  You can also
>> look at the RDF associated with those GUIDs to see what I mean.  Solving
>> this problem is also one of the reasons I have proposed adding the class
>> Individual to DwC (i.e. so that the individuals that are the object of
>> dwc:individualID can be rdfs:type'd using a well-known vocabulary and
>> therefore be "understood" by linked data clients).
>>
>> Steve
>>
>> Bob Morris wrote:
>>
>> http://rs.tdwg.org/dwc/terms/index.htm#associatedOccurrences   carries
>> this description:
>>
>> associatedOccurrences
>> Identifier:	http://rs.tdwg.org/dwc/terms/associatedOccurrences
>> Class:	http://rs.tdwg.org/dwc/terms/Occurrence
>> Definition:	A list (concatenated and separated) of identifiers of
>> other Occurrence records and their associations to this Occurrence.
>> Comment:	Example: "sibling of FMNH:Mammal:1234; sibling of
>> FMNH:Mammal:1235". For discussion see
>> http://code.google.com/p/darwincore/wiki/Occurrence
>> Details:	associatedOccurrences
>>
>> My questions:
>> a.  Are the names of the associations, and/or the syntax of the value
>> meant to be community defined?
>> b. If no to a. , where are those definitions? If yes, Have any
>> communities defined any names and syntax? I am especially interested
>> in "duplicate of" in the case of herbarium sheets."
>> c. (May share an answer with b.) Is there any use being made by anyone
>> in which associatedOccurrences is designed to have machine-readable
>> values.  If yes, where?
>>
>> Thanks
>> Bob
>>
>>
>>
>>
>>
>> --
>> Steven J. Baskauf, Ph.D., Senior Lecturer
>> Vanderbilt University Dept. of Biological Sciences
>>
>> postal mail address:
>> VU Station B 351634
>> Nashville, TN  37235-1634,  U.S.A.
>>
>> delivery address:
>> 2125 Stevenson Center
>> 1161 21st Ave., S.
>> Nashville, TN 37235
>>
>> office: 2128 Stevenson Center
>> phone: (615) 343-4582,  fax: (615) 343-6707
>> http://bioimages.vanderbilt.edu
>>
>>     
>
>
>
>   

-- 
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 343-6707
http://bioimages.vanderbilt.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20100824/8fbd01f7/attachment.html