[tdwg-content] What is an Occurrence? [followup to "Wrong" RDF and What I learned... threads]

Fri Oct 15 07:50:09 CEST 2010

After the flurry of emails recently, I had an opportunity to carefully 
read all the way through the threads again, followed by enforced "think 
time" during my long commute.  I was actually pretty cheerful after that 
because I think that in essence, most of the conversation about what 
constitutes an Occurrence really boils down to the same thing.  So I 
have sat down and tried to summarize what seems to me to be a consensus 
about Occurrences.  To follow my points, please refer to the diagram at:
http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif

Consensus on relationships
1. The fundamental definition of an Occurrence involves evidence that a 
representative of a taxon occurred at a place and time. 
Note 1.A: For clarity, I have modified John's statement in his last 
email by replacing "taxon" with "representative of a taxon".  I'm 
considering a taxon to be an abstract concept that is applied to 
individuals or groups of organisms. 
Note 1.B. This definition is far more useful than the official 
definition of the class Occurrence "The category of information 
pertaining to evidence of an occurrence..." which is essentially circular. 
Note 1.C: This statement is extremely broad because the evidence could 
be of many sorts, the representative could range from a single 
individual to all organisms on the earth, the taxon could be anyone's 
definition at any taxonomic level, the place could range from a GPS 
point with uncertainty of less than 10 meters to the entire planet 
earth, and the time could range from a shutter click of less than one 
second to 3.4 billion years.
2. The diagram is an attempt to summarize in pictorial form statements 
and relationships that have been described in the thread.  The taxon 
representative is recorded as existing at a particular time and place 
(the arrow) and the result is an Occurrence record.  That Occurrence 
record exists as metadata which may be associated with a token that can 
be used to voucher the fact that the taxon representative existed.  That 
token may be the organism itself (or a living part of it as in a twig 
for grafting), all or part of the organism in preserved form, an 
electronic representation such as an image or sound recording, and other 
kinds of things like tissue or DNA samples.  There may also be no token 
at all, in which case we call the Occurrence record an observation.  
Based on direct observation of the taxon representative, examination of 
one or more tokens, or both, some determiner asserts that a taxon 
concept applies to the taxon representative and as a result a scientific 
name can be used to "identify" the taxon representative.  (There may be 
a lot of other complicated stuff above the Identification box, but that 
will have to be filled in by the taxonomists.) 
Note 2.A: I have mapped onto this diagram the letters that John used in 
his last email to refer to entities that are involved in an Occurrence 
(T, E, L, O, and G).  I will beg the forgiveness of fossil people 
because I don't really know how the geological context fits in.  I'm 
assuming that it is a way of asserting time and location on a much 
broader scale than we do for extant organisms. 
Note 2.B: I have put a dotted line around the part of the diagram that I 
think includes all the things that people might consider part of the 
Occurrence itself.  I have left out "T" and the other parts related to 
identification because it seems to me that you can have an occurrence 
that you document which does not yet (and perhaps never will) have an 
identification.  The Occurrence still asserts that a taxon 
representative existed at a time and place; we just don't yet know what 
the taxon is. 
3. The red lines indicate the relationships that connect the various 
entities (I'm going to go ahead and call them resources).  Consistent 
with popular opinion, the Occurrence record is the center of the 
universe and most things are connected to it.
Note 3.A: I am sticking to my guns and refuse to connect the 
Identification directly to the Occurrence.  It is the taxon 
representative that is being identified, not the occurrence.  One can 
assert another sort of relationship between the identification and the 
occurrence if one wants to say that one consulted the occurrence 
metadata and token in order to decide about the identification, but it 
is not correct to say that the Identification identifies either the 
Occurrence metadata or the token (as Rich pointed out). 

OK, so that's step one - defining what is related to what.  If anyone 
disagrees with these relationships, please clarify or create your own 
diagram.

Complicating circumstances/caveats
1. It is noted and recognized that some users will not care to include 
all of these relationships in their models.  In the interest of 
simplification or "flattening" the relationships, they may wish to 
collapse some parts of this diagram (e.g. incorporate time and location 
metadata within the Occurrence metadata rather than considering them 
separate resources, applying scientific names directly to the taxon 
representatives without defining a taxon concept or recording the 
determination metadata, connecting identifications directly to the 
occurrence, etc.).  This doesn't mean that the relationships don't 
exist, it just means that some users don't care about them.
2. It is recognized that different users will be interested in or able 
to specify the various resources to differing degrees of precision.  
Examples: A photographer might record times to the nearest second, a 
collector may only be interested in noting the date on which a specimen 
was collected.  A location may be specified to the precision of a GPS 
reading or be defined as some geographic or political subdivision.  The 
taxon representative may be an individual organism, a flock or clump, or 
some larger aggregation of taxon representatives.

That's step two.  If I've missed any complications, please point them out.

My opinions about the implications of this diagram
1. The circle I've labeled as "taxon representative" is the resource 
type that I'm proposing to be represented by the class Individual.  You 
will note that in both the definition of dwc:individualID ("An 
identifier for an individual or named group of individual organisms...") 
and the proposed class definition ("The category of information 
pertaining to an individual or named group of individual organisms 
represented in an Occurrence"), groups of individual organisms are 
included.  Thus John's example of a fossil having myriad individuals, or 
Richard's examples of thousands of plankton, a large school of fish, 
herd of wildebeest, flock of
birds, could all be categorized as "Individual" under this definition if 
there is a reasonable expectation that all of the individuals in the 
group are members of the same taxon.  Perhaps there is a better name for 
this resource, but since dwc:individualID was already extant, I chose 
Individual as the class name for consistency with the pattern 
established with other classes and their associated xxxxID terms. 
2. Although in note 1.C. I have given the ranges of the various 
resources to their logical extreme (as was done previously in the 
thread), I think that as a practical matter we can adopt guidelines to 
set reasonable values for the "normal" ranges of the resources.  One 
such guideline might be that we suggest a range that can accommodate 
about 95% of the user needs within the community (this came from Rich's 
comment about satisfying 95% of the user need with an establishmentMeans 
controlled vocuabulary).  For example, it was suggested that the range 
for the location of an Occurrence could span the entire planet Earth.  
True enough, but virtually nobody would find such a span useful.  95% of 
users would probably find a range between a GPS reading with 10 meter 
precision and the extent of a county or province useful for recording 
the location of an Occurrence.  I can suggest similar "useful" ranges: 
one second to one day for an event time (excluding fossils), one 
individual organism to the number of organisms that would fit within a 
50 meter radius for an "individual", and taxon identified to family for 
plants and maybe mammals, genus for birds, and order for insects.  So 
framing the definition of an Occurrence in these terms it would be 
something like: "An occurrence involves evidence (consisting of a 
physical token, electronic record, or personal observation) that a 
representative (ranging from a single individual to the number that 
would fit on a football field) of a taxon (hopefully identified to some 
lower taxonomic level) occurred at a place (determined to a precision 
between that of a GPS reading and the size of a county/province) and 
time (spanning one second to one day)."  A few people might object to 
this level of restrictiveness, but I would guess that it would make 95% 
of us happy.
3. With the exception of the "missing" class Individual, every resource 
type on this diagram except for the "token" and Scientific name has a 
Darwin Core class. Every resource type on the diagram except for "token" 
has a dwc:xxxxID term that can be used to refer to a GUID for the 
resource.  The implication of this is that any resource on this diagram 
except for the token and taxon representative (i.e. Individual) is ready 
to be represented in RDF by Darwin Core terms in the sense that the 
relationships (red lines) can be represented by the xxxxID terms and 
that the resources can be rdfs:type'd using Darwin Core classes.  
(Lacking a class for the scientific name doesn't seem like a big deal to 
me since the scientific name can be a string literal - but then I'm not 
a taxonomist.) 
4. OK, I've avoided it as long as I can, so I'm going to confess now to 
the RDF-phobes.  The red lines and shapes are something pretty close to 
an RDF graph.  What that means is that if the community can agree that 
this diagram correctly represents the relationships among the kinds of 
biodiversity resources that we care about, then the matter of providing 
guidelines on how to represent Darwin Core in RDF suddenly gets a lot 
simpler.  Just convert the "picture" of the RDF graph into XML format 
and we have a template.  Alright, that's an oversimplification, but I 
think it is essentially true because the most difficult part of 
achieving a consensus on RDF representations is to decide how we connect 
the resource types, not on the literals that we hang onto resources as 
properties. 
5. While I'm beating the RDF drum again, the importance of my opinion 
number 2 can be extended into the GUID adoption process.  In my comments 
to Kevin about the Beginner's Guide to Persistent Identifiers, I think I 
commented on the question of how one decides whether a GUID needs to be 
assigned to something or not.  I believe that the answer to that 
question boils down to this: we need a GUID for any resource that will 
be referenced by more than one other resource.  Do we need to be able to 
assign a GUID to Taxon concepts?  Yes, because it is likely that many 
identifications will want to reference a particular taxon concept.  Do 
we need to be able to assign a GUID to an Event?  Maybe or maybe not.  
If every occurrence has its own separate time recorded, then no GUID is 
needed because the time is just a part of every separate occurrence 
record.  If the event is defined to be a time range that represents a 
collecting trip, then there may be many Occurrences that are associated 
with that trip and all of them could reference the GUID for that event 
rather than repeating the event information for every Occurrence.  The 
point here is that every shape (class of resources) on this diagram at 
least has the POTENTIAL to be a node connecting multiple resources and 
therefore should have the capability of being assigned a GUID, having 
its own RDF record, and being appropriately typed (presumably by a DwC 
class).  So this is a final technical argument for why we need to have 
the DwC class Individual.  Whether or not people ultimately choose to 
assign GUIDs to particular resource types or not is their own choice, 
but they need to at least be ABLE to if they need that resource to serve 
as a node given the structure of their metadata. 

We need to clarify how the "token" thing fits in, but I'm stopping there 
for now.  I would very much appreciate responses indicating that:

A. you agree with the diagram and connections (and consider this 
definition and diagram a consensus)
B. you disagree with the diagram (and articulate why)
C. you provide an alternative diagram or explanation of the 
relationships among the classes related to Occurrences.

Thanks for you patience with another tome.
Steve

-- 
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 343-6707
http://bioimages.vanderbilt.edu