On Fri, Oct 15, 2010 at 10:45 AM, Steve Baskauf <steve.baskauf@vanderbilt.edu> wrote:

As a background to this post, I want to reference a post by Bob called "SubclassOrNot". I discovered this page on an early foray into the TDWG website labyrinth and it has been very influential on my thinking since then. The idea Bob discusses is central to what I'm writing below so if you haven't read it you might want to do so first. You can probably skip the "OWL Inference" section and still get the point which is described in the first two sections of his post. The URL for the page is http://wiki.tdwg.org/twiki/bin/view/TAG/SubclassOrNot .

To preface what I'm going to say below, I want to put Darwin Core Occurrences in the context of what Bob wrote. In my mind, one of the hallmarks of the Darwin Core standard and one thing that makes it a great improvement over previous versions is that the decision was made to use what Bob called the "has a" approach rather than the "is a" approach. In particular, the Darwin Core standard has a single class called dwc:Occurrence rather than subclasses called "Specimen", "Observation", and other possible things. The way that we differentiate among different kinds of Occurrences is by using the DwC types which are the controlled values for the term dwc:basisOfRecord. Thus we say an Occurrence "has a" basisOfRecord=PreservedSpecimen rather than saying it "is a" PreservedSpecimen. We say an Occurrence "has a" basisOfRecord=HumanObservation rather than saying it "is a"HumanObservation". This approach has greatly reduced the number of different terms in the standard since we don't have to have separate "ObservedBy" and "CollectedBy" terms, but rather can just have a single "RecordedBy" term that applies to both specimens and observations. The same thing applies to many other things, like eventDate rather than DateCollected and DateObserved, locality rather than collectionLocality and observationLocality, etc. With the ratification of Darwin Core, this decision is now a fait acompli and not a subject of discussion or something optional for users of the standard. It also seems to be clear that as necessary new terms can be added to the DwC types which would then be valid controlled values for basisOfRecord.

Since the adoption of the DwC standard, the approach to Occurrences has been what I would describe as "I know an Occurrence when I see one". I consider this as a pretty sloppy practice and as I indicated in my post last night, I think there is enough consensus about what an Occurrence is that we can come up with a better definition than "an occurrence is the category of information pertaining to evidence of an occurrence...". Another part of what I would characterize as sloppiness is the lack of a clear definition of what exactly basisOfRecord means. When I wrote my attempt at summarizing consensus last night, I dodged the question about what I called the "token". This "thing" has been called various names. In the previous discussion on the list, it was sometimes called "the evidence" of the occurrence. In the past I have called it "a representation" - however, I now think the term "token" is better because "representation" has a different technical meaning in the context of content negotiation. When we type an Occurrence by saying it has a basisOfRecord=PreservedSpecimen, we are saying that this Occurrence has as supporting evidence, or as a "token" if you prefer, all or part of the dead remains of the organism (i.e. what I'm calling "the Individual") that was being documented by the Occurrence. When we type an Occurrence by saying it has a basisOfRecord=LivingSpecimen, we are saying that this Occurrence has as a "token" the entire organism that was being documented (or some vegetative part of the live organism that was propagated). When we type an Occurrence by saying it has a basisOfRecord=HumanObservation, we are saying that the Occurrence has no supporting evidence other than the reputation of the observer to accurately record the metadata about the Occurrence. In other words, we "tag" a instance of a core class (to use Bob's words), Occurrence, by telling a metadata consumer what kind of token we are using as evidence of the Occurrence.
A fundamental part of creating a clear definition of what an Occurrence is, is to define exactly what we are including in the concept of Occurrence. One possibility is to (1) say that the two boxes at the right side of the diagram at http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif are fused and that both the Occurrence metadata and its associated token are what we consider to be "the Occurrence". Another approach (2) would be to say that the actual Occurrence as an entity is only the metadata part and that the token is a separate thing. A third approach is to say (3) that everything with the blue dotted lines is considered a part of the Occurrence (i.e. the metadata, the token, the event, and the locality). I don't think in an absolute sense, any one of these approaches is "right". The problem is that these approaches are used inconsistently, sometimes even by the same person, depending on the basisOfRecord. Differences in ways of thinking about this issue is a part of why people aren't understanding the way other people are approaching the structuring of metadata. I have tried to consistently take the approach (1) that the two boxes on the right are fused, i.e. that the Occurrence metadata and the token should both be considered part of the entity that we call "an Occurrence". I think this is why Rich was confused in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001666.html when I said that it was "wrong" to assert that a scientific name is a property of an Occurrence - obviously it is silly to say that the token (photons on a film, sound patterns in a digital file) has a scientific name. Yet that is exactly what people do routinely when the token is a branch cut off a tree and glued to a piece of paper. They say that they are "identifying a specimen". What I am asking (actually demanding) is that the TDWG community get its act together and come to some consistency on this. If we are going to take the approach (2), then we need to take specimens off their pedestal and treat them like we do any other token that we are using as evidence that an Occurrence happened. If we are going to do what was suggested for the BioBlitz in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001603.html, i.e. to call Occurrences "observations" and then link the tokens to them by associatedMedia, ResourceRelationship, or some other means (approach 2) then do it consistently for every kind of token, including specimens, and don't single out media tokens for punishment.
I have in a sense "thrown down the gauntlet" on this issue by proposing that DigitalStillImage be added as a DwC type and as a controlled value for basisOfRecord (http://code.google.com/p/darwincore/issues/detail?id=68). I know what some people are going to say in response to this proposal. "Why do you need to have 'DigitalStillImage' as a value for basisOfRecord when you can just say that the resource's dcterms:type=StillImage?" The answer goes back to Bob's point. If we are going to go the "has a" path (which we already have in DwC for Occurrences) rather than subclassing everything, then we need to provide an appropriate value for the "tag" for any type of resource that a reasonable number of users will want to use as a token. I think it is clear from this and other Bioblitzes, my work in Bioimages, the whale tracking project, and many other examples, that there are plenty of people who are already using DigitalStillImages as tokens and we all need a controlled value to use for basisOfRecord.
The other thing that we accomplish when we type an Occurrence by its basisOfRecord is to tell a consumer what kind of metadata to expect to get about the token in addition to the generic metadata that is provided for all Occurrences. Thus for a LivingSpecimen we expect to be told what zoo, botanical garden, bacterial collection, etc. contains the specimen. For a PreservedSpecimen we expect to be told the preparation type, the location of the repository, etc. For a DigitalStillImage we expect to be told the file type, accessURL, etc. Simply providing a value for dcterms:type=StillImage doesn't indicate whether the image is a physical one (i.e. on film) or a digital one. It is also unreasonable to expect a client to have to be checking two different terms (basisOfRecord and dcterms:type) to find out what they could learn from one (basisOfRecord). Of course it would be advisable to provide a value for dcterms:type as well for clients outside the biodiversity community who may not "understand" what basisOfRecord means.
I hate to keep bringing my posts back to the RDF issue, but thinking about how one would write RDF forces clear thinking about how metadata should be structured. If we intend to separate tokens as entities from their associated Occurrence metadata, i.e. approach (2), then we open up a whole other can of worms. To associate the occurrence resources (i.e. the metadata) with the "different" resource (i.e. the token), we will have probably have to be able to create URIs for the tokens and separate RDF metadata blocks which will have to be rdfs:type'd. What are we going to use for that rdfs:type - create another Darwin Core class? I simply don't think that is a complicated road that we want to travel. It would be far easier to just say that every Occurrence has a one-to-one relationship with its token (which could be "the empty set" for observations). This would not work for people who want to hang multiple tokens on a single observation event, but I think that itself is a bad idea because it makes it even harder to have "flat" occurrence datasets. Just say that every time we collect a different token (or make an observation that has no token), it is a new Occurrence record. Realistically, a single collector can't actually take a picture of a plant at the same time he or she collects it for a specimen anyway. Those really should be considered two different events because they happen at different times.

OK, enough said. Consider this my defense of my proposal "issue 68" to add DigitalStillImage. I would urge the powers that be to respond to the issues that I've raised here before having any kind of "vote" (or whatever is ultimately going to happen when there is an up or down decision about the proposal).

Steve

Steve Baskauf wrote:

After the flurry of emails recently, I had an opportunity to carefully
read all the way through the threads again, followed by enforced "think
time" during my long commute. I was actually pretty cheerful after that
because I think that in essence, most of the conversation about what
constitutes an Occurrence really boils down to the same thing. So I
have sat down and tried to summarize what seems to me to be a consensus
about Occurrences. To follow my points, please refer to the diagram at:
http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif

Consensus on relationships
1. The fundamental definition of an Occurrence involves evidence that a
representative of a taxon occurred at a place and time.
Note 1.A: For clarity, I have modified John's statement in his last
email by replacing "taxon" with "representative of a taxon". I'm
considering a taxon to be an abstract concept that is applied to
individuals or groups of organisms.
Note 1.B. This definition is far more useful than the official
definition of the class Occurrence "The category of information
pertaining to evidence of an occurrence..." which is essentially circular.
Note 1.C: This statement is extremely broad because the evidence could
be of many sorts, the representative could range from a single
individual to all organisms on the earth, the taxon could be anyone's
definition at any taxonomic level, the place could range from a GPS
point with uncertainty of less than 10 meters to the entire planet
earth, and the time could range from a shutter click of less than one
second to 3.4 billion years.
2. The diagram is an attempt to summarize in pictorial form statements
and relationships that have been described in the thread. The taxon
representative is recorded as existing at a particular time and place
(the arrow) and the result is an Occurrence record. That Occurrence
record exists as metadata which may be associated with a token that can
be used to voucher the fact that the taxon representative existed. That
token may be the organism itself (or a living part of it as in a twig
for grafting), all or part of the organism in preserved form, an
electronic representation such as an image or sound recording, and other
kinds of things like tissue or DNA samples. There may also be no token
at all, in which case we call the Occurrence record an observation.
Based on direct observation of the taxon representative, examination of
one or more tokens, or both, some determiner asserts that a taxon
concept applies to the taxon representative and as a result a scientific
name can be used to "identify" the taxon representative. (There may be
a lot of other complicated stuff above the Identification box, but that
will have to be filled in by the taxonomists.)
Note 2.A: I have mapped onto this diagram the letters that John used in
his last email to refer to entities that are involved in an Occurrence
(T, E, L, O, and G). I will beg the forgiveness of fossil people
because I don't really know how the geological context fits in. I'm
assuming that it is a way of asserting time and location on a much
broader scale than we do for extant organisms.
Note 2.B: I have put a dotted line around the part of the diagram that I
think includes all the things that people might consider part of the
Occurrence itself. I have left out "T" and the other parts related to
identification because it seems to me that you can have an occurrence
that you document which does not yet (and perhaps never will) have an
identification. The Occurrence still asserts that a taxon
representative existed at a time and place; we just don't yet know what
the taxon is.
3. The red lines indicate the relationships that connect the various
entities (I'm going to go ahead and call them resources). Consistent
with popular opinion, the Occurrence record is the center of the
universe and most things are connected to it.
Note 3.A: I am sticking to my guns and refuse to connect the
Identification directly to the Occurrence. It is the taxon
representative that is being identified, not the occurrence. One can
assert another sort of relationship between the identification and the
occurrence if one wants to say that one consulted the occurrence
metadata and token in order to decide about the identification, but it
is not correct to say that the Identification identifies either the
Occurrence metadata or the token (as Rich pointed out).

OK, so that's step one - defining what is related to what. If anyone
disagrees with these relationships, please clarify or create your own
diagram.

Complicating circumstances/caveats
1. It is noted and recognized that some users will not care to include
all of these relationships in their models. In the interest of
simplification or "flattening" the relationships, they may wish to
collapse some parts of this diagram (e.g. incorporate time and location
metadata within the Occurrence metadata rather than considering them
separate resources, applying scientific names directly to the taxon
representatives without defining a taxon concept or recording the
determination metadata, connecting identifications directly to the
occurrence, etc.). This doesn't mean that the relationships don't
exist, it just means that some users don't care about them.
2. It is recognized that different users will be interested in or able
to specify the various resources to differing degrees of precision.
Examples: A photographer might record times to the nearest second, a
collector may only be interested in noting the date on which a specimen
was collected. A location may be specified to the precision of a GPS
reading or be defined as some geographic or political subdivision. The
taxon representative may be an individual organism, a flock or clump, or
some larger aggregation of taxon representatives.

That's step two. If I've missed any complications, please point them out.

My opinions about the implications of this diagram
1. The circle I've labeled as "taxon representative" is the resource
type that I'm proposing to be represented by the class Individual. You
will note that in both the definition of dwc:individualID ("An
identifier for an individual or named group of individual organisms...")
and the proposed class definition ("The category of information
pertaining to an individual or named group of individual organisms
represented in an Occurrence"), groups of individual organisms are
included. Thus John's example of a fossil having myriad individuals, or
Richard's examples of thousands of plankton, a large school of fish,
herd of wildebeest, flock of
birds, could all be categorized as "Individual" under this definition if
there is a reasonable expectation that all of the individuals in the
group are members of the same taxon. Perhaps there is a better name for
this resource, but since dwc:individualID was already extant, I chose
Individual as the class name for consistency with the pattern
established with other classes and their associated xxxxID terms.
2. Although in note 1.C. I have given the ranges of the various
resources to their logical extreme (as was done previously in the
thread), I think that as a practical matter we can adopt guidelines to
set reasonable values for the "normal" ranges of the resources. One
such guideline might be that we suggest a range that can accommodate
about 95% of the user needs within the community (this came from Rich's
comment about satisfying 95% of the user need with an establishmentMeans
controlled vocuabulary). For example, it was suggested that the range
for the location of an Occurrence could span the entire planet Earth.
True enough, but virtually nobody would find such a span useful. 95% of
users would probably find a range between a GPS reading with 10 meter
precision and the extent of a county or province useful for recording
the location of an Occurrence. I can suggest similar "useful" ranges:
one second to one day for an event time (excluding fossils), one
individual organism to the number of organisms that would fit within a
50 meter radius for an "individual", and taxon identified to family for
plants and maybe mammals, genus for birds, and order for insects. So
framing the definition of an Occurrence in these terms it would be
something like: "An occurrence involves evidence (consisting of a
physical token, electronic record, or personal observation) that a
representative (ranging from a single individual to the number that
would fit on a football field) of a taxon (hopefully identified to some
lower taxonomic level) occurred at a place (determined to a precision
between that of a GPS reading and the size of a county/province) and
time (spanning one second to one day)." A few people might object to
this level of restrictiveness, but I would guess that it would make 95%
of us happy.
3. With the exception of the "missing" class Individual, every resource
type on this diagram except for the "token" and Scientific name has a
Darwin Core class. Every resource type on the diagram except for "token"
has a dwc:xxxxID term that can be used to refer to a GUID for the
resource. The implication of this is that any resource on this diagram
except for the token and taxon representative (i.e. Individual) is ready
to be represented in RDF by Darwin Core terms in the sense that the
relationships (red lines) can be represented by the xxxxID terms and
that the resources can be rdfs:type'd using Darwin Core classes.
(Lacking a class for the scientific name doesn't seem like a big deal to
me since the scientific name can be a string literal - but then I'm not
a taxonomist.)
4. OK, I've avoided it as long as I can, so I'm going to confess now to
the RDF-phobes. The red lines and shapes are something pretty close to
an RDF graph. What that means is that if the community can agree that
this diagram correctly represents the relationships among the kinds of
biodiversity resources that we care about, then the matter of providing
guidelines on how to represent Darwin Core in RDF suddenly gets a lot
simpler. Just convert the "picture" of the RDF graph into XML format
and we have a template. Alright, that's an oversimplification, but I
think it is essentially true because the most difficult part of
achieving a consensus on RDF representations is to decide how we connect
the resource types, not on the literals that we hang onto resources as
properties.
5. While I'm beating the RDF drum again, the importance of my opinion
number 2 can be extended into the GUID adoption process. In my comments
to Kevin about the Beginner's Guide to Persistent Identifiers, I think I
commented on the question of how one decides whether a GUID needs to be
assigned to something or not. I believe that the answer to that
question boils down to this: we need a GUID for any resource that will
be referenced by more than one other resource. Do we need to be able to
assign a GUID to Taxon concepts? Yes, because it is likely that many
identifications will want to reference a particular taxon concept. Do
we need to be able to assign a GUID to an Event? Maybe or maybe not.
If every occurrence has its own separate time recorded, then no GUID is
needed because the time is just a part of every separate occurrence
record. If the event is defined to be a time range that represents a
collecting trip, then there may be many Occurrences that are associated
with that trip and all of them could reference the GUID for that event
rather than repeating the event information for every Occurrence. The
point here is that every shape (class of resources) on this diagram at
least has the POTENTIAL to be a node connecting multiple resources and
therefore should have the capability of being assigned a GUID, having
its own RDF record, and being appropriately typed (presumably by a DwC
class). So this is a final technical argument for why we need to have
the DwC class Individual. Whether or not people ultimately choose to
assign GUIDs to particular resource types or not is their own choice,
but they need to at least be ABLE to if they need that resource to serve
as a node given the structure of their metadata.

We need to clarify how the "token" thing fits in, but I'm stopping there
for now. I would very much appreciate responses indicating that:

A. you agree with the diagram and connections (and consider this
definition and diagram a consensus)
B. you disagree with the diagram (and articulate why)
C. you provide an alternative diagram or explanation of the
relationships among the classes related to Occurrences.

Thanks for you patience with another tome.
Steve

--
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN 37235-1634, U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582, fax: (615) 343-6707
http://bioimages.vanderbilt.edu

_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
.

--
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN 37235-1634, U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582, fax: (615) 343-6707
http://bioimages.vanderbilt.edu

--
----------------------------------------------------------------
Pete DeVries
Department of Entomology
University of Wisconsin - Madison
445 Russell Laboratories
1630 Linden Drive
Madison, WI 53706
TaxonConcept Knowledge Base / GeoSpecies Knowledge Base
About the GeoSpecies Knowledge Base
------------------------------------------------------------