[tdwg-content] Treatise on Occurrence, tokens, and basisOfRecord
Steve Baskauf
steve.baskauf at vanderbilt.edu
Sun Oct 24 01:02:56 CEST 2010
I have been dreading trying to write this post which I have promised (or
threatened depending on if you have enjoyed or been annoyed by the
previous lengthy thread) for some time. I have dreaded it because this
is a complicated subject and not one that is amenable to terse
messages. However, after the previous conversation with Rich et al., I
feel for the first time that I have the questions (not answers!) clearly
in my mind. So rather than starting off rambling about LivingSpecimens
and establishmentMeans as I had planned, I'm going to start by laying
down several principles that have come into clarity in my mind after the
previous conversation and the attempt to map things out in a diagram. I
will apologize in advance for failure to use the correct database or IT
technical terms when I'm in unfamiliar territory. Until there is a
consensus about how we deal with the "tokens" we use to document
Occurrences, I'm not sure that what I have to say about those other
topics will make sense.
PRINCIPLES (derived from earlier discussion)
1. We have a number of kinds of "things" (which I will henceforth refer
to as "resources") that are useful for describing and organizing
metadata that we collect in our attempts to document biodiversity. For
many of these types of resources, we have defined classes to categorize
the terms that can be used to describe the properties of resources that
are instances of that class. Describing the class helps us to
understand the type of resources that constitute instances of that class.
2. A conscious decision was made to avoid formally defining rdfs:domain
for Darwin Core terms. This decision was made to provide flexibility in
the way the terms can be used and to avoid the situation where semantic
clients would draw incorrect or silly conclusions about what kind of
things resources are. However, this decision does not excuse us from
thinking carefully about whether a term can be appropriately applied to
a resource that is a member of some class (e.g. should we say that a
digital photograph has a scientific name?). Placing a term within a
class is a suggestion that the term would appropriately be applied as a
property of an instance of a class.
3. When users want to "flatten" and simplify their databases, they tend
to eliminate one-to-many (1:M) relationships in favor of one-to-one
(1:1) relationships. The result of that is differences like we saw in
http://bioimages.vanderbilt.edu/pages/rich-diagram1.gif (which allows
1:M relationships between Occurrences and Events and between Events and
Locations) and
http://bioimages.vanderbilt.edu/pages/rich-diagram2.gif (which
"atomizes" every Occurrence by considering it to have its own separate
eventTime and Location information).
A. There is nothing intrinsically "right" or "wrong" about either of
these approaches, because they each have their own advantages. The 1:M
approach is more efficient, but results in a more complicated database,
while the 1:1 approach results in a simpler database but may require
repeating some or many term values in the records.
B. The choices that users make in these situations is the cause of much
of the disagreement about whether a certain class should exist or not
since the people taking the 1:1 approach "collapse" the relationship
diagram and eliminate classes they don't need while people who take the
1:M approach need instances of the class to act as nodes to connect
their "many" resources to some other thing.
C. This collapsing of the diagram is also the reason for some
disagreement about whether a term belongs in a certain class or not. In
the example above, 1:1 people would say that eventDate is a property of
an Occurrence, while 1:M people would say that eventDate is a property
of an Event.
D. The choice of users on this issue influences their decision about
whether or not to create resources that are instances of classes and
hence to assign them identifiers. If users take the 1:M approach, they
need identifiers for resources that are acting as connecting nodes so
that they can make reference to that resource in the metadata of the
many things they are connecting to it. If users take the 1:1 approach,
they probably will skip creating explicit resources (and their
corresponding identifiers) for resources of the class that they are
"collapsing" out of the diagram).
4. I would propose that the "right" relationship diagram is not
necessarily one that caters to a certain "right" philosophical point of
view. Rather, the "right" diagram is the one that allows users to
define the relationships that they need for the organization of their
metadata in the simplest manner, and which provides the most clarity
about what resources of various kinds are, and how they are connected.
A. "Right" as I have defined it above depends on how broadly applicable
the relationship diagram is intended to apply. An individual person or
organization with limited interests may have a relationship diagram that
is simpler than the diagram shown at
http://bioimages.vanderbilt.edu/pages/rich-diagram1.gif or might choose
to add classes for other things that are their personal interest. An
organization interested focused on different issues or with broader
interests might opt for many more or different classes that would be
connected to those shown in the diagram.
B. Given what I just said in A, what is "right" for Darwin Core is going
to be defined by the needs of the Darwin Core constituency. At the TDWG
meeting, John Wieczorek made a statement which I will paraphrase as "in
order for a term to make it into Darwin Core, at least two people had to
want it". I'm not sure to what extent he was joking about this, but it
makes the point that one must consider community needs before saying
that a certain part of the "diagram" is necessary. I think that the
reason that Rich and I were so quickly able to come to a consensus on
the organization of the left side of the diagram is because he realized
that there was a significant part of the DwC constituency that needed a
way to group occurrences (i.e. needed Individuals) and I realized that
there was a significant part of the constituency that needed to group
multiple Events at a Locality and multiple Occurrences at an Event. So
in evaluating alternative conceptual systems for organizing resources,
the question has to be asked as to the extent that an alternative allows
broad segments of the DwC constituency to organize their metadata in an
efficient and conceptually sensible way. If one alternative is more
broadly applicable and conceptually clear than another, then that
alternative is better regardless of the philosophical underpinnings of
the argument.
5. The last point is one that has run as an undercurrent through various
TDWG threads but which may not have been explicitly stated in this
particular thread. That is that there should be a separation between
what a resource IS and what we want to use a resource FOR. To use
technical terms, we need to separate the "type" of a resource from its
fitness of use. A digital image IS a digital image. It might be used
FOR documenting that an organism was at a particular location at a
particular time, but it could be used to illustrate a character, as a
part of a visual key, as media for an educational presentation, as art,
and probably many other things that aren't popping into my mind at the
moment. I believe that much of the confusion about "what is an
Occurrence" comes from a failure to make this distinction.
THE ISSUE OF THE TOKEN
Earlier in the thread of "What is an Occurrence", there was a general
consensus that an Occurrence often had a "thing" that was associated
with it that served as evidence that a taxon representative (i.e.
Individual) occurred at a particular Location at a particular time. In
my Biodiversity Informatics paper, I called this thing a
"representation", but I now believe that "token" is a better term and
will use it hereafter. There also seemed to be a consensus that an
observation was simply an Occurrence that did not have an associated
token. (This is with the understanding that observation is being
narrowly defined as a type of Occurrence, with a definable time and
location, as opposed to what I called the "checklist" definition which
indicated that some undefined taxon representative was present in some
defined geographical area at an indefinite time.) In one of my earlier
posts, I pleaded for somebody to tell me whether there was an assumption
that the token was considered a part of the Occurrence or whether it was
a separate thing. I did not get any responses, which I'm construing to
mean that people weren't sure about this. At the present, I now have a
clearer idea of the general principles I outlined above, and also have
the "Rich" diagram for modeling relationships, so I'm going to again
pose this question, but in what I hope is a clearer way. I have re-made
the earlier diagram as Rich suggested, using triangles rather than
arrows. The wide side of the triangle is the "many" side of the
relationship and the point is the "one" side. As before, I'm deferring
on the right side of the diagram (to the right of Identification) to the
taxonomists for now, so let's keep that out of the discussion for the
moment. I have also clarified the diagram by coloring in the actual DwC
classes to distinguish them from selected terms that fall within those
classes (non-colored boxes) and which can be used as properties of
resources that are instances of the class. The two alternatives that
I'm discussion are:
http://bioimages.vanderbilt.edu/pages/token-assumed.gif which I will
refer to as the "assumed token" model and
http://bioimages.vanderbilt.edu/pages/token-explicit.gif which I will
refer to as the "explicit token" model.
I believe that historically the assumed token model has been the one
which most people have had in mind. Before the new DwC standard, we had
specimens and we had observations. In order to avoid redundancies in
terms for those two types of "things", a combined "thing" called
"Occurrence" was created. An Occurrence that was an observation didn't
have a token and an Occurrence that was a specimen had a physical or
living specimen as its token. That's all pretty simple and sensible and
we see evidence of this kind of thinking on the descriptions given
http://rs.tdwg.org/dwc/terms/index.htm . A record for an Occurrence has
a thing called its dwc:basisOfRecord that presumably describes the kind
of token (if any). So if the token were a preserved specimen, we would
say that [Occurrence] basisOfRecord [PreservedSpecimen]. If there were
no token we would say [Occurrence] basisOfRecord [HumanObservation] or
[Occurrence] basisOfRecord [MachineObservation]. Referring back to the
assumed token diagram, in the case of a specimen there is no explicit
reference to the specimen as a separate entity. The terms related to
the specimen, such as preparations and disposition are just plopped into
the Occurrence class which implies that they are properties of the
Occurrence itself.
There seems to be a general consensus that other kinds of tokens can be
used to document an Occurrence. However, the way that the current
Darwin Core terms are designed and placed within classes are very
inconsistent as to how they handle types of tokens other than
specimens. According to the instructions at the top of
http://rs.tdwg.org/dwc/terms/index.htm, a camera trap bird sighting
should have [Occurrence] basisOfRecord [MachineObservation]. It is not
clear how one is supposed to handle the actually metadata for the image
that serves as the token. Unlike specimens where the token's metadata
terms are placed in the Occurrence class, I guess in the case of an
image one is supposed to use associatedMedia to link the so-called
MachineObservation to the image record. If DNA were extracted, one
would link the sequence to the Occurrence using associatedSequences
(although it's not clear to me what the basisOfRecord for that would be
- "TookATissueSample"?). But what does one do for other kinds of
tokens, like seeds or tissue samples - create terms like associatedSeed
and associatedTissueSample? I think that the ResourceRelationship terms
were supposed to handle this problem, but I have yet to see an example
of exactly how this was supposed to work.
As an attempt to resolve this confusion in my mind, I wrote the
Biodiversity Informatics paper that I've promoted frequently on this
list (https://journals.ku.edu/index.php/jbi/article/view/3664). In that
paper, I take the basic assumed token model and broaden it in an attempt
to make the assumed token model work for all kinds of tokens. Because I
assumed that each occurrence has a single token, I "collapsed the
diagram" and connected the properties of the token directly to the
Occurrence resource (as was modeled when specimen properties were placed
within the Occurrence class). If there were several tokens for a given
Individual, I "flattened" the records by creating a separate Occurrence
resource for each token. The model was generalized further by allowing
secondary Occurrence records where the token was not derived directly
from the organism but rather derived from a primary Occurrence record.
In complicated circumstances such as those found in a botanical garden
where a seed or cutting might be collected from a tree with subsequent
generation of a LivingSpecimen which might have a PreservedSpecimen
collected from it and a DigitalStillImage taken of the preserved
specimen. You can see examples of the complex types of situations I
tried to handle at
http://bioimages.vanderbilt.edu/pages/conceptual-scheme-insect.gif and
http://bioimages.vanderbilt.edu/pages/conceptual-scheme-botanical.gif
I created my own terms (like sernec:derivativeOccurrence and
sernec:derivedFrom) to describe the connections among the individual and
the various layers of Occurrences.
Does this system work? Yes, but there are a number of problems
associated with it. The first problem is related to Principle 4 above.
In order for this system to work, there needs to be a consensus in the
DwC community about several things. One is that each Occurrence must
have only one token. If we are going to "type" Occurrences by their
basisOfRecord (and the acceptable values for basisOfRecord are
officially DwC types, see
http://rs.tdwg.org/dwc/terms/type-vocabulary/index.htm), then an
Occurrence can't have two values for basisOfRecord. It is clear from
the discussion we've had that people would like to consider a single
Occurrence to be able to have multiple tokens as documentation. The
second problem is that there needs to be a consensus that a secondary
Occurrence can exist at all (i.e. can you call the image of a specimen
"an Occurrence"?). It is clear to me from the discussion that when
people are thinking about what an Occurrence means, they have in mind
the documentation of the time and place of the Individual in its
environment. In a previous communication, John Wieczorek clarified that
terms describing Occurrences like recordedBy and eventDate should only
apply to primary occurrences and that it would not be appropriate to use
them as properties of what I'm calling a secondary occurrence (such as
the image of a specimen). So I dealt with this by creating a
distinction between Occurrences that document the distribution of a
taxon (using the term sernec:documentsDistribution) and those that
don't. This is something like the old validDistributionFlag, but I
defined documentsDistribution specifically as having a value of "true"
only for Occurrences that were derived directly from the Individual
(gray arrows in the two diagrams from the paper).
But I think that the worst "crime" of the system I suggested is
violation of Principle 5 above. By asserting an unvarying 1:1
relationship between the Occurrence and its token and by collapsing my
relationship diagram to not explicitly include a resource that is the
token itself, I am confusing the USE of an Occurrence (to demonstrate
that a representative of a taxon was present at a particular Location at
a particular time) which what the token IS (a dead organism in a jar or
glued to paper, an electronic representation of photon patterns, a
series of characters representing a nucleotide sequence). So I'm
charging myself with this "crime", pleading guilty, and accepting my
sentence, which is to admit that the system I suggested in the
Biodiversity Informatics paper is "wrong" based on the principles I
outlined above. What this amounts to is an acceptance of the
"rightness" of the explicit token model (in the sense that I defined
"right" in Principle 3 above).
However, if I'm going to make this admission, I demand that the other
guilty parties also confess, namely people who want to assert that
Occurrences have properties that actually are properties of specimens.
If we are going to have a system that actually works, we can't straddle
the fence and say that the assumed token model is correct for specimens
and that the explicit token model is correct for every other kind of
token. If we accept the explicit token model, then specimen will have
to come off of it's throne and be a token like all of the other ways
that we provide evidence that an Occurrence happened. If we accept the
explicit token model, then as a biodiversity informatics resource type
"observation" will have to disappear into a puff of nothingness just
like the "luminescent ether", "centrifugal force", and other kinds of
things that we thought we needed to have to explain things but which
turned out to be unnecessary when we figured out more basic
explanations. A human observation will simply be an Occurrence that
doesn't have a token (which is what I've heard some people say all
along). If we allow the Occurrence/token relationship to be a
one-to-many relationship rather than one-to-one, then HumanObservation
is just the one-to-zero case of the more general one-to-many. For those
of you who like the idea of a "machine observation", that is just an
Occurrence with a token that is whatever type of resource that the
machine produces (electronic data file, image of the organism, image of
a graph, or whatever).
ADVANTAGES OF RECOGNIZING TOKENS EXPLICITLY
If we accept the explicit token model over the assumed token model, a
number of problems get solved. Just as was the case with Events, people
who want to flatten things out by having only one token per Occurrence
can do so. For example, if I want to atomize things by defining my
Occurrence to have taken place during an Event that lasted only the one
second within which my camera shutter clicked, I can do that and have
only a single token associated with that Occurrence. On the other hand,
if others want to define their Occurrence as taking place over the time
over which they photographed, collected a leaf tissue sample, and then
collected a branch of a tree for an herbarium specimen, then they can do
that and associate all of those tokens (one or more images, the tissue
sample, and the preserved specimen) with the single Occurrence.
Another important benefit will come down the line when we actually try
to develop RDF templates. Right now it is not exactly clear (at least
to me) how properties should be divided up among resources that are
being described in the RDF. Based on the assumed token model, I have
been including the metadata for the token within the container element
for the Occurrence. This leads to some of the kind of odd assertions
that people have been objecting to, such as
[Occurrence] dcterms:rights ["(c) 2002 Steven J. Baskauf"] or
[Occurrence] preparations ["skin"].
In the explicit token model, dividing metadata up appropriately among
separate Occurrence and token resources makes more sense, e.g.
[Occurrence] recordedBy ["Joe Curator"]
[image] dcterms:rights ["(c) 2002 Steven J. Baskauf"]
[specimen] preparations ["skin"]
If we wanted to be really explicit about this, we probably should have a
separate class for PhysicalSpecimens and separate the terms that
describe specimens from those that describe Occurrences in general.
There might be some difficulty in doing this because there are some
terms that might be hard to decide about, like catalogNumber. I don't
really think the catalogNumber is a property of the Occurrence, because
it makes more sense to me to say
[specimen] catalogNumber ["12345"] than
[Occurrence] catalogNumber ["12345"]
Realistically, I can't see this kind of separation ever happening, given
the amount of trouble it's been just to get a few people to admit that
Individuals exist. It is just too hard to get motion to happen in the
TDWG community. As a practical matter, people who "compress" the system
(which we admit happens and make concession to in Principle 3) by having
record tables where a single row contains the metadata for both the
Occurrence and the token (i.e. treat it as a 1:1 relationship) will
simply have a column heading for catalogNumber and not care whether the
catalogNumber applies to the Occurrence or the token. It's the people
who want to do the more complicated stuff like simultaneously keep track
of multiple tokens per Occurrence (like several images, a sound
recording, and a specimen), people who want to write RDF, or people who
want to merge databases containing many types of tokens who will have to
pay attention to this distinction. Physical specimens would really be
the only kind of class we would have to create because there already is
a rich vocabulary for media items that is separate from DwC (i.e. the
MRTG schema) and there are probably also vocabularies for stuff like
tissue samples and DNA sequences (although I'm not familiar with them).
TYPING
Bob has warned us about the dangers of asserting that a term always
applies to a certain type of resource by asserting that the term has an
rdfs:domain . However, we should not avoid attempting to assert that a
resource is itself of a certain type. Describing the "type" of a
resource is an important part of letting potential users assess the
possible fitness of use of that resource. For example, you can collect
DNA from a preserved specimen but not from an image. You can include an
image in a print journal article but not a sound recording. You can
create build a range map from Occurrences, but not from DNA samples. In
RDF, one of the basic properties that should be described about every
resource is its rdfs:type . In the generic Linked Data world, you can
pretty much use anything that you want as an rdfs:type . If you decide
to use something obscure, then the danger is that nobody else will have
any idea what kind of thing you are describing. The Draft TDWG GUID
Applicability Statement recommendation 11 says that "Objects in the
biodiversity informatics domain that are identified by a GUID should be
typed using the TDWG ontology or other well-known vocabularies in
accordance with the TDWG common architecture." So in our community, we
can't just type resources any way we want. But exactly how we SHOULD
type things isn't clear. There isn't any functioning TDWG ontology at
the moment. I have found it useful to use the DwC class as the
rdfs:type in my attempts to write RDF. That works pretty well for
things that have DwC classes. But if we follow the explicit token
model, we need to have some consensus on what we will use as the
rdfs:type for the tokens. At this point it looks to me like it would
make sense to have the convention that for tokens one uses either a
dcterms:type or a Darwin Core type (i.e. one of the types listed at
http://rs.tdwg.org/dwc/terms/type-vocabulary/index.htm, although as I
already noted, there is no need for HumanObservation in the case of
describing a token because human observations don't have tokens). There
isn't any sort of "collision" here of the sort that happened right after
the adoption of the Darwin Core Standard when we tried to merge the
Dublin and Darwin Core types (see
http://www.keytonature.eu/wiki/MRTGv08_Type_term_inconsistent_with_DwC and
http://lists.tdwg.org/pipermail/tdwg-content/2009-October/000301.html
with many following responses for the gruesome details) since rdfs:type
doesn't demand any particular type vocabulary. I'm not entirely happy
with this approach because for digital still images the logical type
would be dctype:StillImage, which doesn't give any indication as to
whether the image is film or digital, but I guess at this point in the
21^st century most consuming applications will probably just assume
digital anyway.
So (assuming that Individuals become a DwC class) I guess I don't really
see that there is any problem in using the current Darwin Core classes
to indicate the rdfs:type of every kind of resource that we would be
reasonably likely to assign GUIDs to EXCEPT for tokens. Typing of
tokens could be done using a combination of Darwin Core and Dublin Core
types. What I'm left scratching my head about is basisOfRecord. When I
subscribed to the assumed token model (i.e. when I wrote the
Biodiversity Informatics paper), I thought I knew what basisOfRecord
meant. It meant the kind of token that backed up an Occurrence. So
when I wrote RDF for a specimen (as in
http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf) I used the
"hand grenade" approach to typing. I lobbed every kind of "typing" that
I knew of at the Occurrence record for a specimen:
[Occurrence] rdfs:type [dwc:Occurrence]
[Occurrence] dwc:basisOfRecord [dwctype:PreservedSpecimen]
and
[Occurrence] dcterms:type [dctype:PhysicalObject]
Under the explicit token model, I would just use
[Occurrence] rdfs:type [dwc:Occurrence]
for the Occurrence and
[specimen] rdfs:type [dwctype:PreservedSpecimen]
for the specimen itself. If I also took an image at the same time and
wanted to say that it was part of the same Occurrence as the specimen, I
would use
[image] rdfs:type [dctype:StillImage]
Under the explicit token model, I really can't see any use for
dwc:basisOfRecord . Despite the resolution of the "train wreck"
involving dcterms:type that we narrowly avoided after the adoption of
Darwin Core, the definition still says "the specific nature of the data
record - a subtype of the dcterms:type." I think this is clearly wrong
because I think we established that it was NOT a subtype of dcterms:type
in that discussion that I referenced above. So what is
basisOfRecord??? What is "the data record" of which we are describing
the nature? If it's the Occurrence, then I think the consensus that I'm
hearing in the discussion is that an Occurrence data record shouldn't
have as its type any of the dwctype terms except for
dwctype:Occurrence. So what are all of the other terms like
PreservedSpecimen for???
Under the explicit token model, what we really need is NOT
basisOfRecord. What we need is some term like "dwc:tokenID" if you like
the Darwin Core IDREF style or if you prefer the style of the Linked
Data community "dwc:hasToken". In both cases, the object of the term
would be an identifier for the token that's associated with a subject
Occurrence. This term could be applied from zero (for observations) to
many times to an Occurrence. People who want to flatten everything out
will just ignore this term and cram all their metadata for the
Occurrence, token, Event, and Location onto one line in their metadata
table. People who are going to use any kind of one-to-many
relationships at all will have to figure out how to handle that anyway
and won't be daunted by having more than one dwc:tokenID per
Occurrence. In the spirit of the complicated resource relationship
diagrams from my paper, one could link primary tokens (like specimens)
to secondary tokens (like specimen images) by using dwc:tokenID as
well. Any kind of token (primary, secondary, tertiary, ad infinatum)
could be linked to the occurrence that it supports with dwc:occurrenceID.
WHAT DOES THIS DEMAND OF US?
OK, I've now gone on for eight pages of text explaining the rationale
behind the question. So I'll return to the basic question: is the
consensus for modeling the relationship between an Occurrence and
associated token(s) the assumed token model:
http://bioimages.vanderbilt.edu/pages/token-assumed.gif
or the explicit token model:
http://bioimages.vanderbilt.edu/pages/token-explicit.gif
?
If we accept the assumed token model with all of its warts, then for
consistency's sake, we must create dwctype terms for each of the types
of tokens that people would reasonably want to use as evidence for
Occurrences (and my proposal for adding DigitalStillImage as a Darwin
Core type stands). We must also resign ourselves to assigning a
separate occurrence to each token that users want to use to document the
presence of a taxon at a time and place. We also must accept having
goofy-sounding statements like
[Occurrence] dcterms:rights ["(c) 2002 Steven J. Baskauf"]
If we accept the explicit token model, then we need to either dump
basisOfRecord or come up with some rational explanation for what it
actually means (and my proposal to add DigitalStillImage as a Darwin
Core type becomes irrelevant). We also need to create some kind of term
like dwc:tokenID that will allow connections to be made between
Occurrence records and their tokens. For people who want to flatten out
their Occurrence records and put the tokens together with the Occurrence
(i.e. "compress the diagram" to get rid of the token resource), and who
feel some need to indicate the type of the token that they are using,
let them use any appropriate term from the Dublin Core or Darwin Core
types as a value for rdfs:type.
Until we make one of these choices or the other and "fix" Darwin Core to
work in a consistent way, we are just going to continue to misunderstand
each other because each person will just "know an Occurrence when they
see it".
In the interest of space, I am going to defer on explaining my opinions
about LivingSpecimen and establishmentMeans. Those explanations are
contingent on the conclusion that we reach on this issue.
Steve
--
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences
postal mail address:
VU Station B 351634
Nashville, TN 37235-1634, U.S.A.
delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235
office: 2128 Stevenson Center
phone: (615) 343-4582, fax: (615) 343-6707
http://bioimages.vanderbilt.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20101023/defc0b9d/attachment-0001.html
More information about the tdwg-content
mailing list