I have been dreading trying to write this post
which I have
promised (or threatened depending on if you have enjoyed or been
annoyed by the
previous lengthy thread) for some time.
I have dreaded it because this is a complicated subject and not
one that
is amenable to terse messages. However,
after the previous conversation with Rich et al., I feel for the first
time
that I have the questions (not answers!) clearly in my mind. So rather than starting off rambling about
LivingSpecimens and establishmentMeans as I had planned, I'm going to
start by
laying down several principles that have come into clarity in my mind
after the
previous conversation and the attempt to map things out in a diagram. I will apologize in advance for failure to
use the correct database or IT technical terms when I'm in unfamiliar
territory. Until there is a consensus about how we deal
with the "tokens" we use to document Occurrences, I'm not sure that
what I have to say about those other topics will make sense.
PRINCIPLES (derived from earlier discussion)
1. We have a number of kinds of "things" (which I will henceforth refer to as "resources") that are useful for describing and organizing metadata that we collect in our attempts to document biodiversity. For many of these types of resources, we have defined classes to categorize the terms that can be used to describe the properties of resources that are instances of that class. Describing the class helps us to understand the type of resources that constitute instances of that class.
2. A conscious decision was made to avoid formally defining rdfs:domain for Darwin Core terms. This decision was made to provide flexibility in the way the terms can be used and to avoid the situation where semantic clients would draw incorrect or silly conclusions about what kind of things resources are. However, this decision does not excuse us from thinking carefully about whether a term can be appropriately applied to a resource that is a member of some class (e.g. should we say that a digital photograph has a scientific name?). Placing a term within a class is a suggestion that the term would appropriately be applied as a property of an instance of a class.
3. When users want to "flatten" and simplify their databases, they tend to eliminate one-to-many (1:M) relationships in favor of one-to-one (1:1) relationships. The result of that is differences like we saw in
http://bioimages.vanderbilt.edu/pages/rich-diagram1.gif (which allows 1:M relationships between Occurrences and Events and between Events and Locations) and
http://bioimages.vanderbilt.edu/pages/rich-diagram2.gif (which "atomizes" every Occurrence by considering it to have its own separate eventTime and Location information).
A. There is nothing intrinsically "right" or "wrong" about either of these approaches, because they each have their own advantages. The 1:M approach is more efficient, but results in a more complicated database, while the 1:1 approach results in a simpler database but may require repeating some or many term values in the records.
B. The choices that users make in these situations is the cause of much of the disagreement about whether a certain class should exist or not since the people taking the 1:1 approach "collapse" the relationship diagram and eliminate classes they don't need while people who take the 1:M approach need instances of the class to act as nodes to connect their "many" resources to some other thing.
C. This collapsing of the diagram is also the reason for some disagreement about whether a term belongs in a certain class or not. In the example above, 1:1 people would say that eventDate is a property of an Occurrence, while 1:M people would say that eventDate is a property of an Event.
D. The choice of users on this issue influences their decision about whether or not to create resources that are instances of classes and hence to assign them identifiers. If users take the 1:M approach, they need identifiers for resources that are acting as connecting nodes so that they can make reference to that resource in the metadata of the many things they are connecting to it. If users take the 1:1 approach, they probably will skip creating explicit resources (and their corresponding identifiers) for resources of the class that they are "collapsing" out of the diagram).
4. I would propose that the "right" relationship diagram is not necessarily one that caters to a certain "right" philosophical point of view. Rather, the "right" diagram is the one that allows users to define the relationships that they need for the organization of their metadata in the simplest manner, and which provides the most clarity about what resources of various kinds are, and how they are connected.
A. "Right" as I have defined it above depends on how broadly applicable the relationship diagram is intended to apply. An individual person or organization with limited interests may have a relationship diagram that is simpler than the diagram shown at http://bioimages.vanderbilt.edu/pages/rich-diagram1.gif or might choose to add classes for other things that are their personal interest. An organization interested focused on different issues or with broader interests might opt for many more or different classes that would be connected to those shown in the diagram.
B. Given what I just said in A, what is "right" for Darwin Core is going to be defined by the needs of the Darwin Core constituency. At the TDWG meeting, John Wieczorek made a statement which I will paraphrase as "in order for a term to make it into Darwin Core, at least two people had to want it". I'm not sure to what extent he was joking about this, but it makes the point that one must consider community needs before saying that a certain part of the "diagram" is necessary. I think that the reason that Rich and I were so quickly able to come to a consensus on the organization of the left side of the diagram is because he realized that there was a significant part of the DwC constituency that needed a way to group occurrences (i.e. needed Individuals) and I realized that there was a significant part of the constituency that needed to group multiple Events at a Locality and multiple Occurrences at an Event. So in evaluating alternative conceptual systems for organizing resources, the question has to be asked as to the extent that an alternative allows broad segments of the DwC constituency to organize their metadata in an efficient and conceptually sensible way. If one alternative is more broadly applicable and conceptually clear than another, then that alternative is better regardless of the philosophical underpinnings of the argument.
5. The last point is one that has run as an undercurrent through various TDWG threads but which may not have been explicitly stated in this particular thread. That is that there should be a separation between what a resource IS and what we want to use a resource FOR. To use technical terms, we need to separate the "type" of a resource from its fitness of use. A digital image IS a digital image. It might be used FOR documenting that an organism was at a particular location at a particular time, but it could be used to illustrate a character, as a part of a visual key, as media for an educational presentation, as art, and probably many other things that aren't popping into my mind at the moment. I believe that much of the confusion about "what is an Occurrence" comes from a failure to make this distinction.
THE ISSUE OF THE TOKEN
Earlier in the thread of "What is an Occurrence", there was a general consensus that an Occurrence often had a "thing" that was associated with it that served as evidence that a taxon representative (i.e. Individual) occurred at a particular Location at a particular time. In my Biodiversity Informatics paper, I called this thing a "representation", but I now believe that "token" is a better term and will use it hereafter. There also seemed to be a consensus that an observation was simply an Occurrence that did not have an associated token. (This is with the understanding that observation is being narrowly defined as a type of Occurrence, with a definable time and location, as opposed to what I called the "checklist" definition which indicated that some undefined taxon representative was present in some defined geographical area at an indefinite time.) In one of my earlier posts, I pleaded for somebody to tell me whether there was an assumption that the token was considered a part of the Occurrence or whether it was a separate thing. I did not get any responses, which I'm construing to mean that people weren't sure about this. At the present, I now have a clearer idea of the general principles I outlined above, and also have the "Rich" diagram for modeling relationships, so I'm going to again pose this question, but in what I hope is a clearer way. I have re-made the earlier diagram as Rich suggested, using triangles rather than arrows. The wide side of the triangle is the "many" side of the relationship and the point is the "one" side. As before, I'm deferring on the right side of the diagram (to the right of Identification) to the taxonomists for now, so let's keep that out of the discussion for the moment. I have also clarified the diagram by coloring in the actual DwC classes to distinguish them from selected terms that fall within those classes (non-colored boxes) and which can be used as properties of resources that are instances of the class. The two alternatives that I'm discussion are:
http://bioimages.vanderbilt.edu/pages/token-assumed.gif which I will refer to as the "assumed token" model and
http://bioimages.vanderbilt.edu/pages/token-explicit.gif which I will refer to as the "explicit token" model.
I believe that historically the assumed token model has been the one which most people have had in mind. Before the new DwC standard, we had specimens and we had observations. In order to avoid redundancies in terms for those two types of "things", a combined "thing" called "Occurrence" was created. An Occurrence that was an observation didn't have a token and an Occurrence that was a specimen had a physical or living specimen as its token. That's all pretty simple and sensible and we see evidence of this kind of thinking on the descriptions given http://rs.tdwg.org/dwc/terms/index.htm . A record for an Occurrence has a thing called its dwc:basisOfRecord that presumably describes the kind of token (if any). So if the token were a preserved specimen, we would say that [Occurrence] basisOfRecord [PreservedSpecimen]. If there were no token we would say [Occurrence] basisOfRecord [HumanObservation] or [Occurrence] basisOfRecord [MachineObservation]. Referring back to the assumed token diagram, in the case of a specimen there is no explicit reference to the specimen as a separate entity. The terms related to the specimen, such as preparations and disposition are just plopped into the Occurrence class which implies that they are properties of the Occurrence itself.
There seems to be a general consensus that other kinds of tokens can be used to document an Occurrence. However, the way that the current Darwin Core terms are designed and placed within classes are very inconsistent as to how they handle types of tokens other than specimens. According to the instructions at the top of http://rs.tdwg.org/dwc/terms/index.htm, a camera trap bird sighting should have [Occurrence] basisOfRecord [MachineObservation]. It is not clear how one is supposed to handle the actually metadata for the image that serves as the token. Unlike specimens where the token's metadata terms are placed in the Occurrence class, I guess in the case of an image one is supposed to use associatedMedia to link the so-called MachineObservation to the image record. If DNA were extracted, one would link the sequence to the Occurrence using associatedSequences (although it's not clear to me what the basisOfRecord for that would be - "TookATissueSample"?). But what does one do for other kinds of tokens, like seeds or tissue samples - create terms like associatedSeed and associatedTissueSample? I think that the ResourceRelationship terms were supposed to handle this problem, but I have yet to see an example of exactly how this was supposed to work.
As an attempt to resolve this confusion in my mind, I wrote the Biodiversity Informatics paper that I've promoted frequently on this list (https://journals.ku.edu/index.php/jbi/article/view/3664). In that paper, I take the basic assumed token model and broaden it in an attempt to make the assumed token model work for all kinds of tokens. Because I assumed that each occurrence has a single token, I "collapsed the diagram" and connected the properties of the token directly to the Occurrence resource (as was modeled when specimen properties were placed within the Occurrence class). If there were several tokens for a given Individual, I "flattened" the records by creating a separate Occurrence resource for each token. The model was generalized further by allowing secondary Occurrence records where the token was not derived directly from the organism but rather derived from a primary Occurrence record. In complicated circumstances such as those found in a botanical garden where a seed or cutting might be collected from a tree with subsequent generation of a LivingSpecimen which might have a PreservedSpecimen collected from it and a DigitalStillImage taken of the preserved specimen. You can see examples of the complex types of situations I tried to handle at
http://bioimages.vanderbilt.edu/pages/conceptual-scheme-insect.gif and
http://bioimages.vanderbilt.edu/pages/conceptual-scheme-botanical.gif
I created my own terms (like sernec:derivativeOccurrence and sernec:derivedFrom) to describe the connections among the individual and the various layers of Occurrences.
Does this system work? Yes, but there are a number of problems associated with it. The first problem is related to Principle 4 above. In order for this system to work, there needs to be a consensus in the DwC community about several things. One is that each Occurrence must have only one token. If we are going to "type" Occurrences by their basisOfRecord (and the acceptable values for basisOfRecord are officially DwC types, see http://rs.tdwg.org/dwc/terms/type-vocabulary/index.htm), then an Occurrence can't have two values for basisOfRecord. It is clear from the discussion we've had that people would like to consider a single Occurrence to be able to have multiple tokens as documentation. The second problem is that there needs to be a consensus that a secondary Occurrence can exist at all (i.e. can you call the image of a specimen "an Occurrence"?). It is clear to me from the discussion that when people are thinking about what an Occurrence means, they have in mind the documentation of the time and place of the Individual in its environment. In a previous communication, John Wieczorek clarified that terms describing Occurrences like recordedBy and eventDate should only apply to primary occurrences and that it would not be appropriate to use them as properties of what I'm calling a secondary occurrence (such as the image of a specimen). So I dealt with this by creating a distinction between Occurrences that document the distribution of a taxon (using the term sernec:documentsDistribution) and those that don't. This is something like the old validDistributionFlag, but I defined documentsDistribution specifically as having a value of "true" only for Occurrences that were derived directly from the Individual (gray arrows in the two diagrams from the paper).
But I think that the worst "crime" of the system I suggested is violation of Principle 5 above. By asserting an unvarying 1:1 relationship between the Occurrence and its token and by collapsing my relationship diagram to not explicitly include a resource that is the token itself, I am confusing the USE of an Occurrence (to demonstrate that a representative of a taxon was present at a particular Location at a particular time) which what the token IS (a dead organism in a jar or glued to paper, an electronic representation of photon patterns, a series of characters representing a nucleotide sequence). So I'm charging myself with this "crime", pleading guilty, and accepting my sentence, which is to admit that the system I suggested in the Biodiversity Informatics paper is "wrong" based on the principles I outlined above. What this amounts to is an acceptance of the "rightness" of the explicit token model (in the sense that I defined "right" in Principle 3 above).
However, if I'm going to make this admission, I demand that the other guilty parties also confess, namely people who want to assert that Occurrences have properties that actually are properties of specimens. If we are going to have a system that actually works, we can't straddle the fence and say that the assumed token model is correct for specimens and that the explicit token model is correct for every other kind of token. If we accept the explicit token model, then specimen will have to come off of it's throne and be a token like all of the other ways that we provide evidence that an Occurrence happened. If we accept the explicit token model, then as a biodiversity informatics resource type "observation" will have to disappear into a puff of nothingness just like the "luminescent ether", "centrifugal force", and other kinds of things that we thought we needed to have to explain things but which turned out to be unnecessary when we figured out more basic explanations. A human observation will simply be an Occurrence that doesn't have a token (which is what I've heard some people say all along). If we allow the Occurrence/token relationship to be a one-to-many relationship rather than one-to-one, then HumanObservation is just the one-to-zero case of the more general one-to-many. For those of you who like the idea of a "machine observation", that is just an Occurrence with a token that is whatever type of resource that the machine produces (electronic data file, image of the organism, image of a graph, or whatever).
ADVANTAGES OF RECOGNIZING TOKENS EXPLICITLY
If we accept the explicit token model over the assumed token model, a number of problems get solved. Just as was the case with Events, people who want to flatten things out by having only one token per Occurrence can do so. For example, if I want to atomize things by defining my Occurrence to have taken place during an Event that lasted only the one second within which my camera shutter clicked, I can do that and have only a single token associated with that Occurrence. On the other hand, if others want to define their Occurrence as taking place over the time over which they photographed, collected a leaf tissue sample, and then collected a branch of a tree for an herbarium specimen, then they can do that and associate all of those tokens (one or more images, the tissue sample, and the preserved specimen) with the single Occurrence.
Another important benefit will come down the line when we actually try to develop RDF templates. Right now it is not exactly clear (at least to me) how properties should be divided up among resources that are being described in the RDF. Based on the assumed token model, I have been including the metadata for the token within the container element for the Occurrence. This leads to some of the kind of odd assertions that people have been objecting to, such as
[Occurrence] dcterms:rights ["(c) 2002 Steven J. Baskauf"] or
[Occurrence] preparations ["skin"].
In the explicit token model, dividing metadata up appropriately among separate Occurrence and token resources makes more sense, e.g.
[Occurrence] recordedBy ["Joe Curator"]
[image] dcterms:rights ["(c) 2002 Steven J. Baskauf"]
[specimen] preparations ["skin"]
If we wanted to be really explicit about this, we probably should have a separate class for PhysicalSpecimens and separate the terms that describe specimens from those that describe Occurrences in general. There might be some difficulty in doing this because there are some terms that might be hard to decide about, like catalogNumber. I don't really think the catalogNumber is a property of the Occurrence, because it makes more sense to me to say
[specimen] catalogNumber ["12345"] than
[Occurrence] catalogNumber ["12345"]
Realistically, I can't see this kind of separation ever happening, given the amount of trouble it's been just to get a few people to admit that Individuals exist. It is just too hard to get motion to happen in the TDWG community. As a practical matter, people who "compress" the system (which we admit happens and make concession to in Principle 3) by having record tables where a single row contains the metadata for both the Occurrence and the token (i.e. treat it as a 1:1 relationship) will simply have a column heading for catalogNumber and not care whether the catalogNumber applies to the Occurrence or the token. It's the people who want to do the more complicated stuff like simultaneously keep track of multiple tokens per Occurrence (like several images, a sound recording, and a specimen), people who want to write RDF, or people who want to merge databases containing many types of tokens who will have to pay attention to this distinction. Physical specimens would really be the only kind of class we would have to create because there already is a rich vocabulary for media items that is separate from DwC (i.e. the MRTG schema) and there are probably also vocabularies for stuff like tissue samples and DNA sequences (although I'm not familiar with them).
TYPING
Bob has warned us about the dangers of asserting that a term always applies to a certain type of resource by asserting that the term has an rdfs:domain . However, we should not avoid attempting to assert that a resource is itself of a certain type. Describing the "type" of a resource is an important part of letting potential users assess the possible fitness of use of that resource. For example, you can collect DNA from a preserved specimen but not from an image. You can include an image in a print journal article but not a sound recording. You can create build a range map from Occurrences, but not from DNA samples. In RDF, one of the basic properties that should be described about every resource is its rdfs:type . In the generic Linked Data world, you can pretty much use anything that you want as an rdfs:type . If you decide to use something obscure, then the danger is that nobody else will have any idea what kind of thing you are describing. The Draft TDWG GUID Applicability Statement recommendation 11 says that "Objects in the biodiversity informatics domain that are identified by a GUID should be typed using the TDWG ontology or other well-known vocabularies in accordance with the TDWG common architecture." So in our community, we can't just type resources any way we want. But exactly how we SHOULD type things isn't clear. There isn't any functioning TDWG ontology at the moment. I have found it useful to use the DwC class as the rdfs:type in my attempts to write RDF. That works pretty well for things that have DwC classes. But if we follow the explicit token model, we need to have some consensus on what we will use as the rdfs:type for the tokens. At this point it looks to me like it would make sense to have the convention that for tokens one uses either a dcterms:type or a Darwin Core type (i.e. one of the types listed at http://rs.tdwg.org/dwc/terms/type-vocabulary/index.htm, although as I already noted, there is no need for HumanObservation in the case of describing a token because human observations don't have tokens). There isn't any sort of "collision" here of the sort that happened right after the adoption of the Darwin Core Standard when we tried to merge the Dublin and Darwin Core types (see http://www.keytonature.eu/wiki/MRTGv08_Type_term_inconsistent_with_DwC and
http://lists.tdwg.org/pipermail/tdwg-content/2009-October/000301.html with many following responses for the gruesome details) since rdfs:type doesn't demand any particular type vocabulary. I'm not entirely happy with this approach because for digital still images the logical type would be dctype:StillImage, which doesn't give any indication as to whether the image is film or digital, but I guess at this point in the 21st century most consuming applications will probably just assume digital anyway.
So (assuming that Individuals become a DwC class) I guess I don't really see that there is any problem in using the current Darwin Core classes to indicate the rdfs:type of every kind of resource that we would be reasonably likely to assign GUIDs to EXCEPT for tokens. Typing of tokens could be done using a combination of Darwin Core and Dublin Core types. What I'm left scratching my head about is basisOfRecord. When I subscribed to the assumed token model (i.e. when I wrote the Biodiversity Informatics paper), I thought I knew what basisOfRecord meant. It meant the kind of token that backed up an Occurrence. So when I wrote RDF for a specimen (as in http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf) I used the "hand grenade" approach to typing. I lobbed every kind of "typing" that I knew of at the Occurrence record for a specimen:
[Occurrence] rdfs:type [dwc:Occurrence]
[Occurrence] dwc:basisOfRecord [dwctype:PreservedSpecimen]
and
[Occurrence] dcterms:type [dctype:PhysicalObject]
Under the explicit token model, I would just use
[Occurrence] rdfs:type [dwc:Occurrence]
for the Occurrence and
[specimen] rdfs:type [dwctype:PreservedSpecimen]
for the specimen itself. If I also took an image at the same time and wanted to say that it was part of the same Occurrence as the specimen, I would use
[image] rdfs:type [dctype:StillImage]
Under the explicit token model, I really can't see any use for dwc:basisOfRecord . Despite the resolution of the "train wreck" involving dcterms:type that we narrowly avoided after the adoption of Darwin Core, the definition still says "the specific nature of the data record - a subtype of the dcterms:type." I think this is clearly wrong because I think we established that it was NOT a subtype of dcterms:type in that discussion that I referenced above. So what is basisOfRecord??? What is "the data record" of which we are describing the nature? If it's the Occurrence, then I think the consensus that I'm hearing in the discussion is that an Occurrence data record shouldn't have as its type any of the dwctype terms except for dwctype:Occurrence. So what are all of the other terms like PreservedSpecimen for???
Under the explicit token model, what we really need is NOT basisOfRecord. What we need is some term like "dwc:tokenID" if you like the Darwin Core IDREF style or if you prefer the style of the Linked Data community "dwc:hasToken". In both cases, the object of the term would be an identifier for the token that's associated with a subject Occurrence. This term could be applied from zero (for observations) to many times to an Occurrence. People who want to flatten everything out will just ignore this term and cram all their metadata for the Occurrence, token, Event, and Location onto one line in their metadata table. People who are going to use any kind of one-to-many relationships at all will have to figure out how to handle that anyway and won't be daunted by having more than one dwc:tokenID per Occurrence. In the spirit of the complicated resource relationship diagrams from my paper, one could link primary tokens (like specimens) to secondary tokens (like specimen images) by using dwc:tokenID as well. Any kind of token (primary, secondary, tertiary, ad infinatum) could be linked to the occurrence that it supports with dwc:occurrenceID.
WHAT DOES THIS DEMAND OF US?
OK, I've now gone on for eight pages of text explaining the rationale behind the question. So I'll return to the basic question: is the consensus for modeling the relationship between an Occurrence and associated token(s) the assumed token model:
http://bioimages.vanderbilt.edu/pages/token-assumed.gif
or the explicit token model:
http://bioimages.vanderbilt.edu/pages/token-explicit.gif
?
If we accept the assumed token model with all of its warts, then for consistency's sake, we must create dwctype terms for each of the types of tokens that people would reasonably want to use as evidence for Occurrences (and my proposal for adding DigitalStillImage as a Darwin Core type stands). We must also resign ourselves to assigning a separate occurrence to each token that users want to use to document the presence of a taxon at a time and place. We also must accept having goofy-sounding statements like
[Occurrence] dcterms:rights ["(c) 2002 Steven J. Baskauf"]
If we accept the explicit token model, then we need to either dump basisOfRecord or come up with some rational explanation for what it actually means (and my proposal to add DigitalStillImage as a Darwin Core type becomes irrelevant). We also need to create some kind of term like dwc:tokenID that will allow connections to be made between Occurrence records and their tokens. For people who want to flatten out their Occurrence records and put the tokens together with the Occurrence (i.e. "compress the diagram" to get rid of the token resource), and who feel some need to indicate the type of the token that they are using, let them use any appropriate term from the Dublin Core or Darwin Core types as a value for rdfs:type.
Until we make one of these choices or the other and "fix" Darwin Core to work in a consistent way, we are just going to continue to misunderstand each other because each person will just "know an Occurrence when they see it".
In the interest of space, I am going to defer on explaining my opinions about LivingSpecimen and establishmentMeans. Those explanations are contingent on the conclusion that we reach on this issue.
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A. delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235 office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu