Treatise on Occurrence, tokens, and basisOfRecord
I have been dreading trying to write this post which I have promised (or threatened depending on if you have enjoyed or been annoyed by the previous lengthy thread) for some time. I have dreaded it because this is a complicated subject and not one that is amenable to terse messages. However, after the previous conversation with Rich et al., I feel for the first time that I have the questions (not answers!) clearly in my mind. So rather than starting off rambling about LivingSpecimens and establishmentMeans as I had planned, I'm going to start by laying down several principles that have come into clarity in my mind after the previous conversation and the attempt to map things out in a diagram. I will apologize in advance for failure to use the correct database or IT technical terms when I'm in unfamiliar territory. Until there is a consensus about how we deal with the "tokens" we use to document Occurrences, I'm not sure that what I have to say about those other topics will make sense.
PRINCIPLES (derived from earlier discussion)
1. We have a number of kinds of "things" (which I will henceforth refer to as "resources") that are useful for describing and organizing metadata that we collect in our attempts to document biodiversity. For many of these types of resources, we have defined classes to categorize the terms that can be used to describe the properties of resources that are instances of that class. Describing the class helps us to understand the type of resources that constitute instances of that class.
2. A conscious decision was made to avoid formally defining rdfs:domain for Darwin Core terms. This decision was made to provide flexibility in the way the terms can be used and to avoid the situation where semantic clients would draw incorrect or silly conclusions about what kind of things resources are. However, this decision does not excuse us from thinking carefully about whether a term can be appropriately applied to a resource that is a member of some class (e.g. should we say that a digital photograph has a scientific name?). Placing a term within a class is a suggestion that the term would appropriately be applied as a property of an instance of a class.
3. When users want to "flatten" and simplify their databases, they tend to eliminate one-to-many (1:M) relationships in favor of one-to-one (1:1) relationships. The result of that is differences like we saw in
http://bioimages.vanderbilt.edu/pages/rich-diagram1.gif (which allows 1:M relationships between Occurrences and Events and between Events and Locations) and
http://bioimages.vanderbilt.edu/pages/rich-diagram2.gif (which "atomizes" every Occurrence by considering it to have its own separate eventTime and Location information).
A. There is nothing intrinsically "right" or "wrong" about either of these approaches, because they each have their own advantages. The 1:M approach is more efficient, but results in a more complicated database, while the 1:1 approach results in a simpler database but may require repeating some or many term values in the records.
B. The choices that users make in these situations is the cause of much of the disagreement about whether a certain class should exist or not since the people taking the 1:1 approach "collapse" the relationship diagram and eliminate classes they don't need while people who take the 1:M approach need instances of the class to act as nodes to connect their "many" resources to some other thing.
C. This collapsing of the diagram is also the reason for some disagreement about whether a term belongs in a certain class or not. In the example above, 1:1 people would say that eventDate is a property of an Occurrence, while 1:M people would say that eventDate is a property of an Event.
D. The choice of users on this issue influences their decision about whether or not to create resources that are instances of classes and hence to assign them identifiers. If users take the 1:M approach, they need identifiers for resources that are acting as connecting nodes so that they can make reference to that resource in the metadata of the many things they are connecting to it. If users take the 1:1 approach, they probably will skip creating explicit resources (and their corresponding identifiers) for resources of the class that they are "collapsing" out of the diagram).
4. I would propose that the "right" relationship diagram is not necessarily one that caters to a certain "right" philosophical point of view. Rather, the "right" diagram is the one that allows users to define the relationships that they need for the organization of their metadata in the simplest manner, and which provides the most clarity about what resources of various kinds are, and how they are connected.
A. "Right" as I have defined it above depends on how broadly applicable the relationship diagram is intended to apply. An individual person or organization with limited interests may have a relationship diagram that is simpler than the diagram shown at http://bioimages.vanderbilt.edu/pages/rich-diagram1.gif or might choose to add classes for other things that are their personal interest. An organization interested focused on different issues or with broader interests might opt for many more or different classes that would be connected to those shown in the diagram.
B. Given what I just said in A, what is "right" for Darwin Core is going to be defined by the needs of the Darwin Core constituency. At the TDWG meeting, John Wieczorek made a statement which I will paraphrase as "in order for a term to make it into Darwin Core, at least two people had to want it". I'm not sure to what extent he was joking about this, but it makes the point that one must consider community needs before saying that a certain part of the "diagram" is necessary. I think that the reason that Rich and I were so quickly able to come to a consensus on the organization of the left side of the diagram is because he realized that there was a significant part of the DwC constituency that needed a way to group occurrences (i.e. needed Individuals) and I realized that there was a significant part of the constituency that needed to group multiple Events at a Locality and multiple Occurrences at an Event. So in evaluating alternative conceptual systems for organizing resources, the question has to be asked as to the extent that an alternative allows broad segments of the DwC constituency to organize their metadata in an efficient and conceptually sensible way. If one alternative is more broadly applicable and conceptually clear than another, then that alternative is better regardless of the philosophical underpinnings of the argument.
5. The last point is one that has run as an undercurrent through various TDWG threads but which may not have been explicitly stated in this particular thread. That is that there should be a separation between what a resource IS and what we want to use a resource FOR. To use technical terms, we need to separate the "type" of a resource from its fitness of use. A digital image IS a digital image. It might be used FOR documenting that an organism was at a particular location at a particular time, but it could be used to illustrate a character, as a part of a visual key, as media for an educational presentation, as art, and probably many other things that aren't popping into my mind at the moment. I believe that much of the confusion about "what is an Occurrence" comes from a failure to make this distinction.
THE ISSUE OF THE TOKEN
Earlier in the thread of "What is an Occurrence", there was a general consensus that an Occurrence often had a "thing" that was associated with it that served as evidence that a taxon representative (i.e. Individual) occurred at a particular Location at a particular time. In my Biodiversity Informatics paper, I called this thing a "representation", but I now believe that "token" is a better term and will use it hereafter. There also seemed to be a consensus that an observation was simply an Occurrence that did not have an associated token. (This is with the understanding that observation is being narrowly defined as a type of Occurrence, with a definable time and location, as opposed to what I called the "checklist" definition which indicated that some undefined taxon representative was present in some defined geographical area at an indefinite time.) In one of my earlier posts, I pleaded for somebody to tell me whether there was an assumption that the token was considered a part of the Occurrence or whether it was a separate thing. I did not get any responses, which I'm construing to mean that people weren't sure about this. At the present, I now have a clearer idea of the general principles I outlined above, and also have the "Rich" diagram for modeling relationships, so I'm going to again pose this question, but in what I hope is a clearer way. I have re-made the earlier diagram as Rich suggested, using triangles rather than arrows. The wide side of the triangle is the "many" side of the relationship and the point is the "one" side. As before, I'm deferring on the right side of the diagram (to the right of Identification) to the taxonomists for now, so let's keep that out of the discussion for the moment. I have also clarified the diagram by coloring in the actual DwC classes to distinguish them from selected terms that fall within those classes (non-colored boxes) and which can be used as properties of resources that are instances of the class. The two alternatives that I'm discussion are:
http://bioimages.vanderbilt.edu/pages/token-assumed.gif which I will refer to as the "assumed token" model and
http://bioimages.vanderbilt.edu/pages/token-explicit.gif which I will refer to as the "explicit token" model.
I believe that historically the assumed token model has been the one which most people have had in mind. Before the new DwC standard, we had specimens and we had observations. In order to avoid redundancies in terms for those two types of "things", a combined "thing" called "Occurrence" was created. An Occurrence that was an observation didn't have a token and an Occurrence that was a specimen had a physical or living specimen as its token. That's all pretty simple and sensible and we see evidence of this kind of thinking on the descriptions given http://rs.tdwg.org/dwc/terms/index.htm . A record for an Occurrence has a thing called its dwc:basisOfRecord that presumably describes the kind of token (if any). So if the token were a preserved specimen, we would say that [Occurrence] basisOfRecord [PreservedSpecimen]. If there were no token we would say [Occurrence] basisOfRecord [HumanObservation] or [Occurrence] basisOfRecord [MachineObservation]. Referring back to the assumed token diagram, in the case of a specimen there is no explicit reference to the specimen as a separate entity. The terms related to the specimen, such as preparations and disposition are just plopped into the Occurrence class which implies that they are properties of the Occurrence itself.
There seems to be a general consensus that other kinds of tokens can be used to document an Occurrence. However, the way that the current Darwin Core terms are designed and placed within classes are very inconsistent as to how they handle types of tokens other than specimens. According to the instructions at the top of http://rs.tdwg.org/dwc/terms/index.htm, a camera trap bird sighting should have [Occurrence] basisOfRecord [MachineObservation]. It is not clear how one is supposed to handle the actually metadata for the image that serves as the token. Unlike specimens where the token's metadata terms are placed in the Occurrence class, I guess in the case of an image one is supposed to use associatedMedia to link the so-called MachineObservation to the image record. If DNA were extracted, one would link the sequence to the Occurrence using associatedSequences (although it's not clear to me what the basisOfRecord for that would be - "TookATissueSample"?). But what does one do for other kinds of tokens, like seeds or tissue samples - create terms like associatedSeed and associatedTissueSample? I think that the ResourceRelationship terms were supposed to handle this problem, but I have yet to see an example of exactly how this was supposed to work.
As an attempt to resolve this confusion in my mind, I wrote the Biodiversity Informatics paper that I've promoted frequently on this list (https://journals.ku.edu/index.php/jbi/article/view/3664). In that paper, I take the basic assumed token model and broaden it in an attempt to make the assumed token model work for all kinds of tokens. Because I assumed that each occurrence has a single token, I "collapsed the diagram" and connected the properties of the token directly to the Occurrence resource (as was modeled when specimen properties were placed within the Occurrence class). If there were several tokens for a given Individual, I "flattened" the records by creating a separate Occurrence resource for each token. The model was generalized further by allowing secondary Occurrence records where the token was not derived directly from the organism but rather derived from a primary Occurrence record. In complicated circumstances such as those found in a botanical garden where a seed or cutting might be collected from a tree with subsequent generation of a LivingSpecimen which might have a PreservedSpecimen collected from it and a DigitalStillImage taken of the preserved specimen. You can see examples of the complex types of situations I tried to handle at
http://bioimages.vanderbilt.edu/pages/conceptual-scheme-insect.gif and
http://bioimages.vanderbilt.edu/pages/conceptual-scheme-botanical.gif
I created my own terms (like sernec:derivativeOccurrence and sernec:derivedFrom) to describe the connections among the individual and the various layers of Occurrences.
Does this system work? Yes, but there are a number of problems associated with it. The first problem is related to Principle 4 above. In order for this system to work, there needs to be a consensus in the DwC community about several things. One is that each Occurrence must have only one token. If we are going to "type" Occurrences by their basisOfRecord (and the acceptable values for basisOfRecord are officially DwC types, see http://rs.tdwg.org/dwc/terms/type-vocabulary/index.htm), then an Occurrence can't have two values for basisOfRecord. It is clear from the discussion we've had that people would like to consider a single Occurrence to be able to have multiple tokens as documentation. The second problem is that there needs to be a consensus that a secondary Occurrence can exist at all (i.e. can you call the image of a specimen "an Occurrence"?). It is clear to me from the discussion that when people are thinking about what an Occurrence means, they have in mind the documentation of the time and place of the Individual in its environment. In a previous communication, John Wieczorek clarified that terms describing Occurrences like recordedBy and eventDate should only apply to primary occurrences and that it would not be appropriate to use them as properties of what I'm calling a secondary occurrence (such as the image of a specimen). So I dealt with this by creating a distinction between Occurrences that document the distribution of a taxon (using the term sernec:documentsDistribution) and those that don't. This is something like the old validDistributionFlag, but I defined documentsDistribution specifically as having a value of "true" only for Occurrences that were derived directly from the Individual (gray arrows in the two diagrams from the paper).
But I think that the worst "crime" of the system I suggested is violation of Principle 5 above. By asserting an unvarying 1:1 relationship between the Occurrence and its token and by collapsing my relationship diagram to not explicitly include a resource that is the token itself, I am confusing the USE of an Occurrence (to demonstrate that a representative of a taxon was present at a particular Location at a particular time) which what the token IS (a dead organism in a jar or glued to paper, an electronic representation of photon patterns, a series of characters representing a nucleotide sequence). So I'm charging myself with this "crime", pleading guilty, and accepting my sentence, which is to admit that the system I suggested in the Biodiversity Informatics paper is "wrong" based on the principles I outlined above. What this amounts to is an acceptance of the "rightness" of the explicit token model (in the sense that I defined "right" in Principle 3 above).
However, if I'm going to make this admission, I demand that the other guilty parties also confess, namely people who want to assert that Occurrences have properties that actually are properties of specimens. If we are going to have a system that actually works, we can't straddle the fence and say that the assumed token model is correct for specimens and that the explicit token model is correct for every other kind of token. If we accept the explicit token model, then specimen will have to come off of it's throne and be a token like all of the other ways that we provide evidence that an Occurrence happened. If we accept the explicit token model, then as a biodiversity informatics resource type "observation" will have to disappear into a puff of nothingness just like the "luminescent ether", "centrifugal force", and other kinds of things that we thought we needed to have to explain things but which turned out to be unnecessary when we figured out more basic explanations. A human observation will simply be an Occurrence that doesn't have a token (which is what I've heard some people say all along). If we allow the Occurrence/token relationship to be a one-to-many relationship rather than one-to-one, then HumanObservation is just the one-to-zero case of the more general one-to-many. For those of you who like the idea of a "machine observation", that is just an Occurrence with a token that is whatever type of resource that the machine produces (electronic data file, image of the organism, image of a graph, or whatever).
ADVANTAGES OF RECOGNIZING TOKENS EXPLICITLY
If we accept the explicit token model over the assumed token model, a number of problems get solved. Just as was the case with Events, people who want to flatten things out by having only one token per Occurrence can do so. For example, if I want to atomize things by defining my Occurrence to have taken place during an Event that lasted only the one second within which my camera shutter clicked, I can do that and have only a single token associated with that Occurrence. On the other hand, if others want to define their Occurrence as taking place over the time over which they photographed, collected a leaf tissue sample, and then collected a branch of a tree for an herbarium specimen, then they can do that and associate all of those tokens (one or more images, the tissue sample, and the preserved specimen) with the single Occurrence.
Another important benefit will come down the line when we actually try to develop RDF templates. Right now it is not exactly clear (at least to me) how properties should be divided up among resources that are being described in the RDF. Based on the assumed token model, I have been including the metadata for the token within the container element for the Occurrence. This leads to some of the kind of odd assertions that people have been objecting to, such as
[Occurrence] dcterms:rights ["(c) 2002 Steven J. Baskauf"] or
[Occurrence] preparations ["skin"].
In the explicit token model, dividing metadata up appropriately among separate Occurrence and token resources makes more sense, e.g.
[Occurrence] recordedBy ["Joe Curator"]
[image] dcterms:rights ["(c) 2002 Steven J. Baskauf"]
[specimen] preparations ["skin"]
If we wanted to be really explicit about this, we probably should have a separate class for PhysicalSpecimens and separate the terms that describe specimens from those that describe Occurrences in general. There might be some difficulty in doing this because there are some terms that might be hard to decide about, like catalogNumber. I don't really think the catalogNumber is a property of the Occurrence, because it makes more sense to me to say
[specimen] catalogNumber ["12345"] than
[Occurrence] catalogNumber ["12345"]
Realistically, I can't see this kind of separation ever happening, given the amount of trouble it's been just to get a few people to admit that Individuals exist. It is just too hard to get motion to happen in the TDWG community. As a practical matter, people who "compress" the system (which we admit happens and make concession to in Principle 3) by having record tables where a single row contains the metadata for both the Occurrence and the token (i.e. treat it as a 1:1 relationship) will simply have a column heading for catalogNumber and not care whether the catalogNumber applies to the Occurrence or the token. It's the people who want to do the more complicated stuff like simultaneously keep track of multiple tokens per Occurrence (like several images, a sound recording, and a specimen), people who want to write RDF, or people who want to merge databases containing many types of tokens who will have to pay attention to this distinction. Physical specimens would really be the only kind of class we would have to create because there already is a rich vocabulary for media items that is separate from DwC (i.e. the MRTG schema) and there are probably also vocabularies for stuff like tissue samples and DNA sequences (although I'm not familiar with them).
TYPING
Bob has warned us about the dangers of asserting that a term always applies to a certain type of resource by asserting that the term has an rdfs:domain . However, we should not avoid attempting to assert that a resource is itself of a certain type. Describing the "type" of a resource is an important part of letting potential users assess the possible fitness of use of that resource. For example, you can collect DNA from a preserved specimen but not from an image. You can include an image in a print journal article but not a sound recording. You can create build a range map from Occurrences, but not from DNA samples. In RDF, one of the basic properties that should be described about every resource is its rdfs:type . In the generic Linked Data world, you can pretty much use anything that you want as an rdfs:type . If you decide to use something obscure, then the danger is that nobody else will have any idea what kind of thing you are describing. The Draft TDWG GUID Applicability Statement recommendation 11 says that "Objects in the biodiversity informatics domain that are identified by a GUID should be typed using the TDWG ontology or other well-known vocabularies in accordance with the TDWG common architecture." So in our community, we can't just type resources any way we want. But exactly how we SHOULD type things isn't clear. There isn't any functioning TDWG ontology at the moment. I have found it useful to use the DwC class as the rdfs:type in my attempts to write RDF. That works pretty well for things that have DwC classes. But if we follow the explicit token model, we need to have some consensus on what we will use as the rdfs:type for the tokens. At this point it looks to me like it would make sense to have the convention that for tokens one uses either a dcterms:type or a Darwin Core type (i.e. one of the types listed at http://rs.tdwg.org/dwc/terms/type-vocabulary/index.htm, although as I already noted, there is no need for HumanObservation in the case of describing a token because human observations don't have tokens). There isn't any sort of "collision" here of the sort that happened right after the adoption of the Darwin Core Standard when we tried to merge the Dublin and Darwin Core types (see http://www.keytonature.eu/wiki/MRTGv08_Type_term_inconsistent_with_DwC and
http://lists.tdwg.org/pipermail/tdwg-content/2009-October/000301.html with many following responses for the gruesome details) since rdfs:type doesn't demand any particular type vocabulary. I'm not entirely happy with this approach because for digital still images the logical type would be dctype:StillImage, which doesn't give any indication as to whether the image is film or digital, but I guess at this point in the 21^st century most consuming applications will probably just assume digital anyway.
So (assuming that Individuals become a DwC class) I guess I don't really see that there is any problem in using the current Darwin Core classes to indicate the rdfs:type of every kind of resource that we would be reasonably likely to assign GUIDs to EXCEPT for tokens. Typing of tokens could be done using a combination of Darwin Core and Dublin Core types. What I'm left scratching my head about is basisOfRecord. When I subscribed to the assumed token model (i.e. when I wrote the Biodiversity Informatics paper), I thought I knew what basisOfRecord meant. It meant the kind of token that backed up an Occurrence. So when I wrote RDF for a specimen (as in http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf) I used the "hand grenade" approach to typing. I lobbed every kind of "typing" that I knew of at the Occurrence record for a specimen:
[Occurrence] rdfs:type [dwc:Occurrence]
[Occurrence] dwc:basisOfRecord [dwctype:PreservedSpecimen]
and
[Occurrence] dcterms:type [dctype:PhysicalObject]
Under the explicit token model, I would just use
[Occurrence] rdfs:type [dwc:Occurrence]
for the Occurrence and
[specimen] rdfs:type [dwctype:PreservedSpecimen]
for the specimen itself. If I also took an image at the same time and wanted to say that it was part of the same Occurrence as the specimen, I would use
[image] rdfs:type [dctype:StillImage]
Under the explicit token model, I really can't see any use for dwc:basisOfRecord . Despite the resolution of the "train wreck" involving dcterms:type that we narrowly avoided after the adoption of Darwin Core, the definition still says "the specific nature of the data record - a subtype of the dcterms:type." I think this is clearly wrong because I think we established that it was NOT a subtype of dcterms:type in that discussion that I referenced above. So what is basisOfRecord??? What is "the data record" of which we are describing the nature? If it's the Occurrence, then I think the consensus that I'm hearing in the discussion is that an Occurrence data record shouldn't have as its type any of the dwctype terms except for dwctype:Occurrence. So what are all of the other terms like PreservedSpecimen for???
Under the explicit token model, what we really need is NOT basisOfRecord. What we need is some term like "dwc:tokenID" if you like the Darwin Core IDREF style or if you prefer the style of the Linked Data community "dwc:hasToken". In both cases, the object of the term would be an identifier for the token that's associated with a subject Occurrence. This term could be applied from zero (for observations) to many times to an Occurrence. People who want to flatten everything out will just ignore this term and cram all their metadata for the Occurrence, token, Event, and Location onto one line in their metadata table. People who are going to use any kind of one-to-many relationships at all will have to figure out how to handle that anyway and won't be daunted by having more than one dwc:tokenID per Occurrence. In the spirit of the complicated resource relationship diagrams from my paper, one could link primary tokens (like specimens) to secondary tokens (like specimen images) by using dwc:tokenID as well. Any kind of token (primary, secondary, tertiary, ad infinatum) could be linked to the occurrence that it supports with dwc:occurrenceID.
WHAT DOES THIS DEMAND OF US?
OK, I've now gone on for eight pages of text explaining the rationale behind the question. So I'll return to the basic question: is the consensus for modeling the relationship between an Occurrence and associated token(s) the assumed token model:
http://bioimages.vanderbilt.edu/pages/token-assumed.gif
or the explicit token model:
http://bioimages.vanderbilt.edu/pages/token-explicit.gif
?
If we accept the assumed token model with all of its warts, then for consistency's sake, we must create dwctype terms for each of the types of tokens that people would reasonably want to use as evidence for Occurrences (and my proposal for adding DigitalStillImage as a Darwin Core type stands). We must also resign ourselves to assigning a separate occurrence to each token that users want to use to document the presence of a taxon at a time and place. We also must accept having goofy-sounding statements like
[Occurrence] dcterms:rights ["(c) 2002 Steven J. Baskauf"]
If we accept the explicit token model, then we need to either dump basisOfRecord or come up with some rational explanation for what it actually means (and my proposal to add DigitalStillImage as a Darwin Core type becomes irrelevant). We also need to create some kind of term like dwc:tokenID that will allow connections to be made between Occurrence records and their tokens. For people who want to flatten out their Occurrence records and put the tokens together with the Occurrence (i.e. "compress the diagram" to get rid of the token resource), and who feel some need to indicate the type of the token that they are using, let them use any appropriate term from the Dublin Core or Darwin Core types as a value for rdfs:type.
Until we make one of these choices or the other and "fix" Darwin Core to work in a consistent way, we are just going to continue to misunderstand each other because each person will just "know an Occurrence when they see it".
In the interest of space, I am going to defer on explaining my opinions about LivingSpecimen and establishmentMeans. Those explanations are contingent on the conclusion that we reach on this issue.
Steve
Hi Steve,
Many thanks for taking the time to carefully articulate all of this. Though your post is long, I think it is clear and well-written, and perhaps a good genesis for a web or wiki page somewhere, to add to the collective documentation of DwC-space.
A couple of comments, as I read through what you wrote:
However, this decision does not excuse us from thinking carefully about whether a term can be appropriately applied to a resource that is a member of some class (e.g. should we say that a digital photograph has a scientific name?). Placing a term within a class is a suggestion that the term would appropriately be applied as a property of an instance of a class.
I'm a bit unsure where this notion of "digital photograph has a scientific name" comes into play. My best guess is that if the basisOfRecord for an Occurrence record is "Digital Image" (not actually listed among the examples at http://rs.tdwg.org/dwc/terms/#basisOfRecord), perhaps a consumer of such a record will misinterpret it as though the Image *is* the item that Occurs (and hence has a scientific name). But I think of basisOfRecord as the "basis" of our *belief* (aka "evidence") that the Occurrence was real. That is, the Occurrence is always understood to refer to an organism (though the documentation doesn't say this explicitly), and the "basis" of the Occurrence is the reason we have for believing that the organism occurred at a place and time. Again, my interpretation of this may be wrong, and it's confounded by our tendency to shortcut information. For example, often in our community the statement "an individual organism that was documented to occur at a place and time was identified by someone as belonging to a taxon concept that is best represented by this scientific name" is truncated to "Occurrence has scientific name". Because we tend to do this, I can easily understand taking it one step furter and reduce the statement "an individual organism that was documented *by a digital image* to occur at a place and time was identified by someone as belonging to a taxon concept that is best represented by this scientific name" to "Digital Image has scientific name".
So, maybe we should try to avoid these short-cut representations of our data as much as possible?
- When users want to "flatten" and simplify their
databases, they tend to eliminate one-to-many (1:M) relationships in favor of one-to-one (1:1) relationships. The result of that is differences like we saw in
http://bioimages.vanderbilt.edu/pages/rich-diagram1.gif (which allows 1:M relationships between Occurrences and Events and between Events and Locations) and
http://bioimages.vanderbilt.edu/pages/rich-diagram2.gif (which "atomizes" every Occurrence by considering it to have its own separate eventTime and Location information).
Another way to look at the difference between diagram1 and diagram2 is that the latter is simply a flattened (aka "denormalized") version of the first. I don't think the latter is really a more "atomized" version, because if eventTime and Location are recorded with such precision, then they could easily be represented in the structure of diagram1 -- except that the 1:M links would tend to shift in abundance towards more 1:1 (there is no problem represnting a 1:M relationship where most instances actually are 1:1). But what really makes diagram2 different is that it is likely to include replication of identical Event propery values over multiple records in cases where there really is a 1:M relationship between Event and Occurrence. No harm there -- it's just a bit denormalized. Denormalization is fine for mechanisms to transmit bulk content around -- especially if they were generated from more normalized original data structures at the source.
A. There is nothing intrinsically "right" or "wrong" about either of these approaches, because they each have their own advantages. The 1:M approach is more efficient, but results in a more complicated database, while the 1:1 approach results in a simpler database but may require repeating some or many term values in the records.
Exactly. Perhaps I misunderstood your point about "atomized".
C. This collapsing of the diagram is also the reason for some disagreement about whether a term belongs in a certain class or not. In the example above, 1:1 people would say that eventDate is a property of an Occurrence, while 1:M people would say that eventDate is a property of an Event.
That's not quite how I would characterize it. I would say, if you establish an Event class at all, then eventDate is clearly a member of that Class. I think the real question is "Do we need an Event class?" If yes, then eventDate belongs to it. If no, then we "collapse" eventDate to Occurrence.
By the way, when I say "Do we need an Event class?", I mean it at two levels. At one level, the question is: "Is it useful to establish it within DwC?" At another level, it's "How shall I structure my pacakge of data using DwC terms?" My understanding (which is crude), is that even with Event class defined in DwC, I still have the choice of representing my Occurrence data as:
1) "Normalized": =================== occurenceID: 1234 eventID: 9876 identificationID: 7654 individualCount: 4 recordedBy: "J. Smith" =================== eventID: 9876 LocationID: 4567 eventDate: 24-October-2010 eventTime: 02:13:00 =================== LocationID: 4567 decimalLatitude: 52.453016 decimalLongitude: 13.309418 geodeticDatum: "WGS84" country: "Germany" locality: "Botanischer Garten Und Botanisches Museum Berlin-Dahlem" =================== identificationID: 7654 taxonID: 2345 identifiedBy: "Richard Pyle" dateIdentified: 24-October-2010 =================== taxonID: 2345 scientificName: "Homo sapiens Linnaeus 1758" namePublishedIn: "Linnaeus, C. 1758. Systema Naturae...." nameAccordingTo: "Linnaeus, C. 1758. Systema Naturae...." ===================
2) "Flattened": =================== occurenceID: 1234 individualCount: 4 recordedBy: "J. Smith" eventDate: 24-October-2010 eventTime: 02:13:00 decimalLatitude: 52.453016 decimalLongitude: 13.309418 geodeticDatum: "WGS84" country: "Germany" locality: "Botanischer Garten Und Botanisches Museum Berlin-Dahlem" identifiedBy: "Richard Pyle" dateIdentified: 24-October-2010 scientificName: "Homo sapiens Linnaeus 1758" namePublishedIn: "Linnaeus, C. 1758. Systema Naturae...." nameAccordingTo: "Linnaeus, C. 1758. Systema Naturae...." ===================
In my understanding, both of these would be legitimate implementations of the DwC terms. The difference is that in the first case, the content is normalized such that the value of the different properties (sorry, Bob -- not sure of the correct word here) are inherited through the various "[class]ID" links; whereas in the the "flattened" version, the properties are represented directly on the Occurrence instance.
The advantage of the first is that the atomized and ID'd class instances can be reused for multiple occurrences, whereas the advantage of the second is that it greatly simplifies the content structure.
- I would propose that the "right" relationship diagram is not
necessarily
one that caters to a certain "right" philosophical point of view. Rather,
the "right" diagram is the one that allows users to define the
relationships
that they need for the organization of their metadata in the simplest
manner,
and which provides the most clarity about what resources of various kinds are, and how they are connected.
Agreed. But another component to "rightness" is the extent to which users want to re-use content from the various class instances. For example, it's incredibly easy to conver the Normalized version to the flattened version. But is's not always so easy to parse the flattened version back to the normalized one. You can always do it by creating unique values of the provided terms for each class, but this can be potentially misleading and artificial -- especially if there were more properties for each of the individual class entities that were not included with the packaged data.
There also seemed to be a consensus that an observation was simply an Occurrence that did not have an associated token.
Well...technically the "token" in this case is a pattern of neurons in the observer's brain that constitute a memory....but that may be a bit abstract.
http://bioimages.vanderbilt.edu/pages/token-assumed.gif which I will refer to as the "assumed token" model and
http://bioimages.vanderbilt.edu/pages/token-explicit.gif which I will refer to as the "explicit token" model.
Nice -- and without reading another word of your message, I'm going to take a chance and say that I conceptually agree with your "token-explicit" diagram. The hard part is (as with the case of class:individual) deciding whether this level of "normalization" is valuable for DwC purposes.
I believe that historically the assumed token model has been the one which most people have had in mind.
Actually, I've always envisioned it as you have in your token-explicit version (and have said as much at various meetings to discuss DwC, going back to 1.0). In fact, I remember discussing this exact issue with Stan Blum long before DwC existed (he was the first to suggest to me the term "evidence" in this context -- which I think is functionally equivalent to your "token"). However, I've conceeded that this level of normalization would probably be too much for the intended purpose of the DwC terms. But I'll keep an open mind on that.
Before the new DwC standard, we had specimens and we had observations. In order to avoid redundancies in terms for those two types of "things", a combined "thing" called "Occurrence" was created. An Occurrence that was an observation didn't have a token and an Occurrence that was a specimen had a physical or living specimen as its token.
My rationalization of it in the early days (pre-DwC) was that *everything* was effectively an observation, and beyond that, the only question was a matter of evidence. In my earliest models, I categorized "evidence" into "Specimen", "Image", "Literature Report", and "Unvouchered Observation" (I was using the word "voucher" in the general sense, as in the verb "to vouch" -- not in the more specific sense for our community, which implies "Specimen preserved in Museum"). My read on the history of DwC is that it was initially established as a means to aggregate and/or share Specimen data amongst Museums (hence its Specimen-centric nature). Later, the Specimen/Observation dichotomy was introduced to allow DwC content to allow more sophisticated and complete representations of the occurrence of organisms in place and time, because there was muchmore information than what existed as specimens in Museums. In my mind, the "Observation" side was effectively a collapsing of my "Image", "Literature Report" and "Unvouchered Observation" -- which I was OK with in the context of the time. Because at the time, the vast majority of content available in computer databases came from museum specimen databases, and from observational databases (largely in the bird realm).
So...I see the current iteration of DwC as another step in the evolution of moving from "sharing and aggregating specimen data among museums" to "documenting biodiversity in nature". It's not all the way into the fully normalized representation of biodiversity data, but it's far enough that it is a nice compromise between practical and effective for the majority of the user constituency. In my mind, the next logical step in this evolutionary trajectory would be to recognize "Individual" as a class (which DwC is apready primed for, via individualID).
It is not clear how one is supposed to handle the actually metadata for the image that serves as the token.
That seems to me to be in the domain of the TDWG MRTG group (http://www.tdwg.org/standards/638/).
Unlike specimens where the token's metadata terms are placed in the Occurrence class, I guess in the case of an image one is supposed to use associatedMedia to link the so-called MachineObservation to the image record. If DNA were extracted, one would link the sequence to the Occurrence using associatedSequences (although it's not clear to me what the basisOfRecord for that would be - "TookATissueSample"?). But what does one do for other kinds of tokens, like seeds or tissue samples - create terms like associatedSeed and associatedTissueSample?
In my mind, things like seeds, tissue samples, and DNA sequences are simply different kinds of specimens (just like dried skeletons vs. botanical pressed sheets vs. whole organisms in jars of alcohol vs. prepared skins, etc.) They may have certain properties specific to each subclass of specimen, but fundamentally I think it's fair to treat them as specimens. DNA sequences are a bit different, of course, because they are not the "stuff" of an organism, but rather an indirect representation of the "stuff". In my mind, that difference justifies associatedSequences, where we don't have associatedSeeds, associatedTeeth, associatedSkins, associatedSkeletons, etc.
However, if I'm going to make this admission, I demand that the other guilty parties also confess, namely people who want to assert that Occurrences have properties that actually are properties of specimens.
I'll plead innocent of those charges, as I always understood the representation of Specimen-properties as applied to Occurrence instances as just another compromise of "flat" vs. "normalized", in much the same way that applying properties of Locations to Events, and Locations+Events to Occurrences, would likewise be compromises in the interest of simplification.
If we accept the explicit token model, then as a biodiversity informatics resource type "observation" will have to disappear into a puff of nothingness
Not necessarily. See my comment earlier about patterns on neurons in a human brain that constitute a memory. Just as a digital image rendered on a hard disk requires certain machinery to convert into photons that strike our retinas (i.e., a computer and monitor), so too does a memory require such machinery (e.g., the brain itself, transmission of sound waves via vocal chords, soud ways striking ear drums, etc.) This may sound weird, but I'm being serious: a human memory is, fundamentally, every bit as much of a "token" as a specimen or a digital image. It's just considerably less accessible and well-resolved.
Realistically, I can't see this kind of separation ever happening, given the amount of trouble it's been just to get a few people to admit that Individuals exist.
I don't think the issue was ever in convincing people that Individuals exist -- that much, I think, was clear to everyone (as proof: see dwc:individualID). The issue was always more about where the current DwC should lie on the scale of highly flattened (e.g., DwC 1.0) to highly normalized (e.g., ABCD and CDM). It's necessarily a compromise between modelling the information "as it really is", vs. modelling the information in a way that's both accessible to the majority to content providers, and useful to the majority of contnent consumers. I think we both understand what the trade-offs are in either direction. The question is, what is the "sweet spot" for the majority of our community at this time in history?
I would venture that at the time DwC 1.0 was developed, that hit the sweet spot reasonably well. As more content holders develop inclreasingly sophisticated DBMS for their content, and as the user community delves into increasingly sophisticated analyses of the data, the "sweet spot" will shift from the flattened end of the scale to the normalized end of the scale. And, I would hope, DwC wll evolve accordingly.
It is just too hard to get motion to happen in the TDWG community.
People make the same complaint about another organization that I'm involved with (ICZN). But here's the thing: as in the case of nomenclature, stability in itself can be a very important thing. If DwC changed every six months, then by the time people developed software apps to work with it, those apps would already be obsolete. If someone writes code that consumes DwC content as expressed in the current version of DwC, then that code may break if people start providing content with class:individual and class:token content. If our community is going to move forward successfully, I think standards like DwC need to evolve in a punctuated way, rather than a gradualist way (same goes for the Codes of nomenclature). That is, a bit of inertia in the system is probably a good thing.
OK, I've now gone on for eight pages of text explaining the rationale behind the question. So I'll return to the basic question: is the consensus for modeling the relationship between an Occurrence and associated token(s) the assumed token model:
http://bioimages.vanderbilt.edu/pages/token-assumed.gif
or the explicit token model:
http://bioimages.vanderbilt.edu/pages/token-explicit.gif
?
Here's how I would answer: When modelling my own databases, tracking my own content, I would *definitely* (and indeed already have, for a long time now) go with the token-expicit.
But when deciding on a community data exchange standard (i.e., DwC), compromise between flat and normalized is still a necesssity, and as such, the answer in terms of modifying DwC needs to take into account the form of the bulk of the existing content, the needs of the bulk of the existing users/consumers, and the virtues of stability of Standards in a world where software app development time stretches for months or years.
Maybe the answer to this is to treat different versions of DwC as concurrent, rather than serial. That is, as long as the next most sophisticated version can easily be "collapsed" to all previous versions (aka, backward compatibility), then maybe we just need a clear mechanism for consuming applications to indicate desired DwC version. That way, apps developed to work with v2.1 can indicate to a provider that is capable of produding v3.6 content, that they want it in v2.1 format. Assuming we maintain backward compatibility (i.e., the more-normalized version can be easily collapsed to the more flattened version), then is should be a very simple matter for the content provider to stream the same content in v2.1 format.
But now I'm dabbling in areas that are WAY outside my scope of expertise...
Anyway...I would reiterate that I, for one, appreciate that you took the time to write all this down (took me over 3 hours to read & respond -- so obviously I care! -- of course, I'm waiting for a taxi to go to the airport, so really not much else for me to do right now). If I didn't reply to parts of your message, it was either because I agreed with you and had nothing to elaborate or expound upon, or I didn't really understand (e.g., all the rdf stuff).
Aloha, Rich
Rich, Thanks for taking the time to read the whole thing. Based on the first series of comments you made, it seems as though we are in agreement on most points. I think that what I wrote was (as I had anticipated) somewhat less clear due to my use (or failure to use) some appropriate terms to describe what I was talking about. For example, when I said "atomized" I probably should have said something like "fine-grained" and correct use of the term "normalized" would have helped. Some other comments inline:
Richard Pyle wrote:
I believe that historically the assumed token model has been the one which most people have had in mind.
Actually, I've always envisioned it as you have in your token-explicit version (and have said as much at various meetings to discuss DwC, going back to 1.0). In fact, I remember discussing this exact issue with Stan Blum long before DwC existed (he was the first to suggest to me the term "evidence" in this context -- which I think is functionally equivalent to your "token"). However, I've conceeded that this level of normalization would probably be too much for the intended purpose of the DwC terms. But I'll keep an open mind on that.
Before the new DwC standard, we had specimens and we had observations. In order to avoid redundancies in terms for those two types of "things", a combined "thing" called "Occurrence" was created. An Occurrence that was an observation didn't have a token and an Occurrence that was a specimen had a physical or living specimen as its token.
My rationalization of it in the early days (pre-DwC) was that *everything* was effectively an observation, and beyond that, the only question was a matter of evidence. In my earliest models, I categorized "evidence" into "Specimen", "Image", "Literature Report", and "Unvouchered Observation" (I was using the word "voucher" in the general sense, as in the verb "to vouch" -- not in the more specific sense for our community, which implies "Specimen preserved in Museum"). My read on the history of DwC is that it was initially established as a means to aggregate and/or share Specimen data amongst Museums (hence its Specimen-centric nature). Later, the Specimen/Observation dichotomy was introduced to allow DwC content to allow more sophisticated and complete representations of the occurrence of organisms in place and time, because there was muchmore information than what existed as specimens in Museums. In my mind, the "Observation" side was effectively a collapsing of my "Image", "Literature Report" and "Unvouchered Observation" -- which I was OK with in the context of the time. Because at the time, the vast majority of content available in computer databases came from museum specimen databases, and from observational databases (largely in the bird realm).
Well, I'm not surprised that the ideas that I'm trying to put down in words and diagrams predate my entry into this arena a year and a half ago. What is a bit frustrating to me is that ideas like these aren't laid out in an easy-to-understand fashion and placed in easy-to-find places. I have spent much of that last year and a half trying to understand how the whole TDWG/DwC universe is supposed to fit together. I think that the idea of having the Google Code site where there are explanations and examples for the various DwC terms is the kind of thing we need. Unfortunately, most of the terms do not yet have entries there. Perhaps I'm just impatient. If it turns out that any of the summaries that I've written here accurately reflect any kind of consensus, then maybe someone could "clean them up" (i.e. use correct technical terms after giving definitions of what they mean) and paste them somewhere where people can find them. That would prevent another person 10 years from now re-articulating the same ideas a third time. I'm particularly thinking of the summary diagram http://bioimages.vanderbilt.edu/pages/token-explicit.gif along with an explanation of how people use the more normalized and more flattened versions of it. We already do have quite lucid examples in the Simple Darwin Core (flattened) and Darwin Core XML guide (normalized), but some sort of overview of the big picture might be helpful. If an RDF guide ever gets off the ground, that would be another example of how the relationships assumed in DwC are expressed in a very explicit way.
So...I see the current iteration of DwC as another step in the evolution of moving from "sharing and aggregating specimen data among museums" to "documenting biodiversity in nature". It's not all the way into the fully normalized representation of biodiversity data, but it's far enough that it is a nice compromise between practical and effective for the majority of the user constituency. In my mind, the next logical step in this evolutionary trajectory would be to recognize "Individual" as a class (which DwC is apready primed for, via individualID).
I think I understand the message that you are trying to convey above and in your later comments about creating new versions of DwC (or new evolutionary states of DwC) that don't break the previous ones. I think that is one reason why the process of examining and clearly articulating the community consensus on what Darwin Core terms and classes "mean" and how they are connected to each other is so important before we embark on implementing GUIDs and RDF. Pete has suggested that we may need a second version of DwC in order to make it work in the Linked Open Data world and he's probably right. I'm not sure that the existing vocabulary has all of the terms we need to do that. However, if we are going to "evolve" Darwin Core so that it will work in the LOD world, I hope that we do it in such a way that we maintain the same "meaning" of things as Darwin Core 1.0 . I think that is the way to maintain the kind of "stability" that you described below.
Unlike specimens where the token's metadata terms are placed in the Occurrence class, I guess in the case of an image one is supposed to use associatedMedia to link the so-called MachineObservation to the image record. If DNA were extracted, one would link the sequence to the Occurrence using associatedSequences (although it's not clear to me what the basisOfRecord for that would be - "TookATissueSample"?). But what does one do for other kinds of tokens, like seeds or tissue samples - create terms like associatedSeed and associatedTissueSample?
In my mind, things like seeds, tissue samples, and DNA sequences are simply different kinds of specimens (just like dried skeletons vs. botanical pressed sheets vs. whole organisms in jars of alcohol vs. prepared skins, etc.) They may have certain properties specific to each subclass of specimen, but fundamentally I think it's fair to treat them as specimens. DNA sequences are a bit different, of course, because they are not the "stuff" of an organism, but rather an indirect representation of the "stuff". In my mind, that difference justifies associatedSequences, where we don't have associatedSeeds, associatedTeeth, associatedSkins, associatedSkeletons, etc.
Your point is well taken in that we don't need a proliferation of types of associated tokens. We need as many different token "types" as we have coherent sets of metadata terms. One of the points of typing resources is to let potential users know what kinds of metadata properties (terms) they can reasonably expect to receive about that resource. If one will receive the same set of properties about two kinds of resources (e.g. skins and skeletons), there is no reason to type them differently. The point that I was trying to get at (eventually) was that it was inconsistent to say that images need to be referenced as associatedMedia and sequences needed to be referenced as associatedSequences, and yet not say that specimens needed to be referenced as "associatedSpecimens". I actually think that based on Roger's explanation of "to subclass or not" (http://wiki.tdwg.org/twiki/bin/view/TAG/SubclassOrNot), it makes more sense to talk about using a generic "hasToken" or "tokenID" along with "tagging" the token using rdfs:type (as I suggested toward the end of my "treatise") rather than a bunch of associatedXXXX terms.
If we accept the explicit token model, then as a biodiversity informatics resource type "observation" will have to disappear into a puff of nothingness
Not necessarily. See my comment earlier about patterns on neurons in a human brain that constitute a memory. Just as a digital image rendered on a hard disk requires certain machinery to convert into photons that strike our retinas (i.e., a computer and monitor), so too does a memory require such machinery (e.g., the brain itself, transmission of sound waves via vocal chords, soud ways striking ear drums, etc.) This may sound weird, but I'm being serious: a human memory is, fundamentally, every bit as much of a "token" as a specimen or a digital image. It's just considerably less accessible and well-resolved.
I guess I'm thinking about this in terms of a token being something to which we can assign an identifier and retrieve a representation (a la representational state transfer). Although I don't deny the existence of memory patterns in neurons that are associated with a HumanObservation, there isn't any way that we can receive a representation of that memory directly. If the person draws a sketch of what he/she remembers, then we have a media item that we can convert into a digital form and transmit through the Internet (a token). If the person types up notes, then we have a text document (a token that can also be delivered as a digital file or scan of typewritten page). On the other hand, if the person simply records the values of recordedBy, eventDate, and Location terms, then we have only Occurrence metadata (no token). If someone claims "basisOfRecord=HumanObservation" and has no token of any kind, then what is there that is deliverable other than the basic Occurrence metadata? That's why I'm claiming that basisOfRecord=HumanObservation simply corresponds to an Occurrence record with no token.
Realistically, I can't see this kind of separation ever happening, given the amount of trouble it's been just to get a few people to admit that Individuals exist.
I don't think the issue was ever in convincing people that Individuals exist -- that much, I think, was clear to everyone (as proof: see dwc:individualID). The issue was always more about where the current DwC should lie on the scale of highly flattened (e.g., DwC 1.0) to highly normalized (e.g., ABCD and CDM). It's necessarily a compromise between modelling the information "as it really is", vs. modelling the information in a way that's both accessible to the majority to content providers, and useful to the majority of contnent consumers. I think we both understand what the trade-offs are in either direction. The question is, what is the "sweet spot" for the majority of our community at this time in history?
I would venture that at the time DwC 1.0 was developed, that hit the sweet spot reasonably well. As more content holders develop inclreasingly sophisticated DBMS for their content, and as the user community delves into increasingly sophisticated analyses of the data, the "sweet spot" will shift from the flattened end of the scale to the normalized end of the scale. And, I would hope, DwC wll evolve accordingly.
It is just too hard to get motion to happen in the TDWG community.
People make the same complaint about another organization that I'm involved with (ICZN). But here's the thing: as in the case of nomenclature, stability in itself can be a very important thing. If DwC changed every six months, then by the time people developed software apps to work with it, those apps would already be obsolete. If someone writes code that consumes DwC content as expressed in the current version of DwC, then that code may break if people start providing content with class:individual and class:token content. If our community is going to move forward successfully, I think standards like DwC need to evolve in a punctuated way, rather than a gradualist way (same goes for the Codes of nomenclature). That is, a bit of inertia in the system is probably a good thing.
OK, I've now gone on for eight pages of text explaining the rationale behind the question. So I'll return to the basic question: is the consensus for modeling the relationship between an Occurrence and associated token(s) the assumed token model:
http://bioimages.vanderbilt.edu/pages/token-assumed.gif or the explicit token model: http://bioimages.vanderbilt.edu/pages/token-explicit.gif ?
Here's how I would answer: When modelling my own databases, tracking my own content, I would *definitely* (and indeed already have, for a long time now) go with the token-expicit.
But when deciding on a community data exchange standard (i.e., DwC), compromise between flat and normalized is still a necesssity, and as such, the answer in terms of modifying DwC needs to take into account the form of the bulk of the existing content, the needs of the bulk of the existing users/consumers, and the virtues of stability of Standards in a world where software app development time stretches for months or years.
Maybe the answer to this is to treat different versions of DwC as concurrent, rather than serial. That is, as long as the next most sophisticated version can easily be "collapsed" to all previous versions (aka, backward compatibility), then maybe we just need a clear mechanism for consuming applications to indicate desired DwC version. That way, apps developed to work with v2.1 can indicate to a provider that is capable of produding v3.6 content, that they want it in v2.1 format. Assuming we maintain backward compatibility (i.e., the more-normalized version can be easily collapsed to the more flattened version), then is should be a very simple matter for the content provider to stream the same content in v2.1 format.
Yes, I agree about this concept. I think that what I'm really advocating for is that we agree on what the most normalized model is that will connect all of the existing Darwin Core classes and terms. In that sense, when I'm asking for Individual to be accepted as a class, I'm not arguing for a "new" thing, I'm arguing for a clarification of what we mean when we use the existing term dwc:individualID. When I'm asking for terms to facilitate a logically consistent way to connect Occurrences with their tokens, I'm also not really asking for an expansion of Darwin Core, I'm asking for a more consistent model than "subclassing" by using associatedMedia and associatedSequences but not using "associatedSpecimens". I think that this is important because if we don't agree on these things, we are going to have a royal mess on our hands if we try to start trying to develop an RDF guide for Darwin Core. As an eternal optimist, I think that describing a fully normalized model that can be translated into RDF can be achieved with only a few minor additions to the existing terms as opposed to requiring a complete new version. If we really need to completely rewrite Darwin Core for RDF I don't have any delusions that it will be accomplished before I retire.
But now I'm dabbling in areas that are WAY outside my scope of expertise...
Anyway...I would reiterate that I, for one, appreciate that you took the time to write all this down (took me over 3 hours to read & respond -- so obviously I care! -- of course, I'm waiting for a taxi to go to the airport, so really not much else for me to do right now). If I didn't reply to parts of your message, it was either because I agreed with you and had nothing to elaborate or expound upon, or I didn't really understand (e.g., all the rdf stuff).
Again, thanks for taking the time to read and comment.
Steve
What is a bit frustrating to me is that ideas like these aren't laid out in an easy-to-understand fashion and placed in easy-to-find places. I have spent much of that last year and a half trying to understand how the whole TDWG/DwC universe is supposed to fit together.
Understood, and agreed. Part of the problem is that a lot of this stuff is driven by passionate individuals, who also happen to be highly over-committed. There's barely enough time available to do the interesting bits (conceptualizing, experimenting with implementations), let alone the less-interesting bits (documentation). Having said that, there are some early documents that go into a lot of this in great detail. One is Stan Blum's description of the ASC model. Another are a series of publications from Walter Berendsohn on "potential taxa". A lot of other stuff is floating around the Specify project, and there are some other earlier sources. But I agree, it's not easy to find, and it doesn't always cover the details we need it to in today's context.
The point that I was trying to get at (eventually) was that it was inconsistent to say that images need to be referenced as associatedMedia and sequences needed to be referenced as associatedSequences, and yet not say that specimens needed to be referenced as "associatedSpecimens".
Hmmmm...not sure I agree. If it is so that Occurrence=Individual+Event, then a Specimen can be said to *be* the Individual, whereas images, DNA sequences, and the like are the tokens. In other words, Individual "is a" Specimen; but Individual "has a" image. Now that I think about it, perhaps Specimens should not be treated on an equal par with other tokens; and indeed, maybe specimens aren't tokens (per your definition) at all. This is not consistent with how I've always thought about it (see my previous email), but if the elusive "Individual" is key to this relationship, then perhaps Specimens serve as bot "evidence" of an occurrence, and the "stuff" of the Individual represneted by the Occurrence.
My brain hurts.
I guess I'm thinking about this in terms of a token being something to which we can assign an identifier and retrieve a representation (a la representational state transfer). Although I don't deny the existence of memory patterns in neurons that are associated with a HumanObservation, there isn't any way that we can receive a representation of that memory directly.
I guess it depends on what you mean by "representation". We can't retrieve a specimen directly either -- but we can retrieve a database record that represents the specimen, and metadata associated with it. I think the same can be said about a human mmory (as the foundation of an observation). That is, there is a species identification, number of individuals, etc., associated with an observation that is based on the memory of the person who made the observation, and that memory is represented by a database record with associated metadata.
This conversation could go very weird, very quickly -- and maybe I'm just being difficult (in which case I apologize). But now that I see that a specimen may, in fact, be fundamentally different from other kinds of evidence supporting an occurrence, I'm not longer sure what I believe anymore (especially after the 11-hr flight from Berlin I just got off of).
Maybe the answer to this is to treat different versions of DwC as concurrent, rather than serial.
[etc.]
Yes, I agree about this concept. I think that what I'm really advocating for is that we agree on what the most normalized model is that will connect all of the existing Darwin Core classes and terms. In that sense, when I'm asking for Individual to be accepted as a class, I'm not arguing for a "new" thing, I'm arguing for a clarification of what we mean when we use the existing term dwc:individualID.
Makes sense to me.
Aloha, Rich
An individual may be represented in several occurrence records.
You might have a bird that was photographed in one study.
Banded in another study.
Then later, preserved in a museum.
I think there is a case for being able to track this individual over time.
- Pete
On Sun, Oct 24, 2010 at 5:43 PM, Richard Pyle deepreef@bishopmuseum.orgwrote:
What is a bit frustrating to me is that ideas like these aren't laid out in an easy-to-understand fashion and placed in easy-to-find places. I have spent much of that last year and a half trying to understand how the whole TDWG/DwC universe is supposed to fit together.
Understood, and agreed. Part of the problem is that a lot of this stuff is driven by passionate individuals, who also happen to be highly over-committed. There's barely enough time available to do the interesting bits (conceptualizing, experimenting with implementations), let alone the less-interesting bits (documentation). Having said that, there are some early documents that go into a lot of this in great detail. One is Stan Blum's description of the ASC model. Another are a series of publications from Walter Berendsohn on "potential taxa". A lot of other stuff is floating around the Specify project, and there are some other earlier sources. But I agree, it's not easy to find, and it doesn't always cover the details we need it to in today's context.
The point that I was trying to get at (eventually) was that it was inconsistent to say that images need to be referenced as associatedMedia and sequences needed to be referenced as associatedSequences, and yet not say that specimens needed to be referenced as "associatedSpecimens".
Hmmmm...not sure I agree. If it is so that Occurrence=Individual+Event, then a Specimen can be said to *be* the Individual, whereas images, DNA sequences, and the like are the tokens. In other words, Individual "is a" Specimen; but Individual "has a" image. Now that I think about it, perhaps Specimens should not be treated on an equal par with other tokens; and indeed, maybe specimens aren't tokens (per your definition) at all. This is not consistent with how I've always thought about it (see my previous email), but if the elusive "Individual" is key to this relationship, then perhaps Specimens serve as bot "evidence" of an occurrence, and the "stuff" of the Individual represneted by the Occurrence.
My brain hurts.
I guess I'm thinking about this in terms of a token being something to which we can assign an identifier and retrieve a representation (a la representational state transfer). Although I don't deny the existence of memory patterns in neurons that are associated with a HumanObservation, there isn't any way that we can receive a representation of that memory directly.
I guess it depends on what you mean by "representation". We can't retrieve a specimen directly either -- but we can retrieve a database record that represents the specimen, and metadata associated with it. I think the same can be said about a human mmory (as the foundation of an observation). That is, there is a species identification, number of individuals, etc., associated with an observation that is based on the memory of the person who made the observation, and that memory is represented by a database record with associated metadata.
This conversation could go very weird, very quickly -- and maybe I'm just being difficult (in which case I apologize). But now that I see that a specimen may, in fact, be fundamentally different from other kinds of evidence supporting an occurrence, I'm not longer sure what I believe anymore (especially after the 11-hr flight from Berlin I just got off of).
Maybe the answer to this is to treat different versions of DwC as concurrent, rather than serial.
[etc.]
Yes, I agree about this concept. I think that what I'm really advocating for is that we agree on what the most normalized model is that will connect all of the existing Darwin Core classes and terms. In that sense, when I'm asking for Individual to be accepted as a class, I'm not arguing for a "new" thing, I'm arguing for a clarification of what we mean when we use the existing term dwc:individualID.
Makes sense to me.
Aloha, Rich
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi Pete,
Yes, that's basically where the conversation on "Individual" began several weeks ago (i.e., that the same Individual could participate in more than one Occurrence). As we've mentioned, DwC already accomodates individualID, but there is no class for an individual. If there were, several of the properties of Occurrence would over.
Aloha, Rich
_____
From: Peter DeVries [mailto:pete.devries@gmail.com] Sent: Sunday, October 24, 2010 1:18 PM To: Richard Pyle Cc: Steve Baskauf; tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] Treatise on Occurrence, tokens, and basisOfRecord
An individual may be represented in several occurrence records.
You might have a bird that was photographed in one study.
Banded in another study.
Then later, preserved in a museum.
I think there is a case for being able to track this individual over time.
- Pete
On Sun, Oct 24, 2010 at 5:43 PM, Richard Pyle deepreef@bishopmuseum.org wrote:
What is a bit frustrating to me is that ideas like these aren't laid out in an easy-to-understand fashion and placed in easy-to-find places. I have spent much of that last year and a half trying to understand how the whole TDWG/DwC universe is supposed to fit together.
Understood, and agreed. Part of the problem is that a lot of this stuff is driven by passionate individuals, who also happen to be highly over-committed. There's barely enough time available to do the interesting bits (conceptualizing, experimenting with implementations), let alone the less-interesting bits (documentation). Having said that, there are some early documents that go into a lot of this in great detail. One is Stan Blum's description of the ASC model. Another are a series of publications from Walter Berendsohn on "potential taxa". A lot of other stuff is floating around the Specify project, and there are some other earlier sources. But I agree, it's not easy to find, and it doesn't always cover the details we need it to in today's context.
The point that I was trying to get at (eventually) was that it was inconsistent to say that images need to be referenced as associatedMedia and sequences needed to be referenced as associatedSequences, and yet not say that specimens needed to be referenced as "associatedSpecimens".
Hmmmm...not sure I agree. If it is so that Occurrence=Individual+Event, then a Specimen can be said to *be* the Individual, whereas images, DNA sequences, and the like are the tokens. In other words, Individual "is a" Specimen; but Individual "has a" image. Now that I think about it, perhaps Specimens should not be treated on an equal par with other tokens; and indeed, maybe specimens aren't tokens (per your definition) at all. This is not consistent with how I've always thought about it (see my previous email), but if the elusive "Individual" is key to this relationship, then perhaps Specimens serve as bot "evidence" of an occurrence, and the "stuff" of the Individual represneted by the Occurrence.
My brain hurts.
I guess I'm thinking about this in terms of a token being something to which we can assign an identifier and retrieve a representation (a la representational state transfer). Although I don't deny the existence of memory patterns in neurons that are associated with a HumanObservation, there isn't any way that we can receive a representation of that memory directly.
I guess it depends on what you mean by "representation". We can't retrieve a specimen directly either -- but we can retrieve a database record that represents the specimen, and metadata associated with it. I think the same can be said about a human mmory (as the foundation of an observation). That is, there is a species identification, number of individuals, etc., associated with an observation that is based on the memory of the person who made the observation, and that memory is represented by a database record with associated metadata.
This conversation could go very weird, very quickly -- and maybe I'm just being difficult (in which case I apologize). But now that I see that a specimen may, in fact, be fundamentally different from other kinds of evidence supporting an occurrence, I'm not longer sure what I believe anymore (especially after the 11-hr flight from Berlin I just got off of).
Maybe the answer to this is to treat different versions of DwC as concurrent, rather than serial.
[etc.]
Yes, I agree about this concept. I think that what I'm really advocating for is that we agree on what the most normalized model is that will connect all of the existing Darwin Core classes and terms. In that sense, when I'm asking for Individual to be accepted as a class, I'm not arguing for a "new" thing, I'm arguing for a clarification of what we mean when we use the existing term dwc:individualID.
Makes sense to me.
Aloha, Rich
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
On Mon, Oct 25, 2010 at 9:43 AM, Richard Pyle deepreef@bishopmuseum.org wrote:
... Part of the problem is that a lot of this stuff is driven by passionate individuals, who also happen to be highly over-committed.
hmmm... passionate... "I cannot give any scientist of any age better advice than this: the intensity of a conviction that a hypothesis is true has no bearing on whether it is true. " Peter B. Medawar, Advice to a Young Scientist, 1979
Hmmmm...not sure I agree. If it is so that Occurrence=Individual+Event, then a Specimen can be said to *be* the Individual, whereas images, DNA sequences, and the like are the tokens. In other words, Individual "is a" Specimen;
That might work for fish, but with *real* organisms, such as plants, a specimen is a fragment or representation of an individual and thus conceptually not really different to a chunk of DNA or a image. It could be thought of as a token of stuff that was in a particular place at a particular time. Think I am with Steve on this one... if only to provoke a conceept fight... :)
but Individual "has a" image.
and an individual has a fragment, sacrificed to become a specimen. It is just that in fish the sacrifice was entire and ultimate... :)
Now that I think about it, perhaps but if the elusive "Individual" is key to this relationship, then perhaps Specimens serve as bot "evidence" of an occurrence, and the "stuff" of the Individual represneted by the Occurrence.
The notion of the 'individual' is probably a furphy... for the different organmisms the token might be an individual, but it might be a fragment, or a part of a population, or perfhaps even the entire population.
The distinction between 'the stuff' and the specimen is only one of definition, isn't it? If a museum or herbarium agrees to accept and curate it, then 'stuff' becomes a specimen.
omg! ... curation is the point at which 'stuff' becomes 'things'! (yes, remember this, you heard it on TDWG first) ...
My brain hurts.
Hey, you only write this stream of subconsiousness... we have to read it... :)
Although I don't deny the existence of memory patterns in neurons that are associated with a HumanObservation, there isn't any way that we can receive a representation of that memory directly.
oh oh... metaconcept/metaphysics alert...
That is, there is a species identification, number of individuals, etc., associated with an observation that is based on the memory of the person who made the observation, and that memory is represented by a database record with associated metadata.
hmmm... thinking... repressed memories (misidentified and forgotten specimens, or, extinctions you refuse to accept)... false memories (occurrences you made up because you're the expert and the species should bloody well be there)... hallucinations (anybody else's taxonony, identifications and survey results)...
At this point I want to fork to a cosmic metaphycical ramble about occurrence being a totally scale dependent many to many to many relationship between stuff (possibly represented by things), time and place... but I won't... ;)
This conversation could go very weird, very quickly
What is this 'could' of which you speak?
jim :)
hmmm... passionate... "I cannot give any scientist of any age better advice than this: the intensity of a conviction that a hypothesis is true has no bearing on whether it is true. " Peter B. Medawar, Advice to a Young Scientist, 1979
Hmmmm....who said anything about hypotheses?
Hmmmm...not sure I agree. If it is so that Occurrence=Individual+Event, then a Specimen can be said to
*be* the
Individual, whereas images, DNA sequences, and the like are
the tokens. In other words, Individual "is a"
Specimen;
That might work for fish, but with *real* organisms, such as plants, a specimen is a fragment or representation of an individual and thus conceptually not really different to a chunk of DNA or a image.
I disagree. A fragment of a "real" organism is no different from a skin of a "fake" organism. Neither of these is a "representation" of an individual -- they are both *part* of an individual. An image is a representation of an individual. A DNA "sequence" is also a representation. I would argue that the product of PCR is also a representation. Actual DNA molecules within actual organism cells would be a "part" of the individual.
and an individual has a fragment, sacrificed to become a specimen. It is just that in fish the sacrifice was entire and ultimate... :)
Without getting into the notion of what constitutes an "entire" organism (e.g., when the water inside the cells is replaced by alcohol, was the displaced water part of the organism?), I would say that both fish and plants are but mere footnotes on the biodiversity ladnscape, compared with insects (usually preserved as mostly intact whole organisms). And, of course, the insects are mere footnotes compared to the bacteria....but I digress.
The notion of the 'individual' is probably a furphy... for the different organmisms the token might be an individual, but it might be a fragment, or a part of a population, or perfhaps even the entire population.
My brain is starting to hurt again.
hmmm... thinking... repressed memories (misidentified and forgotten specimens, or, extinctions you refuse to accept)... false memories (occurrences you made up because you're the expert and the species should bloody well be there)... hallucinations (anybody else's taxonony, identifications and survey results)...
Indeed. And I'm sure all such flavors of memories are represented in databases within our community.
At this point I want to fork to a cosmic metaphycical ramble about occurrence being a totally scale dependent many to many to many relationship between stuff (possibly represented by things), time and place... but I won't... ;)
My brian hurts a bit less now. Thanks.
This conversation could go very weird, very quickly
What is this 'could' of which you speak?
:-)
Evidently, you have very little experience in the realm of the "very" weird. Trust me, we haven't even come close yet. We've only just nudged our toes across the "weird" line.
Definitely time for some sleep.
Rich
Are these things available online somewhere? URLs? What's the ASC model? Steve
Richard Pyle wrote:
early documents that go into a lot of this in great detail. One is Stan Blum's description of the ASC model. Another are a series of publications from Walter Berendsohn on "potential taxa". A lot of other stuff is floating around the Specify project, and there are some other earlier sources. But I agree, it's not easy to find, and it doesn't always cover the details we need it to in today's context.
I'll post PDF versions of the ASC docs somewhere on the TDWG wiki. Looking for an appropriate place now...
-Stan
On 10/25/10 7:30 PM, "Steve Baskauf" steve.baskauf@vanderbilt.edu wrote:
Are these things available online somewhere? URLs? What's the ASC model? Steve
Richard Pyle wrote:
early documents that go into a lot of this in great detail. One is Stan Blum's description of the ASC model. Another are a series of publications from Walter Berendsohn on "potential taxa". A lot of other stuff is floating around the Specify project, and there are some other earlier sources. But I agree, it's not easy to find, and it doesn't always cover the details we need it to in today's context.
Steve et al.,
I've created a page on the TDWG TAG wiki for historical documents and diagrams and posted the ASC model there.
http://wiki.tdwg.org/twiki/bin/view/TAG/HistoricalDocuments
This may not be the best place for this, but I think TDWG should and will try to keep things like this available for their historical significance.
I think this was the first published model to specify the Locality, CollectingEvent, CollectingUnit chain of relationships. It think it was also the first to specify CollectingUnit as a series of subtypes. I think those features have held up pretty well over the intervening years (almost 2 decades!). As you will see, it was missing all the necessary detail to implement a real database, but many subsequent models and DBs reflected these concepts. Note also that the model never made it into any formal standards process, so remained a draft.
The next significant modeling efforts got very elaborate, such as the MVZ and Specify models, but those were explicitly for guiding the development of single applications, not for data exchange. As we got into data exchange with XML documents and XML schema specifications, we saw a strong disagreement within the TDWG (biodiversity informatics) community about the advisability of very simple and limited specifications (DarwinCore), versus very complex specifications (ABCD).
Over the last decade, what we have done in (most of?) the biodiversity information networks is to deploy application schemas; the data specification is used by a relatively limited number of software tools. With RDF I think we are trying to break the binding between a consuming application and a single source schema. I think this is still a thorny problem, however, and detailed conceptual models will have uses, but any single model will have its limits. What will be interesting, I think, will be to show how databases made for different purposes can be integrated with well-designed conceptual models and RDF. We need to hear from the other RDF wonks about the principles we should be using in constructing our schemas, like the warnings about property ranges and domains (which I still don't understand).
Cheers,
-Stan
On 10/25/10 7:30 PM, "Steve Baskauf" steve.baskauf@vanderbilt.edu wrote:
Are these things available online somewhere? URLs? What's the ASC model? Steve
Richard Pyle wrote:
early documents that go into a lot of this in great detail. One is Stan Blum's description of the ASC model. Another are a series of publications from Walter Berendsohn on "potential taxa". A lot of other stuff is floating around the Specify project, and there are some other earlier sources. But I agree, it's not easy to find, and it doesn't always cover the details we need it to in today's context.
Please note that in various examples, I have incorrectly placed rdf:type in the namespace rdfs: (http://www.w3.org/2000/01/rdf-schema#) rather than rdf: (http://www.w3.org/1999/02/22-rdf-syntax-ns#). Thanks to Bob for pointing out this serious error.
Also, the ACS model information is very cool. I wish I'd seen it a long time ago. I especially like the giant relationship chart. Thanks Stan and Rich. Steve
Steve Baskauf wrote:
Rich, Thanks for taking the time to read the whole thing. Based on the first series of comments you made, it seems as though we are in agreement on most points. I think that what I wrote was (as I had anticipated) somewhat less clear due to my use (or failure to use) some appropriate terms to describe what I was talking about. For example, when I said "atomized" I probably should have said something like "fine-grained" and correct use of the term "normalized" would have helped. Some other comments inline:
Richard Pyle wrote:
I believe that historically the assumed token model has been the one which most people have had in mind.
Actually, I've always envisioned it as you have in your token-explicit version (and have said as much at various meetings to discuss DwC, going back to 1.0). In fact, I remember discussing this exact issue with Stan Blum long before DwC existed (he was the first to suggest to me the term "evidence" in this context -- which I think is functionally equivalent to your "token"). However, I've conceeded that this level of normalization would probably be too much for the intended purpose of the DwC terms. But I'll keep an open mind on that.
Before the new DwC standard, we had specimens and we had observations. In order to avoid redundancies in terms for those two types of "things", a combined "thing" called "Occurrence" was created. An Occurrence that was an observation didn't have a token and an Occurrence that was a specimen had a physical or living specimen as its token.
My rationalization of it in the early days (pre-DwC) was that *everything* was effectively an observation, and beyond that, the only question was a matter of evidence. In my earliest models, I categorized "evidence" into "Specimen", "Image", "Literature Report", and "Unvouchered Observation" (I was using the word "voucher" in the general sense, as in the verb "to vouch" -- not in the more specific sense for our community, which implies "Specimen preserved in Museum"). My read on the history of DwC is that it was initially established as a means to aggregate and/or share Specimen data amongst Museums (hence its Specimen-centric nature). Later, the Specimen/Observation dichotomy was introduced to allow DwC content to allow more sophisticated and complete representations of the occurrence of organisms in place and time, because there was muchmore information than what existed as specimens in Museums. In my mind, the "Observation" side was effectively a collapsing of my "Image", "Literature Report" and "Unvouchered Observation" -- which I was OK with in the context of the time. Because at the time, the vast majority of content available in computer databases came from museum specimen databases, and from observational databases (largely in the bird realm).
Well, I'm not surprised that the ideas that I'm trying to put down in words and diagrams predate my entry into this arena a year and a half ago. What is a bit frustrating to me is that ideas like these aren't laid out in an easy-to-understand fashion and placed in easy-to-find places. I have spent much of that last year and a half trying to understand how the whole TDWG/DwC universe is supposed to fit together. I think that the idea of having the Google Code site where there are explanations and examples for the various DwC terms is the kind of thing we need. Unfortunately, most of the terms do not yet have entries there. Perhaps I'm just impatient. If it turns out that any of the summaries that I've written here accurately reflect any kind of consensus, then maybe someone could "clean them up" (i.e. use correct technical terms after giving definitions of what they mean) and paste them somewhere where people can find them. That would prevent another person 10 years from now re-articulating the same ideas a third time. I'm particularly thinking of the summary diagram http://bioimages.vanderbilt.edu/pages/token-explicit.gif along with an explanation of how people use the more normalized and more flattened versions of it. We already do have quite lucid examples in the Simple Darwin Core (flattened) and Darwin Core XML guide (normalized), but some sort of overview of the big picture might be helpful. If an RDF guide ever gets off the ground, that would be another example of how the relationships assumed in DwC are expressed in a very explicit way.
So...I see the current iteration of DwC as another step in the evolution of moving from "sharing and aggregating specimen data among museums" to "documenting biodiversity in nature". It's not all the way into the fully normalized representation of biodiversity data, but it's far enough that it is a nice compromise between practical and effective for the majority of the user constituency. In my mind, the next logical step in this evolutionary trajectory would be to recognize "Individual" as a class (which DwC is apready primed for, via individualID).
I think I understand the message that you are trying to convey above and in your later comments about creating new versions of DwC (or new evolutionary states of DwC) that don't break the previous ones. I think that is one reason why the process of examining and clearly articulating the community consensus on what Darwin Core terms and classes "mean" and how they are connected to each other is so important before we embark on implementing GUIDs and RDF. Pete has suggested that we may need a second version of DwC in order to make it work in the Linked Open Data world and he's probably right. I'm not sure that the existing vocabulary has all of the terms we need to do that. However, if we are going to "evolve" Darwin Core so that it will work in the LOD world, I hope that we do it in such a way that we maintain the same "meaning" of things as Darwin Core 1.0 . I think that is the way to maintain the kind of "stability" that you described below.
Unlike specimens where the token's metadata terms are placed in the Occurrence class, I guess in the case of an image one is supposed to use associatedMedia to link the so-called MachineObservation to the image record. If DNA were extracted, one would link the sequence to the Occurrence using associatedSequences (although it's not clear to me what the basisOfRecord for that would be - "TookATissueSample"?). But what does one do for other kinds of tokens, like seeds or tissue samples - create terms like associatedSeed and associatedTissueSample?
In my mind, things like seeds, tissue samples, and DNA sequences are simply different kinds of specimens (just like dried skeletons vs. botanical pressed sheets vs. whole organisms in jars of alcohol vs. prepared skins, etc.) They may have certain properties specific to each subclass of specimen, but fundamentally I think it's fair to treat them as specimens. DNA sequences are a bit different, of course, because they are not the "stuff" of an organism, but rather an indirect representation of the "stuff". In my mind, that difference justifies associatedSequences, where we don't have associatedSeeds, associatedTeeth, associatedSkins, associatedSkeletons, etc.
Your point is well taken in that we don't need a proliferation of types of associated tokens. We need as many different token "types" as we have coherent sets of metadata terms. One of the points of typing resources is to let potential users know what kinds of metadata properties (terms) they can reasonably expect to receive about that resource. If one will receive the same set of properties about two kinds of resources (e.g. skins and skeletons), there is no reason to type them differently. The point that I was trying to get at (eventually) was that it was inconsistent to say that images need to be referenced as associatedMedia and sequences needed to be referenced as associatedSequences, and yet not say that specimens needed to be referenced as "associatedSpecimens". I actually think that based on Roger's explanation of "to subclass or not" (http://wiki.tdwg.org/twiki/bin/view/TAG/SubclassOrNot), it makes more sense to talk about using a generic "hasToken" or "tokenID" along with "tagging" the token using rdfs:type (as I suggested toward the end of my "treatise") rather than a bunch of associatedXXXX terms.
If we accept the explicit token model, then as a biodiversity informatics resource type "observation" will have to disappear into a puff of nothingness
Not necessarily. See my comment earlier about patterns on neurons in a human brain that constitute a memory. Just as a digital image rendered on a hard disk requires certain machinery to convert into photons that strike our retinas (i.e., a computer and monitor), so too does a memory require such machinery (e.g., the brain itself, transmission of sound waves via vocal chords, soud ways striking ear drums, etc.) This may sound weird, but I'm being serious: a human memory is, fundamentally, every bit as much of a "token" as a specimen or a digital image. It's just considerably less accessible and well-resolved.
I guess I'm thinking about this in terms of a token being something to which we can assign an identifier and retrieve a representation (a la representational state transfer). Although I don't deny the existence of memory patterns in neurons that are associated with a HumanObservation, there isn't any way that we can receive a representation of that memory directly. If the person draws a sketch of what he/she remembers, then we have a media item that we can convert into a digital form and transmit through the Internet (a token). If the person types up notes, then we have a text document (a token that can also be delivered as a digital file or scan of typewritten page). On the other hand, if the person simply records the values of recordedBy, eventDate, and Location terms, then we have only Occurrence metadata (no token). If someone claims "basisOfRecord=HumanObservation" and has no token of any kind, then what is there that is deliverable other than the basic Occurrence metadata? That's why I'm claiming that basisOfRecord=HumanObservation simply corresponds to an Occurrence record with no token.
Realistically, I can't see this kind of separation ever happening, given the amount of trouble it's been just to get a few people to admit that Individuals exist.
I don't think the issue was ever in convincing people that Individuals exist -- that much, I think, was clear to everyone (as proof: see dwc:individualID). The issue was always more about where the current DwC should lie on the scale of highly flattened (e.g., DwC 1.0) to highly normalized (e.g., ABCD and CDM). It's necessarily a compromise between modelling the information "as it really is", vs. modelling the information in a way that's both accessible to the majority to content providers, and useful to the majority of contnent consumers. I think we both understand what the trade-offs are in either direction. The question is, what is the "sweet spot" for the majority of our community at this time in history?
I would venture that at the time DwC 1.0 was developed, that hit the sweet spot reasonably well. As more content holders develop inclreasingly sophisticated DBMS for their content, and as the user community delves into increasingly sophisticated analyses of the data, the "sweet spot" will shift from the flattened end of the scale to the normalized end of the scale. And, I would hope, DwC wll evolve accordingly.
It is just too hard to get motion to happen in the TDWG community.
People make the same complaint about another organization that I'm involved with (ICZN). But here's the thing: as in the case of nomenclature, stability in itself can be a very important thing. If DwC changed every six months, then by the time people developed software apps to work with it, those apps would already be obsolete. If someone writes code that consumes DwC content as expressed in the current version of DwC, then that code may break if people start providing content with class:individual and class:token content. If our community is going to move forward successfully, I think standards like DwC need to evolve in a punctuated way, rather than a gradualist way (same goes for the Codes of nomenclature). That is, a bit of inertia in the system is probably a good thing.
OK, I've now gone on for eight pages of text explaining the rationale behind the question. So I'll return to the basic question: is the consensus for modeling the relationship between an Occurrence and associated token(s) the assumed token model:
http://bioimages.vanderbilt.edu/pages/token-assumed.gif or the explicit token model: http://bioimages.vanderbilt.edu/pages/token-explicit.gif ?
Here's how I would answer: When modelling my own databases, tracking my own content, I would *definitely* (and indeed already have, for a long time now) go with the token-expicit.
But when deciding on a community data exchange standard (i.e., DwC), compromise between flat and normalized is still a necesssity, and as such, the answer in terms of modifying DwC needs to take into account the form of the bulk of the existing content, the needs of the bulk of the existing users/consumers, and the virtues of stability of Standards in a world where software app development time stretches for months or years.
Maybe the answer to this is to treat different versions of DwC as concurrent, rather than serial. That is, as long as the next most sophisticated version can easily be "collapsed" to all previous versions (aka, backward compatibility), then maybe we just need a clear mechanism for consuming applications to indicate desired DwC version. That way, apps developed to work with v2.1 can indicate to a provider that is capable of produding v3.6 content, that they want it in v2.1 format. Assuming we maintain backward compatibility (i.e., the more-normalized version can be easily collapsed to the more flattened version), then is should be a very simple matter for the content provider to stream the same content in v2.1 format.
Yes, I agree about this concept. I think that what I'm really advocating for is that we agree on what the most normalized model is that will connect all of the existing Darwin Core classes and terms. In that sense, when I'm asking for Individual to be accepted as a class, I'm not arguing for a "new" thing, I'm arguing for a clarification of what we mean when we use the existing term dwc:individualID. When I'm asking for terms to facilitate a logically consistent way to connect Occurrences with their tokens, I'm also not really asking for an expansion of Darwin Core, I'm asking for a more consistent model than "subclassing" by using associatedMedia and associatedSequences but not using "associatedSpecimens". I think that this is important because if we don't agree on these things, we are going to have a royal mess on our hands if we try to start trying to develop an RDF guide for Darwin Core. As an eternal optimist, I think that describing a fully normalized model that can be translated into RDF can be achieved with only a few minor additions to the existing terms as opposed to requiring a complete new version. If we really need to completely rewrite Darwin Core for RDF I don't have any delusions that it will be accomplished before I retire.
But now I'm dabbling in areas that are WAY outside my scope of expertise...
Anyway...I would reiterate that I, for one, appreciate that you took the time to write all this down (took me over 3 hours to read & respond -- so obviously I care! -- of course, I'm waiting for a taxi to go to the airport, so really not much else for me to do right now). If I didn't reply to parts of your message, it was either because I agreed with you and had nothing to elaborate or expound upon, or I didn't really understand (e.g., all the rdf stuff).
Again, thanks for taking the time to read and comment.
Steve
Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
Dear Moderator,
Could you please unsubscribe me from all TDWG mailing lists? Unfortunately, without my will I am flooded with TDWG emails since 8th October. Most of subjects are very interesting, but I have my job to do.
Yours, Yuri
---------------------------------------------------------------------- Dr. Yury Roskov Catalogue of Life Executive Editor School of Biological Sciences The Harborne Building The University of Reading Reading, RG6 6AS, UK
Tel. +44 (0) 118 378 6466 Fax +44 (0) 118 378 8160 E-mail: y.roskov@reading.ac.uk
www.sp2000.org, www.catalogueoflife.org EC projects: www.4d4life.eu, www.i4life.eu ----------------------------------------------------------------------
----- Original Message ----- From: Steve Baskauf Cc: tdwg-content@lists.tdwg.org Sent: Wednesday, October 27, 2010 11:42 AM Subject: Re: [tdwg-content] Treatise on Occurrence, tokens, and basisOfRecord
Please note that in various examples, I have incorrectly placed rdf:type in the namespace rdfs: (http://www.w3.org/2000/01/rdf-schema#) rather than rdf: (http://www.w3.org/1999/02/22-rdf-syntax-ns#). Thanks to Bob for pointing out this serious error.
Also, the ACS model information is very cool. I wish I'd seen it a long time ago. I especially like the giant relationship chart. Thanks Stan and Rich. Steve
Steve Baskauf wrote: Rich, Thanks for taking the time to read the whole thing. Based on the first series of comments you made, it seems as though we are in agreement on most points. I think that what I wrote was (as I had anticipated) somewhat less clear due to my use (or failure to use) some appropriate terms to describe what I was talking about. For example, when I said "atomized" I probably should have said something like "fine-grained" and correct use of the term "normalized" would have helped. Some other comments inline:
Richard Pyle wrote:
I believe that historically the assumed token model has been the one which most people have had in mind.
Actually, I've always envisioned it as you have in your token-explicit version (and have said as much at various meetings to discuss DwC, going back to 1.0). In fact, I remember discussing this exact issue with Stan Blum long before DwC existed (he was the first to suggest to me the term "evidence" in this context -- which I think is functionally equivalent to your "token"). However, I've conceeded that this level of normalization would probably be too much for the intended purpose of the DwC terms. But I'll keep an open mind on that.
Before the new DwC standard, we had specimens and we had observations. In order to avoid redundancies in terms for those two types of "things", a combined "thing" called "Occurrence" was created. An Occurrence that was an observation didn't have a token and an Occurrence that was a specimen had a physical or living specimen as its token.
My rationalization of it in the early days (pre-DwC) was that *everything* was effectively an observation, and beyond that, the only question was a matter of evidence. In my earliest models, I categorized "evidence" into "Specimen", "Image", "Literature Report", and "Unvouchered Observation" (I was using the word "voucher" in the general sense, as in the verb "to vouch" -- not in the more specific sense for our community, which implies "Specimen preserved in Museum"). My read on the history of DwC is that it was initially established as a means to aggregate and/or share Specimen data amongst Museums (hence its Specimen-centric nature). Later, the Specimen/Observation dichotomy was introduced to allow DwC content to allow more sophisticated and complete representations of the occurrence of organisms in place and time, because there was muchmore information than what existed as specimens in Museums. In my mind, the "Observation" side was effectively a collapsing of my "Image", "Literature Report" and "Unvouchered Observation" -- which I was OK with in the context of the time. Because at the time, the vast majority of content available in computer databases came from museum specimen databases, and from observational databases (largely in the bird realm). Well, I'm not surprised that the ideas that I'm trying to put down in words and diagrams predate my entry into this arena a year and a half ago. What is a bit frustrating to me is that ideas like these aren't laid out in an easy-to-understand fashion and placed in easy-to-find places. I have spent much of that last year and a half trying to understand how the whole TDWG/DwC universe is supposed to fit together. I think that the idea of having the Google Code site where there are explanations and examples for the various DwC terms is the kind of thing we need. Unfortunately, most of the terms do not yet have entries there. Perhaps I'm just impatient. If it turns out that any of the summaries that I've written here accurately reflect any kind of consensus, then maybe someone could "clean them up" (i.e. use correct technical terms after giving definitions of what they mean) and paste them somewhere where people can find them. That would prevent another person 10 years from now re-articulating the same ideas a third time. I'm particularly thinking of the summary diagram http://bioimages.vanderbilt.edu/pages/token-explicit.gif along with an explanation of how people use the more normalized and more flattened versions of it. We already do have quite lucid examples in the Simple Darwin Core (flattened) and Darwin Core XML guide (normalized), but some sort of overview of the big picture might be helpful. If an RDF guide ever gets off the ground, that would be another example of how the relationships assumed in DwC are expressed in a very explicit way.
So...I see the current iteration of DwC as another step in the evolution of moving from "sharing and aggregating specimen data among museums" to "documenting biodiversity in nature". It's not all the way into the fully normalized representation of biodiversity data, but it's far enough that it is a nice compromise between practical and effective for the majority of the user constituency. In my mind, the next logical step in this evolutionary trajectory would be to recognize "Individual" as a class (which DwC is apready primed for, via individualID). I think I understand the message that you are trying to convey above and in your later comments about creating new versions of DwC (or new evolutionary states of DwC) that don't break the previous ones. I think that is one reason why the process of examining and clearly articulating the community consensus on what Darwin Core terms and classes "mean" and how they are connected to each other is so important before we embark on implementing GUIDs and RDF. Pete has suggested that we may need a second version of DwC in order to make it work in the Linked Open Data world and he's probably right. I'm not sure that the existing vocabulary has all of the terms we need to do that. However, if we are going to "evolve" Darwin Core so that it will work in the LOD world, I hope that we do it in such a way that we maintain the same "meaning" of things as Darwin Core 1.0 . I think that is the way to maintain the kind of "stability" that you described below.
Unlike specimens where the token's metadata terms are placed in the Occurrence class, I guess in the case of an image one is supposed to use associatedMedia to link the so-called MachineObservation to the image record. If DNA were extracted, one would link the sequence to the Occurrence using associatedSequences (although it's not clear to me what the basisOfRecord for that would be - "TookATissueSample"?). But what does one do for other kinds of tokens, like seeds or tissue samples - create terms like associatedSeed and associatedTissueSample?
In my mind, things like seeds, tissue samples, and DNA sequences are simply different kinds of specimens (just like dried skeletons vs. botanical pressed sheets vs. whole organisms in jars of alcohol vs. prepared skins, etc.) They may have certain properties specific to each subclass of specimen, but fundamentally I think it's fair to treat them as specimens. DNA sequences are a bit different, of course, because they are not the "stuff" of an organism, but rather an indirect representation of the "stuff". In my mind, that difference justifies associatedSequences, where we don't have associatedSeeds, associatedTeeth, associatedSkins, associatedSkeletons, etc. Your point is well taken in that we don't need a proliferation of types of associated tokens. We need as many different token "types" as we have coherent sets of metadata terms. One of the points of typing resources is to let potential users know what kinds of metadata properties (terms) they can reasonably expect to receive about that resource. If one will receive the same set of properties about two kinds of resources (e.g. skins and skeletons), there is no reason to type them differently. The point that I was trying to get at (eventually) was that it was inconsistent to say that images need to be referenced as associatedMedia and sequences needed to be referenced as associatedSequences, and yet not say that specimens needed to be referenced as "associatedSpecimens". I actually think that based on Roger's explanation of "to subclass or not" (http://wiki.tdwg.org/twiki/bin/view/TAG/SubclassOrNot), it makes more sense to talk about using a generic "hasToken" or "tokenID" along with "tagging" the token using rdfs:type (as I suggested toward the end of my "treatise") rather than a bunch of associatedXXXX terms.
If we accept the explicit token model, then as a biodiversity informatics resource type "observation" will have to disappear into a puff of nothingness
Not necessarily. See my comment earlier about patterns on neurons in a human brain that constitute a memory. Just as a digital image rendered on a hard disk requires certain machinery to convert into photons that strike our retinas (i.e., a computer and monitor), so too does a memory require such machinery (e.g., the brain itself, transmission of sound waves via vocal chords, soud ways striking ear drums, etc.) This may sound weird, but I'm being serious: a human memory is, fundamentally, every bit as much of a "token" as a specimen or a digital image. It's just considerably less accessible and well-resolved. I guess I'm thinking about this in terms of a token being something to which we can assign an identifier and retrieve a representation (a la representational state transfer). Although I don't deny the existence of memory patterns in neurons that are associated with a HumanObservation, there isn't any way that we can receive a representation of that memory directly. If the person draws a sketch of what he/she remembers, then we have a media item that we can convert into a digital form and transmit through the Internet (a token). If the person types up notes, then we have a text document (a token that can also be delivered as a digital file or scan of typewritten page). On the other hand, if the person simply records the values of recordedBy, eventDate, and Location terms, then we have only Occurrence metadata (no token). If someone claims "basisOfRecord=HumanObservation" and has no token of any kind, then what is there that is deliverable other than the basic Occurrence metadata? That's why I'm claiming that basisOfRecord=HumanObservation simply corresponds to an Occurrence record with no token.
Realistically, I can't see this kind of separation ever happening, given the amount of trouble it's been just to get a few people to admit that Individuals exist.
I don't think the issue was ever in convincing people that Individuals exist -- that much, I think, was clear to everyone (as proof: see dwc:individualID). The issue was always more about where the current DwC should lie on the scale of highly flattened (e.g., DwC 1.0) to highly normalized (e.g., ABCD and CDM). It's necessarily a compromise between modelling the information "as it really is", vs. modelling the information in a way that's both accessible to the majority to content providers, and useful to the majority of contnent consumers. I think we both understand what the trade-offs are in either direction. The question is, what is the "sweet spot" for the majority of our community at this time in history?
I would venture that at the time DwC 1.0 was developed, that hit the sweet spot reasonably well. As more content holders develop inclreasingly sophisticated DBMS for their content, and as the user community delves into increasingly sophisticated analyses of the data, the "sweet spot" will shift from the flattened end of the scale to the normalized end of the scale. And, I would hope, DwC wll evolve accordingly.
It is just too hard to get motion to happen in the TDWG community.
People make the same complaint about another organization that I'm involved with (ICZN). But here's the thing: as in the case of nomenclature, stability in itself can be a very important thing. If DwC changed every six months, then by the time people developed software apps to work with it, those apps would already be obsolete. If someone writes code that consumes DwC content as expressed in the current version of DwC, then that code may break if people start providing content with class:individual and class:token content. If our community is going to move forward successfully, I think standards like DwC need to evolve in a punctuated way, rather than a gradualist way (same goes for the Codes of nomenclature). That is, a bit of inertia in the system is probably a good thing.
OK, I've now gone on for eight pages of text explaining the rationale behind the question. So I'll return to the basic question: is the consensus for modeling the relationship between an Occurrence and associated token(s) the assumed token model:
http://bioimages.vanderbilt.edu/pages/token-assumed.gif
or the explicit token model:
http://bioimages.vanderbilt.edu/pages/token-explicit.gif
?
Here's how I would answer: When modelling my own databases, tracking my own content, I would *definitely* (and indeed already have, for a long time now) go with the token-expicit.
But when deciding on a community data exchange standard (i.e., DwC), compromise between flat and normalized is still a necesssity, and as such, the answer in terms of modifying DwC needs to take into account the form of the bulk of the existing content, the needs of the bulk of the existing users/consumers, and the virtues of stability of Standards in a world where software app development time stretches for months or years.
Maybe the answer to this is to treat different versions of DwC as concurrent, rather than serial. That is, as long as the next most sophisticated version can easily be "collapsed" to all previous versions (aka, backward compatibility), then maybe we just need a clear mechanism for consuming applications to indicate desired DwC version. That way, apps developed to work with v2.1 can indicate to a provider that is capable of produding v3.6 content, that they want it in v2.1 format. Assuming we maintain backward compatibility (i.e., the more-normalized version can be easily collapsed to the more flattened version), then is should be a very simple matter for the content provider to stream the same content in v2.1 format. Yes, I agree about this concept. I think that what I'm really advocating for is that we agree on what the most normalized model is that will connect all of the existing Darwin Core classes and terms. In that sense, when I'm asking for Individual to be accepted as a class, I'm not arguing for a "new" thing, I'm arguing for a clarification of what we mean when we use the existing term dwc:individualID. When I'm asking for terms to facilitate a logically consistent way to connect Occurrences with their tokens, I'm also not really asking for an expansion of Darwin Core, I'm asking for a more consistent model than "subclassing" by using associatedMedia and associatedSequences but not using "associatedSpecimens". I think that this is important because if we don't agree on these things, we are going to have a royal mess on our hands if we try to start trying to develop an RDF guide for Darwin Core. As an eternal optimist, I think that describing a fully normalized model that can be translated into RDF can be achieved with only a few minor additions to the existing terms as opposed to requiring a complete new version. If we really need to completely rewrite Darwin Core for RDF I don't have any delusions that it will be accomplished before I retire.
But now I'm dabbling in areas that are WAY outside my scope of expertise...
Anyway...I would reiterate that I, for one, appreciate that you took the time to write all this down (took me over 3 hours to read & respond -- so obviously I care! -- of course, I'm waiting for a taxi to go to the airport, so really not much else for me to do right now). If I didn't reply to parts of your message, it was either because I agreed with you and had nothing to elaborate or expound upon, or I didn't really understand (e.g., all the rdf stuff). Again, thanks for taking the time to read and comment.
Steve
On 24/10/2010, at 10:02 AM, Steve Baskauf wrote:
Bob has warned us about the dangers of asserting that a term always applies to a certain type of resource by asserting that the term has an rdfs:domain . However, we should not avoid attempting to assert that a resource is itself of a certain type. Describing the "type" of a resource is an important part of letting potential users assess the possible fitness of use of that resource.
Absolutely. Removing the domain and range specifiers from the properties does rather take the "semantic" out of "the semantic web". The difficulty, however, is getting them right.
For instance: at the moment I am attempting to apply the DwC properties to our data at biodiversity.org.au. Our data has a fairly strict distinction between a name and a taxon. Taxon http://biodiversity.org.au/apni.taxon/54321 has name http://biodiversity.org.au/apni.taxon/2422 . To mark up these entities using the DwC properties, I would want to add scientificNameID nameAccordingToID higherTaxonConceptID to the taxon record, and acceptedNameUsageID namePublishedInID originalNameUsageID to the name record.
Now ... arguably the name record by itself can be taken as being the "nominal" taxon concept. But that's not really what our data means. As it is, I can add the properties without asserting that the name is a taxon.
However, I can think of two approaches to adding the domain. On the one hand, you could simply wear the implication. Yes indeed: an APNI "name" all by itself is indeed what DwC means by the word "taxon". Alternatively, the vocabulary could be corrected. A third approach: declaring my own property and stating that it is a superproperty of DwC:acceptedNameUsageID, seems like rather too much work.
By the way: is this the correct place to complain about missing bits and pieces in the DwC vocabulary?
------ If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
------
Steve misinterprets me. My warning is not that a term always applies to a certain type of resource by asserting that the term has an rdfs:domain. It is that a term is given an rdfs:domain then that term \only/ applies to a resource that is of rdf:type that domain. That is the formal semantics of rdfs:domain. It is somewhat the opposite of the usual meaning of "domain" as used by mathematicians. My position is that it should not be done without substantial thought, because it closes the world somewhat, and the open world assumption is a hallmark of rdf.
Bob Morris
On Sun, Oct 24, 2010 at 10:21 PM, Paul Murray pmurray@anbg.gov.au wrote:
On 24/10/2010, at 10:02 AM, Steve Baskauf wrote:
Bob has warned us about the dangers of asserting that a term always applies to a certain type of resource by asserting that the term has an rdfs:domain . However, we should not avoid attempting to assert that a resource is itself of a certain type. Describing the "type" of a resource is an important part of letting potential users assess the possible fitness of use of that resource.
Absolutely. Removing the domain and range specifiers from the properties does rather take the "semantic" out of "the semantic web". The difficulty, however, is getting them right. For instance: at the moment I am attempting to apply the DwC properties to our data at biodiversity.org.au. Our data has a fairly strict distinction between a name and a taxon. Taxon http://biodiversity.org.au/apni.taxon/54321 has name http://biodiversity.org.au/apni.taxon/2422 . To mark up these entities using the DwC properties, I would want to add
scientificNameID nameAccordingToID higherTaxonConceptID
to the taxon record, and
acceptedNameUsageID namePublishedInID originalNameUsageID
to the name record. Now ... arguably the name record by itself can be taken as being the "nominal" taxon concept. But that's not really what our data means. As it is, I can add the properties without asserting that the name is a taxon. However, I can think of two approaches to adding the domain. On the one hand, you could simply wear the implication. Yes indeed: an APNI "name" all by itself is indeed what DwC means by the word "taxon". Alternatively, the vocabulary could be corrected. A third approach: declaring my own property and stating that it is a superproperty of DwC:acceptedNameUsageID, seems like rather too much work. By the way: is this the correct place to complain about missing bits and pieces in the DwC vocabulary?
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
On 25/10/2010, at 3:12 PM, Bob Morris wrote:
Steve misinterprets me. My warning is not that a term always applies to a certain type of resource by asserting that the term has an rdfs:domain. It is that a term is given an rdfs:domain then that term \only/ applies to a resource that is of rdf:type that domain. That is the formal semantics of rdfs:domain. It is somewhat the opposite of the usual meaning of "domain" as used by mathematicians. My position is that it should not be done without substantial thought, because it closes the world somewhat, and the open world assumption is a hallmark of rdf.
And because the meaning is somewhat the opposite of what you expect, it's easy to say something you didn't mean. At the risk of teaching my grandmother to suck eggs, in OWL:
if a property P is declared to have a domain T: It is *not* the case that every instance x of T *must* have a value for property P. It is *not* the case that any x having property P *must* be declared as having type T. It is *not* the case that any x having property P *must not* be declared as having any other type.
According to the spec: "9.2.5 Object Property Domain: ObjectPropertyDomain( OPE CE ) states that ... if an individual x is connected by OPE with some other individual, then x is an instance of CE."
That is: "P domain T" and "x P whatever" -> "x hasType T", in addition to any other types x might also have.
Very much the opposite of what people accustomed to computer languages expect: "int x;" normally means you cannot put a string into x. In owl it means that if you put a string into x, then that string is an int. My first reaction was: what's the use of that? Everything would eventually get typed as everything! The important bit is that your vocabulary can also declare semantic rules. You can declare: that every instance x of T *must* have a value for property P: SubclassOf(T ObjectMinimumCardinality(1 P)) that any x having property P *must not* be declared as some other type T2: SubclassOf(ObjectMinimumCardinality(1 P) ObjectComplementOf(T2))
You can state that every Name has a nameComplete property, or that a Name is never a Specimen. If you do this, and somone uses your classes and properties wrongly, then a reasoner given that ontology (a set of rdf files and their imports closure) will conclude that it is *inconsistent*.
Without these extra rules, people using properties in a way that seems natural to them can inadvertently say things they probably didn't mean. If you declare that nameComplete has a domain of Name, then someone using it to attach a name to a specimen is not accomplishing what they thought they were accomplishing "But I just wanted to say that this is the complete name of the specimen!".
But what's the alternative? People are always going to misinterpret things unless the property names are entire sentences. Perhaps a "strict" vocabulary and a "loose" vocabulary, with identically-named properties, the strict ones being declared as subproperties of the loose ones. Or maybe at the end of the day it is simply correct to reply "No, you ought not use properties from the TDWG TaxonName namespace to decorate Specimen records. Sorry. You'll need to choose something else appropriate.". There's nothing at all to stop anyone from declaring their own vocabularies.
I suppose the point is: * these sorts of issues don't really become issues at the mechanical layer until someone starts to write reasoning rules. * a hasDomain declaration *is* an OWL reasoning rule, it isn't just an annotation or a comment, and you can't ignore the implications of doing it. * OWL perhaps provides a concrete formalism for some of the discussions here - well beyond simply being able to declare properties and annotate them * that formalism is a little tricky and can be counterintuitive
=================================
Here's a simple example (using the RDF serialisation) that can be read into Protege. The reasoner likes it, and deduces that x is a T. If you comment out the indicated line, the reasoner concludes that Nothing is a Thing (you can deduce anything from an inconsistent ontology).
<?xml version="1.0"?> <rdf:RDF xmlns="http://foo#" xml:base="http://foo" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
<owl:ObjectProperty rdf:about="#P1"> <rdfs:domain rdf:resource="#T"/> </owl:ObjectProperty> <owl:ObjectProperty rdf:about="#P2"/> <owl:Class rdf:about="#T"> rdfs:subClassOf owl:Restriction <owl:onProperty rdf:resource="#P2"/> <owl:cardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#nonNegativeInteger%22%3E1</owl:cardinality> </owl:Restriction> </rdfs:subClassOf> </owl:Class> <owl:Thing rdf:about="#x"> <P1 rdf:resource="#y"/> <P2 rdf:resource="#y"/> <!-- comment out this line --> </owl:Thing> <owl:Thing rdf:about="#y"/> </rdf:RDF>
------ If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
------
Dear Steve,
Thanks for this clear and compelling argument in favor of Occurrences being different from the tokens created in their documentation.
So I'll return to the basic question: is the consensus for modeling the relationship between an Occurrence and associated token(s) the assumed token model: ... or the explicit token model
Having no long personal history of use of Occurrence, and with respect for the huge amount of work that crafting the DwC terms must have taken, but having tried semantic modeling (in a previous post) using the overloaded term Occurrence, I for one vote for the latter, as conceptually clearer. A specimen is then a Specimen, an image an Image, and so on.
But then what exactly are the Occurrences themselves? From Richard Pyle:
``So, an Occurrence is the intersection of an Individual and an Event. An Event is a Location+Time[+other metadata]. Each Event may have multiple Occurrences (i.e., one for each distinct Individual at the same Location+Time). Also, an Individual may have multiple Occurrences (one for each Event at which the same Individual was documented).''
So the Occurrence is the Individual _itself_ bounded by space and time, the latter data currently recorded in the Event class. What I then want to ask is, 1. do the terms for clearly defining the bounds of the Occurrence already exist? There exist terms for spatial uncertainty: dwc:coordinateUncertaintyInMeters, and coarse ones for temporal bounds: startDayOfYear + endDayOfYear, but not for temporal uncertainty, or spatial bounds (but see Pete's http://lod.taxonconcept.org/ontology/dwc_area.owl). Also, 2. if there was a consensus for moving to the `explicit token' model, should the space-time bounds of the Occurrence still be contained in an associated (often blank) Event, or accepted as properties of the Occurrence itself (e.g., occurrenceDate, occurrenceDuration, occurrenceLocation, occurrenceRadius)? I would support the latter.
Finally, 3. if there was a consensus for moving to the `explicit token' model, and a human observation was a token-less Occurrence, would we best specify who made the observation with dwc:recordedBy and what the observation was with dwc:occurrenceRemarks, or would it be better to create a second new token (along with `Physical specimens') that was an explicit Observation class, that would link explicitly to, say, an external observational ontology (i.e., OBOE)? The issue of GUIDs for non-physical observations comes up, but this could still be solved in various ways.
Stepping back from the details for a moment, and reading some of the replies to Steve's post that have come in, I am wondering how many readers are thinking, ``the need for a semantic web standard for biodiversity information might be better achieved by a deep fork of Darwin Core, adopting new Classes and explicit domains and ranges for each term, to create a `Darwin SW,' rather than by an effort to evolve Darwin Core itself.'' I'm sure the question of forking Darwin Core has come up before, and I'm sure the discussion was passionate!
Best,
Cam
Hi Cam,
What I then want to ask is, 1. do the terms for clearly defining the bounds of the Occurrence already exist? There exist terms for spatial uncertainty: dwc:coordinateUncertaintyInMeters, and coarse ones for temporal bounds: startDayOfYear + endDayOfYear, but not for temporal uncertainty, or spatial bounds (but see Pete's http://lod.taxonconcept.org/ontology/dwc_area.owl).
The question of temporal uncertainty is an excellent one, and after years of struggling, I have no "elegant" solution ("elegant" in this case being a high degree of capturing what what we want to capture, with a minimal set of attributes in a simple structure). The problem is that given a range represented by startDate and endDate (how I handle it in my database), the interpretation may be any of the following:
- Event occurred at a singular (imprecisely known) point between startDate and endDate - Event occurred at multiple points between startDate and endDate - Event occurred continuously beginning startDate and ending endDate
To overcome this, I added two additional fields:
verbatimDate: Used for historical datasets to record the verbatim date information (useful for things like "Summer 1984") dateRemarks: Some sort of text description of what is meant by the startDate and endDate values
Of course, nether of these date qualifiers is of much use in the semantic context. So what we may need some sort of controlled vocabulary for "dateRangeQualifier" or something, that indicates how to interpret a date range given only dateStart and dateEnd.
Another solution would be something analagous to my approach for handling the second part of your question: spatial bounds.
In my universe of datasets, I have lat/lon coordinates that fall into one of several types:
- single point with uncertainty - two points representing a transect line (commonly used for plankton tows) - two points representing two corners of a bounding box - series of multiple points representing a non-straight line (e.g., a river, road, or a non-straight survey path) - series of multiple points representing a polygon
In addition to the fact that there are 1...n points to represent the bounding of a place, there may be multiple re-interpretations of a point or set of points, when retroactively georeferencing locality data.
So...the model I came up with looks something like this (using my same ASCII-art notation as before)
Location--<CoordinateSet--<Coordinate | CoordinateSetType
A Coordinate minimally consists of a decimalLatitute, decimalLongitude, and Sequence.
CoordinateSetType is a controlled vocabulary that defines the five types listed above (point, transect, boundingBox, line, polygon).
Each CoordinateSet consists of 1, 2, or >2 Coordinates, depending on the CoordinateSetType.
Attached to "CoordinateSet" is all the MaNIS-style metadata for who/when/how the coordinate was derived.
The reason for the 1:M Location:CoordinateSet is to allow for multiple interpretations of retroactively established coordinates (e.g., following the MaNIS protocol).
Whether or not things like Datum and Uncertainty are attached to CoordinateSet or Coordinate depends on how much flexibility you need for capturing heterogenous Datum or Uncertainty values within a particular set (e.g., if certain nodes on a polygon are more precise than other nodes). I would defintiely put Datum on CoordinateSet, and probably also put uncertainty there as well (which assumes that the same datum applies to each point in a set, and also that uncertainty is consistent for each point in a set).
The pretty-much covers all spacial bounding protocols, and it's not too difficult to derive a "point" coordinate from any of the other four, for purposes of "dumbing the data down" to fit into DwC. There are some problems, but they are not important enough to go into now.
But the point is, you could conceivably handle dates using a similar structure -- but the cases where parsing the precise date information is imprtant are so few, that we probably don't need a semantic structure for it, and can simly capture it in a human-readible dateRemarks.
Now....obviosuly this is all in data modelling space, not necessarily DwC-space. But I think it is useful to discuss how databases capture this kind of information at the source when trying to figure out how best to simplify it for aq content exchange protocol.
Also, 2. if there was a consensus for moving to the `explicit token' model, should the space-time bounds of the Occurrence still be contained in an associated (often blank) Event, or accepted as properties of the Occurrence itself (e.g., occurrenceDate, occurrenceDuration, occurrenceLocation, occurrenceRadius)? I would support the latter.
I would support the former. I'm not sure I understand why Event is "often blank". If there is any space-time information, then Event is not blank. In the context of the data I manage, it makes much more sense (in a DwC context) to capture Events and Locations as distinct classes, than representing multiple tokens for the same Occurrence. Even if we don't establish an individual class, dwc:individualID within the Occurrence class allows us to deal with both the "same-organism-at-multiple-events" situation, and the "multiple-tokens-for-same-organism-at-same-event" situation.
Finally, 3. if there was a consensus for moving to the `explicit token' model, and a human observation was a token-less Occurrence, would we best specify who made the observation with dwc:recordedBy and what the observation was with dwc:occurrenceRemarks, or would it be better to create a second new token (along with `Physical specimens') that was an explicit Observation class, that would link explicitly to, say, an external observational ontology (i.e., OBOE)? The issue of GUIDs for non-physical observations comes up, but this could still be solved in various ways.
I would favor at the very least a "place-holder" or "implied" token for a human observation. It's functionally analagous to the situation where a photo was taken, but then accidentally destroyed or lost. The only difference between an image and a memory is that the image is generally more durable, and is more easily and precisely conveyed from person to person.
Stepping back from the details for a moment, and reading some of the replies to Steve's post that have come in, I am wondering how many readers are thinking, ``the need for a semantic web standard for biodiversity information might be better achieved by a deep fork of Darwin Core, adopting new Classes and explicit domains and ranges for each term, to create a `Darwin SW,' rather than by an effort to evolve Darwin Core itself.'' I'm sure the question of forking Darwin Core has come up before, and I'm sure the discussion was passionate!
To the extent that I understand both DwC and the semantic web, this seems to me to be the most parsimonious approach.
Aloha, Rich
- Event occurred at a singular (imprecisely known) point between startDate
and endDate
- Event occurred at multiple points between startDate and endDate
- Event occurred continuously beginning startDate and ending endDate
Has anyone seen http://www.w3.org/TR/owl-time/ ? There's a page of links here http://www.isi.edu/~hobbs/owl-time.html . Perhaps the way to go is contributing to that w3c draft - it seems very incomplete at present.
CalDAV has a model of time, but looking at it I don't think the concepts there will suit, and of course there is no OWL vocabulary.
I have had the dubious pleasure of modelling times and durations in java/sql on a couple of occasions. The main thing I concluded was that it is important to distinguish between a date range expressed in calendar units and the concept of a time interval in the sense of from and to an instant. To clearly distinguish discrete time from continuous time. In particular, it is very useful to treat timestamp intervals as being from some named instant up to *but not including* some other named instant. This is because our names for times always have a degree of granularity, depending on the naming scheme, and we take them to refer to the instant at the start of their range.
Doing this makes it easy to calculate total durations, to determine if two ranges are contiguous, disjoint, or overlapping, and to take intersections and unions. For instance, is the date range "2010-01-31T22:00 to 2010-01-31T23:59" contiguous with the date range "year 2011"? What interval of time covers all of May 1988 and all of Year 1920? You can easily answer if you convert them to their equivalent timestamp ranges ("up to but not including") according to the way each model divides up time into named bits. Attempting to do it any other way is very horrible.
This means that "from 2011 to 2012" may have a duration of one year or two, depending on how you meant it, and there are types permitting you to be explicit.
It does seem that there ought to be a vocabulary for this sort of thing outside of rs.tdwg.org, which DwC can reference.
------ If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
------
Two points (sorry about the pun) of information. First, DwC supports geometries (all of Rich's examples, and many more) through dwc:footprintWKT as Well-known Text. Second, there are no startDate and endDate terms in DwC. The single eventDate term is meant to comply with ISO 8601, which is capable of expressing not just dates, but also intervals, among other expressions of time. eventDate is one of the DwC terms that does have an elaboration in the DwC wiki pages, at http://code.google.com/p/darwincore/wiki/Event#eventDate. Note that those who would express eventDate in an application profile as conforming the the W3C xs:dateTime will be be over-restrictive. The constraints on xs:dateTime can be found at http://www.w3.org/TR/xmlschema11-2/#dateTime.
On Mon, Oct 25, 2010 at 12:26 PM, Richard Pyle deepreef@bishopmuseum.orgwrote:
Hi Cam,
What I then want to ask is, 1. do the terms for clearly defining the bounds of the Occurrence already exist? There exist terms for spatial uncertainty: dwc:coordinateUncertaintyInMeters, and coarse ones for temporal bounds: startDayOfYear + endDayOfYear, but not for temporal uncertainty, or spatial bounds (but see Pete's http://lod.taxonconcept.org/ontology/dwc_area.owl).
The question of temporal uncertainty is an excellent one, and after years of struggling, I have no "elegant" solution ("elegant" in this case being a high degree of capturing what what we want to capture, with a minimal set of attributes in a simple structure). The problem is that given a range represented by startDate and endDate (how I handle it in my database), the interpretation may be any of the following:
- Event occurred at a singular (imprecisely known) point between startDate
and endDate
- Event occurred at multiple points between startDate and endDate
- Event occurred continuously beginning startDate and ending endDate
To overcome this, I added two additional fields:
verbatimDate: Used for historical datasets to record the verbatim date information (useful for things like "Summer 1984") dateRemarks: Some sort of text description of what is meant by the startDate and endDate values
Of course, nether of these date qualifiers is of much use in the semantic context. So what we may need some sort of controlled vocabulary for "dateRangeQualifier" or something, that indicates how to interpret a date range given only dateStart and dateEnd.
Another solution would be something analagous to my approach for handling the second part of your question: spatial bounds.
In my universe of datasets, I have lat/lon coordinates that fall into one of several types:
- single point with uncertainty
- two points representing a transect line (commonly used for plankton tows)
- two points representing two corners of a bounding box
- series of multiple points representing a non-straight line (e.g., a
river, road, or a non-straight survey path)
- series of multiple points representing a polygon
In addition to the fact that there are 1...n points to represent the bounding of a place, there may be multiple re-interpretations of a point or set of points, when retroactively georeferencing locality data.
So...the model I came up with looks something like this (using my same ASCII-art notation as before)
Location--<CoordinateSet--<Coordinate | CoordinateSetType
A Coordinate minimally consists of a decimalLatitute, decimalLongitude, and Sequence.
CoordinateSetType is a controlled vocabulary that defines the five types listed above (point, transect, boundingBox, line, polygon).
Each CoordinateSet consists of 1, 2, or >2 Coordinates, depending on the CoordinateSetType.
Attached to "CoordinateSet" is all the MaNIS-style metadata for who/when/how the coordinate was derived.
The reason for the 1:M Location:CoordinateSet is to allow for multiple interpretations of retroactively established coordinates (e.g., following the MaNIS protocol).
Whether or not things like Datum and Uncertainty are attached to CoordinateSet or Coordinate depends on how much flexibility you need for capturing heterogenous Datum or Uncertainty values within a particular set (e.g., if certain nodes on a polygon are more precise than other nodes). I would defintiely put Datum on CoordinateSet, and probably also put uncertainty there as well (which assumes that the same datum applies to each point in a set, and also that uncertainty is consistent for each point in a set).
The pretty-much covers all spacial bounding protocols, and it's not too difficult to derive a "point" coordinate from any of the other four, for purposes of "dumbing the data down" to fit into DwC. There are some problems, but they are not important enough to go into now.
But the point is, you could conceivably handle dates using a similar structure -- but the cases where parsing the precise date information is imprtant are so few, that we probably don't need a semantic structure for it, and can simly capture it in a human-readible dateRemarks.
Now....obviosuly this is all in data modelling space, not necessarily DwC-space. But I think it is useful to discuss how databases capture this kind of information at the source when trying to figure out how best to simplify it for aq content exchange protocol.
Also, 2. if there was a consensus for moving to the `explicit token' model, should the space-time bounds of the Occurrence still be contained in an associated (often blank) Event, or accepted as properties of the Occurrence itself (e.g., occurrenceDate, occurrenceDuration, occurrenceLocation, occurrenceRadius)? I would support the latter.
I would support the former. I'm not sure I understand why Event is "often blank". If there is any space-time information, then Event is not blank. In the context of the data I manage, it makes much more sense (in a DwC context) to capture Events and Locations as distinct classes, than representing multiple tokens for the same Occurrence. Even if we don't establish an individual class, dwc:individualID within the Occurrence class allows us to deal with both the "same-organism-at-multiple-events" situation, and the "multiple-tokens-for-same-organism-at-same-event" situation.
Finally, 3. if there was a consensus for moving to the `explicit token' model, and a human observation was a token-less Occurrence, would we best specify who made the observation with dwc:recordedBy and what the observation was with dwc:occurrenceRemarks, or would it be better to create a second new token (along with `Physical specimens') that was an explicit Observation class, that would link explicitly to, say, an external observational ontology (i.e., OBOE)? The issue of GUIDs for non-physical observations comes up, but this could still be solved in various ways.
I would favor at the very least a "place-holder" or "implied" token for a human observation. It's functionally analagous to the situation where a photo was taken, but then accidentally destroyed or lost. The only difference between an image and a memory is that the image is generally more durable, and is more easily and precisely conveyed from person to person.
Stepping back from the details for a moment, and reading some of the replies to Steve's post that have come in, I am wondering how many readers are thinking, ``the need for a semantic web standard for biodiversity information might be better achieved by a deep fork of Darwin Core, adopting new Classes and explicit domains and ranges for each term, to create a `Darwin SW,' rather than by an effort to evolve Darwin Core itself.'' I'm sure the question of forking Darwin Core has come up before, and I'm sure the discussion was passionate!
To the extent that I understand both DwC and the semantic web, this seems to me to be the most parsimonious approach.
Aloha, Rich
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
This is a composite response to several posts by Cam, Rich, and Jim.
This thread has been extremely enlightening to me for several reasons. One is that as a "right brain" type person, the evolving diagram of the relationships among the Darwin Core classes (i.e. http://bioimages.vanderbilt.edu/pages/token-explicit.gif) has really clarified some things in my mind. The other reason is that the thread has convinced me that the best approach is to clearly separate things conceptually and avoid "overloading" the terms and classes by expecting them to simultaneously accomplish too many different things. Although this overloading may be convenient from the standpoint of how we like to think about "things" (a.k.a. resources), it causes problems when we try to explicitly define the properties and relationships of those resources. In particular, I'm thinking of trying to have classes both "be" and "do" two things at once.
Some of the disagreement that has emerged regarding Occurrences comes from what we (based on our different personal experiences) think that an Occurrence should BE. I think that a more productive approach would be to ask "what do we want Occurrences to DO?" I will illustrate that approach with the case of the proposed class Individual, then try to see what this approach tells us about how Occurrences should be defined.
Initially, I wanted to think of instances of the proposed class Individual as actual biological individuals. That was, in most cases, what I was interested in tracking. However, when I considered what I wanted the record for an Individual to "do" I realized that many times it was useful to consider an "individual" to include small populations of organisms of the same taxon (species or lower rank if it exists; assume this when I say "taxon" here). Sometimes this was just convenient and sometimes it was necessary (like in the case of moss) because I couldn't tell where one biological individual ended and another began. When I began to try to map out what I meant by an Individual (in terms of diagrams or in RDF), it became clear to me that what I really was interested in was a way to connect multiple Occurrences to (possibly) multiple determinations. That's why I included in my paper's title "... as resource relationship nodes", i.e. as a way to connect those things. Since the beginning of this recent thread, it has been even clearer that the functional approach to defining Individuals defines them better than any conceptual idea that I had about what an individual was. The consensus definition of an Occurrence seemed to be something like "a record that a taxon representative occurred at a particular location at a particular time". "Taxon representative" could legitimately include any unit that could reliably said to represent a single taxon, from a single biological organism to a small group as long as one could be reasonably sure that all of the biological individuals in that group were of the same taxon. If (as someone noted) the group of biological individuals got big enough that it included (perhaps by accident) several species, then it was too large and needed to be split into smaller groups where only a single taxon was included. If that group were to be resampled at a later time (as individualID was designed to facilitate), then the group would need to have some kind of stability (like plants growing together or a stable herd of animals). The point I'm trying to get at here is that the useful way of defining Individual is to define it in a way that it "does" what we want: connect Occurrences to Determinations in a way that allows for resampling (which is functionally equivalent to saying multiple Occurrences per Individual). That is far more productive than trying to make a philosophical argument about what constitutes an individual, or what we would like for an individual to "be".
Applying this approach to Occurrence, we should ask the functional question "What do we want Occurrences to do?" rather than "What do we think that they are?" Let's return to the diagram which seems to be the current favorite model: http://bioimages.vanderbilt.edu/pages/token-explicit.gif . If the "consensus" definition of an Occurrence is that it tells us that a taxon representative was at a particular location at a particular time (and if we accept that Event represents a time and a Location), then what we want Occurrence to "do" is to act as a node that connects an Event to an Individual (i.e. the taxon representative). There also seems to be a consensus that we would, if possible, like to associate Occurrence records with evidence that supports them (called "tokens" by me). Thus we can expand the description of what we want an Occurrence to "do" to include connecting one or more tokens to an Event and an Individual. I submit that we should really forget about whether we think that specimens are somehow more representative of the Individual than sounds, photos, etc. or not. The bottom line is that what we need an Occurrence record to do is to act as a conceptual resource that connects an Event, an Individual, and zero to many tokens (or one to many tokens if a memory is considered a token).
By this functional definition, we can clearly say what an Occurrence is (a resource of the type dwctype:Occurrence) and say what its properties are (ones that always have a one-to-one relationship with a single occurrence, such as recordedBy). If we take a philosophical approach to defining an occurrence and say that specimen metadata should be included with Occurrence metadata because somehow specimens better represent the individuals than "representations" like image, then we have a mess. We would have to say that an Occurrence has dwctype:Occurrence, but that it's also a resource of dwctype:PreservedSpecimen, except of course if its an observation, in which case it's NOT also dwctype:PreservedSpecimen. We would have to say that Occurrences always can have a recordedBy property, but sometimes they will have a dwc:preparations, or a dwc:disposition property but sometimes they won't. It seems to me that it would be far simpler and semantically clearer to just say that an occurrence is a dwctype:Occurrence with properties that only occurrences have, that a specimen is a dwctype:PreservedSpecimen with properties that only specimens have, and that an image is a dctype:StillImage with properties that MRTG says it has. In other words, separate the token (evidence) from the Occurrence no matter what kind of evidence the token is.
I was thinking about walking out onto the dwctype:LivingSpecimen minefield tonight (because I think it is related to this issue), but decided that I would rather hold off until somebody who was involved in the development of the current and previous incarnations of DwC explains exactly what dwc:basisOfRecord is for (since LivingSpecimen is a controlled value for basisOfRecord). I think there is danger of me blowing myself up (i.e. making an idiot of myself) if I don't know the answer to that question first. However, since those people may not be reading the detailed posts, I'm going to post that question as a separate item.
Steve
Cam Webb wrote:
Dear Steve,
Thanks for this clear and compelling argument in favor of Occurrences being different from the tokens created in their documentation.
So I'll return to the basic question: is the consensus for modeling the relationship between an Occurrence and associated token(s) the assumed token model: ... or the explicit token model
Having no long personal history of use of Occurrence, and with respect for the huge amount of work that crafting the DwC terms must have taken, but having tried semantic modeling (in a previous post) using the overloaded term Occurrence, I for one vote for the latter, as conceptually clearer. A specimen is then a Specimen, an image an Image, and so on.
But then what exactly are the Occurrences themselves? From Richard Pyle:
``So, an Occurrence is the intersection of an Individual and an Event. An Event is a Location+Time[+other metadata]. Each Event may have multiple Occurrences (i.e., one for each distinct Individual at the same Location+Time). Also, an Individual may have multiple Occurrences (one for each Event at which the same Individual was documented).''
So the Occurrence is the Individual _itself_ bounded by space and time, the latter data currently recorded in the Event class. What I then want to ask is, 1. do the terms for clearly defining the bounds of the Occurrence already exist? There exist terms for spatial uncertainty: dwc:coordinateUncertaintyInMeters, and coarse ones for temporal bounds: startDayOfYear + endDayOfYear, but not for temporal uncertainty, or spatial bounds (but see Pete's http://lod.taxonconcept.org/ontology/dwc_area.owl). Also, 2. if there was a consensus for moving to the `explicit token' model, should the space-time bounds of the Occurrence still be contained in an associated (often blank) Event, or accepted as properties of the Occurrence itself (e.g., occurrenceDate, occurrenceDuration, occurrenceLocation, occurrenceRadius)? I would support the latter.
Finally, 3. if there was a consensus for moving to the `explicit token' model, and a human observation was a token-less Occurrence, would we best specify who made the observation with dwc:recordedBy and what the observation was with dwc:occurrenceRemarks, or would it be better to create a second new token (along with `Physical specimens') that was an explicit Observation class, that would link explicitly to, say, an external observational ontology (i.e., OBOE)? The issue of GUIDs for non-physical observations comes up, but this could still be solved in various ways.
Stepping back from the details for a moment, and reading some of the replies to Steve's post that have come in, I am wondering how many readers are thinking, ``the need for a semantic web standard for biodiversity information might be better achieved by a deep fork of Darwin Core, adopting new Classes and explicit domains and ranges for each term, to create a `Darwin SW,' rather than by an effort to evolve Darwin Core itself.'' I'm sure the question of forking Darwin Core has come up before, and I'm sure the discussion was passionate!
Best,
Cam
.
Hi Steve,
I read every word, and don't really disagree with anything. I think the key point is this:
"separate the token (evidence) from the Occurrence no matter what kind of evidence the token is"
This is the main reason I've been quick to support your notion of a separate "Individual" (sensu lato) class, and why certain properties of Occurrence should port over to that Class.
Aloha, Rich
________________________________
From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Steve Baskauf Sent: Monday, October 25, 2010 7:32 PM To: Cam Webb Cc: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] Treatise on Occurrence, tokens, and basisOfRecord This is a composite response to several posts by Cam, Rich, and Jim.
This thread has been extremely enlightening to me for several reasons. One is that as a "right brain" type person, the evolving diagram of the relationships among the Darwin Core classes (i.e. http://bioimages.vanderbilt.edu/pages/token-explicit.gif) has really clarified some things in my mind. The other reason is that the thread has convinced me that the best approach is to clearly separate things conceptually and avoid "overloading" the terms and classes by expecting them to simultaneously accomplish too many different things. Although this overloading may be convenient from the standpoint of how we like to think about "things" (a.k.a. resources), it causes problems when we try to explicitly define the properties and relationships of those resources. In particular, I'm thinking of trying to have classes both "be" and "do" two things at once. Some of the disagreement that has emerged regarding Occurrences comes from what we (based on our different personal experiences) think that an Occurrence should BE. I think that a more productive approach would be to ask "what do we want Occurrences to DO?" I will illustrate that approach with the case of the proposed class Individual, then try to see what this approach tells us about how Occurrences should be defined. Initially, I wanted to think of instances of the proposed class Individual as actual biological individuals. That was, in most cases, what I was interested in tracking. However, when I considered what I wanted the record for an Individual to "do" I realized that many times it was useful to consider an "individual" to include small populations of organisms of the same taxon (species or lower rank if it exists; assume this when I say "taxon" here). Sometimes this was just convenient and sometimes it was necessary (like in the case of moss) because I couldn't tell where one biological individual ended and another began. When I began to try to map out what I meant by an Individual (in terms of diagrams or in RDF), it became clear to me that what I really was interested in was a way to connect multiple Occurrences to (possibly) multiple determinations. That's why I included in my paper's title "... as resource relationship nodes", i.e. as a way to connect those things. Since the beginning of this recent thread, it has been even clearer that the functional approach to defining Individuals defines them better than any conceptual idea that I had about what an individual was. The consensus definition of an Occurrence seemed to be something like "a record that a taxon representative occurred at a particular location at a particular time". "Taxon representative" could legitimately include any unit that could reliably said to represent a single taxon, from a single biological organism to a small group as long as one could be reasonably sure that all of the biological individuals in that group were of the same taxon. If (as someone noted) the group of biological individuals got big enough that it included (perhaps by accident) several species, then it was too large and needed to be split into smaller groups where only a single taxon was included. If that group were to be resampled at a later time (as individualID was designed to facilitate), then the group would need to have some kind of stability (like plants growing together or a stable herd of animals). The point I'm trying to get at here is that the useful way of defining Individual is to define it in a way that it "does" what we want: connect Occurrences to Determinations in a way that allows for resampling (which is functionally equivalent to saying multiple Occurrences per Individual). That is far more productive than trying to make a philosophical argument about what constitutes an individual, or what we would like for an individual to "be". Applying this approach to Occurrence, we should ask the functional question "What do we want Occurrences to do?" rather than "What do we think that they are?" Let's return to the diagram which seems to be the current favorite model: http://bioimages.vanderbilt.edu/pages/token-explicit.gif . If the "consensus" definition of an Occurrence is that it tells us that a taxon representative was at a particular location at a particular time (and if we accept that Event represents a time and a Location), then what we want Occurrence to "do" is to act as a node that connects an Event to an Individual (i.e. the taxon representative). There also seems to be a consensus that we would, if possible, like to associate Occurrence records with evidence that supports them (called "tokens" by me). Thus we can expand the description of what we want an Occurrence to "do" to include connecting one or more tokens to an Event and an Individual. I submit that we should really forget about whether we think that specimens are somehow more representative of the Individual than sounds, photos, etc. or not. The bottom line is that what we need an Occurrence record to do is to act as a conceptual resource that connects an Event, an Individual, and zero to many tokens (or one to many tokens if a memory is considered a token). By this functional definition, we can clearly say what an Occurrence is (a resource of the type dwctype:Occurrence) and say what its properties are (ones that always have a one-to-one relationship with a single occurrence, such as recordedBy). If we take a philosophical approach to defining an occurrence and say that specimen metadata should be included with Occurrence metadata because somehow specimens better represent the individuals than "representations" like image, then we have a mess. We would have to say that an Occurrence has dwctype:Occurrence, but that it's also a resource of dwctype:PreservedSpecimen, except of course if its an observation, in which case it's NOT also dwctype:PreservedSpecimen. We would have to say that Occurrences always can have a recordedBy property, but sometimes they will have a dwc:preparations, or a dwc:disposition property but sometimes they won't. It seems to me that it would be far simpler and semantically clearer to just say that an occurrence is a dwctype:Occurrence with properties that only occurrences have, that a specimen is a dwctype:PreservedSpecimen with properties that only specimens have, and that an image is a dctype:StillImage with properties that MRTG says it has. In other words, separate the token (evidence) from the Occurrence no matter what kind of evidence the token is. I was thinking about walking out onto the dwctype:LivingSpecimen minefield tonight (because I think it is related to this issue), but decided that I would rather hold off until somebody who was involved in the development of the current and previous incarnations of DwC explains exactly what dwc:basisOfRecord is for (since LivingSpecimen is a controlled value for basisOfRecord). I think there is danger of me blowing myself up (i.e. making an idiot of myself) if I don't know the answer to that question first. However, since those people may not be reading the detailed posts, I'm going to post that question as a separate item. Steve
On Oct 25, 2010, at 4:37 AM, Cam Webb wrote:
But then what exactly are the Occurrences themselves? From Richard Pyle:
``So, an Occurrence is the intersection of an Individual and an Event. An Event is a Location+Time[+other metadata]. Each Event may have multiple Occurrences (i.e., one for each distinct Individual at the same Location+Time). Also, an Individual may have multiple Occurrences (one for each Event at which the same Individual was documented).''
So the Occurrence is the Individual _itself_ bounded by space and time,
While for the purposes of exchanging occurrence data in a commonly agreed upon markup, i.e., Darwin Core, this may be perfectly acceptable, I think there are some serious issues in the above when we try to tighten up the semantics so that machines could do something with them, or so they can seamlessly integrate into the semantic web.
First there is an internal inconsistency: on the one hand occurrences *are* individuals (albeit only a subset - though see below), and on the other hand individuals *have* occurrences.
Second, occurrence is said to be the intersection of an individual and an event, or an individual and space and time. In the semantic web, OWL models deal with sets of individuals. I would argue that the intersection set of an individual organism (or a set of individual organisms) and an event (or a set of events) is empty, because there are no events that are also individual organisms, and vice versa.
Alternatively, and using "Individuals" as short hand for "instances of an organism" we could say that an Occurrence is the intersection of all Individuals belonging to a specific taxon, all Individuals at a specific location, and all Individuals existing at a specific time. Then an instance of an Occurrence would be an Individual in that intersection, and taxon, location, and time would be (among) its properties.
Just some thoughts.
-hilmar
As shown in this RDF
http://lod.taxonconcept.org/ses/ICmLC.rdf
http://lod.taxonconcept.org/ses/ICmLC.rdfAll the occurrences of this species will have this "type"
http://lod.taxonconcept.org/ses/ICmLC#Occurrence
http://lod.taxonconcept.org/ses/ICmLC#OccurrenceAll the individuals of this species will have this "type"
http://lod.taxonconcept.org/ses/ICmLC#Individual
http://lod.taxonconcept.org/ses/ICmLC#IndividualWhich then makes it easy to query for all occurrence or all individuals of this species concept.
The exact predicate is different but when you tie a subject to and object you are essentially making the subject a "type" of the object.
Which is why you can see the occurrences for that species in the triplestore, when you ask it to describe that type.
< http://lsd.taxonconcept.org/describe/?url=http%3A%2F%2Flod.taxonconcept.org%...
- Pete
On Tue, Oct 26, 2010 at 9:46 PM, Hilmar Lapp hlapp@nescent.org wrote:
On Oct 25, 2010, at 4:37 AM, Cam Webb wrote:
But then what exactly are the Occurrences themselves? From Richard Pyle:
``So, an Occurrence is the intersection of an Individual and an Event. An Event is a Location+Time[+other metadata]. Each Event may have multiple Occurrences (i.e., one for each distinct Individual at the same Location+Time). Also, an Individual may have multiple Occurrences (one for each Event at which the same Individual was documented).''
So the Occurrence is the Individual _itself_ bounded by space and time,
While for the purposes of exchanging occurrence data in a commonly agreed upon markup, i.e., Darwin Core, this may be perfectly acceptable, I think there are some serious issues in the above when we try to tighten up the semantics so that machines could do something with them, or so they can seamlessly integrate into the semantic web.
First there is an internal inconsistency: on the one hand occurrences *are* individuals (albeit only a subset - though see below), and on the other hand individuals *have* occurrences.
Second, occurrence is said to be the intersection of an individual and an event, or an individual and space and time. In the semantic web, OWL models deal with sets of individuals. I would argue that the intersection set of an individual organism (or a set of individual organisms) and an event (or a set of events) is empty, because there are no events that are also individual organisms, and vice versa.
Alternatively, and using "Individuals" as short hand for "instances of an organism" we could say that an Occurrence is the intersection of all Individuals belonging to a specific taxon, all Individuals at a specific location, and all Individuals existing at a specific time. Then an instance of an Occurrence would be an Individual in that intersection, and taxon, location, and time would be (among) its properties.
Just some thoughts.
-hilmar
--
: Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org :
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
On Oct 26, 2010, at 11:13 PM, Peter DeVries wrote:
As shown in this RDF
http://lod.taxonconcept.org/ses/ICmLC.rdf
All the occurrences of this species will have this "type"
http://lod.taxonconcept.org/ses/ICmLC#Occurrence
All the individuals of this species will have this "type"
Do you mean rdf:type when you say "have this 'type'"? If so, I don't see that from the RDF.
-hilmar
If you look at the occurrence record.
http://ocs.taxonconcept.org/ocs/f522444a-2dd9-400e-be59-47213ef38cb9.rdf
You will see the following link to the species occurrence tag, this essentially makes the occurrence a "type" of this tag.
<txn:occurrenceHasSpeciesOccurrenceTag rdf:resource=" http://lod.taxonconcept.org/ses/ICmLC#Occurrence%22/%3E
Allowing you to browse from
< http://lsd.taxonconcept.org/describe/?url=http%3A%2F%2Flod.taxonconcept.org%...
to the Occurrence Record
< http://lsd.taxonconcept.org/describe/?url=http%3A%2F%2Focs.taxonconcept.org%...
Or run the following SPARQL query.
DESCRIBE http://lod.taxonconcept.org/ses/ICmLC#Occurrence
< http://lsd.taxonconcept.org/isparql/view/?query=DESCRIBE%20%3Chttp%3A%2F%2Fl...
or run the following SPARQL query:
PREFIX txn: http://lod.taxonconcept.org/ontology/txn.owl# PREFIX boloria_selene_occ: <http://lod.taxonconcept.org/ses/ICmLC#Occurrence
DESCRIBE ?x WHERE { ?x txn:occurrenceHasSpeciesOccurrenceTag boloria_selene_occ:. }
Which can be run on the LOD cloud with the following query (or any endpoint)
< http://lod.openlinksw.com/isparql/view/?query=PREFIX%20txn%3A%20%3Chttp%3A%2...
Bit.ly http://bit.ly/dbIiLQ
Respectfully,
- Pete
On Tue, Oct 26, 2010 at 10:25 PM, Hilmar Lapp hlapp@nescent.org wrote:
On Oct 26, 2010, at 11:13 PM, Peter DeVries wrote:
As shown in this RDF
http://lod.taxonconcept.org/ses/ICmLC.rdf
http://lod.taxonconcept.org/ses/ICmLC.rdfAll the occurrences of this species will have this "type"
http://lod.taxonconcept.org/ses/ICmLC#Occurrence
http://lod.taxonconcept.org/ses/ICmLC#OccurrenceAll the individuals of this species will have this "type"
http://lod.taxonconcept.org/ses/ICmLC#Individual
Do you mean rdf:type when you say "have this 'type'"? If so, I don't see that from the RDF.
-hilmar
=========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : ===========================================================
Hilmar,
I think Rich's statements are consistent:
1) An Occurrence is the intersection of an Individual and an Event (has place and time dimensions); and 2) an Occurrence is an Individual bounded by (in) space and time.
It would be incorrect to truncate #2 as: "an occurrence is an individual", or even to say that an individual is_a (kind of) occurrence.
In general, however, I agree with your point, we should try to be precise in our definitions. This whole discussion about individuals and occurrences has driven home one of my pet peeves. A lot of people refer to the millions of records made accessible by GBIF as species occurrence records. I think that's an unfortunate simplification; they are organism occurrence records. People identify those organisms as being members of a species. There is metadata in those identifications, which enables the assessment of fitness for use.
-Stan
On 10/26/10 7:46 PM, "Hilmar Lapp" hlapp@nescent.org wrote:
On Oct 25, 2010, at 4:37 AM, Cam Webb wrote:
But then what exactly are the Occurrences themselves? From Richard Pyle:
``So, an Occurrence is the intersection of an Individual and an Event. An Event is a Location+Time[+other metadata]. Each Event may have multiple Occurrences (i.e., one for each distinct Individual at the same Location+Time). Also, an Individual may have multiple Occurrences (one for each Event at which the same Individual was documented).''
So the Occurrence is the Individual _itself_ bounded by space and time,
While for the purposes of exchanging occurrence data in a commonly agreed upon markup, i.e., Darwin Core, this may be perfectly acceptable, I think there are some serious issues in the above when we try to tighten up the semantics so that machines could do something with them, or so they can seamlessly integrate into the semantic web.
First there is an internal inconsistency: on the one hand occurrences *are* individuals (albeit only a subset - though see below), and on the other hand individuals *have* occurrences.
Second, occurrence is said to be the intersection of an individual and an event, or an individual and space and time. In the semantic web, OWL models deal with sets of individuals. I would argue that the intersection set of an individual organism (or a set of individual organisms) and an event (or a set of events) is empty, because there are no events that are also individual organisms, and vice versa.
Alternatively, and using "Individuals" as short hand for "instances of an organism" we could say that an Occurrence is the intersection of all Individuals belonging to a specific taxon, all Individuals at a specific location, and all Individuals existing at a specific time. Then an instance of an Occurrence would be an Individual in that intersection, and taxon, location, and time would be (among) its properties.
Just some thoughts.
-hilmar
Hi Stan:
On Oct 27, 2010, at 3:31 AM, Blum, Stan wrote:
- An Occurrence is the intersection of an Individual and an Event
(has place and time dimensions); and 2) an Occurrence is an Individual bounded by (in) space and time.
It would be incorrect to truncate #2 as: "an occurrence is an individual", or even to say that an individual is_a (kind of) occurrence.
I wasn't trying to say that Rich is inconsistent. I was trying to make the point that if we want to model and express things with explicit semantics, we need to also properly choose our terminology, as in any domain of science. In OWL, the term intersection has an already defined meaning in the language. If A is an (is equivalent to an) intersection of classes B and C, any reasoner will automatically infer that A subClassOf B, and A subClassOf C. If such inference is incorrect, I suggest that we shouldn't call it an intersection to begin with. Avoids misunderstandings down the road.
In fact, I think this is fully in line with the rest of your post.
-hilmar
Thanks, Hilmar.
I must admit that I was initially confused about your earlier post, because you seemed to be saying the same thing that I was saying (at least from the perspective of logic). However, I now realize that the problem was a semantic one: i.e., the specific definition of the word "intersection". I was not even aware that there was an OWL definition for this term, and thus certainly I was not using it in a way that is consistent with that particular definition.
I'll add the word "intersection" to the long list of homonymous and ambiguous (out of context) terms in our domain ("name", "class", "type", "natural key", "significant", etc.).
Meanwhile, what would be an appropriate term to use for what I originally meant, that would not collide with a more specific definition in some particular space?
Some possibilities:
An Occurrence is a combination of an Individual and an Event. An Occurrence is a coupling of an Individual and an Event. An Occurrence is a pairing of an Individual and an Event.
Perhaps we need a comprehensive glossary of terms to cover the entire spectrum of biodiversity informatics, including both the biological terms and the informatics terms.
Aloha, Rich
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Hilmar Lapp Sent: Wednesday, October 27, 2010 8:36 AM To: Blum, Stan Cc: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] Treatise on Occurrence, tokens, and basisOfRecord
Hi Stan:
On Oct 27, 2010, at 3:31 AM, Blum, Stan wrote:
- An Occurrence is the intersection of an Individual and an Event
(has place and time dimensions); and 2) an Occurrence is an Individual bounded by (in) space and time.
It would be incorrect to truncate #2 as: "an occurrence is an individual", or even to say that an individual is_a (kind of) occurrence.
I wasn't trying to say that Rich is inconsistent. I was trying to make the point that if we want to model and express things with explicit semantics, we need to also properly choose our terminology, as in any domain of science. In OWL, the term intersection has an already defined meaning in the language. If A is an (is equivalent to an) intersection of classes B and C, any reasoner will automatically infer that A subClassOf B, and A subClassOf C. If such inference is incorrect, I suggest that we shouldn't call it an intersection to begin with. Avoids misunderstandings down the road.
In fact, I think this is fully in line with the rest of your post.
-hilmar
=========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : ===========================================================
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
On Oct 27, 2010, at 2:59 PM, Richard Pyle wrote:
Some possibilities:
An Occurrence is a combination of an Individual and an Event. An Occurrence is a coupling of an Individual and an Event. An Occurrence is a pairing of an Individual and an Event.
Probably the latter, assuming that by pairing we do not mean union, but forming a tuple. Thus, more formally, an Occurrence would be a tuple of (Individual, Event), and has properties for referring to the Individual and the Event.
Is that congruent with what you had in mind?
-hilmar
Yes, that is exactly Congruent with what I had in mind.
Many thanks for the clarification!
Aloha, Rich
-----Original Message----- From: Hilmar Lapp [mailto:hlapp@nescent.org] Sent: Wednesday, October 27, 2010 9:24 AM To: Richard Pyle Cc: 'Blum, Stan'; tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] Treatise on Occurrence, tokens, and basisOfRecord
On Oct 27, 2010, at 2:59 PM, Richard Pyle wrote:
Some possibilities:
An Occurrence is a combination of an Individual and an Event. An Occurrence is a coupling of an Individual and an Event. An Occurrence is a pairing of an Individual and an Event.
Probably the latter, assuming that by pairing we do not mean union, but forming a tuple. Thus, more formally, an Occurrence would be a tuple of (Individual, Event), and has properties for referring to the Individual and the Event.
Is that congruent with what you had in mind?
-hilmar
=========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : ===========================================================
On 28/10/2010, at 5:35 AM, Hilmar Lapp wrote:
- An Occurrence is the intersection of an Individual and an Event
I suggest that we shouldn't call it an intersection to begin with. Avoids misunderstandings down the road.
Perhaps the word is 'product'? I am thinking along the lines that the cartesian product of the set of all Individuals and the set of all Events is the set of all possible Occurrences. Unfortunately, 'intersection' is quite a good word if we are thinking in natural language, and 'product' is not - not unless you are already thinking in terms of a mathematical vocabulary.
------ If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
------
Perhaps the word is 'product'? I am thinking along the lines that the cartesian product of the set of all Individuals and the set of all Events is the set of all possible Occurrences. Unfortunately, 'intersection' is quite a good word if we are thinking in natural language, and 'product' is not - not unless you are already thinking in terms of a mathematical vocabulary.
Hmmm, thinking about the mathematical definition of "product", it would seem to me that the "product" would be some sort of representation of all *possible* combinations of unique events and unique individuals; rather than just the tiny, tiny, tiny subset of those that ever existed, and the even tinier, tinier, tinier susbset that have actually been documented.
In database-speak, an Occurrence is simply a many-to-many join between Events and Individuals, with a few additional properties.
...which is what I *used* to use the word "intersection" for.
It reminds me of the time when, as a young undergrad, I was informed that I could no longer use the word "significant" in the sense of "important"....
Aloha, Rich
Hey Rich, Hilmar, Paul, and everyone -
I liked the definition from a couple of weeks ago:
"An occurrence is a tuple consiting of time, place, individual, and some optional properties."
What's that lacking?
Thanks - Joel.
On Wed, 27 Oct 2010, Richard Pyle wrote:
Perhaps the word is 'product'? I am thinking along the lines that the cartesian product of the set of all Individuals and the set of all Events is the set of all possible Occurrences. Unfortunately, 'intersection' is quite a good word if we are thinking in natural language, and 'product' is not - not unless you are already thinking in terms of a mathematical vocabulary.
Hmmm, thinking about the mathematical definition of "product", it would seem to me that the "product" would be some sort of representation of all *possible* combinations of unique events and unique individuals; rather than just the tiny, tiny, tiny subset of those that ever existed, and the even tinier, tinier, tinier susbset that have actually been documented.
In database-speak, an Occurrence is simply a many-to-many join between Events and Individuals, with a few additional properties.
...which is what I *used* to use the word "intersection" for.
It reminds me of the time when, as a young undergrad, I was informed that I could no longer use the word "significant" in the sense of "important"....
Aloha, Rich
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
On Oct 28, 2010, at 11:58 AM, joel sachs wrote:
"An occurrence is a tuple consiting of time, place, individual, and some optional properties."
What's that lacking?
Nothing in my mind. I missed that definition (was it yours?) :-) - so much stuff flying by left and right
-hilmar
What exactly is an individual? A flock? A herd? A breading pair? A colony? A clonal stand? And of course the individual that was not there - the absence!
:)
Roger
On 28 Oct 2010, at 17:19, Hilmar Lapp wrote:
On Oct 28, 2010, at 11:58 AM, joel sachs wrote:
"An occurrence is a tuple consiting of time, place, individual, and some optional properties."
What's that lacking?
Nothing in my mind. I missed that definition (was it yours?) :-) - so much stuff flying by left and right
-hilmar
=========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : ===========================================================
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
breeding pear !
On Thu, Oct 28, 2010 at 2:00 PM, Roger Hyam rogerhyam@mac.com wrote:
On 28 Oct 2010, at 17:37, Roger Hyam wrote:
A breading pair?
but of course he meant "bedding pair" or was it "breeding pair"
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi Roger -
On Thu, 28 Oct 2010, Roger Hyam wrote:
What exactly is an individual? A flock? A herd? A breading pair? A colony? A clonal stand?
I thought that Rich convinced everyone that a DwC:individual can be any of those things.
And of course the individual that was not there - the absence!
Absence assertions are a good example of something discussed earlier - that occurrence records can be both raw data, and also the output of analysis. It seems that one typical (or at least proper) workflow would be to i) infer the absence of a taxon in an area from an absence of occurrence records together with an event record with an appropriate value for samplingEffort. ii) assert the absence, via an occurrence record w. occurrenceStatus="absent". (Of course, there are cleaner ways to assert absence.)
Joel.
:)
Roger
On 28 Oct 2010, at 17:19, Hilmar Lapp wrote:
On Oct 28, 2010, at 11:58 AM, joel sachs wrote:
"An occurrence is a tuple consiting of time, place, individual, and some optional properties."
What's that lacking?
Nothing in my mind. I missed that definition (was it yours?) :-) - so much stuff flying by left and right
-hilmar
=========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : ===========================================================
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
On Oct 28, 2010, at 12:37 PM, Roger Hyam wrote:
What exactly is an individual? A flock? A herd? A breading pair? A colony? A clonal stand?
One or more members of a class, for example, the class defined as all members of a taxon.
And of course the individual that was not there - the absence!
No member is not included in one or more members, so negation needs to be treated differently (in a reasoning framework, that is; how it is recorded matters less).
-hilmar
What exactly is an individual? A flock? A herd? A breading pair? A colony? A clonal stand?
One or more members of a class, for example, the class defined as all members of a taxon.
We'll have to add "individual" to the list of overloaded terms.
In the world of taxonomy and specimen curation, it apparently possibly means various things (perhaps "living things you can count"? "Living things that are identifiably the same thing from one day to another"? The boundaries of individuals are sometimes wobbly.).
In the world of OWL and RDF, an individual is an unspecified something that can be the subject or object of a (object) property. Individuals can be named with URIs.
Perhaps, then, an individual is simply "A living thing that we are sufficiently interested in to identify as an individual". That is: essentially to leave the term undefined and axiomatic.
------ If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
------
OK, I'm going to respectfully disagree here. dwc:Individual is not "overloaded" any more than dwc:class is overloaded. We know that dwc:class does not mean the same thing as "class" in RDF or Java because the term name is http://rs.tdwg.org/dwc/terms/class, not "class". We know that the proposed dwc:Individual has a specific meaning because it would be http://rs.tdwg.org/dwc/terms/Individual and not "individual" in the sense of OWL or RDF or anything else.
The problem here is not lack of a clear definition for the proposed DwC class dwc:Individual . That thing has been defined to death, having been the subject of an entire published paper (Biodiversity Informatics 7:17-44), and having its definition restated at least three times in this thread. The problem is people entering the thread without being aware that it's been defined or having not read any of the definitions (I'm not trying to be rude here, I'm just observing that this has happened several times in the thread). So one last time, I'll define what I intend for dwc:Individual to mean ("taxon" here means terminal taxon, species, ssp., or var.):
Layman's definition: a representative of a single taxon that serves to connect one or more dwc:Occurrences to one or more dwc:Identifications.
More technical definition: a resource representing a single taxon that serves as a node (sensu RDF) connecting one or more instances of the class dwc:Occurrence to one or instances of the class dwc:Identification .
These are functional definitions - they define what dwc:Individual "does" not what dwc:Individual "is". What dwc:Individual "is" is anything that fits the definition. Thus a biological individual can be a dwc:Individual, as can a clump of moss. The mixed-species content of a pitfall trap cannot be an individual because it does not represent a single taxon. Groups of biological individuals that are too large to know for sure that they are a single taxon probably shouldn't be considered a dwc:Individual.
I would be perfectly happy with changing the term name from "Individual" to something else as long as the definition of its purpose doesn't change and as long as dwc:individualID and the proposed dwc:individualRemarks are changed to match.
Leaving the term undefined and axiomatic is not an option. We have a proposal for a term addition to DwC (http://code.google.com/p/darwincore/issues/detail?id=69) that's been on the table for nine months and I've essentially "called for the question" on the proposal. So unless somebody has something to add that's different from what has already been discussed at great length, let's move on.
Steve
Paul Murray wrote:
What exactly is an individual? A flock? A herd? A breading pair? A colony? A clonal stand?
One or more members of a class, for example, the class defined as all members of a taxon.
We'll have to add "individual" to the list of overloaded terms.
In the world of taxonomy and specimen curation, it apparently possibly means various things (perhaps "living things you can count"? "Living things that are identifiably the same thing from one day to another"? The boundaries of individuals are sometimes wobbly.).
In the world of OWL and RDF, an individual is an unspecified something that can be the subject or object of a (object) property. Individuals can be named with URIs.
Perhaps, then, an individual is simply "A living thing that we are sufficiently interested in to identify as an individual". That is: essentially to leave the term undefined and axiomatic.
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Steve,
Can you add a comment to Issue 69 in which you state the updated term recommendation for the following?
Definition: Comment: Refines:
It might also be a good time to decide if Individual as a term name is equally offensive to all. Sure, it doesn't capture exactly all of the things an Individual might be, but the same is true of almost every term name - people should always consult the definitions, comments, and secondary documentation.
On Tue, Nov 2, 2010 at 7:03 AM, Steve Baskauf steve.baskauf@vanderbilt.eduwrote:
OK, I'm going to respectfully disagree here. dwc:Individual is not "overloaded" any more than dwc:class is overloaded. We know that dwc:class does not mean the same thing as "class" in RDF or Java because the term name is http://rs.tdwg.org/dwc/terms/class, not "class". We know that the proposed dwc:Individual has a specific meaning because it would be http://rs.tdwg.org/dwc/terms/Individual and not "individual" in the sense of OWL or RDF or anything else.
The problem here is not lack of a clear definition for the proposed DwC class dwc:Individual . That thing has been defined to death, having been the subject of an entire published paper (Biodiversity Informatics 7:17-44), and having its definition restated at least three times in this thread. The problem is people entering the thread without being aware that it's been defined or having not read any of the definitions (I'm not trying to be rude here, I'm just observing that this has happened several times in the thread). So one last time, I'll define what I intend for dwc:Individual to mean ("taxon" here means terminal taxon, species, ssp., or var.):
Layman's definition: a representative of a single taxon that serves to connect one or more dwc:Occurrences to one or more dwc:Identifications.
More technical definition: a resource representing a single taxon that serves as a node (sensu RDF) connecting one or more instances of the class dwc:Occurrence to one or instances of the class dwc:Identification .
These are functional definitions - they define what dwc:Individual "does" not what dwc:Individual "is". What dwc:Individual "is" is anything that fits the definition. Thus a biological individual can be a dwc:Individual, as can a clump of moss. The mixed-species content of a pitfall trap cannot be an individual because it does not represent a single taxon. Groups of biological individuals that are too large to know for sure that they are a single taxon probably shouldn't be considered a dwc:Individual.
I would be perfectly happy with changing the term name from "Individual" to something else as long as the definition of its purpose doesn't change and as long as dwc:individualID and the proposed dwc:individualRemarks are changed to match.
Leaving the term undefined and axiomatic is not an option. We have a proposal for a term addition to DwC ( http://code.google.com/p/darwincore/issues/detail?id=69) that's been on the table for nine months and I've essentially "called for the question" on the proposal. So unless somebody has something to add that's different from what has already been discussed at great length, let's move on.
Steve
Paul Murray wrote:
What exactly is an individual? A flock? A herd? A breading pair? A colony? A clonal stand?
One or more members of a class, for example, the class defined as all members of a taxon.
We'll have to add "individual" to the list of overloaded terms.
In the world of taxonomy and specimen curation, it apparently possibly means various things (perhaps "living things you can count"? "Living things that are identifiably the same thing from one day to another"? The boundaries of individuals are sometimes wobbly.).
In the world of OWL and RDF, an individual is an unspecified something that can be the subject or object of a (object) property. Individuals can be named with URIs.
Perhaps, then, an individual is simply "A living thing that we are sufficiently interested in to identify as an individual". That is: essentially to leave the term undefined and axiomatic.
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-content mailing listtdwg-content@lists.tdwg.orghttp://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
John, Thanks for the suggestion. It is appropriate given the clarification that has been made through the course of the discussion on this list. I have created a revised term definition and comments at http://code.google.com/p/darwincore/issues/detail?id=69 .
With regards to the actual term name, I don't have any better idea. If someone has a suggestion, perhaps they can post it to the list for comment. Steve
John Wieczorek wrote:
Steve,
Can you add a comment to Issue 69 in which you state the updated term recommendation for the following?
Definition: Comment: Refines:
It might also be a good time to decide if Individual as a term name is equally offensive to all. Sure, it doesn't capture exactly all of the things an Individual might be, but the same is true of almost every term name - people should always consult the definitions, comments, and secondary documentation.
On Tue, Nov 2, 2010 at 7:03 AM, Steve Baskauf <steve.baskauf@vanderbilt.edu mailto:steve.baskauf@vanderbilt.edu> wrote:
OK, I'm going to respectfully disagree here. dwc:Individual is not "overloaded" any more than dwc:class is overloaded. We know that dwc:class does not mean the same thing as "class" in RDF or Java because the term name is http://rs.tdwg.org/dwc/terms/class, not "class". We know that the proposed dwc:Individual has a specific meaning because it would be http://rs.tdwg.org/dwc/terms/Individual and not "individual" in the sense of OWL or RDF or anything else. The problem here is not lack of a clear definition for the proposed DwC class dwc:Individual . That thing has been defined to death, having been the subject of an entire published paper (Biodiversity Informatics 7:17-44), and having its definition restated at least three times in this thread. The problem is people entering the thread without being aware that it's been defined or having not read any of the definitions (I'm not trying to be rude here, I'm just observing that this has happened several times in the thread). So one last time, I'll define what I intend for dwc:Individual to mean ("taxon" here means terminal taxon, species, ssp., or var.): Layman's definition: a representative of a single taxon that serves to connect one or more dwc:Occurrences to one or more dwc:Identifications. More technical definition: a resource representing a single taxon that serves as a node (sensu RDF) connecting one or more instances of the class dwc:Occurrence to one or instances of the class dwc:Identification . These are functional definitions - they define what dwc:Individual "does" not what dwc:Individual "is". What dwc:Individual "is" is anything that fits the definition. Thus a biological individual can be a dwc:Individual, as can a clump of moss. The mixed-species content of a pitfall trap cannot be an individual because it does not represent a single taxon. Groups of biological individuals that are too large to know for sure that they are a single taxon probably shouldn't be considered a dwc:Individual. I would be perfectly happy with changing the term name from "Individual" to something else as long as the definition of its purpose doesn't change and as long as dwc:individualID and the proposed dwc:individualRemarks are changed to match. Leaving the term undefined and axiomatic is not an option. We have a proposal for a term addition to DwC (http://code.google.com/p/darwincore/issues/detail?id=69) that's been on the table for nine months and I've essentially "called for the question" on the proposal. So unless somebody has something to add that's different from what has already been discussed at great length, let's move on. Steve Paul Murray wrote:
What exactly is an individual? A flock? A herd? A breading pair? A colony? A clonal stand?
One or more members of a class, for example, the class defined as all members of a taxon.
We'll have to add "individual" to the list of overloaded terms. In the world of taxonomy and specimen curation, it apparently possibly means various things (perhaps "living things you can count"? "Living things that are identifiably the same thing from one day to another"? The boundaries of individuals are sometimes wobbly.). In the world of OWL and RDF, an individual is an unspecified something that can be the subject or object of a (object) property. Individuals can be named with URIs. Perhaps, then, an individual is simply "A living thing that we are sufficiently interested in to identify as an individual". That is: essentially to leave the term undefined and axiomatic. ------ If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments. Please consider the environment before printing this email. ------ _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org <mailto:tdwg-content@lists.tdwg.org> http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A. delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235 office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org <mailto:tdwg-content@lists.tdwg.org> http://lists.tdwg.org/mailman/listinfo/tdwg-content
I think the only alternative to "Individual" that has been floated, and might be more appropriate, is "Organism". In my mind, at least, the word "Organism" can apply equally to a single cell, or a single multicellular organism, or a group of individuals, or a colony, or a population, or even a taxon. The advantage it has over "Individual" is that is more clearly related to the biology domain (not to be confused with other things called "Individual" in other domains), and also "Individual" might lead people to assume that gorups and populations and such are not within scope.
I don't feel strongly about it either way -- it's just a suggestion.
Rich
_____
From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Steve Baskauf Sent: Tuesday, November 02, 2010 5:21 PM To: tuco@berkeley.edu Cc: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] Treatise on Occurrence, tokens, and basisOfRecord [SEC=UNCLASSIFIED]
John, Thanks for the suggestion. It is appropriate given the clarification that has been made through the course of the discussion on this list. I have created a revised term definition and comments at http://code.google.com/p/darwincore/issues/detail?id=69 .
With regards to the actual term name, I don't have any better idea. If someone has a suggestion, perhaps they can post it to the list for comment. Steve
John Wieczorek wrote:
Steve,
Can you add a comment to Issue 69 in which you state the updated term recommendation for the following?
Definition: Comment: Refines:
It might also be a good time to decide if Individual as a term name is equally offensive to all. Sure, it doesn't capture exactly all of the things an Individual might be, but the same is true of almost every term name - people should always consult the definitions, comments, and secondary documentation.
On Tue, Nov 2, 2010 at 7:03 AM, Steve Baskauf steve.baskauf@vanderbilt.edu wrote:
OK, I'm going to respectfully disagree here. dwc:Individual is not "overloaded" any more than dwc:class is overloaded. We know that dwc:class does not mean the same thing as "class" in RDF or Java because the term name is http://rs.tdwg.org/dwc/terms/class, not "class". We know that the proposed dwc:Individual has a specific meaning because it would be http://rs.tdwg.org/dwc/terms/Individual and not "individual" in the sense of OWL or RDF or anything else.
The problem here is not lack of a clear definition for the proposed DwC class dwc:Individual . That thing has been defined to death, having been the subject of an entire published paper (Biodiversity Informatics 7:17-44), and having its definition restated at least three times in this thread. The problem is people entering the thread without being aware that it's been defined or having not read any of the definitions (I'm not trying to be rude here, I'm just observing that this has happened several times in the thread). So one last time, I'll define what I intend for dwc:Individual to mean ("taxon" here means terminal taxon, species, ssp., or var.):
Layman's definition: a representative of a single taxon that serves to connect one or more dwc:Occurrences to one or more dwc:Identifications.
More technical definition: a resource representing a single taxon that serves as a node (sensu RDF) connecting one or more instances of the class dwc:Occurrence to one or instances of the class dwc:Identification .
These are functional definitions - they define what dwc:Individual "does" not what dwc:Individual "is". What dwc:Individual "is" is anything that fits the definition. Thus a biological individual can be a dwc:Individual, as can a clump of moss. The mixed-species content of a pitfall trap cannot be an individual because it does not represent a single taxon. Groups of biological individuals that are too large to know for sure that they are a single taxon probably shouldn't be considered a dwc:Individual.
I would be perfectly happy with changing the term name from "Individual" to something else as long as the definition of its purpose doesn't change and as long as dwc:individualID and the proposed dwc:individualRemarks are changed to match.
Leaving the term undefined and axiomatic is not an option. We have a proposal for a term addition to DwC (http://code.google.com/p/darwincore/issues/detail?id=69) that's been on the table for nine months and I've essentially "called for the question" on the proposal. So unless somebody has something to add that's different from what has already been discussed at great length, let's move on.
Steve
Paul Murray wrote:
What exactly is an individual? A flock? A herd? A breading pair? A
colony? A clonal stand?
One or more members of a class, for example, the class defined as all
members of a taxon.
We'll have to add "individual" to the list of overloaded terms.
In the world of taxonomy and specimen curation, it apparently possibly means various things (perhaps "living things you can count"? "Living things that are identifiably the same thing from one day to another"? The boundaries of individuals are sometimes wobbly.).
In the world of OWL and RDF, an individual is an unspecified something that can be the subject or object of a (object) property. Individuals can be named with URIs.
Perhaps, then, an individual is simply "A living thing that we are sufficiently interested in to identify as an individual". That is: essentially to leave the term undefined and axiomatic.
------
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
------
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
I like Organism, but I don't like the inconsistency it would make with individualID and individualCount on the one hand, or extra work to change these to organismID and organismCount on the other. Individual doesn't carry these extra burdens, and could be added without breaking any existing applications.
On Tue, Nov 2, 2010 at 8:30 PM, Richard Pyle deepreef@bishopmuseum.orgwrote:
I think the only alternative to "Individual" that has been floated, and might be more appropriate, is "Organism". In my mind, at least, the word "Organism" can apply equally to a single cell, or a single multicellular organism, or a group of individuals, or a colony, or a population, or even a taxon. The advantage it has over "Individual" is that is more clearly related to the biology domain (not to be confused with other things called "Individual" in other domains), and also "Individual" might lead people to assume that gorups and populations and such are not within scope.
I don't feel strongly about it either way -- it's just a suggestion.
Rich
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Steve Baskauf *Sent:* Tuesday, November 02, 2010 5:21 PM *To:* tuco@berkeley.edu
*Cc:* tdwg-content@lists.tdwg.org *Subject:* Re: [tdwg-content] Treatise on Occurrence, tokens, and basisOfRecord [SEC=UNCLASSIFIED]
John, Thanks for the suggestion. It is appropriate given the clarification that has been made through the course of the discussion on this list. I have created a revised term definition and comments at http://code.google.com/p/darwincore/issues/detail?id=69 .
With regards to the actual term name, I don't have any better idea. If someone has a suggestion, perhaps they can post it to the list for comment. Steve
John Wieczorek wrote:
Steve,
Can you add a comment to Issue 69 in which you state the updated term recommendation for the following?
Definition: Comment: Refines:
It might also be a good time to decide if Individual as a term name is equally offensive to all. Sure, it doesn't capture exactly all of the things an Individual might be, but the same is true of almost every term name - people should always consult the definitions, comments, and secondary documentation.
On Tue, Nov 2, 2010 at 7:03 AM, Steve Baskauf < steve.baskauf@vanderbilt.edu> wrote:
OK, I'm going to respectfully disagree here. dwc:Individual is not "overloaded" any more than dwc:class is overloaded. We know that dwc:class does not mean the same thing as "class" in RDF or Java because the term name is http://rs.tdwg.org/dwc/terms/class, not "class". We know that the proposed dwc:Individual has a specific meaning because it would be http://rs.tdwg.org/dwc/terms/Individual and not "individual" in the sense of OWL or RDF or anything else.
The problem here is not lack of a clear definition for the proposed DwC class dwc:Individual . That thing has been defined to death, having been the subject of an entire published paper (Biodiversity Informatics 7:17-44), and having its definition restated at least three times in this thread. The problem is people entering the thread without being aware that it's been defined or having not read any of the definitions (I'm not trying to be rude here, I'm just observing that this has happened several times in the thread). So one last time, I'll define what I intend for dwc:Individual to mean ("taxon" here means terminal taxon, species, ssp., or var.):
Layman's definition: a representative of a single taxon that serves to connect one or more dwc:Occurrences to one or more dwc:Identifications.
More technical definition: a resource representing a single taxon that serves as a node (sensu RDF) connecting one or more instances of the class dwc:Occurrence to one or instances of the class dwc:Identification .
These are functional definitions - they define what dwc:Individual "does" not what dwc:Individual "is". What dwc:Individual "is" is anything that fits the definition. Thus a biological individual can be a dwc:Individual, as can a clump of moss. The mixed-species content of a pitfall trap cannot be an individual because it does not represent a single taxon. Groups of biological individuals that are too large to know for sure that they are a single taxon probably shouldn't be considered a dwc:Individual.
I would be perfectly happy with changing the term name from "Individual" to something else as long as the definition of its purpose doesn't change and as long as dwc:individualID and the proposed dwc:individualRemarks are changed to match.
Leaving the term undefined and axiomatic is not an option. We have a proposal for a term addition to DwC ( http://code.google.com/p/darwincore/issues/detail?id=69) that's been on the table for nine months and I've essentially "called for the question" on the proposal. So unless somebody has something to add that's different from what has already been discussed at great length, let's move on.
Steve
Paul Murray wrote:
What exactly is an individual? A flock? A herd? A breading pair? A colony? A clonal stand?
One or more members of a class, for example, the class defined as all members of a taxon.
We'll have to add "individual" to the list of overloaded terms.
In the world of taxonomy and specimen curation, it apparently possibly means various things (perhaps "living things you can count"? "Living things that are identifiably the same thing from one day to another"? The boundaries of individuals are sometimes wobbly.).
In the world of OWL and RDF, an individual is an unspecified something that can be the subject or object of a (object) property. Individuals can be named with URIs.
Perhaps, then, an individual is simply "A living thing that we are sufficiently interested in to identify as an individual". That is: essentially to leave the term undefined and axiomatic.
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-content mailing listtdwg-content@lists.tdwg.orghttp://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
Good point -- I agree.
_____
From: gtuco.btuco@gmail.com [mailto:gtuco.btuco@gmail.com] On Behalf Of John Wieczorek Sent: Tuesday, November 02, 2010 6:31 PM To: Richard Pyle Cc: Steve Baskauf; tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] Treatise on Occurrence, tokens, and basisOfRecord [SEC=UNCLASSIFIED]
I like Organism, but I don't like the inconsistency it would make with individualID and individualCount on the one hand, or extra work to change these to organismID and organismCount on the other. Individual doesn't carry these extra burdens, and could be added without breaking any existing applications.
On Tue, Nov 2, 2010 at 8:30 PM, Richard Pyle deepreef@bishopmuseum.org wrote:
I think the only alternative to "Individual" that has been floated, and might be more appropriate, is "Organism". In my mind, at least, the word "Organism" can apply equally to a single cell, or a single multicellular organism, or a group of individuals, or a colony, or a population, or even a taxon. The advantage it has over "Individual" is that is more clearly related to the biology domain (not to be confused with other things called "Individual" in other domains), and also "Individual" might lead people to assume that gorups and populations and such are not within scope.
I don't feel strongly about it either way -- it's just a suggestion.
Rich
_____
From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Steve Baskauf Sent: Tuesday, November 02, 2010 5:21 PM To: tuco@berkeley.edu
Cc: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] Treatise on Occurrence, tokens, and basisOfRecord [SEC=UNCLASSIFIED]
John, Thanks for the suggestion. It is appropriate given the clarification that has been made through the course of the discussion on this list. I have created a revised term definition and comments at http://code.google.com/p/darwincore/issues/detail?id=69 .
With regards to the actual term name, I don't have any better idea. If someone has a suggestion, perhaps they can post it to the list for comment. Steve
John Wieczorek wrote:
Steve,
Can you add a comment to Issue 69 in which you state the updated term recommendation for the following?
Definition: Comment: Refines:
It might also be a good time to decide if Individual as a term name is equally offensive to all. Sure, it doesn't capture exactly all of the things an Individual might be, but the same is true of almost every term name - people should always consult the definitions, comments, and secondary documentation.
On Tue, Nov 2, 2010 at 7:03 AM, Steve Baskauf steve.baskauf@vanderbilt.edu wrote:
OK, I'm going to respectfully disagree here. dwc:Individual is not "overloaded" any more than dwc:class is overloaded. We know that dwc:class does not mean the same thing as "class" in RDF or Java because the term name is http://rs.tdwg.org/dwc/terms/class, not "class". We know that the proposed dwc:Individual has a specific meaning because it would be http://rs.tdwg.org/dwc/terms/Individual and not "individual" in the sense of OWL or RDF or anything else.
The problem here is not lack of a clear definition for the proposed DwC class dwc:Individual . That thing has been defined to death, having been the subject of an entire published paper (Biodiversity Informatics 7:17-44), and having its definition restated at least three times in this thread. The problem is people entering the thread without being aware that it's been defined or having not read any of the definitions (I'm not trying to be rude here, I'm just observing that this has happened several times in the thread). So one last time, I'll define what I intend for dwc:Individual to mean ("taxon" here means terminal taxon, species, ssp., or var.):
Layman's definition: a representative of a single taxon that serves to connect one or more dwc:Occurrences to one or more dwc:Identifications.
More technical definition: a resource representing a single taxon that serves as a node (sensu RDF) connecting one or more instances of the class dwc:Occurrence to one or instances of the class dwc:Identification .
These are functional definitions - they define what dwc:Individual "does" not what dwc:Individual "is". What dwc:Individual "is" is anything that fits the definition. Thus a biological individual can be a dwc:Individual, as can a clump of moss. The mixed-species content of a pitfall trap cannot be an individual because it does not represent a single taxon. Groups of biological individuals that are too large to know for sure that they are a single taxon probably shouldn't be considered a dwc:Individual.
I would be perfectly happy with changing the term name from "Individual" to something else as long as the definition of its purpose doesn't change and as long as dwc:individualID and the proposed dwc:individualRemarks are changed to match.
Leaving the term undefined and axiomatic is not an option. We have a proposal for a term addition to DwC (http://code.google.com/p/darwincore/issues/detail?id=69) that's been on the table for nine months and I've essentially "called for the question" on the proposal. So unless somebody has something to add that's different from what has already been discussed at great length, let's move on.
Steve
Paul Murray wrote:
What exactly is an individual? A flock? A herd? A breading pair? A
colony? A clonal stand?
One or more members of a class, for example, the class defined as all
members of a taxon.
We'll have to add "individual" to the list of overloaded terms.
In the world of taxonomy and specimen curation, it apparently possibly means various things (perhaps "living things you can count"? "Living things that are identifiably the same thing from one day to another"? The boundaries of individuals are sometimes wobbly.).
In the world of OWL and RDF, an individual is an unspecified something that can be the subject or object of a (object) property. Individuals can be named with URIs.
Perhaps, then, an individual is simply "A living thing that we are sufficiently interested in to identify as an individual". That is: essentially to leave the term undefined and axiomatic.
------
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
------
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
On Thu, 28 Oct 2010, Hilmar Lapp wrote:
On Oct 28, 2010, at 11:58 AM, joel sachs wrote:
"An occurrence is a tuple consiting of time, place, individual, and some optional properties."
What's that lacking?
Nothing in my mind. I missed that definition (was it yours?) :-)
Oh God ... maybe.
Looking back, you introudced the idea of a (taxonConcept, location, time) tuple. Syntactically, I think
(taxonConcept, location, time) (Individual, location, time) (vernacularName, location, time) (Identification, location, time) etc.
should all be considered valid representations of an occurrence. Semantically, the Individual strikes me as being implied in each case.
Joel.
- so much
stuff flying by left and right
-hilmar
=========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : ===========================================================
On Oct 28, 2010, at 11:58 AM, joel sachs wrote:
Hey Rich, Hilmar, Paul, and everyone -
I liked the definition from a couple of weeks ago:
"An occurrence is a tuple consiting of time, place, individual, and some optional properties."
What's that lacking?
Nothing, but the form of "occurrence" as a 3-way relation between a thing, a place and a time might be less amenable to formal reasoning than some other formulations, depending on how its rendered.
I thought that was the reason others had introduced "event" as place + time. Otherwise, what is the reason for "event"?
Arlin ------- Arlin Stoltzfus (arlin@umd.edu) Fellow, IBBR; Adj. Assoc. Prof., UMCP; Research Biologist, NIST IBBR, 9600 Gudelsky Drive, Rockville, MD tel: 240 314 6208; web: www.molevol.org
On Thu, 28 Oct 2010, Arlin Stoltzfus wrote:
On Oct 28, 2010, at 11:58 AM, joel sachs wrote:
Hey Rich, Hilmar, Paul, and everyone -
I liked the definition from a couple of weeks ago:
"An occurrence is a tuple consiting of time, place, individual, and some optional properties."
What's that lacking?
Nothing, but the form of "occurrence" as a 3-way relation between a thing, a place and a time might be less amenable to formal reasoning than some other formulations, depending on how its rendered.
It might be. But I haven't seen any examples of the kind of reasoning that the alternative definitions seek to enable. Did I miss them?
I thought that was the reason others had introduced "event" as place + time. Otherwise, what is the reason for "event"?
I was wondering the same thing. Event=(Location, Time) strikes me as weird, for a couple of reasons:
i. Intuitively, an event is something that happens at a region of space/time, not the region itself.
ii. More significantly, a DwC:event is a container for metadata that gets attached to occurrences via eventID. A DwC:event corresponds to the intuitive definition above, since you can have multiple DwC:events over the same space/time region, e.g. two groups surveying for different taxa, using different methodologies.
Joel.
Arlin
Arlin Stoltzfus (arlin@umd.edu) Fellow, IBBR; Adj. Assoc. Prof., UMCP; Research Biologist, NIST IBBR, 9600 Gudelsky Drive, Rockville, MD tel: 240 314 6208; web: www.molevol.org
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
I agree with most of the statements made during this thread (the majority of which were posted prior to sunrise in Hawaii).
I thought that was the reason others had introduced "event"
as place +
time. Otherwise, what is the reason for "event"?
I was wondering the same thing. Event=(Location, Time) strikes me as weird, for a couple of reasons:
i. Intuitively, an event is something that happens at a region of space/time, not the region itself.
ii. More significantly, a DwC:event is a container for metadata that gets attached to occurrences via eventID. A DwC:event corresponds to the intuitive definition above, since you can have multiple DwC:events over the same space/time region, e.g. two groups surveying for different taxa, using different methodologies.
A couple of things:
1) In my mind, an Event is, and has always been, in its most fundamental form, a "tuple" (exercising my new vocabulary) of Location + Time, both of which are scoped in some way.
2) Less clear is whether an "Event" also includes things like collectors, collection method, habitat description, etc. (deliberately avoiding dwc terms at the moment).
3) Coming back to point 1, an Event does not necessarily need to be treated as a tuple of Location + Time, if you vew it from the perspective of 4-dimensional space-time (with time representing the 4th dimension, in addtion to the more conventional 3 dimensions of space). Location is already a tuple of latitude and longitude; and indeed is most properly represented as a triple of latitude+longitude+altitude (X+Y+Z). Time is simply the fouth dimension. In other words, there is a fundamental argument that could be made that really Location+Time boils down to a single coordinate in 4-dimensional space-time.
Side Note to Jim Croft: We're still only treading on the outskirts of weirdness -- we still have a lot of weirdness-space that we have yet to venture into.
Aloha, Rich
Hmm, perhaps I had no good reason to assume that Event was being used as a shorthand for (Location, Time). I can see that there is a DwC:event implied by every occurrence record, even when that event is nowhere described. (And now I see that Rich thinks that Event does equal Location + Time, and further see that this could get wiry.)
But my objection isn't to Occurrence=(Individual, Event). I simply think that we're better off conceiving of occurrences as tuples, rather than defining them as intersections, or products, or whatever.
Joel.
On Thu, 28 Oct 2010, joel sachs wrote:
On Thu, 28 Oct 2010, Arlin Stoltzfus wrote:
On Oct 28, 2010, at 11:58 AM, joel sachs wrote:
Hey Rich, Hilmar, Paul, and everyone -
I liked the definition from a couple of weeks ago:
"An occurrence is a tuple consiting of time, place, individual, and some optional properties."
What's that lacking?
Nothing, but the form of "occurrence" as a 3-way relation between a thing, a place and a time might be less amenable to formal reasoning than some other formulations, depending on how its rendered.
It might be. But I haven't seen any examples of the kind of reasoning that the alternative definitions seek to enable. Did I miss them?
I thought that was the reason others had introduced "event" as place + time. Otherwise, what is the reason for "event"?
I was wondering the same thing. Event=(Location, Time) strikes me as weird, for a couple of reasons:
i. Intuitively, an event is something that happens at a region of space/time, not the region itself.
ii. More significantly, a DwC:event is a container for metadata that gets attached to occurrences via eventID. A DwC:event corresponds to the intuitive definition above, since you can have multiple DwC:events over the same space/time region, e.g. two groups surveying for different taxa, using different methodologies.
Joel.
Arlin
Arlin Stoltzfus (arlin@umd.edu) Fellow, IBBR; Adj. Assoc. Prof., UMCP; Research Biologist, NIST IBBR, 9600 Gudelsky Drive, Rockville, MD tel: 240 314 6208; web: www.molevol.org
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
On Oct 28, 2010, at 2:23 PM, joel sachs wrote:
I was wondering the same thing. Event=(Location, Time) strikes me as weird, for a couple of reasons:
i. Intuitively, an event is something that happens at a region of space/time, not the region itself.
Yeah, some act took place at that place and time. In the context of Occurrence, probably an act of sampling or collecting, and so one should expect metadata documenting that act (who, how, etc)
-hilmar
Yeah, some act took place at that place and time. In the context of Occurrence, probably an act of sampling or collecting, and so one should expect metadata documenting that act (who, how, etc)
As I alluded to in my previous post, it's not clear to me whether the who, how, etc. are intrinsic properties of the Event, or if the "Event" simply represents the single coordinate in 4-D space-time; and the other stuff is more a function of the Occurrence (i.e., more metadata about documenting the tuple of Event+Individual). I can see rational arguments either way. My inclination is that the word "Event" goes beyond merely the space-time coordinate, and implies some sort of "action". As such, my inclination is to define the "Event" as more than just the 4D space-time coordinate, and include who, how, etc. as part of the "action" of the Event.
As is probably obvious, I haven't throught this through extensively yet, so I reserve the right to change my mind.
Rich
As Stan said in an earlier message today:
"Back to the issue of event definition: If Event is defined as just the conjunction (association, intersection, join) of space and time, there is nothing to tell you why this particular interval is of interest."
In other words the who and how and what they did that made the event of interest.
julian
At 02:30 PM 10/28/2010, Richard Pyle wrote:
Yeah, some act took place at that place and time. In the context of Occurrence, probably an act of sampling or collecting, and so one should expect metadata documenting that act (who, how, etc)
As I alluded to in my previous post, it's not clear to me whether the who, how, etc. are intrinsic properties of the Event, or if the "Event" simply represents the single coordinate in 4-D space-time; and the other stuff is more a function of the Occurrence (i.e., more metadata about documenting the tuple of Event+Individual). I can see rational arguments either way. My inclination is that the word "Event" goes beyond merely the space-time coordinate, and implies some sort of "action". As such, my inclination is to define the "Event" as more than just the 4D space-time coordinate, and include who, how, etc. as part of the "action" of the Event.
As is probably obvious, I haven't throught this through extensively yet, so I reserve the right to change my mind.
Rich
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
This "three-way" was the essence of the definition used in the ASC model.
<copied text, from http://wiki.tdwg.org/twiki/bin/viewfile/TAG/HistoricalDocuments?rev=1;filena me=Ascmodrpt.pdf page 24>
Entity Name: COLLECTING-EVENT (Supertype) Description: The act of collecting zero or more COLLECTING-UNITs at a particular LOCALITY and TIME.
</copied text>
Other CollectingEvent properties discussed included the conceptual equivalents of: CollectingMethod, Collectors (0-many), StatedDateTime, and StatedLocality (=verbatimLocality).
Note that the zero-or-more cardinality on collecting-units (covering Roger's observed absence) was discussed at some length and kept intentionally. (I'll try to find John Damuth's very funny proposal to establish the department of NULL collections at the Smithsonian.) I think observations were also discussed as a type of collecing-unit, but they weren't included in the draft and I don't remember why. Perhaps because the focus was on collections, and observations would have expanded the scope too much to be dealt with adequately. Also, the model did not include in a structured way was any measure of collecting (sampling) effort. That would have been relegated to text in a collecting method or collecting event remarks (inadequate for quantifying abundance).
Back to the issue of event definition: If Event is defined as just the conjunction (association, intersection, join) of space and time, there is nothing to tell you why this particular interval is of interest. From the old school information modeling perspective, the definition should say WHAT happened. In our biodiversity domain, it implies the act of trying to collect or observe and that implies a collector/observer and something collected/observed, including zeros. I see Joel just posted support for that notion.
And just to show that I'm not completely stuck in 1992, in the MVZ model -- a more detailed model for mammal, bird and herp collections, completed as recently as 1996 -- recognized a distinction between the number (count) of items observed and the number collected.
Our challenge is still to how to accumulate these artifacts of conceptualization in an organized way, and to record how much support there is for particular concepts.
Cheers,
-Stan
On 10/28/10 10:09 AM, "Arlin Stoltzfus" arlin@umd.edu wrote:
On Oct 28, 2010, at 11:58 AM, joel sachs wrote:
Hey Rich, Hilmar, Paul, and everyone -
I liked the definition from a couple of weeks ago:
"An occurrence is a tuple consiting of time, place, individual, and some optional properties."
What's that lacking?
Nothing, but the form of "occurrence" as a 3-way relation between a thing, a place and a time might be less amenable to formal reasoning than some other formulations, depending on how its rendered.
I thought that was the reason others had introduced "event" as place + time. Otherwise, what is the reason for "event"?
Arlin
Arlin Stoltzfus (arlin@umd.edu) Fellow, IBBR; Adj. Assoc. Prof., UMCP; Research Biologist, NIST IBBR, 9600 Gudelsky Drive, Rockville, MD tel: 240 314 6208; web: www.molevol.org
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Stan, it does my heart good to see both the continuing relevance of that lengthy weekends discussions (and BBQ), and how much we actually got "right."
At 02:13 PM 10/28/2010, Blum, Stan wrote:
This "three-way" was the essence of the definition used in the ASC model.
<copied text, from http://wiki.tdwg.org/twiki/bin/viewfile/TAG/HistoricalDocuments?rev=1;filena me=Ascmodrpt.pdf page 24>
Entity Name: COLLECTING-EVENT (Supertype) Description: The act of collecting zero or more COLLECTING-UNITs at a particular LOCALITY and TIME.
</copied text>
Other CollectingEvent properties discussed included the conceptual equivalents of: CollectingMethod, Collectors (0-many), StatedDateTime, and StatedLocality (=verbatimLocality).
Note that the zero-or-more cardinality on collecting-units (covering Roger's observed absence) was discussed at some length and kept intentionally. (I'll try to find John Damuth's very funny proposal to establish the department of NULL collections at the Smithsonian.) I think observations were also discussed as a type of collecing-unit, but they weren't included in the draft and I don't remember why. Perhaps because the focus was on collections, and observations would have expanded the scope too much to be dealt with adequately. Also, the model did not include in a structured way was any measure of collecting (sampling) effort. That would have been relegated to text in a collecting method or collecting event remarks (inadequate for quantifying abundance).
Back to the issue of event definition: If Event is defined as just the conjunction (association, intersection, join) of space and time, there is nothing to tell you why this particular interval is of interest. From the old school information modeling perspective, the definition should say WHAT happened. In our biodiversity domain, it implies the act of trying to collect or observe and that implies a collector/observer and something collected/observed, including zeros. I see Joel just posted support for that notion.
And just to show that I'm not completely stuck in 1992, in the MVZ model -- a more detailed model for mammal, bird and herp collections, completed as recently as 1996 -- recognized a distinction between the number (count) of items observed and the number collected.
Our challenge is still to how to accumulate these artifacts of conceptualization in an organized way, and to record how much support there is for particular concepts.
Cheers,
-Stan
On 10/28/10 10:09 AM, "Arlin Stoltzfus" arlin@umd.edu wrote:
On Oct 28, 2010, at 11:58 AM, joel sachs wrote:
Hey Rich, Hilmar, Paul, and everyone -
I liked the definition from a couple of weeks ago:
"An occurrence is a tuple consiting of time, place, individual, and some optional properties."
What's that lacking?
Nothing, but the form of "occurrence" as a 3-way relation between a thing, a place and a time might be less amenable to formal reasoning than some other formulations, depending on how its rendered.
I thought that was the reason others had introduced "event" as place + time. Otherwise, what is the reason for "event"?
Arlin
Arlin Stoltzfus (arlin@umd.edu) Fellow, IBBR; Adj. Assoc. Prof., UMCP; Research Biologist, NIST IBBR, 9600 Gudelsky Drive, Rockville, MD tel: 240 314 6208; web: www.molevol.org
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
On 29/10/2010, at 2:58 AM, joel sachs wrote:
Hey Rich, Hilmar, Paul, and everyone - I liked the definition from a couple of weeks ago: "An occurrence is a tuple consiting of time, place, individual, and some optional properties." What's that lacking?
I joined this list just recently, and missed that post. I like 'Tuple'.
Actually, I should get on with what I am actually supposed to be doing, and add the DwC predicates to the data at biodiversity.org.au .
Speaking of which - (Looking back on what I have written below, it's very disorganised. Just a brain dump, really. ) :
Currently, I have used the TDWG rdf vocabulary as far as I am able to work it out. For instance: http://biodiversity.org.au/apni.name/33407.rdf (aka: http://biodiversity.org.au/apni.name/33407 , urn:lsid:biodiversity.org.au:apni.name:33407 )
Of course, not having a owl:domain predicate does make things difficult to untangle: when I read the DwC vocabulary in with protege, I just have a list of predicate names. Luckily, the quick reference guide (http://rs.tdwg.org/dwc/terms/index.htm) does organise the properties into the classes they apply to. The only DwC classes that our data involves at this stage would be Taxon and ResourceRelationship.
----------------------
As per the TDWG vocabulary, we make a fairly strong distinction between taxonomic and nomenclatural components. A TaxonName is not a TaxonConcept. I'm finding that the Taxon predicates in the DwC vocabulary seem to be a mix of things that variously belong to names and taxa. My impression is that the distinction is there, in fact - it is modelled by a DwC taxon having or not having a nameAccordingTo rather than by an explicit class. If there is no AccordingTo, then we are discussing the "nominal taxon" - what the name means in the absence of any specific information about what it means.
But as we are so careful to distinguish between name and taxon, I think I will take the (safer) position that a Name is not the same thing as its nominal taxon. That is, I will not declare that biodiversity.org.au names are DwC taxa, even though they have properties from DwC.
(Perhaps our data should genrate an id for these nominal taxa - it's easy enough, just use the name objectid as the taxon objectid and "[afd|apni].taxon.nominal" as the LSID namespace. In principle, everyone who uses a name is also asserting that their taxon "is congruent to" the nominal taxon. Every synonym relationship is also an assertion of synonymy to the nominal taxon. But that's an awful lot of unnecessary detail to make explicit - over-engineering things is one of my failings. Forget I said it.)
----------------------
DwC properties variously use "taxonID" and also "nameUsageID'. Now, I believe I understand the distinction: not all usages of a name are of taxonomic interest (my favourite example is a bottle of weedkiller that happens to mention a scientific name.) Our databases only contain name usages that are taxa, so the distinction does not arise - a name usage is simply a taxon.
However, not all of our names are scientific names. We have cultivar names, and we have vernacular names. Al usages of these are TDWG TaxonConcepts - they have synonomy relationships and so on. However, the DwC property for declaring that a taxon record has a name seems to be "scientificNameID". This would seem to be inappropriate for taxa that don't have scientific names. I think that the correct way for me to go is to not declare these taxa as DwC taxa at all. That is, the absence of a "nameID" property seems to indicate that DwC is only "interested" in scientific names - scientific taxa if you will.
To continue: These properties apply to our taxa (TaxonConcepts) without difficulty: scientificNameID parentNameUsageID nameAccordingToID
These apply to our taxon names: acceptedNameUsageID originalNameUsageID namePublishedInID scientificNameAuthorship
One of the wiki pages seemed to indicated that Taxa would have both a nameAccordingToID and also the namePublishedInID (the two being equal indicting that the taxon is the original one), but I think we will continue to not do this on the grounds that it's best to assert things only once to avoid data inconsistencies.
----------------------
scientificName higherClassification kingdom | phylum | class | order | family | genus | subgenus| specificEpithet | infraspecificEpithet
The various properties for name parts are ... problematic from the point of view of our data. These properties sort of di double duty: they are places for putting parts of names (ie, strictly nomenclatural), and they also are places to put taxonomy.
With respect to holding name parts, there seems to be no property in which to put - for instance - a subfamily name. The closest thing is "infraspecificEpithet", which contains the terminal epithet, but obviously that's not right for supergeneric names. TCS and the TDWG vocabulary have "uninomial". It might be nice to have this property, and to declare the other bits as being subproperties.
With respect to taxonomy, if you want to use these for holding taxonomic relationships, then you don't need "order", you need "orderNameUsageID" or "orderTaxonID".
Of course, what's really going on here is that these fields are simply a denormalisation of the data. Let's face it: in my data, I do indeed have the scientific name string in the taxon record even though *technically* it's duplicating the data. So I think the conclusion is that these properties *on taxon records* are denormalisation, whereas these properties *on name records* are primary data. This is fine for me, but only because I have a separate TaxonName class.
----------------------
taxonRank | verbatimTaxonRank
Simple enough - "taxonRank" is controlled, "verbatim" is not. It's yet another mapping exercise for me, but them's the breaks. The whole "rank" issue is so fraught that one of our datasets here uses numeric codes. Which is fine, until you fill up all of the slots. What the world really needs is a dotted decimal notation, where negative numbers are allowed. Family, subfamily, and superfamily would be "5", "5.1", "5.-1". If you ever need a sub-superfamily, then it's "5.-1.1" . But maybe that's over-engineering things again.
In any case. According to the wiki page, the controlled vocabulary seems to be just a list of strings. I would have expected them to be typed named individuals, permitting you to have an abbreviation, and the english and latin name. A difficulty is that in order to render a botanical name correctly, you need the rank abbreviation string: "Evolvulus alsinoides var. sericeus". At present, there is no DWC property for that.
----------------------
In summary - shouldn't be too difficult. At least, to get the basics up.
------ If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
------
Hi Paul,
I'll try to address your questions relating to DwC terms in the class "Taxon". I'm partly responsible for some of them being there, and even I was a bit confused about what a couple of them meant. Fortunately, I had the opportunity to sit down with Markus in Berlin one evening last week, and that conversation helped clear up a number of things in min mind (of course, Markus may well contradict what I'm about to write).
As per the TDWG vocabulary, we make a fairly strong distinction between taxonomic and nomenclatural components. A TaxonName is not a TaxonConcept. I'm finding that the Taxon predicates in the DwC vocabulary seem to be a mix of things that variously belong to names and taxa. My impression is that the distinction is there, in fact - it is modelled by a DwC taxon having or not having a nameAccordingTo rather than by an explicit class. If there is no AccordingTo, then we are discussing the "nominal taxon" - what the name means in the absence of any specific information about what it means.
I think that's generally a safe assumption; but I think it's a bit more involved than that.
But as we are so careful to distinguish between name and taxon, I think I will take the (safer) position that a Name is not the same thing as its nominal taxon. That is, I will not declare that biodiversity.org.au names are DwC taxa, even though they have properties from DwC.
The problem with treating "name" as a distinct entity (independent of a particular usage of the name) is problematic, because there are several different interpretations of what a "name" is. Is it a simple text string? Or, is it a nomenclatural "object" with properties beyond the text string? Are all orthographic variants and misspellings different representations of the same "name" (object perspective); or is each variant a different "name" (text-string perspective). Is a name formatted as "Genus (Subgenus) species" the same as a name formatted as "Genus species"? Is authorship and associated details part of the name? What about infraspecific prefixes such as "var." and "subsp."? This is just a sampling of questions for which you will find a variety of answers when talking to different people in our community about what a "name" is.
For this reason, I'm rather unclear on what sorts of identifiers that one might populate dwc:scientificNameID with. I would have guessed that this is where you would put an identifier fora Taxon Name Usage (TNU) record that represents either a Protonym (~=basionym), a New Combination, a Replacement name (nom. nov.), or the like; but we already have dwc:originalNameUsageID for that function. Perhaps dwc:scientificNameID should link to a nominal concept record? Or maybe something like an ITIS TSN record? I've never been very clear on this. The example given (urn:lsid:ipni.org:names:37829-1:1.3) doesn't seem to resolve right now, but corresponding IPNI record shows details for what looks to me like a TNU (and hence, probably best represented via dwc:originalNameUsageID).
(Perhaps our data should genrate an id for these nominal taxa
- it's easy enough, just use the name objectid as the taxon
objectid and "[afd|apni].taxon.nominal" as the LSID namespace. In principle, everyone who uses a name is also asserting that their taxon "is congruent to" the nominal taxon. Every synonym relationship is also an assertion of synonymy to the nominal taxon. But that's an awful lot of unnecessary detail to make explicit - over-engineering things is one of my failings. Forget I said it.)
The topic of Nominal Concepts is definitely one that needs to be hammered out at some point -- but I agree, now may not be the best time.
DwC properties variously use "taxonID" and also "nameUsageID'.
That's what I used to think too -- but then I realised that that the unqualified "nameUsageID" isn't in the dwc spec (as far as I can tell) -- onlt the qualified versions (dwc:acceptedNameUsageID, dwc:parentNameUsageID, and dwc:originalNameUsageID).
Thus, I interpret TaxonID to effectively be nameUsageID (and Markus agreed when we discussed this -- right Markus???).
However, not all of our names are scientific names. We have cultivar names, and we have vernacular names. Al usages of these are TDWG TaxonConcepts - they have synonomy relationships and so on. However, the DwC property for declaring that a taxon record has a name seems to be "scientificNameID". This would seem to be inappropriate for taxa that don't have scientific names. I think that the correct way for me to go is to not declare these taxa as DwC taxa at all. That is, the absence of a "nameID" property seems to indicate that DwC is only "interested" in scientific names - scientific taxa if you will.
I tend to agree. I think cultivars will fit within the scientificName framework reasonably well; but not so much for vernaculars. I think that they could be represented by a taxonID instance -- but I don't see where you would put the actual vernacular name-string.
To continue: These properties apply to our taxa (TaxonConcepts) without difficulty: scientificNameID parentNameUsageID nameAccordingToID
These apply to our taxon names: acceptedNameUsageID originalNameUsageID namePublishedInID scientificNameAuthorship
One of the wiki pages seemed to indicated that Taxa would have both a nameAccordingToID and also the namePublishedInID (the two being equal indicting that the taxon is the original one), but I think we will continue to not do this on the grounds that it's best to assert things only once to avoid data inconsistencies.
Actually, these are quite different things. They only are identical if you are passing the original taxon concept circumscription that was used when the name was first established under the Code. In the (vast?) majority of cases, they will be different; with dwc:nameAccordingToID pointing to the publication representing the particular taxon concept circumscription, and namePublishedInID pointing to the publication in which the name was formally established under the relevant Code.
scientificName higherClassification kingdom | phylum | class | order | family | genus | subgenus| specificEpithet | infraspecificEpithet
The various properties for name parts are ... problematic from the point of view of our data. These properties sort of di double duty: they are places for putting parts of names (ie, strictly nomenclatural), and they also are places to put taxonomy.
With respect to holding name parts, there seems to be no property in which to put - for instance - a subfamily name. The closest thing is "infraspecificEpithet", which contains the terminal epithet, but obviously that's not right for supergeneric names. TCS and the TDWG vocabulary have "uninomial". It might be nice to have this property, and to declare the other bits as being subproperties.
The idea is that for a record representing the subfamily name itself, the text-string subfamily name goes in dwc:scientificName. But for names below the rank of subfamily, the Subfamily name is genrally not included among the parsed classification elements (neither are any other higher infra-rank names). The real information, I think, goes in dwc:scientificName. The terms dwc:genus, dwc:subgenus, dwc:specificEpithet, and infraspecificEpithet (as well as scientificNameAuthorship) are there to allow you to provide pre-parsed name elements of a compond name represented in scientificName.
With respect to taxonomy, if you want to use these for holding taxonomic relationships, then you don't need "order", you need "orderNameUsageID" or "orderTaxonID".
No, because presumably there would be a record for the Order name itself (linked to child names via a series of parentNameUsageID), which would have its own value of taxonID (="nameUsageID")
Of course, what's really going on here is that these fields are simply a denormalisation of the data.
Yes, exactly. You can represent them in a normalized way using the available terms, but not all people have the information broken into a normalised form, so DWC accomodates a denormalized representation as well.
Let's face it: in my data, I do indeed have the scientific name string in the taxon record even though *technically* it's duplicating the data. So I think the conclusion is that these properties *on taxon records* are denormalisation, whereas these properties *on name records* are primary data. This is fine for me, but only because I have a separate TaxonName class.
I don't think I understand the difference in normalisation between records representing names, and records representing concepts. They both seem equally denormalised to me. Either the name *is* the object being described, or the name elements are labels for the element being described, but in both cases, the same amount of denormalisation seems to be happening.
taxonRank | verbatimTaxonRank
Simple enough - "taxonRank" is controlled, "verbatim" is not. It's yet another mapping exercise for me, but them's the breaks. The whole "rank" issue is so fraught that one of our datasets here uses numeric codes. Which is fine, until you fill up all of the slots. What the world really needs is a dotted decimal notation, where negative numbers are allowed. Family, subfamily, and superfamily would be "5", "5.1", "5.-1". If you ever need a sub-superfamily, then it's "5.-1.1" . But maybe that's over-engineering things again.
I'm not sure I understand the value of a numeric surrogate for rank in DwC in place of (or in addition to) a controlled vocabulary for taxonRank. Sure, you can do clever semantic things, but it seems to me that those clever that kind of information should be embedded within code logic tied to the controlled vocabulary; but not part of the DwC itself.
In any case. According to the wiki page, the controlled vocabulary seems to be just a list of strings. I would have expected them to be typed named individuals, permitting you to have an abbreviation, and the english and latin name. A difficulty is that in order to render a botanical name correctly, you need the rank abbreviation string: "Evolvulus alsinoides var. sericeus". At present, there is no DWC property for that.
Agreed -- there needs to be more robust attributes for the taxonRank controlled vocabulary. They probably shouldn't be part of DwC, but we should have a community-shared representation of what those attributes are (e.g., standard abbreviations for each rank that can be used for concatenating a "standard" compound name-string). Markus and I discussed that a bit in Berlin. Once I get my own head around it, I'll try to draft something for further discussion.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences Associate Zoologist in Ichthyology Dive Safety Officer Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
The problem with treating "name" as a distinct entity (independent of a particular usage of the name) is problematic, because there are several different interpretations of what a "name" is. Is it a simple text string? Or, is it a nomenclatural "object" with properties beyond the text string?
I believe that the position around these parts (which position I believe I have absorbed mainly by osmosis) is that a name is that thing comes into being with a nomenclatural act. Which I can't help picturing with the image of someone being knighted: "I dub thee Apus apus! Arise, Sir Taxon!". A text string refers to the abstract thing created by the act ... by the act that the writer of that text string *intended* it to refer. Of course, a number of things can go wrong at that point.
Are all orthographic variants and misspellings different representations of the same "name" (object perspective);
Drat. Quite right - by what I have said above, orth vars refer to the same name as the correct spelling does. But the fact is that we have them in separate records with separate ids - what else can we do? Does this mean that our "name" records are not "name" records at all? Or that some are, some are not, and that some synonym types indicate not synonymy, but something different? Shall we distinguish between *taxonomic* synonymy (two names meaning the same taxon) and *foo* synonymy (two name strings meaning the same name)? "Nomenclatural synonymy" already means "two names having the same type specimen", so we can't call it that. But whatever we call it, it does seem that there's another layer of mapping. But what if the writer of a string had no specific idea they were using a homonym? Do we have "nominal name" as well as "nominal taxon"?
With respect to holding name parts, there seems to be no property in which to put - for instance - a subfamily name.
The idea is that for a record representing the subfamily name itself, the text-string subfamily name goes in dwc:scientificName.
Thanks. Will do.
I don't think I understand the difference in normalisation between records representing names, and records representing concepts. They both seem equally denormalised to me. Either the name *is* the object being described, or the name elements are labels for the element being described, but in both cases, the same amount of denormalisation seems to be happening.
For our data, a name record only contains elements that are part of the name. A record for a generic name will not contain a family name - the family that a genus is in is taxonomy. A taxon record has a name by way of a name id. In principle, it doesn't have any name strings in it at all. In practise, it contains data drawn from its name and from the names of its supertaxa. However, this data is not "primary" - it's a copy.
I suppose ... In saying that a name record has its name parts "not denormalised", I am saying that the strings representing the parts of the name (including its author and year) are a key - the combination of those parts uniquely identify the name. A taxon has-a set of name parts, by virtue of it having a name and some supertaxa, whereas a name is-a particular combination of name parts. It seems I'm rather more wedded to this "a name is a set of part tuples" than I thought.
I'm not sure I understand the value of a numeric surrogate for rank in DwC in place of (or in addition to) a controlled vocabulary for taxonRank.
I suppose what I was suggesting is that the semantic web (etc) needs another primitive datatype, alongside xs:string and xs:int, to represent a position in an abstract ordering. Dotted decimals - used in jar file manifests, for instance - fit the bill. It has even less chance of becoming a standard than my other bright idea: that "schema URI" and "element name" should be standard facets for the XML datatype, but I can dream.
------ If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
------
I believe that the position around these parts (which position I believe I have absorbed mainly by osmosis) is that a name is that thing comes into being with a nomenclatural act.
OK, then we're on the same page. This is what I've been calling a "Protonym", which, as I said in my earlier reply to Steve, means the taxonNameUsage isnatnce that represents the original establishment of a name.
Are all orthographic variants and misspellings different representations of the same "name" (object perspective);
Drat. Quite right - by what I have said above, orth vars refer to the same name as the correct spelling does. But the fact is that we have them in separate records with separate ids - what else can we do?
What I do is track individual usage instances of the orth vars, all anchored back to the protonym. Everything becomes easy if you can atomize things down to individual TNUs. The problem, of course, is that most datasets don't exist at that level of resolution.
Does this mean that our "name" records are not "name" records at all? Or that some are, some are not, and that some synonym types indicate not synonymy, but something different? Shall we distinguish between *taxonomic* synonymy (two names meaning the same taxon) and *foo* synonymy (two name strings meaning the same name)? "Nomenclatural synonymy" already means "two names having the same type specimen", so we can't call it that. But whatever we call it, it does seem that there's another layer of mapping. But what if the writer of a string had no specific idea they were using a homonym? Do we have "nominal name" as well as "nominal taxon"?
Like I said, it all becomes relatively straightforward if you can atomize all the way down to individual Taxon Name Usage instances (aka individual "Treatments" of taxon names).
For our data, a name record only contains elements that are part of the name. A record for a generic name will not contain a family name - the family that a genus is in is taxonomy. A taxon record has a name by way of a name id. In principle, it doesn't have any name strings in it at all. In practise, it contains data drawn from its name and from the names of its supertaxa. However, this data is not "primary" - it's a copy.
OK, got it. Thanks.
Aloha, Rich
Paul, Rich, et al. I have decided that it was time to face up to trying to understand the "right side" of that diagram that Rich made a while back, which I tried to put in readable form at: http://bioimages.vanderbilt.edu/pages/token-explicit.gif Up to this point I ignored the right side of the diagram because I basically am unfamiliar with taxon/name issues. But I feel that I should be! I read through Rich's various emails in the thread, Paul's email (below), and looked at the Darwin Core Taxon and Identification class terms and came up with: http://bioimages.vanderbilt.edu/pages/taxon-diagram1.gif Please note that this diagram does NOT represent an opinion on my part but rather an attempt to summarize what people have said in a graphical way.
In general, I have come to understand the following: 1. There are taxon concepts, which I guess represents a particular circumscription of individuals. The taxon concept is the result of some kind of rule that allows one to decide whether particular individuals should be included in that taxon or not. The set of all biological individuals that are included are the actual concept (or maybe not?). 2. There are taxon names, which have been published for the purpose of identifying taxa. 3. There are taxon name usages, which are a sort of node that connects a name with a concept. If I'm getting Rich right, this is the resource to which dwc:Identifications should be tied. Rich also suggested that taxon name usages might be instances of the dwc:Taxon class. Although these three types of resources aren't all defined as "classes" in Darwin Core, it seems to me that they are classes in the "RDF sense" (i.e. that their instances can be typed to them).
In the diagram, I used triangles to indicate 1:many relationships similar to the way Rich did in his original diagram. In the DwC term index, (http://rs.tdwg.org/dwc/terms/index.htm), there seem to be other entities represented that are like the "acceptedName" (i.e. originalName, parentName) but I've left them off the diagram for simplicity and because I don't fully understand them. Because of the dual use of the xxxxxxxxID terms, I should clarify their use in the diagram. The arrows are used in the way you would use an arrow in an RDF graph. The subject is at the tail, the object is at the head and the predicate is the term beside the arrow. Where I put an xxxxxxxID term, I'm using it in a way that the Linked Data world would use "hasXxxxx". So the arrow from taxonNameUsage to taxonName asserts the statement [taxonNameUsage] hasTaxonName [taxonName] which I'm guessing is equivalent to [taxonNameUsage] scientificNameID [taxonName] Again, I'm not asserting that the xxxxxxID terms mean what I've put on the diagram. I'm guessing and asking if that's what they mean.
I've indicated my guess about some of the key properties of each of the four "classes" (including also Identifications) by arrows pointing away from the boxes. In a lot of cases there are both xyz and xyzID terms, where xyz would be for a string literal and xyzID would be for a GUID (e.g. URI). But they would behave the same way, so I only showed the xyzID version in the diagram.
Here is a use case for me. I refer to the Gleason and Cronquist key and the Golden Guide to the Trees to identify a tree that I've documented in an Occurrence. identifiedBy would be me. nameAccordingTo would be a reference to the Gleason and Cronquist treatment. namePublishedIn would be the original publication for the species name. identificationReferences would be the Golden Guide and the Gleason and Cronquist key. If somebody like Pete had created a URI to represent the concept, I could refer to that using taxonConceptID. If there weren't such a URI, I'd just skip making a reference to the concept. The identifier for the taxonName could be something like a TSNID and the taxonName instance could have properties like string values for genus, species, scientificName, etc.
So is this anything close to reality? Steve
Paul Murray wrote:
Speaking of which - (Looking back on what I have written below, it's very disorganised. Just a brain dump, really. ) :
Currently, I have used the TDWG rdf vocabulary as far as I am able to work it out. For instance: http://biodiversity.org.au/apni.name/33407.rdf (aka: http://biodiversity.org.au/apni.name/33407 , urn:lsid:biodiversity.org.au:apni.name:33407 )
Of course, not having a owl:domain predicate does make things difficult to untangle: when I read the DwC vocabulary in with protege, I just have a list of predicate names. Luckily, the quick reference guide (http://rs.tdwg.org/dwc/terms/index.htm) does organise the properties into the classes they apply to. The only DwC classes that our data involves at this stage would be Taxon and ResourceRelationship.
As per the TDWG vocabulary, we make a fairly strong distinction between taxonomic and nomenclatural components. A TaxonName is not a TaxonConcept. I'm finding that the Taxon predicates in the DwC vocabulary seem to be a mix of things that variously belong to names and taxa. My impression is that the distinction is there, in fact - it is modelled by a DwC taxon having or not having a nameAccordingTo rather than by an explicit class. If there is no AccordingTo, then we are discussing the "nominal taxon" - what the name means in the absence of any specific information about what it means.
But as we are so careful to distinguish between name and taxon, I think I will take the (safer) position that a Name is not the same thing as its nominal taxon. That is, I will not declare that biodiversity.org.au names are DwC taxa, even though they have properties from DwC.
(Perhaps our data should genrate an id for these nominal taxa - it's easy enough, just use the name objectid as the taxon objectid and "[afd|apni].taxon.nominal" as the LSID namespace. In principle, everyone who uses a name is also asserting that their taxon "is congruent to" the nominal taxon. Every synonym relationship is also an assertion of synonymy to the nominal taxon. But that's an awful lot of unnecessary detail to make explicit - over-engineering things is one of my failings. Forget I said it.)
DwC properties variously use "taxonID" and also "nameUsageID'. Now, I believe I understand the distinction: not all usages of a name are of taxonomic interest (my favourite example is a bottle of weedkiller that happens to mention a scientific name.) Our databases only contain name usages that are taxa, so the distinction does not arise - a name usage is simply a taxon.
However, not all of our names are scientific names. We have cultivar names, and we have vernacular names. Al usages of these are TDWG TaxonConcepts - they have synonomy relationships and so on. However, the DwC property for declaring that a taxon record has a name seems to be "scientificNameID". This would seem to be inappropriate for taxa that don't have scientific names. I think that the correct way for me to go is to not declare these taxa as DwC taxa at all. That is, the absence of a "nameID" property seems to indicate that DwC is only "interested" in scientific names - scientific taxa if you will.
To continue: These properties apply to our taxa (TaxonConcepts) without difficulty: scientificNameID parentNameUsageID nameAccordingToID
These apply to our taxon names: acceptedNameUsageID originalNameUsageID namePublishedInID scientificNameAuthorship
One of the wiki pages seemed to indicated that Taxa would have both a nameAccordingToID and also the namePublishedInID (the two being equal indicting that the taxon is the original one), but I think we will continue to not do this on the grounds that it's best to assert things only once to avoid data inconsistencies.
scientificName higherClassification kingdom | phylum | class | order | family | genus | subgenus| specificEpithet | infraspecificEpithet
The various properties for name parts are ... problematic from the point of view of our data. These properties sort of di double duty: they are places for putting parts of names (ie, strictly nomenclatural), and they also are places to put taxonomy.
With respect to holding name parts, there seems to be no property in which to put - for instance - a subfamily name. The closest thing is "infraspecificEpithet", which contains the terminal epithet, but obviously that's not right for supergeneric names. TCS and the TDWG vocabulary have "uninomial". It might be nice to have this property, and to declare the other bits as being subproperties.
With respect to taxonomy, if you want to use these for holding taxonomic relationships, then you don't need "order", you need "orderNameUsageID" or "orderTaxonID".
Of course, what's really going on here is that these fields are simply a denormalisation of the data. Let's face it: in my data, I do indeed have the scientific name string in the taxon record even though *technically* it's duplicating the data. So I think the conclusion is that these properties *on taxon records* are denormalisation, whereas these properties *on name records* are primary data. This is fine for me, but only because I have a separate TaxonName class.
taxonRank | verbatimTaxonRank
Simple enough - "taxonRank" is controlled, "verbatim" is not. It's yet another mapping exercise for me, but them's the breaks. The whole "rank" issue is so fraught that one of our datasets here uses numeric codes. Which is fine, until you fill up all of the slots. What the world really needs is a dotted decimal notation, where negative numbers are allowed. Family, subfamily, and superfamily would be "5", "5.1", "5.-1". If you ever need a sub-superfamily, then it's "5.-1.1" . But maybe that's over-engineering things again.
In any case. According to the wiki page, the controlled vocabulary seems to be just a list of strings. I would have expected them to be typed named individuals, permitting you to have an abbreviation, and the english and latin name. A difficulty is that in order to render a botanical name correctly, you need the rank abbreviation string: "Evolvulus alsinoides var. sericeus". At present, there is no DWC property for that.
In summary - shouldn't be too difficult. At least, to get the basics up.
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content .
Dear Steve et al.:
I would've phrased it more or less like this:
2. The names (example: Curculio L.) are labels (!) that denote one or more concept labels (example: 1. Curculio L. sec. Linnaeus, 1758 and 2. Curculio L. sec. Pelsue & O'Brien, 2010).
1. The concepts are something like this (from Franz & Peet, 2009): A taxonomic concept is the underlying meaning, or referential extension, of a scientific name as stated by a particular author in a particular publication. It represents the author’s full-blown view of how the name reaches out to observed or unobserved objects in nature (beyond statements about type specimens). It is a direct reflection of what has been written, illustrated, and deposited by a taxonomist, regardless of his or her theoretical orientation.Taxonomic concepts are labelled using the abbreviation ‘sec.’ for the Latin secundum, or ‘according to’ (Berendsohn, 1995). The ‘sec.’ is preceded by the full Linnaean name and followed by the specific author and publication, as in Andropogon virginicus L. sec. Radford et al. (1968), an earlier concept, versus Andropogon virginicus L. sec.Weakley (2006), which is a later and narrower concept. The consistent practice of handling a taxonomic name only in connection with a specific source makes it possible to trace the evolution of its multiple meanings through time.
So according to this (debatable) view, two things are perhaps most important: (1) by using a fairly rigorous name + reference approach to recognizing concepts, you can have a lot of concepts (inflation) that point to the same set of biological individuals (referring back to your words) but are nevertheless separate as data entities, even if their meanings are aparently congruent. So (2) in that sense the "concept" is not the set of individuals (past, present, future) - that would be the taxon, I presume. The concept is more like a perspective of what the taxon is or might be - and always according to a particular author and reference. Identification differ, in my mind, from this mainly because the identifier makes no strong claim about challenging a published concepts or authoring a new one.
3. It seems that name usage and what I call above taxon concept label are close to synonymous (?).
Respectfully,
Nico
Nico M. Franz Department of Biology University of Puerto Rico Call Box 9000 Mayagüez, PR 00681-9000
Phone: (787) 832-4040, ext. 3005 Fax: (787) 834-3673 E-mail: nico.franz@upr.edu Website: http://academic.uprm.edu/~franz/
On 11/1/2010 1:33 PM, Steve Baskauf wrote:
Paul, Rich, et al. [...] In general, I have come to understand the following:
- There are taxon concepts, which I guess represents a particular
circumscription of individuals. The taxon concept is the result of some kind of rule that allows one to decide whether particular individuals should be included in that taxon or not. The set of all biological individuals that are included are the actual concept (or maybe not?). 2. There are taxon names, which have been published for the purpose of identifying taxa. 3. There are taxon name usages, which are a sort of node that connects a name with a concept. If I'm getting Rich right, this is the resource to which dwc:Identifications should be tied. Rich also suggested that taxon name usages might be instances of the dwc:Taxon class. Although these three types of resources aren't all defined as "classes" in Darwin Core, it seems to me that they are classes in the "RDF sense" (i.e. that their instances can be typed to them).
[...]
So is this anything close to reality? Steve
[...]
Hi Steve,
I read through Rich's various emails in the thread, Paul's email (below), and looked at the Darwin Core Taxon and Identification class terms and came up with: http://bioimages.vanderbilt.edu/pages/taxon-diagram1.gif Please note that this diagram does NOT represent an opinion on my part but rather an attempt to summarize what people have said in a graphical way.
That's pretty close. The way we deal with the "taxonName" box, however, is as a subset of taxonNameUsage which we've been calling a "Protonym". Roughly equivalent to a botanical basionym, but more general. Basically, a protonym is "the naxonNameUsage instance that established the name". The word "established" is the one that needs to be carefully defined; for scientific names, it refers to the particular taxonNameUsage instance that established the name in accordance (or attempted accordance with) with the relevant Code(s) ["Code(s)" because some names fall under more than one Code -- see Ambiregnal).
Anyway, the point is, I would represent the diagram with the taxonName box represented as a subset of tanxonNameUsage, or as a recursive relationship (not sure how best to do that using your symbols).
A more detailed explanation starts on p. 18 (after "Taxa" heading) of this article: http://systbio.org/files/phyloinformatics/1.pdf ...where the term "Assertion" is equivalent to "taxonNameUsage".
- There are taxon concepts, which I guess represents a
particular circumscription of individuals. The taxon concept is the result of some kind of rule that allows one to decide whether particular individuals should be included in that taxon or not. The set of all biological individuals that are included are the actual concept (or maybe not?).
I think that's probably about right, but I generally prefer Nico's wording in his reply. Only thing I'm a little uncomfortable with in his #1 (second paragraph) is the notion that a Taxon Concept relies on the existence of a scientific name. I think that a taxon concept exists independantly of the name(s) that have been used to label it, to include circumscribed sets of individuals that have not yet been assigned a scientific name. I also think they can exist independently of a publication. I also tend to think of the concept as the "implied" set of organisms (living, dead, yet-to-be-born), which is more or less what I think Nico means, but perhaps worded slightly differently.
I agree that what I think of as "taxonNameUsage" is somewhat close to what Nico defines as "Taxon Concept", but a TNU doesn't always come with an implied concept -- sometimes it's just a raw name-usage without an implied concept (e.g., in a catalog of type specimens at a Museum). However, I would say that *all* taxon concepts are anchored to at least one TNU instance, keeping in mind that the "N" part doesn't have to be a scientific name, and the "U" doesn't have to occur within the scope of a publication.
- There are taxon name usages, which are a sort of node that
connects a name with a concept. If I'm getting Rich right, this is the resource to which dwc:Identifications should be tied. Rich also suggested that taxon name usages might be instances of the dwc:Taxon class. Although these three types of resources aren't all defined as "classes" in Darwin Core, it seems to me that they are classes in the "RDF sense" (i.e. that their instances can be typed to them).
As mentioned, I think of taxon concepts as additional contextual information that exist in the form of TNUs. Just like Nomecnaltural Acts (which are not concepts). Just like a lot of other information that's not really part of a defined taxon concept (but often constitutes information that definds the boundaries of the concept circumscription).
TNUs are really just a core index of documented usages of names for organisms, where "documented" is usually a publication, but can include other forms of static documentation. But the thing is, basically all nomenclatural acts, and all concept definitions, and a very, very large proposrtion of information about biodiversity (at least the ones that provide context via scientific names) exist as part of a TNU.
In the diagram, I used triangles to indicate 1:many relationships similar to the way Rich did in his original diagram. In the DwC term index, (http://rs.tdwg.org/dwc/terms/index.htm), there seem to be other entities represented that are like the "acceptedName" (i.e. originalName, parentName) but I've left them off the diagram for simplicity and because I don't fully understand them.
acceptedName[Id] refers to the particular usage instance of a name that the data provider believes "got it right" (i.e., right spelling, right parent, right rank, right set of heterotypic synonyms). It's a way of saying, "When we use the name Aus bus, we mean it in the sense of Jones, 1980. It's the "sec" reference that Nico was referring to.
originalName[ID] is a pointer to the Protonym TNU for the terminal name. For example, if Aus bus was originally described by Linnaeus in 1758, there would be a TNU for the epithet "bus" linked to the reference of Linnaeus 1758, and that particular usage instance for "bus" would be the Protonym. If the data provider wanted to represent the name as "Xus bus (Linnaeus) Smith" (botanical standard for saying the species epithet "bus" established by Linnaeus, first combined with the genus "Xus" by Smith), then the originalName[ID] would refer to the (protonym) usage instance of "Aus bus" by Linnaeus.
parentName[ID] is a link to the parent taxon of the particular usage. For example, if the given record (taxonID) referred to the protonym instance of Aus bus in Linnaeus 1758, then the parentName[ID] would point to another TNU representing the treatment of the genus "Aus" in Linnaeus 1758. If the taxonID record is for the usage of Xus bus (L.) by Smith, then the parentUsage[ID] would refer to the usage instance of the genus name "Xus" by Smith.
All three are recursive links back to other taxonNameUsage instances.
I've indicated my guess about some of the key properties of each of the four "classes" (including also Identifications) by arrows pointing away from the boxes. In a lot of cases there are both xyz and xyzID terms, where xyz would be for a string literal and xyzID would be for a GUID (e.g. URI). But they would behave the same way, so I only showed the xyzID version in the diagram.
Most of these seem about right to me, but I haven't examined them in detail yet.
Here is a use case for me. I refer to the Gleason and Cronquist key and the Golden Guide to the Trees to identify a tree that I've documented in an Occurrence. identifiedBy would be me. nameAccordingTo would be a reference to the Gleason and Cronquist treatment. namePublishedIn would be the original publication for the species name.
Agreed.
identificationReferences would be the Golden Guide and the Gleason and Cronquist key.
Not sure -- I guess this presumes that the taxon concept represented in G&C is congruent with the taxon concept represented in the Golden Guide.
If somebody like Pete had created a URI to represent the concept, I could refer to that using taxonConceptID.
Right...and presumably this taxonConceptID would include the fact that the G&C TNU and the Golden Guide TNU represent congruent Concepts.
So is this anything close to reality?
I would say very close...but I didn't study your email or diagram in detail.
Aloha, Rich
Rich:
As a matter of clarification, perhaps also to the group - the "definition" I provided for a taxonomic concept has a bit of a normative quality (agenda is too grand a word). The thinking behind it is that concept taxonomy with fairly rapidly dissolve into name taxonomy if the distinction between acts of authoring (even if congruently), citing, and identifying to, concepts is not maintained with some consistency.
Sure, the Catalogue of Life (as just one example) purports to present some authoritative (mix of) taxonomic view(s). An informal name on a museum specimen by a late expert of the group probably translates into a concept in the mind of a student familiar with the group.
I just think that there's this other taxonomy out there in the future, where we taxonomists think and act more like we care for others (incl. computers) to understand our classifications, where the parts come from, what's congruent and what has changed, how to precisely reconcile with previous views, etc. And for that future to become more real, perhaps a high threshold for identifying new concepts (in the sense of authoring anew [versus citing], not necessarily a new meaning) is needed.
In other contexts, possibly including the representation of identification events in museums, the bar for calling something a concept need not be that high (informal names, names outside of publications, local checklists, etc.). In any case, it's a matter of where one puts the emphasis, and hopefully I've pointed out where I would set it and why.
Respectfully,
Nico
On 11/2/2010 4:19 AM, Richard Pyle wrote:
Hi Steve,
[...]
- There are taxon concepts, which I guess represents a
particular circumscription of individuals. The taxon concept is the result of some kind of rule that allows one to decide whether particular individuals should be included in that taxon or not. The set of all biological individuals that are included are the actual concept (or maybe not?).
I think that's probably about right, but I generally prefer Nico's wording in his reply. Only thing I'm a little uncomfortable with in his #1 (second paragraph) is the notion that a Taxon Concept relies on the existence of a scientific name. I think that a taxon concept exists independantly of the name(s) that have been used to label it, to include circumscribed sets of individuals that have not yet been assigned a scientific name. I also think they can exist independently of a publication. I also tend to think of the concept as the "implied" set of organisms (living, dead, yet-to-be-born), which is more or less what I think Nico means, but perhaps worded slightly differently.
I agree that what I think of as "taxonNameUsage" is somewhat close to what Nico defines as "Taxon Concept", but a TNU doesn't always come with an implied concept -- sometimes it's just a raw name-usage without an implied concept (e.g., in a catalog of type specimens at a Museum). However, I would say that *all* taxon concepts are anchored to at least one TNU instance, keeping in mind that the "N" part doesn't have to be a scientific name, and the "U" doesn't have to occur within the scope of a publication. I would say very close...but I didn't study your email or diagram in detail.
Aloha, Rich
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi Nico,
As a matter of clarification, perhaps also to the group -
the "definition" I provided for a taxonomic concept has a bit of a normative quality (agenda is too grand a word). The thinking behind it is that concept taxonomy with fairly rapidly dissolve into name taxonomy if the distinction between acts of authoring (even if congruently), citing, and identifying to, concepts is not maintained with some consistency.
Agreed (I think...if I understand you correctly).
Sure, the Catalogue of Life (as just one example)
purports to present some authoritative (mix of) taxonomic view(s). An informal name on a museum specimen by a late expert of the group probably translates into a concept in the mind of a student familiar with the group.
Yes, they're all implied concepts. It's just that only a tiny, tiny subset of them are well-defined.
I just think that there's this other taxonomy out there
in the future, where we taxonomists think and act more like we care for others (incl. computers) to understand our classifications, where the parts come from, what's congruent and what has changed, how to precisely reconcile with previous views, etc. And for that future to become more real, perhaps a high threshold for identifying new concepts (in the sense of authoring anew [versus citing], not necessarily a new meaning) is needed.
Yes, I can see that (and we've had this conversation before). But first we need to come up with a common understanding of what a "taxon concept" is, and how to articulate it with some precision. One way to define a concept is as a clade; that is, something like "all descendents of the most recent common ancestor of these two organisms". This is one of the ways that Phylocode establishes the definition of a clade. It's not perfect (nothing is ever perfect), because with such definitions you will often have organisms that belong to two different taxon concepts simultaneously (hybridization, introgression, etc.) But that's not intrinsically a bad thing, as long as it's understood and accomodated. The real problem, of course, is that we're still in our relative infancy in our ability to discirn whether or not a paricular organism is, or is not, a descendant of the most recent common ancestor of two other organisms. Also, there is the problem of mapping to centuries of legacy information.
So I think the most practical way to define the cirumscription boundaries of a taxon cocnept at this point in history (and the one that is most likely to leverage historical content) is via type specimens (proxied by heterotypic synonyms). It's much fuzzier and less precise than the mechanism described in the previous paragraph, but far more practical.
What you're advocating is, I think, represents a reasonable path forward towards more robustly defined, and objectively articulated, taxon cocnepts.
In other contexts, possibly including the representation
of identification events in museums, the bar for calling something a concept need not be that high (informal names, names outside of publications, local checklists, etc.). In any case, it's a matter of where one puts the emphasis, and hopefully I've pointed out where I would set it and why.
I think the key distinction that should be made is the distinction between a "defined" concept, and an "implied" concept. Almost every Taxon Name Usage (sensu lato) instance carries with it an implied taxon concept, but as I said, the vast, vast majority of those (especially if you include Museum specimen identifications) are extremely anemic on deails for understanding the boundaries of the implied taxon concept, and therefore it's difficult or impossible to reliably map the congurnecy (or not) with other implied or defined taxon concepts. On the other hand, what we should be really striving for is recognition of taxon concepts that are well "defined". These are also rooted in TNU's, but carry with them robust information for inferring the boundaries of the circumscribed concept (full synonymy, robust mateial examined, robust descriptions of morphological and/or genetic characters, etc.) I think this distinction is important to make ("defined", vs. merely "implied"), because what we'd ultimately like to do is find a way to map implied concepts to well-defined concepts.
Getting back to DwC, one area of direct relevance to this is the Identification class.
Most datasets out there simply slap a taxon name to a specimen or observation. Some of them go so far as to say who identified it to that name, and when. But most do not take the final step and anchor the identification to a particular well-defined concept (or, indeed, any TNU). All specimen identifications represent an action that places the specimen within the boundaries of a circumscribed taxon cocnept (whether the person making the identification realizes this or not). What we should be striving for is a mechanim to tie specimen identifications to particular TNUs that represent reasonably well-defined taxon cocnepts. The statement should be:
"On this date, this person asserted that this specimen falls within the taxon concept circumscription of Aus bus (L.) sec. Smith 1990".
If an Identification is a tuple of a Taxon instance (in the DWC sense) and an Occurrence instance (or an Individual instance, if that class becomes established), with associated metadata, then I think DwC is already primed to make the quoted statement above (i.e., to anchor Identifications to particular usage instances, because "taxonID" can represent a specific usage instance -- assuming the right attributes ae included). Retsated in DwC terms, this would be:
"On [dwc:dateIdentified], [dwc:identifiedBy] asserted that [dwc:occurenceID|dwc:individualID] falls within the taxon concept circumscription of [taxonID], which represents a taxon name usage for [dwc:scientificName] according to [dwc:nameAccordingToID]"
In the future, if Peter DeVries is successful with his ambitions, then this could be simplied to:
"On [dwc:dateIdentified], [dwc:identifiedBy] asserted that [dwc:occurenceID|dwc:individualID] falls within the taxon concept circumscription of [taxonID], which represents [dwc:taxonConceptID]"
Or maybe even:
"On [dwc:dateIdentified], [dwc:identifiedBy] asserted that [dwc:occurenceID|dwc:individualID] falls within the taxon concept circumscription of [taxonConceptID]"
In the latter two examples, taxonConceptID would represent an array or set of TNUs with congruent taxon concept definitions.
I'm still not 100% sure how best to use dwc:identificationReferences, other than perhaps as a method for aggregating multiple TNUs (assumed to refer to congruent, or at least overlapping concept circumscriptions), which were used by someone when making an identification.
Aloha, Rich
Thanks, Rich:
This might quickly turn into a test of endurance for all involved. Nevertheless, I think a couple of things a worth taking up specifically, as done below.
On 11/2/2010 2:36 PM, Richard Pyle wrote:
I just think that there's this other taxonomy out there
in the future, where we taxonomists think and act more like we care for others (incl. computers) to understand our classifications, where the parts come from, what's congruent and what has changed, how to precisely reconcile with previous views, etc. And for that future to become more real, perhaps a high threshold for identifying new concepts (in the sense of authoring anew [versus citing], not necessarily a new meaning) is needed.
Yes, I can see that (and we've had this conversation before). But first we need to come up with a common understanding of what a "taxon concept" is, and how to articulate it with some precision. One way to define a concept is as a clade; that is, something like "all descendents of the most recent common ancestor of these two organisms". This is one of the ways that Phylocode establishes the definition of a clade. It's not perfect (nothing is ever perfect), because with such definitions you will often have organisms that belong to two different taxon concepts simultaneously (hybridization, introgression, etc.) But that's not intrinsically a bad thing, as long as it's understood and accomodated. The real problem, of course, is that we're still in our relative infancy in our ability to discirn whether or not a paricular organism is, or is not, a descendant of the most recent common ancestor of two other organisms. Also, there is the problem of mapping to centuries of legacy information.
Disagree, on the following grounds. The basic model of reference in play is (crudely): uses of human language <==> some sort of mapping/reference <==> entities in nature. This model allows for the mapping of language to nature to be spot on or way off, which is critical for proper modeling of taxonomic practice through time. Using precise terminology, a taxonomic concept could never be a clade, that would be nature (on the right side of the equation), but a best a perceived clade, according to a particular perspective. In addition, there can be valid concepts that are not clades (in the Hennigian/phylogenetic sense), and not even intended to be clades, for example in groups with lots of horizontal gene transfer or at and below the species level. In short, concepts are meant to variously refer to clades or other reference-worthy groups in nature, but they are not fruitfully equated with clades themselves.
So I think the most practical way to define the cirumscription boundaries of a taxon cocnept at this point in history (and the one that is most likely to leverage historical content) is via type specimens (proxied by heterotypic synonyms). It's much fuzzier and less precise than the mechanism described in the previous paragraph, but far more practical.
What you're advocating is, I think, represents a reasonable path forward towards more robustly defined, and objectively articulated, taxon cocnepts.
Many historically published published make use of a combination of ostensive components (things being pointed to; type specimens, type species, other members) and intensional components (properties being referenced; diagnostic features, synapomorphies, metabolic functions, etc.). Each component has strengths and weaknesses, but it's hard for me to image that we can do a passable representation job focusing mainly on one and not the other. I'll leave it at that.
In other contexts, possibly including the representation
of identification events in museums, the bar for calling something a concept need not be that high (informal names, names outside of publications, local checklists, etc.). In any case, it's a matter of where one puts the emphasis, and hopefully I've pointed out where I would set it and why.
I think the key distinction that should be made is the distinction between a "defined" concept, and an "implied" concept. Almost every Taxon Name Usage (sensu lato) instance carries with it an implied taxon concept, but as I said, the vast, vast majority of those (especially if you include Museum specimen identifications) are extremely anemic on deails for understanding the boundaries of the implied taxon concept, and therefore it's difficult or impossible to reliably map the congurnecy (or not) with other implied or defined taxon concepts. On the other hand, what we should be really striving for is recognition of taxon concepts that are well "defined". These are also rooted in TNU's, but carry with them robust information for inferring the boundaries of the circumscribed concept (full synonymy, robust mateial examined, robust descriptions of morphological and/or genetic characters, etc.) I think this distinction is important to make ("defined", vs. merely "implied"), because what we'd ultimately like to do is find a way to map implied concepts to well-defined concepts.
Agree.
Getting back to DwC, one area of direct relevance to this is the Identification class.
Most datasets out there simply slap a taxon name to a specimen or observation. Some of them go so far as to say who identified it to that name, and when. But most do not take the final step and anchor the identification to a particular well-defined concept (or, indeed, any TNU). All specimen identifications represent an action that places the specimen within the boundaries of a circumscribed taxon cocnept (whether the person making the identification realizes this or not). What we should be striving for is a mechanim to tie specimen identifications to particular TNUs that represent reasonably well-defined taxon cocnepts. The statement should be:
"On this date, this person asserted that this specimen falls within the taxon concept circumscription of Aus bus (L.) sec. Smith 1990".
Yes.
If an Identification is a tuple of a Taxon instance (in the DWC sense) and an Occurrence instance (or an Individual instance, if that class becomes established), with associated metadata, then I think DwC is already primed to make the quoted statement above (i.e., to anchor Identifications to particular usage instances, because "taxonID" can represent a specific usage instance -- assuming the right attributes ae included). Retsated in DwC terms, this would be:
"On [dwc:dateIdentified], [dwc:identifiedBy] asserted that [dwc:occurenceID|dwc:individualID] falls within the taxon concept circumscription of [taxonID], which represents a taxon name usage for [dwc:scientificName] according to [dwc:nameAccordingToID]"
In the future, if Peter DeVries is successful with his ambitions, then this could be simplied to:
"On [dwc:dateIdentified], [dwc:identifiedBy] asserted that [dwc:occurenceID|dwc:individualID] falls within the taxon concept circumscription of [taxonID], which represents [dwc:taxonConceptID]"
Or maybe even:
"On [dwc:dateIdentified], [dwc:identifiedBy] asserted that [dwc:occurenceID|dwc:individualID] falls within the taxon concept circumscription of [taxonConceptID]"
In the latter two examples, taxonConceptID would represent an array or set of TNUs with congruent taxon concept definitions.
I'm still not 100% sure how best to use dwc:identificationReferences, other than perhaps as a method for aggregating multiple TNUs (assumed to refer to congruent, or at least overlapping concept circumscriptions), which were used by someone when making an identification.
Aloha, Rich
Cheers,
Nico
Nico M. Franz Department of Biology University of Puerto Rico Call Box 9000 Mayagüez, PR 00681-9000
Phone: (787) 832-4040, ext. 3005 Fax: (787) 834-3673 E-mail: nico.franz@upr.edu Website: http://academic.uprm.edu/~franz/
Hi Nico,
This might quickly turn into a test of endurance for all
involved.
:-)
Disagree,
There's a lot of stuff I wrote that you quoted, and I'm not sure if you diagree with all of it, or just parts of it; but I'll respond to the bits you commented on.
on the following grounds. The basic model of reference in play is (crudely): uses of human language <==> some sort of mapping/reference <==> entities in nature.
I guess I need to understand what you mean by "entities". Is a taxon an entity in nature (existinting independantly of a human's definition of it), in your view? If so, we may be stuck on first principles, at which point the safest thing to do (for all parties involved) is to agree to disagree.
I'm also struggling to understand the scope of "uses of human language". Are we talking just taxon name-labels? Or do you also include the way we refer to diagnostic characters and such?
This model allows for the mapping of language to nature to be spot on or way off, which is critical for proper modeling of taxonomic practice through time. Using precise terminology, a taxonomic concept could never be a clade, that would be nature (on the right side of the equation), but a best a perceived clade, according to a particular perspective.
Hmmmm...I would say that sentiment applies to all notions of "taxon concept", regardless of whether they're framed as a clade or in some other way.
In addition, there can be valid concepts that are not clades (in the Hennigian/phylogenetic sense), and not even intended to be clades, for example in groups with lots of horizontal gene transfer or at and below the species level. In short, concepts are meant to variously refer to clades or other reference-worthy groups in nature, but they are not fruitfully equated with clades themselves.
Fair enough. Keep in mind that I'm not a clade-warrior - I was only using it as a way to illustrate a somewhat less fuzzy/more objective mechanim for defining a meaningful circumscription of organisms. If one defined the taxon concept circumscription as "all individuals containing genetic material inherited from the most recent common ancestor of this specimen and that specimen", you'll at least have the *potential* to come a hell of a lot closer to objectively determining "within" circumscription vs. "outside" circumscription; compared with almost any other mechanism for doing so. Obviously, hybridization, introgression and (I forgot to include this -- thanks for the catch) lateral gene transfer mess this up. But they mess it up a lot less than most other methods for determining inside vs. outside circumscription.
Many historically published published make use of a combination of ostensive components (things being pointed to; type specimens, type species, other members) and intensional components (properties being referenced; diagnostic features, synapomorphies, metabolic functions, etc.). Each component has strengths and weaknesses, but it's hard for me to image that we can do a passable representation job focusing mainly on one and not the other. I'll leave it at that.
No argument here! I'd even take it a step further, in that rarely do we achieve a passable representation even when focusing on *both*. Bottom line: we (as a community) just don't have enough clarity and/or consensus on what a taxon concept should be, and how it should be defined, that we can realistically approach an objective detrmination of whether a particular organisms is within vs. outside a particular taxon concept circumscription.
If I understood the rest of your message correctly, we seem to agree on the part relevant to this list -- that is, how to represent the relationship between an organism (i.e., "Individual") and an implied taxon concept.
Aloha, Rich
Thanks, Rich:
I just HAVE to answer this one thing... (the rest seems either not too important and/or we're largely in agreement).
On 11/2/2010 9:10 PM, Richard Pyle wrote:
on the following grounds. The basic model of reference in play is (crudely): uses of human language<==> some sort of mapping/reference<==> entities in nature.
I guess I need to understand what you mean by "entities". Is a taxon an entity in nature (existinting independantly of a human's definition of it), in your view? If so, we may be stuck on first principles, at which point the safest thing to do (for all parties involved) is to agree to disagree.
I'm also struggling to understand the scope of "uses of human language". Are we talking just taxon name-labels? Or do you also include the way we refer to diagnostic characters and such?
The whole point of the taxon concept approach - done right in my (not really all that humble) opinion - is that the question about reality, versus construct, versus some mix thereof, is not really relevant. Wrong question, so to speak. A solid taxonomic concept approach should be able to accommodate taxonomic practice as it is actually being done.
If all taxonomists thought that their perceptions of taxa (including feature diagnoses) map to something "objectively" real (independently of the particularities of human cognitive abilities and semantic conventions => "the causal structure of the world") - fine, then the a well executed taxon concept approach shouldn't have a problem with that.
If, on the other hand, taxonomists thought of their products merely as a matter of quasi-reliable and convenient vocabularies that somehow reflect something about the human-external world but could well be very different and still serve their purpose ("arbitrary constructs" - though it's never that arbitrary once you start down a given path and test for reliability) - then just the same that should be accommodated within a taxon concept approach.
So then, the reason why mentions of PhyloCode-like definitions of clades vis-a-vis concept taxonomy tend to give me light allergies, is because phylogenetic taxonomy actually does on occasion make fairly strong claims about what nature is like, and how good taxonomic practice should reflect this ("use definition type X, not Y"). In that sense, I regard concept taxonomy as a full-fledged alternative and competitor to PhyloCode-like taxonomy. Both, I think, try to improve upon the semantics of Linnaean taxonomy and ultimately help users. But the PhyloCode, if I am allowed the strong oversimplification, tries to do so by getting definitions of taxa right once and for all. Concept taxonomy, on the other hand, is exclusively interested in comparing and reconciling different taxonomic "products" (concepts, classifications) published at different times and likely under different systematic paradigms. The issue is not at all whether we got the concepts "right", i.e. whether they closely map to natural taxa. Instead, the goal is to properly archive the sequence of views (so that ontological reasoning may come into play). Normative claims about practice are restricted to the practice of archiving only.
I think taxonomic publications are real (enough), and I think occurrences of intersubjective human understanding and misunderstanding are real (enough). That's what concept taxonomy should concentrate on representing. The rest is up to the producers and users.
Respectfully,
Nico
Hi Nico,
I just HAVE to answer this one thing... (the rest seems
either not too important and/or we're largely in agreement).
[...]
Thanks for that.
I read all of it, understand most of it (I think), and, to the extent that I understand it, I also tend to agree. I do want to clarify that my use of the word "objective" in earlier posts was in no way intended to suggest that I believe taxa exist as objectively-defined entities in nature; but rather that, given an implied taxon concept circumscription (regardless of it's correlation with some "natural" entity in nature), to what extent is it an objective (rather than subjective) exercize to place a given individal within, or outside of, that circumscription.
In other words, I'm very much in this camp:
If, on the other hand, taxonomists thought of their
products merely as a matter of quasi-reliable and convenient vocabularies that somehow reflect something about the human-external world but could well be very different and still serve their purpose ("arbitrary constructs" - though it's never that arbitrary once you start down a given path and test for reliability) - then just the same that should be accommodated within a taxon concept approach.
But I gather that's beside the point.
So then, the reason why mentions of PhyloCode-like
definitions of clades vis-a-vis concept taxonomy tend to give me light allergies, is because phylogenetic taxonomy actually does on occasion make fairly strong claims about what nature is like, and how good taxonomic practice should reflect this ("use definition type X, not Y").
Yes...you and I have had several discussions about Phylocode in the past...usually in the company of some alcohol-containing beverage or another....
:-)
I really didn't intend to open that can of worms; but leaving Phylocode out of it, I still maintain that it is legitimate to define a circumscribed set of organisms as "all descendants of the most recent common ancestor of of 'X' and 'Y'"; keeping in mind all the caveats that go into the words "decendants" and "common ancestor". Certainly, thois is not the only legitimate way to define a circumscribed set of organisms, and I would argue that it's probably not the "best" one either (depending on the metric of "best").
In that sense, I regard concept taxonomy as a full-fledged alternative and competitor to PhyloCode-like taxonomy. Both, I think, try to improve upon the semantics of Linnaean taxonomy and ultimately help users.
I take a more general approach to the notion of Taxon Concept. To me, any circumscribed set of organisms (living, dead, and yet-to-be-born) purported to represent "a taxon", is a taxon concept -- regardless of what methods or metricies are used to define the boundaries of the circumscription. Linnaeus, having preceeded Darwin by a century, was a Creationist; yet some of his taxon concepts seem to have persisted over 250 years (e.g., Homo sapiens -- at least at the species level).
The issue is not at all whether we got the concepts "right", i.e. whether they closely map to natural taxa. Instead, the goal is to properly archive the sequence of views (so that ontological reasoning may come into play).
Agreed!
I think taxonomic publications are real (enough), and I
think occurrences of intersubjective human understanding and misunderstanding are real (enough). That's what concept taxonomy should concentrate on representing. The rest is up to the producers and users.
If I understand you correctly, I agree here as well.
Aloha, Rich
My concern is that all this hard work and great discussion is going to end up in the mailing list archives! What we really need is someone to maintain a wiki (or something) that summarises the discussions had here - I know no-one is keen to do this really, but I wonder if this might be something TDWG needs to invest in ???
Kevin
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Nico Franz Sent: Wednesday, 3 November 2010 1:09 p.m. To: Richard Pyle Cc: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] Taxon and Name
Thanks, Rich:
This might quickly turn into a test of endurance for all involved. Nevertheless, I think a couple of things a worth taking up specifically, as done below.
On 11/2/2010 2:36 PM, Richard Pyle wrote:
I just think that there's this other taxonomy out there
in the future, where we taxonomists think and act more like we care for others (incl. computers) to understand our classifications, where the parts come from, what's congruent and what has changed, how to precisely reconcile with previous views, etc. And for that future to become more real, perhaps a high threshold for identifying new concepts (in the sense of authoring anew [versus citing], not necessarily a new meaning) is needed.
Yes, I can see that (and we've had this conversation before). But first we need to come up with a common understanding of what a "taxon concept" is, and how to articulate it with some precision. One way to define a concept is as a clade; that is, something like "all descendents of the most recent common ancestor of these two organisms". This is one of the ways that Phylocode establishes the definition of a clade. It's not perfect (nothing is ever perfect), because with such definitions you will often have organisms that belong to two different taxon concepts simultaneously (hybridization, introgression, etc.) But that's not intrinsically a bad thing, as long as it's understood and accomodated. The real problem, of course, is that we're still in our relative infancy in our ability to discirn whether or not a paricular organism is, or is not, a descendant of the most recent common ancestor of two other organisms. Also, there is the problem of mapping to centuries of legacy information.
Disagree, on the following grounds. The basic model of reference in play is (crudely): uses of human language <==> some sort of mapping/reference <==> entities in nature. This model allows for the mapping of language to nature to be spot on or way off, which is critical for proper modeling of taxonomic practice through time. Using precise terminology, a taxonomic concept could never be a clade, that would be nature (on the right side of the equation), but a best a perceived clade, according to a particular perspective. In addition, there can be valid concepts that are not clades (in the Hennigian/phylogenetic sense), and not even intended to be clades, for example in groups with lots of horizontal gene transfer or at and below the species level. In short, concepts are meant to variously refer to clades or other reference-worthy groups in nature, but they are not fruitfully equated with clades themselves.
So I think the most practical way to define the cirumscription boundaries of a taxon cocnept at this point in history (and the one that is most likely to leverage historical content) is via type specimens (proxied by heterotypic synonyms). It's much fuzzier and less precise than the mechanism described in the previous paragraph, but far more practical.
What you're advocating is, I think, represents a reasonable path forward towards more robustly defined, and objectively articulated, taxon cocnepts.
Many historically published published make use of a combination of ostensive components (things being pointed to; type specimens, type species, other members) and intensional components (properties being referenced; diagnostic features, synapomorphies, metabolic functions, etc.). Each component has strengths and weaknesses, but it's hard for me to image that we can do a passable representation job focusing mainly on one and not the other. I'll leave it at that.
In other contexts, possibly including the representation
of identification events in museums, the bar for calling something a concept need not be that high (informal names, names outside of publications, local checklists, etc.). In any case, it's a matter of where one puts the emphasis, and hopefully I've pointed out where I would set it and why.
I think the key distinction that should be made is the distinction between a "defined" concept, and an "implied" concept. Almost every Taxon Name Usage (sensu lato) instance carries with it an implied taxon concept, but as I said, the vast, vast majority of those (especially if you include Museum specimen identifications) are extremely anemic on deails for understanding the boundaries of the implied taxon concept, and therefore it's difficult or impossible to reliably map the congurnecy (or not) with other implied or defined taxon concepts. On the other hand, what we should be really striving for is recognition of taxon concepts that are well "defined". These are also rooted in TNU's, but carry with them robust information for inferring the boundaries of the circumscribed concept (full synonymy, robust mateial examined, robust descriptions of morphological and/or genetic characters, etc.) I think this distinction is important to make ("defined", vs. merely "implied"), because what we'd ultimately like to do is find a way to map implied concepts to well-defined concepts.
Agree.
Getting back to DwC, one area of direct relevance to this is the Identification class.
Most datasets out there simply slap a taxon name to a specimen or observation. Some of them go so far as to say who identified it to that name, and when. But most do not take the final step and anchor the identification to a particular well-defined concept (or, indeed, any TNU). All specimen identifications represent an action that places the specimen within the boundaries of a circumscribed taxon cocnept (whether the person making the identification realizes this or not). What we should be striving for is a mechanim to tie specimen identifications to particular TNUs that represent reasonably well-defined taxon cocnepts. The statement should be:
"On this date, this person asserted that this specimen falls within the taxon concept circumscription of Aus bus (L.) sec. Smith 1990".
Yes.
If an Identification is a tuple of a Taxon instance (in the DWC sense) and an Occurrence instance (or an Individual instance, if that class becomes established), with associated metadata, then I think DwC is already primed to make the quoted statement above (i.e., to anchor Identifications to particular usage instances, because "taxonID" can represent a specific usage instance -- assuming the right attributes ae included). Retsated in DwC terms, this would be:
"On [dwc:dateIdentified], [dwc:identifiedBy] asserted that [dwc:occurenceID|dwc:individualID] falls within the taxon concept circumscription of [taxonID], which represents a taxon name usage for [dwc:scientificName] according to [dwc:nameAccordingToID]"
In the future, if Peter DeVries is successful with his ambitions, then this could be simplied to:
"On [dwc:dateIdentified], [dwc:identifiedBy] asserted that [dwc:occurenceID|dwc:individualID] falls within the taxon concept circumscription of [taxonID], which represents [dwc:taxonConceptID]"
Or maybe even:
"On [dwc:dateIdentified], [dwc:identifiedBy] asserted that [dwc:occurenceID|dwc:individualID] falls within the taxon concept circumscription of [taxonConceptID]"
In the latter two examples, taxonConceptID would represent an array or set of TNUs with congruent taxon concept definitions.
I'm still not 100% sure how best to use dwc:identificationReferences, other than perhaps as a method for aggregating multiple TNUs (assumed to refer to congruent, or at least overlapping concept circumscriptions), which were used by someone when making an identification.
Aloha, Rich
Cheers,
Nico
Nico M. Franz Department of Biology University of Puerto Rico Call Box 9000 Mayagüez, PR 00681-9000
Phone: (787) 832-4040, ext. 3005 Fax: (787) 834-3673 E-mail: nico.franz@upr.edu Website: http://academic.uprm.edu/~franz/ _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
That task falls upon my shoulders, I think (as convener of the TDWG TNC working group). I'm actually working on documentation for GNUB now, and I hope to summarize a bunch of these ideas in there (at least the ones I'm professing). But in my copious free time, I'll review the taxon-related threads on this list and try to summarize the alternate perspectives, then format something on the TDWG TNC wiki.
For now, though, I thing email technology is still the best way to engage the broadest set of perspectives.
Rich
-----Original Message----- From: Kevin Richards [mailto:RichardsK@landcareresearch.co.nz] Sent: Tuesday, November 02, 2010 3:35 PM To: Nico Franz; Richard Pyle Cc: tdwg-content@lists.tdwg.org Subject: RE: [tdwg-content] Taxon and Name
My concern is that all this hard work and great discussion is going to end up in the mailing list archives! What we really need is someone to maintain a wiki (or something) that summarises the discussions had here - I know no-one is keen to do this really, but I wonder if this might be something TDWG needs to invest in ???
Kevin
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Nico Franz Sent: Wednesday, 3 November 2010 1:09 p.m. To: Richard Pyle Cc: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] Taxon and Name
Thanks, Rich:
This might quickly turn into a test of endurance for all involved.
Nevertheless, I think a couple of things a worth taking up specifically, as done below.
On 11/2/2010 2:36 PM, Richard Pyle wrote:
I just think that there's this other taxonomy out
there in the
future, where we taxonomists think and act more like we care for others (incl. computers) to understand our
classifications, where the
parts come from, what's congruent and what has changed, how to precisely reconcile with previous views, etc. And for that
future to
become more real, perhaps a high threshold for identifying new concepts (in the sense of authoring anew [versus citing], not necessarily a new meaning) is needed.
Yes, I can see that (and we've had this conversation before). But first we need to come up with a common understanding of
what a "taxon
concept" is, and how to articulate it with some precision.
One way to
define a concept is as a clade; that is, something like "all descendents of the most recent common ancestor of these two organisms". This is one of the ways that Phylocode establishes the definition of a clade. It's not perfect (nothing is ever perfect), because with such definitions you will often have organisms that belong to two different taxon concepts simultaneously
(hybridization,
introgression, etc.) But that's not intrinsically a bad
thing, as long
as it's understood and accomodated. The real problem, of
course, is
that we're still in our relative infancy in our ability to discirn whether or not a paricular organism is, or is not, a
descendant of the
most recent common ancestor of two other organisms. Also,
there is the problem of mapping to centuries of legacy information. Disagree, on the following grounds. The basic model of reference in play is (crudely): uses of human language <==> some sort of mapping/reference <==> entities in nature. This model allows for the mapping of language to nature to be spot on or way off, which is critical for proper modeling of taxonomic practice through time. Using precise terminology, a taxonomic concept could never be a clade, that would be nature (on the right side of the equation), but a best a perceived clade, according to a particular perspective. In addition, there can be valid concepts that are not clades (in the Hennigian/phylogenetic sense), and not even intended to be clades, for example in groups with lots of horizontal gene transfer or at and below the species level. In short, concepts are meant to variously refer to clades or other reference-worthy groups in nature, but they are not fruitfully equated with clades themselves.
So I think the most practical way to define the cirumscription boundaries of a taxon cocnept at this point in history (and the one that is most likely to leverage historical content) is via type specimens (proxied by heterotypic synonyms). It's much fuzzier and less precise than the mechanism described in the previous
paragraph, but far more practical.
What you're advocating is, I think, represents a reasonable path forward towards more robustly defined, and objectively
articulated, taxon cocnepts.
Many historically published published make use of a combination of ostensive components (things being pointed to; type specimens, type species, other members) and intensional components (properties being referenced; diagnostic features, synapomorphies, metabolic functions, etc.). Each component has strengths and weaknesses, but it's hard for me to image that we can do a passable representation job focusing mainly on one and not the other. I'll leave it at that.
In other contexts, possibly including the representation of
identification events in museums, the bar for calling something a concept need not be that high (informal names, names outside of publications, local checklists, etc.). In any case, it's a
matter of
where one puts the emphasis, and hopefully I've pointed
out where I
would set it and why.
I think the key distinction that should be made is the distinction between a "defined" concept, and an "implied" concept.
Almost every
Taxon Name Usage (sensu lato) instance carries with it an implied taxon concept, but as I said, the vast, vast majority of those (especially if you include Museum specimen identifications) are extremely anemic on deails for understanding the boundaries of the implied taxon concept, and therefore it's difficult or
impossible to
reliably map the congurnecy (or not) with other implied or defined taxon concepts. On the other hand, what we should be
really striving
for is recognition of taxon concepts that are well "defined". These are also rooted in TNU's, but carry with them robust
information for
inferring the boundaries of the circumscribed concept (full
synonymy,
robust mateial examined, robust descriptions of
morphological and/or
genetic characters, etc.) I think this distinction is important to make ("defined", vs. merely "implied"), because what we'd
ultimately like to do is find a way to map implied concepts to well-defined concepts. Agree.
Getting back to DwC, one area of direct relevance to this is the Identification class.
Most datasets out there simply slap a taxon name to a specimen or observation. Some of them go so far as to say who identified it to that name, and when. But most do not take the final step
and anchor
the identification to a particular well-defined concept
(or, indeed, any TNU).
All specimen identifications represent an action that places the specimen within the boundaries of a circumscribed taxon cocnept (whether the person making the identification realizes this
or not).
What we should be striving for is a mechanim to tie specimen identifications to particular TNUs that represent
reasonably well-defined taxon cocnepts. The statement should be:
"On this date, this person asserted that this specimen falls within the taxon concept circumscription of Aus bus (L.) sec. Smith 1990".
Yes.
If an Identification is a tuple of a Taxon instance (in the
DWC sense)
and an Occurrence instance (or an Individual instance, if
that class
becomes established), with associated metadata, then I think DwC is already primed to make the quoted statement above (i.e., to anchor Identifications to particular usage instances, because
"taxonID" can
represent a specific usage instance -- assuming the right
attributes
ae included). Retsated in DwC terms, this would be:
"On [dwc:dateIdentified], [dwc:identifiedBy] asserted that [dwc:occurenceID|dwc:individualID] falls within the taxon concept circumscription of [taxonID], which represents a taxon name
usage for
[dwc:scientificName] according to [dwc:nameAccordingToID]"
In the future, if Peter DeVries is successful with his
ambitions, then
this could be simplied to:
"On [dwc:dateIdentified], [dwc:identifiedBy] asserted that [dwc:occurenceID|dwc:individualID] falls within the taxon concept circumscription of [taxonID], which represents [dwc:taxonConceptID]"
Or maybe even:
"On [dwc:dateIdentified], [dwc:identifiedBy] asserted that [dwc:occurenceID|dwc:individualID] falls within the taxon concept circumscription of [taxonConceptID]"
In the latter two examples, taxonConceptID would represent
an array or
set of TNUs with congruent taxon concept definitions.
I'm still not 100% sure how best to use
dwc:identificationReferences,
other than perhaps as a method for aggregating multiple
TNUs (assumed
to refer to congruent, or at least overlapping concept circumscriptions), which were used by someone when making
an identification.
Aloha, Rich
Cheers,
Nico
Nico M. Franz Department of Biology University of Puerto Rico Call Box 9000 Mayagüez, PR 00681-9000
Phone: (787) 832-4040, ext. 3005 Fax: (787) 834-3673 E-mail: nico.franz@upr.edu Website: http://academic.uprm.edu/~franz/ _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
participants (17)
-
Arlin Stoltzfus
-
Blum, Stan
-
Bob Morris
-
Cam Webb
-
Hilmar Lapp
-
Jim Croft
-
joel sachs
-
John Wieczorek
-
Julian H
-
Kevin Richards
-
Nico Franz
-
Paul Murray
-
Peter DeVries
-
Richard Pyle
-
Roger Hyam
-
Steve Baskauf
-
Yuri Roskov