What is an Occurrence? [followup to "Wrong" RDF and What I learned... threads]
After the flurry of emails recently, I had an opportunity to carefully read all the way through the threads again, followed by enforced "think time" during my long commute. I was actually pretty cheerful after that because I think that in essence, most of the conversation about what constitutes an Occurrence really boils down to the same thing. So I have sat down and tried to summarize what seems to me to be a consensus about Occurrences. To follow my points, please refer to the diagram at: http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif
Consensus on relationships 1. The fundamental definition of an Occurrence involves evidence that a representative of a taxon occurred at a place and time. Note 1.A: For clarity, I have modified John's statement in his last email by replacing "taxon" with "representative of a taxon". I'm considering a taxon to be an abstract concept that is applied to individuals or groups of organisms. Note 1.B. This definition is far more useful than the official definition of the class Occurrence "The category of information pertaining to evidence of an occurrence..." which is essentially circular. Note 1.C: This statement is extremely broad because the evidence could be of many sorts, the representative could range from a single individual to all organisms on the earth, the taxon could be anyone's definition at any taxonomic level, the place could range from a GPS point with uncertainty of less than 10 meters to the entire planet earth, and the time could range from a shutter click of less than one second to 3.4 billion years. 2. The diagram is an attempt to summarize in pictorial form statements and relationships that have been described in the thread. The taxon representative is recorded as existing at a particular time and place (the arrow) and the result is an Occurrence record. That Occurrence record exists as metadata which may be associated with a token that can be used to voucher the fact that the taxon representative existed. That token may be the organism itself (or a living part of it as in a twig for grafting), all or part of the organism in preserved form, an electronic representation such as an image or sound recording, and other kinds of things like tissue or DNA samples. There may also be no token at all, in which case we call the Occurrence record an observation. Based on direct observation of the taxon representative, examination of one or more tokens, or both, some determiner asserts that a taxon concept applies to the taxon representative and as a result a scientific name can be used to "identify" the taxon representative. (There may be a lot of other complicated stuff above the Identification box, but that will have to be filled in by the taxonomists.) Note 2.A: I have mapped onto this diagram the letters that John used in his last email to refer to entities that are involved in an Occurrence (T, E, L, O, and G). I will beg the forgiveness of fossil people because I don't really know how the geological context fits in. I'm assuming that it is a way of asserting time and location on a much broader scale than we do for extant organisms. Note 2.B: I have put a dotted line around the part of the diagram that I think includes all the things that people might consider part of the Occurrence itself. I have left out "T" and the other parts related to identification because it seems to me that you can have an occurrence that you document which does not yet (and perhaps never will) have an identification. The Occurrence still asserts that a taxon representative existed at a time and place; we just don't yet know what the taxon is. 3. The red lines indicate the relationships that connect the various entities (I'm going to go ahead and call them resources). Consistent with popular opinion, the Occurrence record is the center of the universe and most things are connected to it. Note 3.A: I am sticking to my guns and refuse to connect the Identification directly to the Occurrence. It is the taxon representative that is being identified, not the occurrence. One can assert another sort of relationship between the identification and the occurrence if one wants to say that one consulted the occurrence metadata and token in order to decide about the identification, but it is not correct to say that the Identification identifies either the Occurrence metadata or the token (as Rich pointed out).
OK, so that's step one - defining what is related to what. If anyone disagrees with these relationships, please clarify or create your own diagram.
Complicating circumstances/caveats 1. It is noted and recognized that some users will not care to include all of these relationships in their models. In the interest of simplification or "flattening" the relationships, they may wish to collapse some parts of this diagram (e.g. incorporate time and location metadata within the Occurrence metadata rather than considering them separate resources, applying scientific names directly to the taxon representatives without defining a taxon concept or recording the determination metadata, connecting identifications directly to the occurrence, etc.). This doesn't mean that the relationships don't exist, it just means that some users don't care about them. 2. It is recognized that different users will be interested in or able to specify the various resources to differing degrees of precision. Examples: A photographer might record times to the nearest second, a collector may only be interested in noting the date on which a specimen was collected. A location may be specified to the precision of a GPS reading or be defined as some geographic or political subdivision. The taxon representative may be an individual organism, a flock or clump, or some larger aggregation of taxon representatives.
That's step two. If I've missed any complications, please point them out.
My opinions about the implications of this diagram 1. The circle I've labeled as "taxon representative" is the resource type that I'm proposing to be represented by the class Individual. You will note that in both the definition of dwc:individualID ("An identifier for an individual or named group of individual organisms...") and the proposed class definition ("The category of information pertaining to an individual or named group of individual organisms represented in an Occurrence"), groups of individual organisms are included. Thus John's example of a fossil having myriad individuals, or Richard's examples of thousands of plankton, a large school of fish, herd of wildebeest, flock of birds, could all be categorized as "Individual" under this definition if there is a reasonable expectation that all of the individuals in the group are members of the same taxon. Perhaps there is a better name for this resource, but since dwc:individualID was already extant, I chose Individual as the class name for consistency with the pattern established with other classes and their associated xxxxID terms. 2. Although in note 1.C. I have given the ranges of the various resources to their logical extreme (as was done previously in the thread), I think that as a practical matter we can adopt guidelines to set reasonable values for the "normal" ranges of the resources. One such guideline might be that we suggest a range that can accommodate about 95% of the user needs within the community (this came from Rich's comment about satisfying 95% of the user need with an establishmentMeans controlled vocuabulary). For example, it was suggested that the range for the location of an Occurrence could span the entire planet Earth. True enough, but virtually nobody would find such a span useful. 95% of users would probably find a range between a GPS reading with 10 meter precision and the extent of a county or province useful for recording the location of an Occurrence. I can suggest similar "useful" ranges: one second to one day for an event time (excluding fossils), one individual organism to the number of organisms that would fit within a 50 meter radius for an "individual", and taxon identified to family for plants and maybe mammals, genus for birds, and order for insects. So framing the definition of an Occurrence in these terms it would be something like: "An occurrence involves evidence (consisting of a physical token, electronic record, or personal observation) that a representative (ranging from a single individual to the number that would fit on a football field) of a taxon (hopefully identified to some lower taxonomic level) occurred at a place (determined to a precision between that of a GPS reading and the size of a county/province) and time (spanning one second to one day)." A few people might object to this level of restrictiveness, but I would guess that it would make 95% of us happy. 3. With the exception of the "missing" class Individual, every resource type on this diagram except for the "token" and Scientific name has a Darwin Core class. Every resource type on the diagram except for "token" has a dwc:xxxxID term that can be used to refer to a GUID for the resource. The implication of this is that any resource on this diagram except for the token and taxon representative (i.e. Individual) is ready to be represented in RDF by Darwin Core terms in the sense that the relationships (red lines) can be represented by the xxxxID terms and that the resources can be rdfs:type'd using Darwin Core classes. (Lacking a class for the scientific name doesn't seem like a big deal to me since the scientific name can be a string literal - but then I'm not a taxonomist.) 4. OK, I've avoided it as long as I can, so I'm going to confess now to the RDF-phobes. The red lines and shapes are something pretty close to an RDF graph. What that means is that if the community can agree that this diagram correctly represents the relationships among the kinds of biodiversity resources that we care about, then the matter of providing guidelines on how to represent Darwin Core in RDF suddenly gets a lot simpler. Just convert the "picture" of the RDF graph into XML format and we have a template. Alright, that's an oversimplification, but I think it is essentially true because the most difficult part of achieving a consensus on RDF representations is to decide how we connect the resource types, not on the literals that we hang onto resources as properties. 5. While I'm beating the RDF drum again, the importance of my opinion number 2 can be extended into the GUID adoption process. In my comments to Kevin about the Beginner's Guide to Persistent Identifiers, I think I commented on the question of how one decides whether a GUID needs to be assigned to something or not. I believe that the answer to that question boils down to this: we need a GUID for any resource that will be referenced by more than one other resource. Do we need to be able to assign a GUID to Taxon concepts? Yes, because it is likely that many identifications will want to reference a particular taxon concept. Do we need to be able to assign a GUID to an Event? Maybe or maybe not. If every occurrence has its own separate time recorded, then no GUID is needed because the time is just a part of every separate occurrence record. If the event is defined to be a time range that represents a collecting trip, then there may be many Occurrences that are associated with that trip and all of them could reference the GUID for that event rather than repeating the event information for every Occurrence. The point here is that every shape (class of resources) on this diagram at least has the POTENTIAL to be a node connecting multiple resources and therefore should have the capability of being assigned a GUID, having its own RDF record, and being appropriately typed (presumably by a DwC class). So this is a final technical argument for why we need to have the DwC class Individual. Whether or not people ultimately choose to assign GUIDs to particular resource types or not is their own choice, but they need to at least be ABLE to if they need that resource to serve as a node given the structure of their metadata.
We need to clarify how the "token" thing fits in, but I'm stopping there for now. I would very much appreciate responses indicating that:
A. you agree with the diagram and connections (and consider this definition and diagram a consensus) B. you disagree with the diagram (and articulate why) C. you provide an alternative diagram or explanation of the relationships among the classes related to Occurrences.
Thanks for you patience with another tome. Steve
As a background to this post, I want to reference a post by Bob called "SubclassOrNot". I discovered this page on an early foray into the TDWG website labyrinth and it has been very influential on my thinking since then. The idea Bob discusses is central to what I'm writing below so if you haven't read it you might want to do so first. You can probably skip the "OWL Inference" section and still get the point which is described in the first two sections of his post. The URL for the page is http://wiki.tdwg.org/twiki/bin/view/TAG/SubclassOrNot .
To preface what I'm going to say below, I want to put Darwin Core Occurrences in the context of what Bob wrote. In my mind, one of the hallmarks of the Darwin Core standard and one thing that makes it a great improvement over previous versions is that the decision was made to use what Bob called the "has a" approach rather than the "is a" approach. In particular, the Darwin Core standard has a single class called dwc:Occurrence rather than subclasses called "Specimen", "Observation", and other possible things. The way that we differentiate among different kinds of Occurrences is by using the DwC types which are the controlled values for the term dwc:basisOfRecord. Thus we say an Occurrence "has a" basisOfRecord=PreservedSpecimen rather than saying it "is a" PreservedSpecimen. We say an Occurrence "has a" basisOfRecord=HumanObservation rather than saying it "is a"HumanObservation". This approach has greatly reduced the number of different terms in the standard since we don't have to have separate "ObservedBy" and "CollectedBy" terms, but rather can just have a single "RecordedBy" term that applies to both specimens and observations. The same thing applies to many other things, like eventDate rather than DateCollected and DateObserved, locality rather than collectionLocality and observationLocality, etc. With the ratification of Darwin Core, this decision is now a fait acompli and not a subject of discussion or something optional for users of the standard. It also seems to be clear that as necessary new terms can be added to the DwC types which would then be valid controlled values for basisOfRecord.
Since the adoption of the DwC standard, the approach to Occurrences has been what I would describe as "I know an Occurrence when I see one". I consider this as a pretty sloppy practice and as I indicated in my post last night, I think there is enough consensus about what an Occurrence is that we can come up with a better definition than "an occurrence is the category of information pertaining to evidence of an occurrence...". Another part of what I would characterize as sloppiness is the lack of a clear definition of what exactly basisOfRecord means. When I wrote my attempt at summarizing consensus last night, I dodged the question about what I called the "token". This "thing" has been called various names. In the previous discussion on the list, it was sometimes called "the evidence" of the occurrence. In the past I have called it "a representation" - however, I now think the term "token" is better because "representation" has a different technical meaning in the context of content negotiation. When we type an Occurrence by saying it has a basisOfRecord=PreservedSpecimen, we are saying that this Occurrence has as supporting evidence, or as a "token" if you prefer, all or part of the dead remains of the organism (i.e. what I'm calling "the Individual") that was being documented by the Occurrence. When we type an Occurrence by saying it has a basisOfRecord=LivingSpecimen, we are saying that this Occurrence has as a "token" the entire organism that was being documented (or some vegetative part of the live organism that was propagated). When we type an Occurrence by saying it has a basisOfRecord=HumanObservation, we are saying that the Occurrence has no supporting evidence other than the reputation of the observer to accurately record the metadata about the Occurrence. In other words, we "tag" a instance of a core class (to use Bob's words), Occurrence, by telling a metadata consumer what kind of token we are using as evidence of the Occurrence.
A fundamental part of creating a clear definition of what an Occurrence is, is to define exactly what we are including in the concept of Occurrence. One possibility is to (1) say that the two boxes at the right side of the diagram at http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif are fused and that both the Occurrence metadata and its associated token are what we consider to be "the Occurrence". Another approach (2) would be to say that the actual Occurrence as an entity is only the metadata part and that the token is a separate thing. A third approach is to say (3) that everything with the blue dotted lines is considered a part of the Occurrence (i.e. the metadata, the token, the event, and the locality). I don't think in an absolute sense, any one of these approaches is "right". The problem is that these approaches are used inconsistently, sometimes even by the same person, depending on the basisOfRecord. Differences in ways of thinking about this issue is a part of why people aren't understanding the way other people are approaching the structuring of metadata. I have tried to consistently take the approach (1) that the two boxes on the right are fused, i.e. that the Occurrence metadata and the token should both be considered part of the entity that we call "an Occurrence". I think this is why Rich was confused in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001666.html when I said that it was "wrong" to assert that a scientific name is a property of an Occurrence - obviously it is silly to say that the token (photons on a film, sound patterns in a digital file) has a scientific name. Yet that is exactly what people do routinely when the token is a branch cut off a tree and glued to a piece of paper. They say that they are "identifying a specimen". What I am asking (actually demanding) is that the TDWG community get its act together and come to some consistency on this. If we are going to take the approach (2), then we need to take specimens off their pedestal and treat them like we do any other token that we are using as evidence that an Occurrence happened. If we are going to do what was suggested for the BioBlitz in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001603.html, i.e. to call Occurrences "observations" and then link the tokens to them by associatedMedia, ResourceRelationship, or some other means (approach 2) then do it consistently for every kind of token, including specimens, and don't single out media tokens for punishment.
I have in a sense "thrown down the gauntlet" on this issue by proposing that DigitalStillImage be added as a DwC type and as a controlled value for basisOfRecord (http://code.google.com/p/darwincore/issues/detail?id=68). I know what some people are going to say in response to this proposal. "Why do you need to have 'DigitalStillImage' as a value for basisOfRecord when you can just say that the resource's dcterms:type=StillImage?" The answer goes back to Bob's point. If we are going to go the "has a" path (which we already have in DwC for Occurrences) rather than subclassing everything, then we need to provide an appropriate value for the "tag" for any type of resource that a reasonable number of users will want to use as a token. I think it is clear from this and other Bioblitzes, my work in Bioimages, the whale tracking project, and many other examples, that there are plenty of people who are already using DigitalStillImages as tokens and we all need a controlled value to use for basisOfRecord.
The other thing that we accomplish when we type an Occurrence by its basisOfRecord is to tell a consumer what kind of metadata to expect to get about the token in addition to the generic metadata that is provided for all Occurrences. Thus for a LivingSpecimen we expect to be told what zoo, botanical garden, bacterial collection, etc. contains the specimen. For a PreservedSpecimen we expect to be told the preparation type, the location of the repository, etc. For a DigitalStillImage we expect to be told the file type, accessURL, etc. Simply providing a value for dcterms:type=StillImage doesn't indicate whether the image is a physical one (i.e. on film) or a digital one. It is also unreasonable to expect a client to have to be checking two different terms (basisOfRecord and dcterms:type) to find out what they could learn from one (basisOfRecord). Of course it would be advisable to provide a value for dcterms:type as well for clients outside the biodiversity community who may not "understand" what basisOfRecord means.
I hate to keep bringing my posts back to the RDF issue, but thinking about how one would write RDF forces clear thinking about how metadata should be structured. If we intend to separate tokens as entities from their associated Occurrence metadata, i.e. approach (2), then we open up a whole other can of worms. To associate the occurrence resources (i.e. the metadata) with the "different" resource (i.e. the token), we will have probably have to be able to create URIs for the tokens and separate RDF metadata blocks which will have to be rdfs:type'd. What are we going to use for that rdfs:type - create another Darwin Core class? I simply don't think that is a complicated road that we want to travel. It would be far easier to just say that every Occurrence has a one-to-one relationship with its token (which could be "the empty set" for observations). This would not work for people who want to hang multiple tokens on a single observation event, but I think that itself is a bad idea because it makes it even harder to have "flat" occurrence datasets. Just say that every time we collect a different token (or make an observation that has no token), it is a new Occurrence record. Realistically, a single collector can't actually take a picture of a plant at the same time he or she collects it for a specimen anyway. Those really should be considered two different events because they happen at different times.
OK, enough said. Consider this my defense of my proposal "issue 68" to add DigitalStillImage. I would urge the powers that be to respond to the issues that I've raised here before having any kind of "vote" (or whatever is ultimately going to happen when there is an up or down decision about the proposal).
Steve
Steve Baskauf wrote:
After the flurry of emails recently, I had an opportunity to carefully read all the way through the threads again, followed by enforced "think time" during my long commute. I was actually pretty cheerful after that because I think that in essence, most of the conversation about what constitutes an Occurrence really boils down to the same thing. So I have sat down and tried to summarize what seems to me to be a consensus about Occurrences. To follow my points, please refer to the diagram at: http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif
Consensus on relationships
- The fundamental definition of an Occurrence involves evidence that a
representative of a taxon occurred at a place and time. Note 1.A: For clarity, I have modified John's statement in his last email by replacing "taxon" with "representative of a taxon". I'm considering a taxon to be an abstract concept that is applied to individuals or groups of organisms. Note 1.B. This definition is far more useful than the official definition of the class Occurrence "The category of information pertaining to evidence of an occurrence..." which is essentially circular. Note 1.C: This statement is extremely broad because the evidence could be of many sorts, the representative could range from a single individual to all organisms on the earth, the taxon could be anyone's definition at any taxonomic level, the place could range from a GPS point with uncertainty of less than 10 meters to the entire planet earth, and the time could range from a shutter click of less than one second to 3.4 billion years. 2. The diagram is an attempt to summarize in pictorial form statements and relationships that have been described in the thread. The taxon representative is recorded as existing at a particular time and place (the arrow) and the result is an Occurrence record. That Occurrence record exists as metadata which may be associated with a token that can be used to voucher the fact that the taxon representative existed. That token may be the organism itself (or a living part of it as in a twig for grafting), all or part of the organism in preserved form, an electronic representation such as an image or sound recording, and other kinds of things like tissue or DNA samples. There may also be no token at all, in which case we call the Occurrence record an observation. Based on direct observation of the taxon representative, examination of one or more tokens, or both, some determiner asserts that a taxon concept applies to the taxon representative and as a result a scientific name can be used to "identify" the taxon representative. (There may be a lot of other complicated stuff above the Identification box, but that will have to be filled in by the taxonomists.) Note 2.A: I have mapped onto this diagram the letters that John used in his last email to refer to entities that are involved in an Occurrence (T, E, L, O, and G). I will beg the forgiveness of fossil people because I don't really know how the geological context fits in. I'm assuming that it is a way of asserting time and location on a much broader scale than we do for extant organisms. Note 2.B: I have put a dotted line around the part of the diagram that I think includes all the things that people might consider part of the Occurrence itself. I have left out "T" and the other parts related to identification because it seems to me that you can have an occurrence that you document which does not yet (and perhaps never will) have an identification. The Occurrence still asserts that a taxon representative existed at a time and place; we just don't yet know what the taxon is. 3. The red lines indicate the relationships that connect the various entities (I'm going to go ahead and call them resources). Consistent with popular opinion, the Occurrence record is the center of the universe and most things are connected to it. Note 3.A: I am sticking to my guns and refuse to connect the Identification directly to the Occurrence. It is the taxon representative that is being identified, not the occurrence. One can assert another sort of relationship between the identification and the occurrence if one wants to say that one consulted the occurrence metadata and token in order to decide about the identification, but it is not correct to say that the Identification identifies either the Occurrence metadata or the token (as Rich pointed out).
OK, so that's step one - defining what is related to what. If anyone disagrees with these relationships, please clarify or create your own diagram.
Complicating circumstances/caveats
- It is noted and recognized that some users will not care to include
all of these relationships in their models. In the interest of simplification or "flattening" the relationships, they may wish to collapse some parts of this diagram (e.g. incorporate time and location metadata within the Occurrence metadata rather than considering them separate resources, applying scientific names directly to the taxon representatives without defining a taxon concept or recording the determination metadata, connecting identifications directly to the occurrence, etc.). This doesn't mean that the relationships don't exist, it just means that some users don't care about them. 2. It is recognized that different users will be interested in or able to specify the various resources to differing degrees of precision. Examples: A photographer might record times to the nearest second, a collector may only be interested in noting the date on which a specimen was collected. A location may be specified to the precision of a GPS reading or be defined as some geographic or political subdivision. The taxon representative may be an individual organism, a flock or clump, or some larger aggregation of taxon representatives.
That's step two. If I've missed any complications, please point them out.
My opinions about the implications of this diagram
- The circle I've labeled as "taxon representative" is the resource
type that I'm proposing to be represented by the class Individual. You will note that in both the definition of dwc:individualID ("An identifier for an individual or named group of individual organisms...") and the proposed class definition ("The category of information pertaining to an individual or named group of individual organisms represented in an Occurrence"), groups of individual organisms are included. Thus John's example of a fossil having myriad individuals, or Richard's examples of thousands of plankton, a large school of fish, herd of wildebeest, flock of birds, could all be categorized as "Individual" under this definition if there is a reasonable expectation that all of the individuals in the group are members of the same taxon. Perhaps there is a better name for this resource, but since dwc:individualID was already extant, I chose Individual as the class name for consistency with the pattern established with other classes and their associated xxxxID terms. 2. Although in note 1.C. I have given the ranges of the various resources to their logical extreme (as was done previously in the thread), I think that as a practical matter we can adopt guidelines to set reasonable values for the "normal" ranges of the resources. One such guideline might be that we suggest a range that can accommodate about 95% of the user needs within the community (this came from Rich's comment about satisfying 95% of the user need with an establishmentMeans controlled vocuabulary). For example, it was suggested that the range for the location of an Occurrence could span the entire planet Earth. True enough, but virtually nobody would find such a span useful. 95% of users would probably find a range between a GPS reading with 10 meter precision and the extent of a county or province useful for recording the location of an Occurrence. I can suggest similar "useful" ranges: one second to one day for an event time (excluding fossils), one individual organism to the number of organisms that would fit within a 50 meter radius for an "individual", and taxon identified to family for plants and maybe mammals, genus for birds, and order for insects. So framing the definition of an Occurrence in these terms it would be something like: "An occurrence involves evidence (consisting of a physical token, electronic record, or personal observation) that a representative (ranging from a single individual to the number that would fit on a football field) of a taxon (hopefully identified to some lower taxonomic level) occurred at a place (determined to a precision between that of a GPS reading and the size of a county/province) and time (spanning one second to one day)." A few people might object to this level of restrictiveness, but I would guess that it would make 95% of us happy. 3. With the exception of the "missing" class Individual, every resource type on this diagram except for the "token" and Scientific name has a Darwin Core class. Every resource type on the diagram except for "token" has a dwc:xxxxID term that can be used to refer to a GUID for the resource. The implication of this is that any resource on this diagram except for the token and taxon representative (i.e. Individual) is ready to be represented in RDF by Darwin Core terms in the sense that the relationships (red lines) can be represented by the xxxxID terms and that the resources can be rdfs:type'd using Darwin Core classes. (Lacking a class for the scientific name doesn't seem like a big deal to me since the scientific name can be a string literal - but then I'm not a taxonomist.) 4. OK, I've avoided it as long as I can, so I'm going to confess now to the RDF-phobes. The red lines and shapes are something pretty close to an RDF graph. What that means is that if the community can agree that this diagram correctly represents the relationships among the kinds of biodiversity resources that we care about, then the matter of providing guidelines on how to represent Darwin Core in RDF suddenly gets a lot simpler. Just convert the "picture" of the RDF graph into XML format and we have a template. Alright, that's an oversimplification, but I think it is essentially true because the most difficult part of achieving a consensus on RDF representations is to decide how we connect the resource types, not on the literals that we hang onto resources as properties. 5. While I'm beating the RDF drum again, the importance of my opinion number 2 can be extended into the GUID adoption process. In my comments to Kevin about the Beginner's Guide to Persistent Identifiers, I think I commented on the question of how one decides whether a GUID needs to be assigned to something or not. I believe that the answer to that question boils down to this: we need a GUID for any resource that will be referenced by more than one other resource. Do we need to be able to assign a GUID to Taxon concepts? Yes, because it is likely that many identifications will want to reference a particular taxon concept. Do we need to be able to assign a GUID to an Event? Maybe or maybe not. If every occurrence has its own separate time recorded, then no GUID is needed because the time is just a part of every separate occurrence record. If the event is defined to be a time range that represents a collecting trip, then there may be many Occurrences that are associated with that trip and all of them could reference the GUID for that event rather than repeating the event information for every Occurrence. The point here is that every shape (class of resources) on this diagram at least has the POTENTIAL to be a node connecting multiple resources and therefore should have the capability of being assigned a GUID, having its own RDF record, and being appropriately typed (presumably by a DwC class). So this is a final technical argument for why we need to have the DwC class Individual. Whether or not people ultimately choose to assign GUIDs to particular resource types or not is their own choice, but they need to at least be ABLE to if they need that resource to serve as a node given the structure of their metadata.
We need to clarify how the "token" thing fits in, but I'm stopping there for now. I would very much appreciate responses indicating that:
A. you agree with the diagram and connections (and consider this definition and diagram a consensus) B. you disagree with the diagram (and articulate why) C. you provide an alternative diagram or explanation of the relationships among the classes related to Occurrences.
Thanks for you patience with another tome. Steve
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content .
Oops. I just went by the name at the top of the page. Sorry Roger.
Gregor Hagedorn wrote:
As a background to this post, I want to reference a post by Bob called "SubclassOrNot".
For the record: Roger Hyam wrote this post, Bob only made a small improvement... TWiki is a bit blind on collaboration...
Hi Steve,
I would hypothesize that for the vast majority of identified records the process is something like this:
1) An individual uses some sort of key to determine what species (taxon concept) to assign to a given individual * They may have created some sort of mental key in which once they recognize one individual mosquito they can then pretty quickly sort a number of individuals into collections.
2) The actual name they assign to the specimen is usually based on what their key says the name is. Often this does not specify the authorship. Most of these human identifiers have not read the original species descriptions and for the species they are identifying. So the specimen is actually tied to a concept that is based more on the "key" than the original description. * An exception, would be where there is a key in the original description and that was what what was used.
3) So in a sense, the process of modeling this as if the if the identifier actually asserted that the concept was the same as that described by the original description or a subsequent revision is "fudging"
Side effects of this process include:
1) A new key for North American Mosquitoes comes out that incorporates recent changes in nomenclature. The major change being the elevation of a subgenus to a genus. For most of the species described the "key concept" is unchanged.
Student identifier, Bob, in state X is using the latest key, while student identifier, Joe, is state Z is using a slightly older edition of the same key.
Bob identifies the species as *Ochlerotatus triseriatus*, while Joe identifies what should be the same species as *Aedes triseriatus*.
These show up in GBIF on two different maps, they show up in the EOL as two different pages.
Various TDWG'ers continue to argue that the original description and subsequent revisions were really important in determining what these individuals actually meant when they assigned a name to a specimen, and that this is how we should model it in excruciating detail.
I would argue this should be modeled as best as possible to what actually happens.
For example, how many of the species observed in the recent BioBlitz were identified by referring to the original species description or subsequent revisions?
In your diagram, I would suggest that you show that a taxon concept may have many names associated with it. Since it is not clear what the identifier intended by his or her choice of a name, it is often difficult to determine what taxon concept they actually meant.
This is why I advocate a move to a more taxon concept based identifier to link these data sets together because this allows the intent of the identifier is more accurately modeled.
This would be done in the form of:
"I assert that this specimen (of what I call *Aedes triseriatus*) was observed here. I also assert that it is an instance of the this species concept => URI"
Or I assert that this is an individual of the type "Individual of species concept X" = > URI
All of these are instances of the class "Individual"
So the resulting DarwinCore record would contain both the name and and an optional, but I think needed, asserted species concept.
The species concept is a subclass of taxon concept, but is fundamentally different than the higher clades.
There are some guidelines as to what an entity needs to be considered a species.
While their are no real guidelines as to what clades should be considered genera and what clades should be considered families etc.
Assigning properties at the level of genera or family is also problematic because it assumes that there will be inferencing and it will require rechecking that those properties are still valid if the species within that genera change.
So if there is some property that is common to all the species in the genus, make that a property of each of the individual species - not a property of the genus.
Respectfully,
- Pete
On Fri, Oct 15, 2010 at 10:45 AM, Steve Baskauf < steve.baskauf@vanderbilt.edu> wrote:
As a background to this post, I want to reference a post by Bob called "SubclassOrNot". I discovered this page on an early foray into the TDWG website labyrinth and it has been very influential on my thinking since then. The idea Bob discusses is central to what I'm writing below so if you haven't read it you might want to do so first. You can probably skip the "OWL Inference" section and still get the point which is described in the first two sections of his post. The URL for the page is http://wiki.tdwg.org/twiki/bin/view/TAG/SubclassOrNot .
To preface what I'm going to say below, I want to put Darwin Core Occurrences in the context of what Bob wrote. In my mind, one of the hallmarks of the Darwin Core standard and one thing that makes it a great improvement over previous versions is that the decision was made to use what Bob called the "has a" approach rather than the "is a" approach. In particular, the Darwin Core standard has a single class called dwc:Occurrence rather than subclasses called "Specimen", "Observation", and other possible things. The way that we differentiate among different kinds of Occurrences is by using the DwC types which are the controlled values for the term dwc:basisOfRecord. Thus we say an Occurrence "has a" basisOfRecord=PreservedSpecimen rather than saying it "is a" PreservedSpecimen. We say an Occurrence "has a" basisOfRecord=HumanObservation rather than saying it "is a"HumanObservation". This approach has greatly reduced the number of different terms in the standard since we don't have to have separate "ObservedBy" and "CollectedBy" terms, but rather can just have a single "RecordedBy" term that applies to both specimens and observations. The same thing applies to many other things, like eventDate rather than DateCollected and DateObserved, locality rather than collectionLocality and observationLocality, etc. With the ratification of Darwin Core, this decision is now a fait acompli and not a subject of discussion or something optional for users of the standard. It also seems to be clear that as necessary new terms can be added to the DwC types which would then be valid controlled values for basisOfRecord.
Since the adoption of the DwC standard, the approach to Occurrences has been what I would describe as "I know an Occurrence when I see one". I consider this as a pretty sloppy practice and as I indicated in my post last night, I think there is enough consensus about what an Occurrence is that we can come up with a better definition than "an occurrence is the category of information pertaining to evidence of an occurrence...". Another part of what I would characterize as sloppiness is the lack of a clear definition of what exactly basisOfRecord means. When I wrote my attempt at summarizing consensus last night, I dodged the question about what I called the "token". This "thing" has been called various names. In the previous discussion on the list, it was sometimes called "the evidence" of the occurrence. In the past I have called it "a representation" - however, I now think the term "token" is better because "representation" has a different technical meaning in the context of content negotiation. When we type an Occurrence by saying it has a basisOfRecord=PreservedSpecimen, we are saying that this Occurrence has as supporting evidence, or as a "token" if you prefer, all or part of the dead remains of the organism (i.e. what I'm calling "the Individual") that was being documented by the Occurrence. When we type an Occurrence by saying it has a basisOfRecord=LivingSpecimen, we are saying that this Occurrence has as a "token" the entire organism that was being documented (or some vegetative part of the live organism that was propagated). When we type an Occurrence by saying it has a basisOfRecord=HumanObservation, we are saying that the Occurrence has no supporting evidence other than the reputation of the observer to accurately record the metadata about the Occurrence. In other words, we "tag" a instance of a core class (to use Bob's words), Occurrence, by telling a metadata consumer what kind of token we are using as evidence of the Occurrence. A fundamental part of creating a clear definition of what an Occurrence is, is to define exactly what we are including in the concept of Occurrence. One possibility is to (1) say that the two boxes at the right side of the diagram at http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gifare fused and that both the Occurrence metadata and its associated token are what we consider to be "the Occurrence". Another approach (2) would be to say that the actual Occurrence as an entity is only the metadata part and that the token is a separate thing. A third approach is to say (3) that everything with the blue dotted lines is considered a part of the Occurrence (i.e. the metadata, the token, the event, and the locality). I don't think in an absolute sense, any one of these approaches is "right". The problem is that these approaches are used inconsistently, sometimes even by the same person, depending on the basisOfRecord. Differences in ways of thinking about this issue is a part of why people aren't understanding the way other people are approaching the structuring of metadata. I have tried to consistently take the approach (1) that the two boxes on the right are fused, i.e. that the Occurrence metadata and the token should both be considered part of the entity that we call "an Occurrence". I think this is why Rich was confused in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001666.html when I said that it was "wrong" to assert that a scientific name is a property of an Occurrence - obviously it is silly to say that the token (photons on a film, sound patterns in a digital file) has a scientific name. Yet that is exactly what people do routinely when the token is a branch cut off a tree and glued to a piece of paper. They say that they are "identifying a specimen". What I am asking (actually demanding) is that the TDWG community get its act together and come to some consistency on this. If we are going to take the approach (2), then we need to take specimens off their pedestal and treat them like we do any other token that we are using as evidence that an Occurrence happened. If we are going to do what was suggested for the BioBlitz in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001603.html, i.e. to call Occurrences "observations" and then link the tokens to them by associatedMedia, ResourceRelationship, or some other means (approach 2) then do it consistently for every kind of token, including specimens, and don't single out media tokens for punishment. I have in a sense "thrown down the gauntlet" on this issue by proposing that DigitalStillImage be added as a DwC type and as a controlled value for basisOfRecord (http://code.google.com/p/darwincore/issues/detail?id=68). I know what some people are going to say in response to this proposal. "Why do you need to have 'DigitalStillImage' as a value for basisOfRecord when you can just say that the resource's dcterms:type=StillImage?" The answer goes back to Bob's point. If we are going to go the "has a" path (which we already have in DwC for Occurrences) rather than subclassing everything, then we need to provide an appropriate value for the "tag" for any type of resource that a reasonable number of users will want to use as a token. I think it is clear from this and other Bioblitzes, my work in Bioimages, the whale tracking project, and many other examples, that there are plenty of people who are already using DigitalStillImages as tokens and we all need a controlled value to use for basisOfRecord. The other thing that we accomplish when we type an Occurrence by its basisOfRecord is to tell a consumer what kind of metadata to expect to get about the token in addition to the generic metadata that is provided for all Occurrences. Thus for a LivingSpecimen we expect to be told what zoo, botanical garden, bacterial collection, etc. contains the specimen. For a PreservedSpecimen we expect to be told the preparation type, the location of the repository, etc. For a DigitalStillImage we expect to be told the file type, accessURL, etc. Simply providing a value for dcterms:type=StillImage doesn't indicate whether the image is a physical one (i.e. on film) or a digital one. It is also unreasonable to expect a client to have to be checking two different terms (basisOfRecord and dcterms:type) to find out what they could learn from one (basisOfRecord). Of course it would be advisable to provide a value for dcterms:type as well for clients outside the biodiversity community who may not "understand" what basisOfRecord means. I hate to keep bringing my posts back to the RDF issue, but thinking about how one would write RDF forces clear thinking about how metadata should be structured. If we intend to separate tokens as entities from their associated Occurrence metadata, i.e. approach (2), then we open up a whole other can of worms. To associate the occurrence resources (i.e. the metadata) with the "different" resource (i.e. the token), we will have probably have to be able to create URIs for the tokens and separate RDF metadata blocks which will have to be rdfs:type'd. What are we going to use for that rdfs:type - create another Darwin Core class? I simply don't think that is a complicated road that we want to travel. It would be far easier to just say that every Occurrence has a one-to-one relationship with its token (which could be "the empty set" for observations). This would not work for people who want to hang multiple tokens on a single observation event, but I think that itself is a bad idea because it makes it even harder to have "flat" occurrence datasets. Just say that every time we collect a different token (or make an observation that has no token), it is a new Occurrence record. Realistically, a single collector can't actually take a picture of a plant at the same time he or she collects it for a specimen anyway. Those really should be considered two different events because they happen at different times.
OK, enough said. Consider this my defense of my proposal "issue 68" to add DigitalStillImage. I would urge the powers that be to respond to the issues that I've raised here before having any kind of "vote" (or whatever is ultimately going to happen when there is an up or down decision about the proposal).
Steve
Steve Baskauf wrote:
After the flurry of emails recently, I had an opportunity to carefully read all the way through the threads again, followed by enforced "think time" during my long commute. I was actually pretty cheerful after that because I think that in essence, most of the conversation about what constitutes an Occurrence really boils down to the same thing. So I have sat down and tried to summarize what seems to me to be a consensus about Occurrences. To follow my points, please refer to the diagram at: http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif
Consensus on relationships
- The fundamental definition of an Occurrence involves evidence that a
representative of a taxon occurred at a place and time. Note 1.A: For clarity, I have modified John's statement in his last email by replacing "taxon" with "representative of a taxon". I'm considering a taxon to be an abstract concept that is applied to individuals or groups of organisms. Note 1.B. This definition is far more useful than the official definition of the class Occurrence "The category of information pertaining to evidence of an occurrence..." which is essentially circular. Note 1.C: This statement is extremely broad because the evidence could be of many sorts, the representative could range from a single individual to all organisms on the earth, the taxon could be anyone's definition at any taxonomic level, the place could range from a GPS point with uncertainty of less than 10 meters to the entire planet earth, and the time could range from a shutter click of less than one second to 3.4 billion years. 2. The diagram is an attempt to summarize in pictorial form statements and relationships that have been described in the thread. The taxon representative is recorded as existing at a particular time and place (the arrow) and the result is an Occurrence record. That Occurrence record exists as metadata which may be associated with a token that can be used to voucher the fact that the taxon representative existed. That token may be the organism itself (or a living part of it as in a twig for grafting), all or part of the organism in preserved form, an electronic representation such as an image or sound recording, and other kinds of things like tissue or DNA samples. There may also be no token at all, in which case we call the Occurrence record an observation. Based on direct observation of the taxon representative, examination of one or more tokens, or both, some determiner asserts that a taxon concept applies to the taxon representative and as a result a scientific name can be used to "identify" the taxon representative. (There may be a lot of other complicated stuff above the Identification box, but that will have to be filled in by the taxonomists.) Note 2.A: I have mapped onto this diagram the letters that John used in his last email to refer to entities that are involved in an Occurrence (T, E, L, O, and G). I will beg the forgiveness of fossil people because I don't really know how the geological context fits in. I'm assuming that it is a way of asserting time and location on a much broader scale than we do for extant organisms. Note 2.B: I have put a dotted line around the part of the diagram that I think includes all the things that people might consider part of the Occurrence itself. I have left out "T" and the other parts related to identification because it seems to me that you can have an occurrence that you document which does not yet (and perhaps never will) have an identification. The Occurrence still asserts that a taxon representative existed at a time and place; we just don't yet know what the taxon is. 3. The red lines indicate the relationships that connect the various entities (I'm going to go ahead and call them resources). Consistent with popular opinion, the Occurrence record is the center of the universe and most things are connected to it. Note 3.A: I am sticking to my guns and refuse to connect the Identification directly to the Occurrence. It is the taxon representative that is being identified, not the occurrence. One can assert another sort of relationship between the identification and the occurrence if one wants to say that one consulted the occurrence metadata and token in order to decide about the identification, but it is not correct to say that the Identification identifies either the Occurrence metadata or the token (as Rich pointed out).
OK, so that's step one - defining what is related to what. If anyone disagrees with these relationships, please clarify or create your own diagram.
Complicating circumstances/caveats
- It is noted and recognized that some users will not care to include
all of these relationships in their models. In the interest of simplification or "flattening" the relationships, they may wish to collapse some parts of this diagram (e.g. incorporate time and location metadata within the Occurrence metadata rather than considering them separate resources, applying scientific names directly to the taxon representatives without defining a taxon concept or recording the determination metadata, connecting identifications directly to the occurrence, etc.). This doesn't mean that the relationships don't exist, it just means that some users don't care about them. 2. It is recognized that different users will be interested in or able to specify the various resources to differing degrees of precision. Examples: A photographer might record times to the nearest second, a collector may only be interested in noting the date on which a specimen was collected. A location may be specified to the precision of a GPS reading or be defined as some geographic or political subdivision. The taxon representative may be an individual organism, a flock or clump, or some larger aggregation of taxon representatives.
That's step two. If I've missed any complications, please point them out.
My opinions about the implications of this diagram
- The circle I've labeled as "taxon representative" is the resource
type that I'm proposing to be represented by the class Individual. You will note that in both the definition of dwc:individualID ("An identifier for an individual or named group of individual organisms...") and the proposed class definition ("The category of information pertaining to an individual or named group of individual organisms represented in an Occurrence"), groups of individual organisms are included. Thus John's example of a fossil having myriad individuals, or Richard's examples of thousands of plankton, a large school of fish, herd of wildebeest, flock of birds, could all be categorized as "Individual" under this definition if there is a reasonable expectation that all of the individuals in the group are members of the same taxon. Perhaps there is a better name for this resource, but since dwc:individualID was already extant, I chose Individual as the class name for consistency with the pattern established with other classes and their associated xxxxID terms. 2. Although in note 1.C. I have given the ranges of the various resources to their logical extreme (as was done previously in the thread), I think that as a practical matter we can adopt guidelines to set reasonable values for the "normal" ranges of the resources. One such guideline might be that we suggest a range that can accommodate about 95% of the user needs within the community (this came from Rich's comment about satisfying 95% of the user need with an establishmentMeans controlled vocuabulary). For example, it was suggested that the range for the location of an Occurrence could span the entire planet Earth. True enough, but virtually nobody would find such a span useful. 95% of users would probably find a range between a GPS reading with 10 meter precision and the extent of a county or province useful for recording the location of an Occurrence. I can suggest similar "useful" ranges: one second to one day for an event time (excluding fossils), one individual organism to the number of organisms that would fit within a 50 meter radius for an "individual", and taxon identified to family for plants and maybe mammals, genus for birds, and order for insects. So framing the definition of an Occurrence in these terms it would be something like: "An occurrence involves evidence (consisting of a physical token, electronic record, or personal observation) that a representative (ranging from a single individual to the number that would fit on a football field) of a taxon (hopefully identified to some lower taxonomic level) occurred at a place (determined to a precision between that of a GPS reading and the size of a county/province) and time (spanning one second to one day)." A few people might object to this level of restrictiveness, but I would guess that it would make 95% of us happy. 3. With the exception of the "missing" class Individual, every resource type on this diagram except for the "token" and Scientific name has a Darwin Core class. Every resource type on the diagram except for "token" has a dwc:xxxxID term that can be used to refer to a GUID for the resource. The implication of this is that any resource on this diagram except for the token and taxon representative (i.e. Individual) is ready to be represented in RDF by Darwin Core terms in the sense that the relationships (red lines) can be represented by the xxxxID terms and that the resources can be rdfs:type'd using Darwin Core classes. (Lacking a class for the scientific name doesn't seem like a big deal to me since the scientific name can be a string literal - but then I'm not a taxonomist.) 4. OK, I've avoided it as long as I can, so I'm going to confess now to the RDF-phobes. The red lines and shapes are something pretty close to an RDF graph. What that means is that if the community can agree that this diagram correctly represents the relationships among the kinds of biodiversity resources that we care about, then the matter of providing guidelines on how to represent Darwin Core in RDF suddenly gets a lot simpler. Just convert the "picture" of the RDF graph into XML format and we have a template. Alright, that's an oversimplification, but I think it is essentially true because the most difficult part of achieving a consensus on RDF representations is to decide how we connect the resource types, not on the literals that we hang onto resources as properties. 5. While I'm beating the RDF drum again, the importance of my opinion number 2 can be extended into the GUID adoption process. In my comments to Kevin about the Beginner's Guide to Persistent Identifiers, I think I commented on the question of how one decides whether a GUID needs to be assigned to something or not. I believe that the answer to that question boils down to this: we need a GUID for any resource that will be referenced by more than one other resource. Do we need to be able to assign a GUID to Taxon concepts? Yes, because it is likely that many identifications will want to reference a particular taxon concept. Do we need to be able to assign a GUID to an Event? Maybe or maybe not. If every occurrence has its own separate time recorded, then no GUID is needed because the time is just a part of every separate occurrence record. If the event is defined to be a time range that represents a collecting trip, then there may be many Occurrences that are associated with that trip and all of them could reference the GUID for that event rather than repeating the event information for every Occurrence. The point here is that every shape (class of resources) on this diagram at least has the POTENTIAL to be a node connecting multiple resources and therefore should have the capability of being assigned a GUID, having its own RDF record, and being appropriately typed (presumably by a DwC class). So this is a final technical argument for why we need to have the DwC class Individual. Whether or not people ultimately choose to assign GUIDs to particular resource types or not is their own choice, but they need to at least be ABLE to if they need that resource to serve as a node given the structure of their metadata.
We need to clarify how the "token" thing fits in, but I'm stopping there for now. I would very much appreciate responses indicating that:
A. you agree with the diagram and connections (and consider this definition and diagram a consensus) B. you disagree with the diagram (and articulate why) C. you provide an alternative diagram or explanation of the relationships among the classes related to Occurrences.
Thanks for you patience with another tome. Steve
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content .
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
Various TDWG'ers continue to argue that the original description and subsequent revisions were really important in determining what these individuals actually meant when they assigned a name to a specimen, and that this is how we should model it in excruciating detail.
Most of the "TDWG'ers" that I know are FULLY aware that many "modern" taxon concepts are not congruent to the concepts as originally cirumscribed when a Code-compliant name was first established. Obviously, the more recent the original description, the more congruent the original taxon concept will be to a "modern" concept.
The reason why it's important to be cognizant of original descriptions of names is to ensrue that when one applies a taxon name to a modern concept, the modern concept includes within its circumscription the type specimen for the name that is used. The original description is relevant primarily for nomenclatural purposes, and to ensure that a modern taxon concept does not exclude the type specimen for the name being applied to the modern concept.
Subsequent revisions *are* important to modern concepts, because those are the places where real taxon concept definitions (e.g., the sort that are used when people construct keys) are documented.
For example, how many of the species observed in the recent BioBlitz were identified by referring to the original species description or subsequent revisions?
Probably none. More likely they were identified to field guides, and the field guides more than likely base their concept boundaries (=implied synonimies) on a (relatively) recent taxonomic work.
Aloha, Rich
On Sun, Oct 17, 2010 at 9:40 AM, Richard Pyle deepreef@bishopmuseum.orgwrote:
Various TDWG'ers continue to argue that the original description and subsequent revisions were really important in determining what these individuals actually meant when they assigned a name to a specimen, and that this is how we should model it in excruciating detail.
Most of the "TDWG'ers" that I know are FULLY aware that many "modern" taxon concepts are not congruent to the concepts as originally cirumscribed when a Code-compliant name was first established. Obviously, the more recent the original description, the more congruent the original taxon concept will be to a "modern" concept.
This was a bit of a "straw man" but I think what we would both agree that annotating the identification to the "concept" "as described by the key" would more accurately represent the assertion that was made. It is as if there is pressure to make documenting the identification process more "code compliant" than making it accurately reflect what happened.
In my experience with my bugs and and some of the mammals, the original descriptions and subsequent revisions are not as informative as some in the community portray them. They often do not serve as good guides as to what specimens are instances of that concept and what specimens are not.
Also, maybe some one can tell me where the type specimen is for *Culex triseriatus* Say, 1823 is? (*Aedes triseriatus*/*Ochlerotatus triseriatus*)
Perhaps the species descriptions need to be done in a way that they serve both as a description and as a "key element". Descriptions that are more informative as to what specimens are instances of that species concept and what specimens are not.
Also, that perhaps the Code should be revised to fit the biology, rather than trying to get the biology and related databases to fit the Code.
Respectfully,
- Pete
The reason why it's important to be cognizant of original descriptions of names is to ensrue that when one applies a taxon name to a modern concept, the modern concept includes within its circumscription the type specimen for the name that is used. The original description is relevant primarily for nomenclatural purposes, and to ensure that a modern taxon concept does not exclude the type specimen for the name being applied to the modern concept.
Subsequent revisions *are* important to modern concepts, because those are the places where real taxon concept definitions (e.g., the sort that are used when people construct keys) are documented.
For example, how many of the species observed in the recent BioBlitz were identified by referring to the original species description or subsequent revisions?
Probably none. More likely they were identified to field guides, and the field guides more than likely base their concept boundaries (=implied synonimies) on a (relatively) recent taxonomic work.
Aloha, Rich
This was a bit of a "straw man" but I think what we would both agree that annotating the identification to the "concept" "as described by the key" would more accurately represent the assertion that was made. It is as if there is pressure to make documenting the identification process more "code compliant" than making it accurately reflect what happened.
Yes, exactly! The *SINGLE* most important thing we can do to reduce the taxonomic ambiguity in our databases is to get people in the habit of recording what field guide/monograph/key/whatever was used in making the determination of the specimen's taxonomic identity. Even if an expert pulled the identification out of his/her head, s/he should document the best published representation of the taxon concept that matches what the identifier (person, not GUID) had in mind when making the determination.
Jim Croft once told me that he tried to get his users to do this many years ago, but he simply couldn't persuade them to do this. (I think it was Jim who told me this.)
There's such a huge difference in informatic value between "This specimen is Aus bus", vs. "This specimen falls within the species concept of Aus bus as circumscribed by Jones, 1950". The latter sounds like a lot of extra work, but in fact, all you need is one field labelled "sec", or "in the sense of", with a drop-down list of publications that treated "Aus bus". For most field surveys & collections, you can probably find a single default reference that would apply in 90% of the cases, and then tag only the remaining 10% with a different reference, as needed.
The biggest problem I have with dwc:identificationReferences (http://rs.tdwg.org/dwc/terms/index.htm#identificationReferences), is that it allows many. What to do, then, if two listed references present two different concept circumscriptions (e.g., one sensu lato, and one sensu stricto)? Always fall to the strictest sense?
Personally, I think a "best practices" approach to this term would be "list only the best Reference, unless it's absolutely necessary to indicate more than one reference, from which a composite concept can be established".
In my experience with my bugs and and some of the mammals, the original descriptions and subsequent revisions are not as informative as some in the community portray them. They often do not serve as good guides as to what specimens are instances of that concept and what specimens are not.
....so what, then, are the guides following? Or are they presenting original taxonomy within the guide itself? If you can anchor the identification to the field guide, that's 90% of the battle right there. Later we can map the field guide to a mopnograph, or some other source for the full concept definition.
Also, that perhaps the Code should be revised to fit the biology, rather than trying to get the biology and related databases to fit the Code.
I don't follow. Can you give me an example of what you mean?
Are you saying that the Code(s) should make rules for defining taxon concepts, rather than just rules for establishing names? I hope not! But if so, then you might want to check out the Phylocode, which basically does exactly that (to the extent that a clade is also a form of defining a taxon cocnept).
Aloha, Rich
Dear All,
Again I support Rich's analysis:
" Yes, exactly! The *SINGLE* most important thing we can do to reduce the taxonomic ambiguity in our databases is to get people in the habit of recording what field guide/monograph/key/whatever was used in making the determination of the specimen's taxonomic identity. Even if an expert pulled the identification out of his/her head, s/he should document the best published representation of the taxon concept that matches what the identifier (person, not GUID) had in mind when making the determination."
I would go even further.
As taxa are only hypotheses, their use must be clarified in Material and Methods in any publication (I mean in biology, ecology and others in general): a number of citations must be made: - the one that establishes the circumscription used in the publication. - the one with identification key or diagnostic character used to identify specimens or individuals in the work. - the one that gives the phylogeny. - the one that gives the classification (as far as we recognized that the classification is a simplification of the phylogeny, but it is not the point of discussion here). - and any other citations on topics, e.g., distribution, used in the paper.
Hopefully, a taxonomic revision encompasses all of them, but when the last revision is old, then additional citations must be made.
In chemical papers for instance, there is always the description of methods, sometimes even including the brand of the chemicals ... so why not precising concepts in biological publications.
If these citations were done properly, there would be no problem to use impact factors as well for taxonomy, and databases today would be terrific. Doing the work back looks overwhelming but CLEMAM on European marine Mollusks is an example it is doable, although it does include few publications outside taxonomy.
I feel that the taxonomic community should have been more aggressive to journal editors and other colleagues to move that way, a role for systematics societies all around the world I suggest.
BW Nicolas.
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Richard Pyle Sent: Monday 18 October 2010 04:18 To: 'Peter DeVries' Cc: tdwg-content@lists.tdwg.org; tdwg-bioblitz@googlegroups.com Subject: Re: [tdwg-content] What is an Occurrence? [what about the "token"]
This was a bit of a "straw man" but I think what we would both agree that annotating the identification to the "concept" "as described by the key" would more accurately represent the assertion that was made. It is as if there is pressure to make documenting the identification process more "code compliant" than making it accurately reflect what happened.
Yes, exactly! The *SINGLE* most important thing we can do to reduce the taxonomic ambiguity in our databases is to get people in the habit of recording what field guide/monograph/key/whatever was used in making the determination of the specimen's taxonomic identity. Even if an expert pulled the identification out of his/her head, s/he should document the best published representation of the taxon concept that matches what the identifier (person, not GUID) had in mind when making the determination.
Jim Croft once told me that he tried to get his users to do this many years ago, but he simply couldn't persuade them to do this. (I think it was Jim who told me this.)
There's such a huge difference in informatic value between "This specimen is Aus bus", vs. "This specimen falls within the species concept of Aus bus as circumscribed by Jones, 1950". The latter sounds like a lot of extra work, but in fact, all you need is one field labelled "sec", or "in the sense of", with a drop-down list of publications that treated "Aus bus". For most field surveys & collections, you can probably find a single default reference that would apply in 90% of the cases, and then tag only the remaining 10% with a different reference, as needed.
The biggest problem I have with dwc:identificationReferences (http://rs.tdwg.org/dwc/terms/index.htm#identificationReferences), is that it allows many. What to do, then, if two listed references present two different concept circumscriptions (e.g., one sensu lato, and one sensu stricto)? Always fall to the strictest sense?
Personally, I think a "best practices" approach to this term would be "list only the best Reference, unless it's absolutely necessary to indicate more than one reference, from which a composite concept can be established".
In my experience with my bugs and and some of the mammals, the original descriptions and subsequent revisions are not as informative as some in the community portray them. They often do not serve as good guides as to what specimens are instances of that concept and what specimens are not.
....so what, then, are the guides following? Or are they presenting original taxonomy within the guide itself? If you can anchor the identification to the field guide, that's 90% of the battle right there. Later we can map the field guide to a mopnograph, or some other source for the full concept definition.
Also, that perhaps the Code should be revised to fit the biology, rather than trying to get the biology and related databases to fit the Code.
I don't follow. Can you give me an example of what you mean?
Are you saying that the Code(s) should make rules for defining taxon concepts, rather than just rules for establishing names? I hope not! But if so, then you might want to check out the Phylocode, which basically does exactly that (to the extent that a clade is also a form of defining a taxon cocnept).
Aloha, Rich
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi Nicolas -
I like your suggestion! I've always tried to document whose concepts I am using in a paper; but I think going forward I will be more explicit about it in the M&M, as you suggest.
Aloha, Rich
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Bailly, Nicolas (WorldFish) Sent: Sunday, October 17, 2010 4:23 PM To: tdwg-content@lists.tdwg.org; tdwg-bioblitz@googlegroups.com Subject: [tdwg-content] Dlarification of taxon concept uses in biologicalpapers (from Rich's statement)
Dear All,
Again I support Rich's analysis:
" Yes, exactly! The *SINGLE* most important thing we can do to reduce the taxonomic ambiguity in our databases is to get people in the habit of recording what field guide/monograph/key/whatever was used in making the determination of the specimen's taxonomic identity. Even if an expert pulled the identification out of his/her head, s/he should document the best published representation of the taxon concept that matches what the identifier (person, not GUID) had in mind when making the determination."
I would go even further.
As taxa are only hypotheses, their use must be clarified in Material and Methods in any publication (I mean in biology, ecology and others in general): a number of citations must be made:
- the one that establishes the circumscription used in the
publication.
- the one with identification key or diagnostic character
used to identify specimens or individuals in the work.
- the one that gives the phylogeny.
- the one that gives the classification (as far as we
recognized that the classification is a simplification of the phylogeny, but it is not the point of discussion here).
- and any other citations on topics, e.g., distribution, used
in the paper.
Hopefully, a taxonomic revision encompasses all of them, but when the last revision is old, then additional citations must be made.
In chemical papers for instance, there is always the description of methods, sometimes even including the brand of the chemicals ... so why not precising concepts in biological publications.
If these citations were done properly, there would be no problem to use impact factors as well for taxonomy, and databases today would be terrific. Doing the work back looks overwhelming but CLEMAM on European marine Mollusks is an example it is doable, although it does include few publications outside taxonomy.
I feel that the taxonomic community should have been more aggressive to journal editors and other colleagues to move that way, a role for systematics societies all around the world I suggest.
BW Nicolas.
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Richard Pyle Sent: Monday 18 October 2010 04:18 To: 'Peter DeVries' Cc: tdwg-content@lists.tdwg.org; tdwg-bioblitz@googlegroups.com Subject: Re: [tdwg-content] What is an Occurrence? [what about the "token"]
This was a bit of a "straw man" but I think what we would
both agree
that annotating the identification to the "concept" "as described by the key" would more accurately represent the assertion that was made. It is as if there is pressure to make documenting the identification process more "code compliant" than making it accurately reflect what happened.
Yes, exactly! The *SINGLE* most important thing we can do to reduce the taxonomic ambiguity in our databases is to get people in the habit of recording what field guide/monograph/key/whatever was used in making the determination of the specimen's taxonomic identity. Even if an expert pulled the identification out of his/her head, s/he should document the best published representation of the taxon concept that matches what the identifier (person, not GUID) had in mind when making the determination.
Jim Croft once told me that he tried to get his users to do this many years ago, but he simply couldn't persuade them to do this. (I think it was Jim who told me this.)
There's such a huge difference in informatic value between "This specimen is Aus bus", vs. "This specimen falls within the species concept of Aus bus as circumscribed by Jones, 1950". The latter sounds like a lot of extra work, but in fact, all you need is one field labelled "sec", or "in the sense of", with a drop-down list of publications that treated "Aus bus". For most field surveys & collections, you can probably find a single default reference that would apply in 90% of the cases, and then tag only the remaining 10% with a different reference, as needed.
The biggest problem I have with dwc:identificationReferences (http://rs.tdwg.org/dwc/terms/index.htm#identificationReferenc es), is that it allows many. What to do, then, if two listed references present two different concept circumscriptions (e.g., one sensu lato, and one sensu stricto)? Always fall to the strictest sense?
Personally, I think a "best practices" approach to this term would be "list only the best Reference, unless it's absolutely necessary to indicate more than one reference, from which a composite concept can be established".
In my experience with my bugs and and some of the mammals, the original descriptions and subsequent revisions are not as
informative
as some in the community portray them. They often do not
serve as good
guides as to what specimens are instances of that concept and what specimens are not.
....so what, then, are the guides following? Or are they presenting original taxonomy within the guide itself? If you can anchor the identification to the field guide, that's 90% of the battle right there. Later we can map the field guide to a mopnograph, or some other source for the full concept definition.
Also, that perhaps the Code should be revised to fit the biology, rather than trying to get the biology and related databases
to fit the
Code.
I don't follow. Can you give me an example of what you mean?
Are you saying that the Code(s) should make rules for defining taxon concepts, rather than just rules for establishing names? I hope not! But if so, then you might want to check out the Phylocode, which basically does exactly that (to the extent that a clade is also a form of defining a taxon cocnept).
Aloha, Rich
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Dear all:
Interesting discussion. I meant to inject that, based on some version of a "causal theory of reference", it is not relevant at all that a contemporary identifier (a person) has actually read or is aware of an original description, for him or her to still be causally linked to that initial event of baptism through a chain of verbal and written communication, and thus in essence have knowledge of what species X's name according to authors Y and Z "means". Apologies for the convoluted phrase.
http://en.wikipedia.org/wiki/Causal_theory_of_reference
For downstream semantic resolution, imperfect but presumably congruent passing on of an original concept is relevant, as is the flagging of multiple causal chains of reference originating from two or more non-congruent concepts. I think that this might be the next state for semantic annotation in taxonomy (and yes, only real believers will do it at first), beyond the kinds of things already promoted and made reality mainly in ZooKeys; i.e. new revisions that come out and make clear distinctions between identifications, names, concepts, and their relationships to previous works.
Respectfully (I like that!),
Nico
Nico M. Franz, Ph.D. Assistant Professor Director, UPRM Invertebrate Collection Department of Biology University of Puerto Rico Call Box 9000 Mayagüez, PR 00681-9000
Phone: (787) 832-4040, ext. 3005 Fax: (787) 834-3673 E-mail: nico.franz@upr.edu Laboratory website: http://academic.uprm.edu/~franz/ UPRM-INVCOL: http://uprm-invcol-project.tumblr.com/
On 10/17/2010 10:40 AM, Richard Pyle wrote:
Various TDWG'ers continue to argue that the original description and subsequent revisions were really important in determining what these individuals actually meant when they assigned a name to a specimen, and that this is how we should model it in excruciating detail.
Most of the "TDWG'ers" that I know are FULLY aware that many "modern" taxon concepts are not congruent to the concepts as originally cirumscribed when a Code-compliant name was first established. Obviously, the more recent the original description, the more congruent the original taxon concept will be to a "modern" concept.
The reason why it's important to be cognizant of original descriptions of names is to ensrue that when one applies a taxon name to a modern concept, the modern concept includes within its circumscription the type specimen for the name that is used. The original description is relevant primarily for nomenclatural purposes, and to ensure that a modern taxon concept does not exclude the type specimen for the name being applied to the modern concept.
Subsequent revisions *are* important to modern concepts, because those are the places where real taxon concept definitions (e.g., the sort that are used when people construct keys) are documented.
For example, how many of the species observed in the recent BioBlitz were identified by referring to the original species description or subsequent revisions?
Probably none. More likely they were identified to field guides, and the field guides more than likely base their concept boundaries (=implied synonimies) on a (relatively) recent taxonomic work.
Aloha, Rich
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
...
After asking, very nicely, for more than twenty years, that botanists do their best to anchor determinations to published taxonomic fact; and providing, at much expense, applications to make this easy to do through lookups in the Australian Plant Name Index and the Australian Plant Census I have come to the conclusion that it is probably not achievable using this route.
In some ways I no longer think that this even matters very much. If we can achieve robust standards for semantic interoperability between *all* of the parts of our domain it should eventually be possible to infer such causal relationships through the evidence of taxa, names, annotations, specimens, locality, agents and the like - in much the same ways that people do. The work of the APNI and APC teams in documenting concepts and the linked data experiments of Rod Page, Pete de Vries and others provides hope that somewhere someone will be battling against all odds to make this possible. One certain thing is that we still have to deal with the hundreds of millions of specimens out there that are simply tagged with names.
The route to acceptance of the need for semantic integration of these data may come, as Nico suggests, through the publication of born interoperable content. Names, taxa, individuals ( or their parts ) and their interrelationships published in such a way that reuse is trivial - using simple tools to hide perceived complexity within familiar semantic frameworks and well known forms. Value adding by reference as simple as drag-and-drop. Point and click for detail. A complete taxonomic object no bigger than a URI looking for all the world like a simple taxon name.
greg.
On Mon, 2010-10-18 at 10:10, Nico Franz wrote:
Dear all:
Interesting discussion. I meant to inject that, based on some
version of a "causal theory of reference", it is not relevant at all that a contemporary identifier (a person) has actually read or is aware of an original description, for him or her to still be causally linked to that initial event of baptism through a chain of verbal and written communication, and thus in essence have knowledge of what species X's name according to authors Y and Z "means". Apologies for the convoluted phrase.
http://en.wikipedia.org/wiki/Causal_theory_of_reference
For downstream semantic resolution, imperfect but presumably
congruent passing on of an original concept is relevant, as is the flagging of multiple causal chains of reference originating from two or more non-congruent concepts. I think that this might be the next state for semantic annotation in taxonomy (and yes, only real believers will do it at first), beyond the kinds of things already promoted and made reality mainly in ZooKeys; i.e. new revisions that come out and make clear distinctions between identifications, names, concepts, and their relationships to previous works.
Respectfully (I like that!),
Nico
Nico M. Franz, Ph.D. Assistant Professor Director, UPRM Invertebrate Collection Department of Biology University of Puerto Rico Call Box 9000 Mayagüez, PR 00681-9000
Phone: (787) 832-4040, ext. 3005 Fax: (787) 834-3673 E-mail: nico.franz@upr.edu Laboratory website: http://academic.uprm.edu/~franz/ UPRM-INVCOL: http://uprm-invcol-project.tumblr.com/
On 10/17/2010 10:40 AM, Richard Pyle wrote:
Various TDWG'ers continue to argue that the original description and subsequent revisions were really important in determining what these individuals actually meant when they assigned a name to a specimen, and that this is how we should model it in excruciating detail.
Most of the "TDWG'ers" that I know are FULLY aware that many "modern" taxon concepts are not congruent to the concepts as originally cirumscribed when a Code-compliant name was first established. Obviously, the more recent the original description, the more congruent the original taxon concept will be to a "modern" concept.
The reason why it's important to be cognizant of original descriptions of names is to ensrue that when one applies a taxon name to a modern concept, the modern concept includes within its circumscription the type specimen for the name that is used. The original description is relevant primarily for nomenclatural purposes, and to ensure that a modern taxon concept does not exclude the type specimen for the name being applied to the modern concept.
Subsequent revisions *are* important to modern concepts, because those are the places where real taxon concept definitions (e.g., the sort that are used when people construct keys) are documented.
For example, how many of the species observed in the recent BioBlitz were identified by referring to the original species description or subsequent revisions?
Probably none. More likely they were identified to field guides, and the field guides more than likely base their concept boundaries (=implied synonimies) on a (relatively) recent taxonomic work.
Aloha, Rich
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
I've fallen behind on systematically perusing the list responses, but I would like to focus in on a point that seems to be a consensus in the responses that have shown up recently. The consensus seems to be that documenting determinations (a.k.a. instances of dwc:Identification class) that are applied to Individuals (or Occurrences if you don't believe in Individuals) is the way to go. So in my usual graphical way of thinking about this, I would draw a "relationship line" from the determination to the Individual (or Occurrence) on one side and from the determination to the species concept on the other. I will leave up to the taxonomy people the different things would be connected to the species concept and how all of their lines would be connected. The determination would have any of the properties that are terms listed in the dwc:Identification class (identifiedBy, dateIdentified, identificationReferences, identification Remarks, identificationQualifier, and typeStatus). Some properties like dateIdentified and identificationReferences would be string literals and others (especially identifiedBy) should probably be GUIDs but could be literals if they had to be.
That all seems pretty clear. However, when I've started trying to do this in real life, I immediately have questions. Take a look at http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf which should show up as a web page in your browser.
1. The original label identifies the species as Juncus diffusissimus. However, there is no indicator as to who originally identified it or when. My assumption is that it was the collector (Glen N. Montz) but I don't really know that. Do I assume that, or list the original determiner as "unknown"? 2. Do we draw a distinction between the initial identification and subsequent annotations? I think the answer should be "no" and that's why I refer to both generically as "determinations". 3. There is really no indication given on the annotation labels as to many of the things that we would like to know, such as the concept they had in mind, any source they used (if any), or the reason why they did the annotation. So how does one connect the name that they applied to the determination when there is no indication of the concept? Is this just something we can't do for old annotations and just something that we try to do from this point forward? 4. The last question is one that I really want to some opinions about. It seems to me that there are a number of reasons why one would apply a determination. One would be to correct an actual error in identification. One would be to increase the precision of a previous determination (e.g. an insect identified to family now is identified to species). One would be to assert a difference in opinion as to the correct way to group this individual with others (i.e. as in a taxonomic revision). Finally, a single determiner might apply several determinations to one individual and indicate in each determination the concept intended (i.e. if you subscribe to Cronquist, you'd call it X; if you like Radford's book, you'd call it Y; if you like Weakley's treatment, you'd call it Z). Some of these four reasons may be functionally equivalent, but how would you use Darwin Core to indicate the reason why you applied the determination? Please don't say "identificationRemarks"! From a machine-processing standpoint, this is something we should know and there should be some kind of controlled vocabulary to express it. For instance if an identification is "deprecated" because it was in error (perhaps by the determiner him/herself), one would like the incorrect determination to show up in the historical metadata, but I wouldn't want it to be listed in a website index. The same would hold true if an annotator was able to pin the taxon down to a lower taxonomic level than the original identifier. If someone goes to the trouble to connect an Individual/Occurrence to several names under alternative concepts, there should be a way the a machine would know this so that a software user could select the concept they wanted to use and the name under that concept would pop up.
I don't really see any term under the current DwC that could be used to do this last thing. Am I missing something? Do we need several terms to explain the reason why we made the determination because the reasons fall into different categories?
The other comment that I'll throw out (since this is going out to the bioblitz list as well as to tdwg-content) is that those of you who are building apps to collect metadata in the field really need to separate the process of entering (or acquiring) the collection metadata from the determination process. In at least some apps, the user immediately has to commit to a taxon as they enter the data at the time of collection. It seems to me that it would be a very common situation (especially in the case of "citizen science") that the collector/observer/photographer would have no idea what the taxonomic identity was at the time of collection. The process of determination (and the recording of the various dwc:Identification class terms) is really a separate process that should be able to happen at the time of collection OR later.
Steve
Peter DeVries wrote:
Hi Steve,
I would hypothesize that for the vast majority of identified records the process is something like this:
- An individual uses some sort of key to determine what species
(taxon concept) to assign to a given individual
- They may have created some sort of mental key in which once they
recognize one individual mosquito they can then pretty quickly sort a number of individuals into collections.
- The actual name they assign to the specimen is usually based on
what their key says the name is. Often this does not specify the authorship. Most of these human identifiers have not read the original species descriptions and for the species they are identifying. So the specimen is actually tied to a concept that is based more on the "key" than the original description. * An exception, would be where there is a key in the original description and that was what what was used.
- So in a sense, the process of modeling this as if the if the
identifier actually asserted that the concept was the same as that described by the original description or a subsequent revision is "fudging"
Side effects of this process include:
- A new key for North American Mosquitoes comes out that incorporates
recent changes in nomenclature. The major change being the elevation of a subgenus to a genus. For most of the species described the "key concept" is unchanged.
Student identifier, Bob, in state X is using the latest key, while student identifier, Joe, is state Z is using a slightly older edition of the same key.
Bob identifies the species as /Ochlerotatus triseriatus/, while Joe identifies what should be the same species as /Aedes triseriatus/.
These show up in GBIF on two different maps, they show up in the EOL as two different pages.
Various TDWG'ers continue to argue that the original description and subsequent revisions were really important in determining what these individuals actually meant when they assigned a name to a specimen, and that this is how we should model it in excruciating detail.
I would argue this should be modeled as best as possible to what actually happens.
For example, how many of the species observed in the recent BioBlitz were identified by referring to the original species description or subsequent revisions?
In your diagram, I would suggest that you show that a taxon concept may have many names associated with it. Since it is not clear what the identifier intended by his or her choice of a name, it is often difficult to determine what taxon concept they actually meant.
This is why I advocate a move to a more taxon concept based identifier to link these data sets together because this allows the intent of the identifier is more accurately modeled.
This would be done in the form of:
"I assert that this specimen (of what I call /Aedes triseriatus/) was observed here. I also assert that it is an instance of the this species concept => URI"
Or I assert that this is an individual of the type "Individual of species concept X" = > URI
All of these are instances of the class "Individual"
So the resulting DarwinCore record would contain both the name and and an optional, but I think needed, asserted species concept.
The species concept is a subclass of taxon concept, but is fundamentally different than the higher clades.
There are some guidelines as to what an entity needs to be considered a species.
While their are no real guidelines as to what clades should be considered genera and what clades should be considered families etc.
Assigning properties at the level of genera or family is also problematic because it assumes that there will be inferencing and it will require rechecking that those properties are still valid if the species within that genera change.
So if there is some property that is common to all the species in the genus, make that a property of each of the individual species - not a property of the genus.
Respectfully,
- Pete
On Fri, Oct 15, 2010 at 10:45 AM, Steve Baskauf <steve.baskauf@vanderbilt.edu mailto:steve.baskauf@vanderbilt.edu> wrote:
As a background to this post, I want to reference a post by Bob called "SubclassOrNot". I discovered this page on an early foray into the TDWG website labyrinth and it has been very influential on my thinking since then. The idea Bob discusses is central to what I'm writing below so if you haven't read it you might want to do so first. You can probably skip the "OWL Inference" section and still get the point which is described in the first two sections of his post. The URL for the page is http://wiki.tdwg.org/twiki/bin/view/TAG/SubclassOrNot . To preface what I'm going to say below, I want to put Darwin Core Occurrences in the context of what Bob wrote. In my mind, one of the hallmarks of the Darwin Core standard and one thing that makes it a great improvement over previous versions is that the decision was made to use what Bob called the "has a" approach rather than the "is a" approach. In particular, the Darwin Core standard has a single class called dwc:Occurrence rather than subclasses called "Specimen", "Observation", and other possible things. The way that we differentiate among different kinds of Occurrences is by using the DwC types which are the controlled values for the term dwc:basisOfRecord. Thus we say an Occurrence "has a" basisOfRecord=PreservedSpecimen rather than saying it "is a" PreservedSpecimen. We say an Occurrence "has a" basisOfRecord=HumanObservation rather than saying it "is a"HumanObservation". This approach has greatly reduced the number of different terms in the standard since we don't have to have separate "ObservedBy" and "CollectedBy" terms, but rather can just have a single "RecordedBy" term that applies to both specimens and observations. The same thing applies to many other things, like eventDate rather than DateCollected and DateObserved, locality rather than collectionLocality and observationLocality, etc. With the ratification of Darwin Core, this decision is now a fait acompli and not a subject of discussion or something optional for users of the standard. It also seems to be clear that as necessary new terms can be added to the DwC types which would then be valid controlled values for basisOfRecord. Since the adoption of the DwC standard, the approach to Occurrences has been what I would describe as "I know an Occurrence when I see one". I consider this as a pretty sloppy practice and as I indicated in my post last night, I think there is enough consensus about what an Occurrence is that we can come up with a better definition than "an occurrence is the category of information pertaining to evidence of an occurrence...". Another part of what I would characterize as sloppiness is the lack of a clear definition of what exactly basisOfRecord means. When I wrote my attempt at summarizing consensus last night, I dodged the question about what I called the "token". This "thing" has been called various names. In the previous discussion on the list, it was sometimes called "the evidence" of the occurrence. In the past I have called it "a representation" - however, I now think the term "token" is better because "representation" has a different technical meaning in the context of content negotiation. When we type an Occurrence by saying it has a basisOfRecord=PreservedSpecimen, we are saying that this Occurrence has as supporting evidence, or as a "token" if you prefer, all or part of the dead remains of the organism (i.e. what I'm calling "the Individual") that was being documented by the Occurrence. When we type an Occurrence by saying it has a basisOfRecord=LivingSpecimen, we are saying that this Occurrence has as a "token" the entire organism that was being documented (or some vegetative part of the live organism that was propagated). When we type an Occurrence by saying it has a basisOfRecord=HumanObservation, we are saying that the Occurrence has no supporting evidence other than the reputation of the observer to accurately record the metadata about the Occurrence. In other words, we "tag" a instance of a core class (to use Bob's words), Occurrence, by telling a metadata consumer what kind of token we are using as evidence of the Occurrence. A fundamental part of creating a clear definition of what an Occurrence is, is to define exactly what we are including in the concept of Occurrence. One possibility is to (1) say that the two boxes at the right side of the diagram at http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif are fused and that both the Occurrence metadata and its associated token are what we consider to be "the Occurrence". Another approach (2) would be to say that the actual Occurrence as an entity is only the metadata part and that the token is a separate thing. A third approach is to say (3) that everything with the blue dotted lines is considered a part of the Occurrence (i.e. the metadata, the token, the event, and the locality). I don't think in an absolute sense, any one of these approaches is "right". The problem is that these approaches are used inconsistently, sometimes even by the same person, depending on the basisOfRecord. Differences in ways of thinking about this issue is a part of why people aren't understanding the way other people are approaching the structuring of metadata. I have tried to consistently take the approach (1) that the two boxes on the right are fused, i.e. that the Occurrence metadata and the token should both be considered part of the entity that we call "an Occurrence". I think this is why Rich was confused in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001666.html when I said that it was "wrong" to assert that a scientific name is a property of an Occurrence - obviously it is silly to say that the token (photons on a film, sound patterns in a digital file) has a scientific name. Yet that is exactly what people do routinely when the token is a branch cut off a tree and glued to a piece of paper. They say that they are "identifying a specimen". What I am asking (actually demanding) is that the TDWG community get its act together and come to some consistency on this. If we are going to take the approach (2), then we need to take specimens off their pedestal and treat them like we do any other token that we are using as evidence that an Occurrence happened. If we are going to do what was suggested for the BioBlitz in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001603.html, i.e. to call Occurrences "observations" and then link the tokens to them by associatedMedia, ResourceRelationship, or some other means (approach 2) then do it consistently for every kind of token, including specimens, and don't single out media tokens for punishment. I have in a sense "thrown down the gauntlet" on this issue by proposing that DigitalStillImage be added as a DwC type and as a controlled value for basisOfRecord (http://code.google.com/p/darwincore/issues/detail?id=68). I know what some people are going to say in response to this proposal. "Why do you need to have 'DigitalStillImage' as a value for basisOfRecord when you can just say that the resource's dcterms:type=StillImage?" The answer goes back to Bob's point. If we are going to go the "has a" path (which we already have in DwC for Occurrences) rather than subclassing everything, then we need to provide an appropriate value for the "tag" for any type of resource that a reasonable number of users will want to use as a token. I think it is clear from this and other Bioblitzes, my work in Bioimages, the whale tracking project, and many other examples, that there are plenty of people who are already using DigitalStillImages as tokens and we all need a controlled value to use for basisOfRecord. The other thing that we accomplish when we type an Occurrence by its basisOfRecord is to tell a consumer what kind of metadata to expect to get about the token in addition to the generic metadata that is provided for all Occurrences. Thus for a LivingSpecimen we expect to be told what zoo, botanical garden, bacterial collection, etc. contains the specimen. For a PreservedSpecimen we expect to be told the preparation type, the location of the repository, etc. For a DigitalStillImage we expect to be told the file type, accessURL, etc. Simply providing a value for dcterms:type=StillImage doesn't indicate whether the image is a physical one (i.e. on film) or a digital one. It is also unreasonable to expect a client to have to be checking two different terms (basisOfRecord and dcterms:type) to find out what they could learn from one (basisOfRecord). Of course it would be advisable to provide a value for dcterms:type as well for clients outside the biodiversity community who may not "understand" what basisOfRecord means. I hate to keep bringing my posts back to the RDF issue, but thinking about how one would write RDF forces clear thinking about how metadata should be structured. If we intend to separate tokens as entities from their associated Occurrence metadata, i.e. approach (2), then we open up a whole other can of worms. To associate the occurrence resources (i.e. the metadata) with the "different" resource (i.e. the token), we will have probably have to be able to create URIs for the tokens and separate RDF metadata blocks which will have to be rdfs:type'd. What are we going to use for that rdfs:type - create another Darwin Core class? I simply don't think that is a complicated road that we want to travel. It would be far easier to just say that every Occurrence has a one-to-one relationship with its token (which could be "the empty set" for observations). This would not work for people who want to hang multiple tokens on a single observation event, but I think that itself is a bad idea because it makes it even harder to have "flat" occurrence datasets. Just say that every time we collect a different token (or make an observation that has no token), it is a new Occurrence record. Realistically, a single collector can't actually take a picture of a plant at the same time he or she collects it for a specimen anyway. Those really should be considered two different events because they happen at different times. OK, enough said. Consider this my defense of my proposal "issue 68" to add DigitalStillImage. I would urge the powers that be to respond to the issues that I've raised here before having any kind of "vote" (or whatever is ultimately going to happen when there is an up or down decision about the proposal). Steve Steve Baskauf wrote: After the flurry of emails recently, I had an opportunity to carefully read all the way through the threads again, followed by enforced "think time" during my long commute. I was actually pretty cheerful after that because I think that in essence, most of the conversation about what constitutes an Occurrence really boils down to the same thing. So I have sat down and tried to summarize what seems to me to be a consensus about Occurrences. To follow my points, please refer to the diagram at: http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif Consensus on relationships 1. The fundamental definition of an Occurrence involves evidence that a representative of a taxon occurred at a place and time. Note 1.A: For clarity, I have modified John's statement in his last email by replacing "taxon" with "representative of a taxon". I'm considering a taxon to be an abstract concept that is applied to individuals or groups of organisms. Note 1.B. This definition is far more useful than the official definition of the class Occurrence "The category of information pertaining to evidence of an occurrence..." which is essentially circular. Note 1.C: This statement is extremely broad because the evidence could be of many sorts, the representative could range from a single individual to all organisms on the earth, the taxon could be anyone's definition at any taxonomic level, the place could range from a GPS point with uncertainty of less than 10 meters to the entire planet earth, and the time could range from a shutter click of less than one second to 3.4 billion years. 2. The diagram is an attempt to summarize in pictorial form statements and relationships that have been described in the thread. The taxon representative is recorded as existing at a particular time and place (the arrow) and the result is an Occurrence record. That Occurrence record exists as metadata which may be associated with a token that can be used to voucher the fact that the taxon representative existed. That token may be the organism itself (or a living part of it as in a twig for grafting), all or part of the organism in preserved form, an electronic representation such as an image or sound recording, and other kinds of things like tissue or DNA samples. There may also be no token at all, in which case we call the Occurrence record an observation. Based on direct observation of the taxon representative, examination of one or more tokens, or both, some determiner asserts that a taxon concept applies to the taxon representative and as a result a scientific name can be used to "identify" the taxon representative. (There may be a lot of other complicated stuff above the Identification box, but that will have to be filled in by the taxonomists.) Note 2.A: I have mapped onto this diagram the letters that John used in his last email to refer to entities that are involved in an Occurrence (T, E, L, O, and G). I will beg the forgiveness of fossil people because I don't really know how the geological context fits in. I'm assuming that it is a way of asserting time and location on a much broader scale than we do for extant organisms. Note 2.B: I have put a dotted line around the part of the diagram that I think includes all the things that people might consider part of the Occurrence itself. I have left out "T" and the other parts related to identification because it seems to me that you can have an occurrence that you document which does not yet (and perhaps never will) have an identification. The Occurrence still asserts that a taxon representative existed at a time and place; we just don't yet know what the taxon is. 3. The red lines indicate the relationships that connect the various entities (I'm going to go ahead and call them resources). Consistent with popular opinion, the Occurrence record is the center of the universe and most things are connected to it. Note 3.A: I am sticking to my guns and refuse to connect the Identification directly to the Occurrence. It is the taxon representative that is being identified, not the occurrence. One can assert another sort of relationship between the identification and the occurrence if one wants to say that one consulted the occurrence metadata and token in order to decide about the identification, but it is not correct to say that the Identification identifies either the Occurrence metadata or the token (as Rich pointed out). OK, so that's step one - defining what is related to what. If anyone disagrees with these relationships, please clarify or create your own diagram. Complicating circumstances/caveats 1. It is noted and recognized that some users will not care to include all of these relationships in their models. In the interest of simplification or "flattening" the relationships, they may wish to collapse some parts of this diagram (e.g. incorporate time and location metadata within the Occurrence metadata rather than considering them separate resources, applying scientific names directly to the taxon representatives without defining a taxon concept or recording the determination metadata, connecting identifications directly to the occurrence, etc.). This doesn't mean that the relationships don't exist, it just means that some users don't care about them. 2. It is recognized that different users will be interested in or able to specify the various resources to differing degrees of precision. Examples: A photographer might record times to the nearest second, a collector may only be interested in noting the date on which a specimen was collected. A location may be specified to the precision of a GPS reading or be defined as some geographic or political subdivision. The taxon representative may be an individual organism, a flock or clump, or some larger aggregation of taxon representatives. That's step two. If I've missed any complications, please point them out. My opinions about the implications of this diagram 1. The circle I've labeled as "taxon representative" is the resource type that I'm proposing to be represented by the class Individual. You will note that in both the definition of dwc:individualID ("An identifier for an individual or named group of individual organisms...") and the proposed class definition ("The category of information pertaining to an individual or named group of individual organisms represented in an Occurrence"), groups of individual organisms are included. Thus John's example of a fossil having myriad individuals, or Richard's examples of thousands of plankton, a large school of fish, herd of wildebeest, flock of birds, could all be categorized as "Individual" under this definition if there is a reasonable expectation that all of the individuals in the group are members of the same taxon. Perhaps there is a better name for this resource, but since dwc:individualID was already extant, I chose Individual as the class name for consistency with the pattern established with other classes and their associated xxxxID terms. 2. Although in note 1.C. I have given the ranges of the various resources to their logical extreme (as was done previously in the thread), I think that as a practical matter we can adopt guidelines to set reasonable values for the "normal" ranges of the resources. One such guideline might be that we suggest a range that can accommodate about 95% of the user needs within the community (this came from Rich's comment about satisfying 95% of the user need with an establishmentMeans controlled vocuabulary). For example, it was suggested that the range for the location of an Occurrence could span the entire planet Earth. True enough, but virtually nobody would find such a span useful. 95% of users would probably find a range between a GPS reading with 10 meter precision and the extent of a county or province useful for recording the location of an Occurrence. I can suggest similar "useful" ranges: one second to one day for an event time (excluding fossils), one individual organism to the number of organisms that would fit within a 50 meter radius for an "individual", and taxon identified to family for plants and maybe mammals, genus for birds, and order for insects. So framing the definition of an Occurrence in these terms it would be something like: "An occurrence involves evidence (consisting of a physical token, electronic record, or personal observation) that a representative (ranging from a single individual to the number that would fit on a football field) of a taxon (hopefully identified to some lower taxonomic level) occurred at a place (determined to a precision between that of a GPS reading and the size of a county/province) and time (spanning one second to one day)." A few people might object to this level of restrictiveness, but I would guess that it would make 95% of us happy. 3. With the exception of the "missing" class Individual, every resource type on this diagram except for the "token" and Scientific name has a Darwin Core class. Every resource type on the diagram except for "token" has a dwc:xxxxID term that can be used to refer to a GUID for the resource. The implication of this is that any resource on this diagram except for the token and taxon representative (i.e. Individual) is ready to be represented in RDF by Darwin Core terms in the sense that the relationships (red lines) can be represented by the xxxxID terms and that the resources can be rdfs:type'd using Darwin Core classes. (Lacking a class for the scientific name doesn't seem like a big deal to me since the scientific name can be a string literal - but then I'm not a taxonomist.) 4. OK, I've avoided it as long as I can, so I'm going to confess now to the RDF-phobes. The red lines and shapes are something pretty close to an RDF graph. What that means is that if the community can agree that this diagram correctly represents the relationships among the kinds of biodiversity resources that we care about, then the matter of providing guidelines on how to represent Darwin Core in RDF suddenly gets a lot simpler. Just convert the "picture" of the RDF graph into XML format and we have a template. Alright, that's an oversimplification, but I think it is essentially true because the most difficult part of achieving a consensus on RDF representations is to decide how we connect the resource types, not on the literals that we hang onto resources as properties. 5. While I'm beating the RDF drum again, the importance of my opinion number 2 can be extended into the GUID adoption process. In my comments to Kevin about the Beginner's Guide to Persistent Identifiers, I think I commented on the question of how one decides whether a GUID needs to be assigned to something or not. I believe that the answer to that question boils down to this: we need a GUID for any resource that will be referenced by more than one other resource. Do we need to be able to assign a GUID to Taxon concepts? Yes, because it is likely that many identifications will want to reference a particular taxon concept. Do we need to be able to assign a GUID to an Event? Maybe or maybe not. If every occurrence has its own separate time recorded, then no GUID is needed because the time is just a part of every separate occurrence record. If the event is defined to be a time range that represents a collecting trip, then there may be many Occurrences that are associated with that trip and all of them could reference the GUID for that event rather than repeating the event information for every Occurrence. The point here is that every shape (class of resources) on this diagram at least has the POTENTIAL to be a node connecting multiple resources and therefore should have the capability of being assigned a GUID, having its own RDF record, and being appropriately typed (presumably by a DwC class). So this is a final technical argument for why we need to have the DwC class Individual. Whether or not people ultimately choose to assign GUIDs to particular resource types or not is their own choice, but they need to at least be ABLE to if they need that resource to serve as a node given the structure of their metadata. We need to clarify how the "token" thing fits in, but I'm stopping there for now. I would very much appreciate responses indicating that: A. you agree with the diagram and connections (and consider this definition and diagram a consensus) B. you disagree with the diagram (and articulate why) C. you provide an alternative diagram or explanation of the relationships among the classes related to Occurrences. Thanks for you patience with another tome. Steve -- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A. delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235 office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org <mailto:tdwg-content@lists.tdwg.org> http://lists.tdwg.org/mailman/listinfo/tdwg-content . -- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A. delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235 office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base http://www.taxonconcept.org/ / GeoSpecies Knowledge Base http://lod.geospecies.org/ About the GeoSpecies Knowledge Base http://about.geospecies.org/
I am sorry I dont have the time to follow this extensive thread, but I can manage at least the first paragraphs ;) A quick comment on tying identification sources to a scientific name. As for other taxon concepts this is usually done with the sec/sensu reference which should be recorded as dwc:nameAccordingTo:
http://rs.tdwg.org/dwc/terms/index.htm#nameAccordingTo
I am slightly irritated that we seem to have some term duplicates for this use case. Maybe dwc:identificationReferences is supposed to only list additional references?
Markus
On Oct 18, 2010, at 18:49, Steve Baskauf wrote:
I've fallen behind on systematically perusing the list responses, but I would like to focus in on a point that seems to be a consensus in the responses that have shown up recently. The consensus seems to be that documenting determinations (a.k.a. instances of dwc:Identification class) that are applied to Individuals (or Occurrences if you don't believe in Individuals) is the way to go. So in my usual graphical way of thinking about this, I would draw a "relationship line" from the determination to the Individual (or Occurrence) on one side and from the determination to the species concept on the other. I will leave up to the taxonomy people the different things would be connected to the species concept and how all of their lines would be connected. The determination would have any of the properties that are terms listed in the dwc:Identification class (identifiedBy, dateIdentified, identificationReferences, identification Remarks, identificationQualifier, and typeStatus). Some properties like dateIdentified and identificationReferences would be string literals and others (especially identifiedBy) should probably be GUIDs but could be literals if they had to be.
That all seems pretty clear. However, when I've started trying to do this in real life, I immediately have questions. Take a look at http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf which should show up as a web page in your browser.
- The original label identifies the species as Juncus diffusissimus. However, there is no indicator as to who originally identified it or when. My assumption is that it was the collector (Glen N. Montz) but I don't really know that. Do I assume that, or list the original determiner as "unknown"?
- Do we draw a distinction between the initial identification and subsequent annotations? I think the answer should be "no" and that's why I refer to both generically as "determinations".
- There is really no indication given on the annotation labels as to many of the things that we would like to know, such as the concept they had in mind, any source they used (if any), or the reason why they did the annotation. So how does one connect the name that they applied to the determination when there is no indication of the concept? Is this just something we can't do for old annotations and just something that we try to do from this point forward?
- The last question is one that I really want to some opinions about. It seems to me that there are a number of reasons why one would apply a determination. One would be to correct an actual error in identification. One would be to increase the precision of a previous determination (e.g. an insect identified to family now is identified to species). One would be to assert a difference in opinion as to the correct way to group this individual with others (i.e. as in a taxonomic revision). Finally, a single determiner might apply several determinations to one individual and indicate in each determination the concept intended (i.e. if you subscribe to Cronquist, you'd call it X; if you like Radford's book, you'd call it Y; if you like Weakley's treatment, you'd call it Z). Some of these four reasons may be functionally equivalent, but how would you use Darwin Core to indicate the reason why you applied the determination? Please don't say "identificationRemarks"! From a machine-processing standpoint, this is something we should know and there should be some kind of controlled vocabulary to express it. For instance if an identification is "deprecated" because it was in error (perhaps by the determiner him/herself), one would like the incorrect determination to show up in the historical metadata, but I wouldn't want it to be listed in a website index. The same would hold true if an annotator was able to pin the taxon down to a lower taxonomic level than the original identifier. If someone goes to the trouble to connect an Individual/Occurrence to several names under alternative concepts, there should be a way the a machine would know this so that a software user could select the concept they wanted to use and the name under that concept would pop up.
I don't really see any term under the current DwC that could be used to do this last thing. Am I missing something? Do we need several terms to explain the reason why we made the determination because the reasons fall into different categories?
The other comment that I'll throw out (since this is going out to the bioblitz list as well as to tdwg-content) is that those of you who are building apps to collect metadata in the field really need to separate the process of entering (or acquiring) the collection metadata from the determination process. In at least some apps, the user immediately has to commit to a taxon as they enter the data at the time of collection. It seems to me that it would be a very common situation (especially in the case of "citizen science") that the collector/observer/photographer would have no idea what the taxonomic identity was at the time of collection. The process of determination (and the recording of the various dwc:Identification class terms) is really a separate process that should be able to happen at the time of collection OR later.
Steve
Peter DeVries wrote:
Hi Steve,
I would hypothesize that for the vast majority of identified records the process is something like this:
An individual uses some sort of key to determine what species (taxon concept) to assign to a given individual
- They may have created some sort of mental key in which once they recognize one individual mosquito they can then pretty quickly sort a number of individuals into collections.
The actual name they assign to the specimen is usually based on what their key says the name is. Often this does not specify the authorship. Most of these human identifiers have not read the original species descriptions and for the species they are identifying. So the specimen is actually tied to a concept that is based more on the "key" than the original description.
- An exception, would be where there is a key in the original description and that was what what was used.
So in a sense, the process of modeling this as if the if the identifier actually asserted that the concept was the same as that described by the original description or a subsequent revision is "fudging"
Side effects of this process include:
- A new key for North American Mosquitoes comes out that incorporates recent changes in nomenclature. The major change being the elevation of a subgenus to a genus. For most of the species described the "key concept" is unchanged.
Student identifier, Bob, in state X is using the latest key, while student identifier, Joe, is state Z is using a slightly older edition of the same key.
Bob identifies the species as Ochlerotatus triseriatus, while Joe identifies what should be the same species as Aedes triseriatus.
These show up in GBIF on two different maps, they show up in the EOL as two different pages.
Various TDWG'ers continue to argue that the original description and subsequent revisions were really important in determining what these individuals actually meant when they assigned a name to a specimen, and that this is how we should model it in excruciating detail.
I would argue this should be modeled as best as possible to what actually happens.
For example, how many of the species observed in the recent BioBlitz were identified by referring to the original species description or subsequent revisions?
In your diagram, I would suggest that you show that a taxon concept may have many names associated with it. Since it is not clear what the identifier intended by his or her choice of a name, it is often difficult to determine what taxon concept they actually meant.
This is why I advocate a move to a more taxon concept based identifier to link these data sets together because this allows the intent of the identifier is more accurately modeled.
This would be done in the form of:
"I assert that this specimen (of what I call Aedes triseriatus) was observed here. I also assert that it is an instance of the this species concept => URI"
Or I assert that this is an individual of the type "Individual of species concept X" = > URI
All of these are instances of the class "Individual"
So the resulting DarwinCore record would contain both the name and and an optional, but I think needed, asserted species concept.
The species concept is a subclass of taxon concept, but is fundamentally different than the higher clades.
There are some guidelines as to what an entity needs to be considered a species.
While their are no real guidelines as to what clades should be considered genera and what clades should be considered families etc.
Assigning properties at the level of genera or family is also problematic because it assumes that there will be inferencing and it will require rechecking that those properties are still valid if the species within that genera change.
So if there is some property that is common to all the species in the genus, make that a property of each of the individual species - not a property of the genus.
Respectfully,
- Pete
On Fri, Oct 15, 2010 at 10:45 AM, Steve Baskauf steve.baskauf@vanderbilt.edu wrote: As a background to this post, I want to reference a post by Bob called "SubclassOrNot". I discovered this page on an early foray into the TDWG website labyrinth and it has been very influential on my thinking since then. The idea Bob discusses is central to what I'm writing below so if you haven't read it you might want to do so first. You can probably skip the "OWL Inference" section and still get the point which is described in the first two sections of his post. The URL for the page is http://wiki.tdwg.org/twiki/bin/view/TAG/SubclassOrNot .
To preface what I'm going to say below, I want to put Darwin Core Occurrences in the context of what Bob wrote. In my mind, one of the hallmarks of the Darwin Core standard and one thing that makes it a great improvement over previous versions is that the decision was made to use what Bob called the "has a" approach rather than the "is a" approach. In particular, the Darwin Core standard has a single class called dwc:Occurrence rather than subclasses called "Specimen", "Observation", and other possible things. The way that we differentiate among different kinds of Occurrences is by using the DwC types which are the controlled values for the term dwc:basisOfRecord. Thus we say an Occurrence "has a" basisOfRecord=PreservedSpecimen rather than saying it "is a" PreservedSpecimen. We say an Occurrence "has a" basisOfRecord=HumanObservation rather than saying it "is a"HumanObservation". This approach has greatly reduced the number of different terms in the standard since we don't have to have separate "ObservedBy" and "CollectedBy" terms, but rather can just have a single "RecordedBy" term that applies to both specimens and observations. The same thing applies to many other things, like eventDate rather than DateCollected and DateObserved, locality rather than collectionLocality and observationLocality, etc. With the ratification of Darwin Core, this decision is now a fait acompli and not a subject of discussion or something optional for users of the standard. It also seems to be clear that as necessary new terms can be added to the DwC types which would then be valid controlled values for basisOfRecord.
Since the adoption of the DwC standard, the approach to Occurrences has been what I would describe as "I know an Occurrence when I see one". I consider this as a pretty sloppy practice and as I indicated in my post last night, I think there is enough consensus about what an Occurrence is that we can come up with a better definition than "an occurrence is the category of information pertaining to evidence of an occurrence...". Another part of what I would characterize as sloppiness is the lack of a clear definition of what exactly basisOfRecord means. When I wrote my attempt at summarizing consensus last night, I dodged the question about what I called the "token". This "thing" has been called various names. In the previous discussion on the list, it was sometimes called "the evidence" of the occurrence. In the past I have called it "a representation" - however, I now think the term "token" is better because "representation" has a different technical meaning in the context of content negotiation. When we type an Occurrence by saying it has a basisOfRecord=PreservedSpecimen, we are saying that this Occurrence has as supporting evidence, or as a "token" if you prefer, all or part of the dead remains of the organism (i.e. what I'm calling "the Individual") that was being documented by the Occurrence. When we type an Occurrence by saying it has a basisOfRecord=LivingSpecimen, we are saying that this Occurrence has as a "token" the entire organism that was being documented (or some vegetative part of the live organism that was propagated). When we type an Occurrence by saying it has a basisOfRecord=HumanObservation, we are saying that the Occurrence has no supporting evidence other than the reputation of the observer to accurately record the metadata about the Occurrence. In other words, we "tag" a instance of a core class (to use Bob's words), Occurrence, by telling a metadata consumer what kind of token we are using as evidence of the Occurrence. A fundamental part of creating a clear definition of what an Occurrence is, is to define exactly what we are including in the concept of Occurrence. One possibility is to (1) say that the two boxes at the right side of the diagram at http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif are fused and that both the Occurrence metadata and its associated token are what we consider to be "the Occurrence". Another approach (2) would be to say that the actual Occurrence as an entity is only the metadata part and that the token is a separate thing. A third approach is to say (3) that everything with the blue dotted lines is considered a part of the Occurrence (i.e. the metadata, the token, the event, and the locality). I don't think in an absolute sense, any one of these approaches is "right". The problem is that these approaches are used inconsistently, sometimes even by the same person, depending on the basisOfRecord. Differences in ways of thinking about this issue is a part of why people aren't understanding the way other people are approaching the structuring of metadata. I have tried to consistently take the approach (1) that the two boxes on the right are fused, i.e. that the Occurrence metadata and the token should both be considered part of the entity that we call "an Occurrence". I think this is why Rich was confused in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001666.html when I said that it was "wrong" to assert that a scientific name is a property of an Occurrence - obviously it is silly to say that the token (photons on a film, sound patterns in a digital file) has a scientific name. Yet that is exactly what people do routinely when the token is a branch cut off a tree and glued to a piece of paper. They say that they are "identifying a specimen". What I am asking (actually demanding) is that the TDWG community get its act together and come to some consistency on this. If we are going to take the approach (2), then we need to take specimens off their pedestal and treat them like we do any other token that we are using as evidence that an Occurrence happened. If we are going to do what was suggested for the BioBlitz in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001603.html, i.e. to call Occurrences "observations" and then link the tokens to them by associatedMedia, ResourceRelationship, or some other means (approach 2) then do it consistently for every kind of token, including specimens, and don't single out media tokens for punishment. I have in a sense "thrown down the gauntlet" on this issue by proposing that DigitalStillImage be added as a DwC type and as a controlled value for basisOfRecord (http://code.google.com/p/darwincore/issues/detail?id=68). I know what some people are going to say in response to this proposal. "Why do you need to have 'DigitalStillImage' as a value for basisOfRecord when you can just say that the resource's dcterms:type=StillImage?" The answer goes back to Bob's point. If we are going to go the "has a" path (which we already have in DwC for Occurrences) rather than subclassing everything, then we need to provide an appropriate value for the "tag" for any type of resource that a reasonable number of users will want to use as a token. I think it is clear from this and other Bioblitzes, my work in Bioimages, the whale tracking project, and many other examples, that there are plenty of people who are already using DigitalStillImages as tokens and we all need a controlled value to use for basisOfRecord. The other thing that we accomplish when we type an Occurrence by its basisOfRecord is to tell a consumer what kind of metadata to expect to get about the token in addition to the generic metadata that is provided for all Occurrences. Thus for a LivingSpecimen we expect to be told what zoo, botanical garden, bacterial collection, etc. contains the specimen. For a PreservedSpecimen we expect to be told the preparation type, the location of the repository, etc. For a DigitalStillImage we expect to be told the file type, accessURL, etc. Simply providing a value for dcterms:type=StillImage doesn't indicate whether the image is a physical one (i.e. on film) or a digital one. It is also unreasonable to expect a client to have to be checking two different terms (basisOfRecord and dcterms:type) to find out what they could learn from one (basisOfRecord). Of course it would be advisable to provide a value for dcterms:type as well for clients outside the biodiversity community who may not "understand" what basisOfRecord means. I hate to keep bringing my posts back to the RDF issue, but thinking about how one would write RDF forces clear thinking about how metadata should be structured. If we intend to separate tokens as entities from their associated Occurrence metadata, i.e. approach (2), then we open up a whole other can of worms. To associate the occurrence resources (i.e. the metadata) with the "different" resource (i.e. the token), we will have probably have to be able to create URIs for the tokens and separate RDF metadata blocks which will have to be rdfs:type'd. What are we going to use for that rdfs:type - create another Darwin Core class? I simply don't think that is a complicated road that we want to travel. It would be far easier to just say that every Occurrence has a one-to-one relationship with its token (which could be "the empty set" for observations). This would not work for people who want to hang multiple tokens on a single observation event, but I think that itself is a bad idea because it makes it even harder to have "flat" occurrence datasets. Just say that every time we collect a different token (or make an observation that has no token), it is a new Occurrence record. Realistically, a single collector can't actually take a picture of a plant at the same time he or she collects it for a specimen anyway. Those really should be considered two different events because they happen at different times.
OK, enough said. Consider this my defense of my proposal "issue 68" to add DigitalStillImage. I would urge the powers that be to respond to the issues that I've raised here before having any kind of "vote" (or whatever is ultimately going to happen when there is an up or down decision about the proposal).
Steve
Steve Baskauf wrote: After the flurry of emails recently, I had an opportunity to carefully read all the way through the threads again, followed by enforced "think time" during my long commute. I was actually pretty cheerful after that because I think that in essence, most of the conversation about what constitutes an Occurrence really boils down to the same thing. So I have sat down and tried to summarize what seems to me to be a consensus about Occurrences. To follow my points, please refer to the diagram at: http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif
Consensus on relationships
- The fundamental definition of an Occurrence involves evidence that a
representative of a taxon occurred at a place and time. Note 1.A: For clarity, I have modified John's statement in his last email by replacing "taxon" with "representative of a taxon". I'm considering a taxon to be an abstract concept that is applied to individuals or groups of organisms. Note 1.B. This definition is far more useful than the official definition of the class Occurrence "The category of information pertaining to evidence of an occurrence..." which is essentially circular. Note 1.C: This statement is extremely broad because the evidence could be of many sorts, the representative could range from a single individual to all organisms on the earth, the taxon could be anyone's definition at any taxonomic level, the place could range from a GPS point with uncertainty of less than 10 meters to the entire planet earth, and the time could range from a shutter click of less than one second to 3.4 billion years. 2. The diagram is an attempt to summarize in pictorial form statements and relationships that have been described in the thread. The taxon representative is recorded as existing at a particular time and place (the arrow) and the result is an Occurrence record. That Occurrence record exists as metadata which may be associated with a token that can be used to voucher the fact that the taxon representative existed. That token may be the organism itself (or a living part of it as in a twig for grafting), all or part of the organism in preserved form, an electronic representation such as an image or sound recording, and other kinds of things like tissue or DNA samples. There may also be no token at all, in which case we call the Occurrence record an observation. Based on direct observation of the taxon representative, examination of one or more tokens, or both, some determiner asserts that a taxon concept applies to the taxon representative and as a result a scientific name can be used to "identify" the taxon representative. (There may be a lot of other complicated stuff above the Identification box, but that will have to be filled in by the taxonomists.) Note 2.A: I have mapped onto this diagram the letters that John used in his last email to refer to entities that are involved in an Occurrence (T, E, L, O, and G). I will beg the forgiveness of fossil people because I don't really know how the geological context fits in. I'm assuming that it is a way of asserting time and location on a much broader scale than we do for extant organisms. Note 2.B: I have put a dotted line around the part of the diagram that I think includes all the things that people might consider part of the Occurrence itself. I have left out "T" and the other parts related to identification because it seems to me that you can have an occurrence that you document which does not yet (and perhaps never will) have an identification. The Occurrence still asserts that a taxon representative existed at a time and place; we just don't yet know what the taxon is. 3. The red lines indicate the relationships that connect the various entities (I'm going to go ahead and call them resources). Consistent with popular opinion, the Occurrence record is the center of the universe and most things are connected to it. Note 3.A: I am sticking to my guns and refuse to connect the Identification directly to the Occurrence. It is the taxon representative that is being identified, not the occurrence. One can assert another sort of relationship between the identification and the occurrence if one wants to say that one consulted the occurrence metadata and token in order to decide about the identification, but it is not correct to say that the Identification identifies either the Occurrence metadata or the token (as Rich pointed out).
OK, so that's step one - defining what is related to what. If anyone disagrees with these relationships, please clarify or create your own diagram.
Complicating circumstances/caveats
- It is noted and recognized that some users will not care to include
all of these relationships in their models. In the interest of simplification or "flattening" the relationships, they may wish to collapse some parts of this diagram (e.g. incorporate time and location metadata within the Occurrence metadata rather than considering them separate resources, applying scientific names directly to the taxon representatives without defining a taxon concept or recording the determination metadata, connecting identifications directly to the occurrence, etc.). This doesn't mean that the relationships don't exist, it just means that some users don't care about them. 2. It is recognized that different users will be interested in or able to specify the various resources to differing degrees of precision. Examples: A photographer might record times to the nearest second, a collector may only be interested in noting the date on which a specimen was collected. A location may be specified to the precision of a GPS reading or be defined as some geographic or political subdivision. The taxon representative may be an individual organism, a flock or clump, or some larger aggregation of taxon representatives.
That's step two. If I've missed any complications, please point them out.
My opinions about the implications of this diagram
- The circle I've labeled as "taxon representative" is the resource
type that I'm proposing to be represented by the class Individual. You will note that in both the definition of dwc:individualID ("An identifier for an individual or named group of individual organisms...") and the proposed class definition ("The category of information pertaining to an individual or named group of individual organisms represented in an Occurrence"), groups of individual organisms are included. Thus John's example of a fossil having myriad individuals, or Richard's examples of thousands of plankton, a large school of fish, herd of wildebeest, flock of birds, could all be categorized as "Individual" under this definition if there is a reasonable expectation that all of the individuals in the group are members of the same taxon. Perhaps there is a better name for this resource, but since dwc:individualID was already extant, I chose Individual as the class name for consistency with the pattern established with other classes and their associated xxxxID terms. 2. Although in note 1.C. I have given the ranges of the various resources to their logical extreme (as was done previously in the thread), I think that as a practical matter we can adopt guidelines to set reasonable values for the "normal" ranges of the resources. One such guideline might be that we suggest a range that can accommodate about 95% of the user needs within the community (this came from Rich's comment about satisfying 95% of the user need with an establishmentMeans controlled vocuabulary). For example, it was suggested that the range for the location of an Occurrence could span the entire planet Earth. True enough, but virtually nobody would find such a span useful. 95% of users would probably find a range between a GPS reading with 10 meter precision and the extent of a county or province useful for recording the location of an Occurrence. I can suggest similar "useful" ranges: one second to one day for an event time (excluding fossils), one individual organism to the number of organisms that would fit within a 50 meter radius for an "individual", and taxon identified to family for plants and maybe mammals, genus for birds, and order for insects. So framing the definition of an Occurrence in these terms it would be something like: "An occurrence involves evidence (consisting of a physical token, electronic record, or personal observation) that a representative (ranging from a single individual to the number that would fit on a football field) of a taxon (hopefully identified to some lower taxonomic level) occurred at a place (determined to a precision between that of a GPS reading and the size of a county/province) and time (spanning one second to one day)." A few people might object to this level of restrictiveness, but I would guess that it would make 95% of us happy. 3. With the exception of the "missing" class Individual, every resource type on this diagram except for the "token" and Scientific name has a Darwin Core class. Every resource type on the diagram except for "token" has a dwc:xxxxID term that can be used to refer to a GUID for the resource. The implication of this is that any resource on this diagram except for the token and taxon representative (i.e. Individual) is ready to be represented in RDF by Darwin Core terms in the sense that the relationships (red lines) can be represented by the xxxxID terms and that the resources can be rdfs:type'd using Darwin Core classes. (Lacking a class for the scientific name doesn't seem like a big deal to me since the scientific name can be a string literal - but then I'm not a taxonomist.) 4. OK, I've avoided it as long as I can, so I'm going to confess now to the RDF-phobes. The red lines and shapes are something pretty close to an RDF graph. What that means is that if the community can agree that this diagram correctly represents the relationships among the kinds of biodiversity resources that we care about, then the matter of providing guidelines on how to represent Darwin Core in RDF suddenly gets a lot simpler. Just convert the "picture" of the RDF graph into XML format and we have a template. Alright, that's an oversimplification, but I think it is essentially true because the most difficult part of achieving a consensus on RDF representations is to decide how we connect the resource types, not on the literals that we hang onto resources as properties. 5. While I'm beating the RDF drum again, the importance of my opinion number 2 can be extended into the GUID adoption process. In my comments to Kevin about the Beginner's Guide to Persistent Identifiers, I think I commented on the question of how one decides whether a GUID needs to be assigned to something or not. I believe that the answer to that question boils down to this: we need a GUID for any resource that will be referenced by more than one other resource. Do we need to be able to assign a GUID to Taxon concepts? Yes, because it is likely that many identifications will want to reference a particular taxon concept. Do we need to be able to assign a GUID to an Event? Maybe or maybe not. If every occurrence has its own separate time recorded, then no GUID is needed because the time is just a part of every separate occurrence record. If the event is defined to be a time range that represents a collecting trip, then there may be many Occurrences that are associated with that trip and all of them could reference the GUID for that event rather than repeating the event information for every Occurrence. The point here is that every shape (class of resources) on this diagram at least has the POTENTIAL to be a node connecting multiple resources and therefore should have the capability of being assigned a GUID, having its own RDF record, and being appropriately typed (presumably by a DwC class). So this is a final technical argument for why we need to have the DwC class Individual. Whether or not people ultimately choose to assign GUIDs to particular resource types or not is their own choice, but they need to at least be ABLE to if they need that resource to serve as a node given the structure of their metadata.
We need to clarify how the "token" thing fits in, but I'm stopping there for now. I would very much appreciate responses indicating that:
A. you agree with the diagram and connections (and consider this definition and diagram a consensus) B. you disagree with the diagram (and articulate why) C. you provide an alternative diagram or explanation of the relationships among the classes related to Occurrences.
Thanks for you patience with another tome. Steve
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content .
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707
http://bioimages.vanderbilt.edu _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi Markus,
I feel your pain. :-)
Maybe an example might help clarify this.
I use the key* listed below to id my mosquitoes.
So I should mark up my RDF for the identification with something like:
dwcterms:nameAccordingToIdentification And Geographical Distribution Of The Mosquitoes: Of North America, North Of Mexico By Richard F., Jr. Darsie et al. 2004</dwcterms:nameAccordingTo>
Rather than use some other term like "dwc:identificationReferences"
Correct?
- Pete
* Identification And Geographical Distribution Of The Mosquitoes: Of North America, North Of Mexico By Richard F., Jr. Darsie, RONALD A. WARD, Chien C. Chang, Taina Litwak University Press of Florida, 2004 ISBN: 0813027845 Cite: 13463
=================================================================
On Mon, Oct 18, 2010 at 12:19 PM, "Markus Döring (GBIF)" mdoering@gbif.orgwrote:
I am sorry I dont have the time to follow this extensive thread, but I can manage at least the first paragraphs ;) A quick comment on tying identification sources to a scientific name. As for other taxon concepts this is usually done with the sec/sensu reference which should be recorded as dwc:nameAccordingTo:
http://rs.tdwg.org/dwc/terms/index.htm#nameAccordingTo
I am slightly irritated that we seem to have some term duplicates for this use case. Maybe dwc:identificationReferences is supposed to only list additional references?
Markus
On Oct 18, 2010, at 18:49, Steve Baskauf wrote:
I've fallen behind on systematically perusing the list responses, but I
would like to focus in on a point that seems to be a consensus in the responses that have shown up recently. The consensus seems to be that documenting determinations (a.k.a. instances of dwc:Identification class) that are applied to Individuals (or Occurrences if you don't believe in Individuals) is the way to go. So in my usual graphical way of thinking about this, I would draw a "relationship line" from the determination to the Individual (or Occurrence) on one side and from the determination to the species concept on the other. I will leave up to the taxonomy people the different things would be connected to the species concept and how all of their lines would be connected. The determination would have any of the properties that are terms listed in the dwc:Identification class (identifiedBy, dateIdentified, identificationReferences, identification Remarks, identificationQualifier, and typeStatus). Some properties like dateIdentified and identificationReferences would be string literals and others (especially identifiedBy) should probably be GUIDs but could be literals if they had to be.
That all seems pretty clear. However, when I've started trying to do
this in real life, I immediately have questions. Take a look at
http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf which
should show up as a web page in your browser.
- The original label identifies the species as Juncus diffusissimus.
However, there is no indicator as to who originally identified it or when. My assumption is that it was the collector (Glen N. Montz) but I don't really know that. Do I assume that, or list the original determiner as "unknown"?
- Do we draw a distinction between the initial identification and
subsequent annotations? I think the answer should be "no" and that's why I refer to both generically as "determinations".
- There is really no indication given on the annotation labels as to
many of the things that we would like to know, such as the concept they had in mind, any source they used (if any), or the reason why they did the annotation. So how does one connect the name that they applied to the determination when there is no indication of the concept? Is this just something we can't do for old annotations and just something that we try to do from this point forward?
- The last question is one that I really want to some opinions about.
It seems to me that there are a number of reasons why one would apply a determination. One would be to correct an actual error in identification. One would be to increase the precision of a previous determination (e.g. an insect identified to family now is identified to species). One would be to assert a difference in opinion as to the correct way to group this individual with others (i.e. as in a taxonomic revision). Finally, a single determiner might apply several determinations to one individual and indicate in each determination the concept intended (i.e. if you subscribe to Cronquist, you'd call it X; if you like Radford's book, you'd call it Y; if you like Weakley's treatment, you'd call it Z). Some of these four reasons may be functionally equivalent, but how would you use Darwin Core to indicate the reason why you applied the determination? Please don't say "identificationRemarks"! From a machine-processing standpoint, this is something we should know and there should be some kind of controlled vocabulary to express it. For instance if an identification is "deprecated" because it was in error (perhaps by the determiner him/herself), one would like the incorrect determination to show up in the historical metadata, but I wouldn't want it to be listed in a website index. The same would hold true if an annotator was able to pin the taxon down to a lower taxonomic level than the original identifier. If someone goes to the trouble to connect an Individual/Occurrence to several names under alternative concepts, there should be a way the a machine would know this so that a software user could select the concept they wanted to use and the name under that concept would pop up.
I don't really see any term under the current DwC that could be used to
do this last thing. Am I missing something? Do we need several terms to explain the reason why we made the determination because the reasons fall into different categories?
The other comment that I'll throw out (since this is going out to the
bioblitz list as well as to tdwg-content) is that those of you who are building apps to collect metadata in the field really need to separate the process of entering (or acquiring) the collection metadata from the determination process. In at least some apps, the user immediately has to commit to a taxon as they enter the data at the time of collection. It seems to me that it would be a very common situation (especially in the case of "citizen science") that the collector/observer/photographer would have no idea what the taxonomic identity was at the time of collection. The process of determination (and the recording of the various dwc:Identification class terms) is really a separate process that should be able to happen at the time of collection OR later.
Steve
Peter DeVries wrote:
Hi Steve,
I would hypothesize that for the vast majority of identified records the
process is something like this:
- An individual uses some sort of key to determine what species (taxon
concept) to assign to a given individual
- They may have created some sort of mental key in which once they
recognize one individual mosquito they can then pretty quickly sort
a number of individuals into collections.
- The actual name they assign to the specimen is usually based on what
their key says the name is. Often this does not specify the authorship.
Most of these human identifiers have not read the original species
descriptions and for the species they are identifying.
So the specimen is actually tied to a concept that is based more on
the "key" than the original description.
* An exception, would be where there is a key in the original
description and that was what what was used.
- So in a sense, the process of modeling this as if the if the
identifier actually asserted that the concept was the same as that described by
the original description or a subsequent revision is "fudging"
Side effects of this process include:
- A new key for North American Mosquitoes comes out that incorporates
recent changes in nomenclature. The major change being the elevation of
a subgenus to a genus. For most of the species described the "key
concept" is unchanged.
Student identifier, Bob, in state X is using the latest key, while
student identifier, Joe, is state Z is using a slightly older edition of the same key.
Bob identifies the species as Ochlerotatus triseriatus, while Joe
identifies what should be the same species as Aedes triseriatus.
These show up in GBIF on two different maps, they show up in the EOL as
two different pages.
Various TDWG'ers continue to argue that the original description and
subsequent revisions were really important in determining what these individuals
actually meant when they assigned a name to a specimen, and that this is
how we should model it in excruciating detail.
I would argue this should be modeled as best as possible to what
actually happens.
For example, how many of the species observed in the recent BioBlitz
were identified by referring to the original species description or subsequent revisions?
In your diagram, I would suggest that you show that a taxon concept may
have many names associated with it. Since it is not clear what the identifier intended by his or her choice of a name, it is often difficult to determine what taxon concept they actually meant.
This is why I advocate a move to a more taxon concept based identifier
to link these data sets together because this allows the intent of the identifier
is more accurately modeled.
This would be done in the form of:
"I assert that this specimen (of what I call Aedes triseriatus) was
observed here. I also assert that it is an instance of the this species concept => URI"
Or I assert that this is an individual of the type "Individual of
species concept X" = > URI
All of these are instances of the class "Individual"
So the resulting DarwinCore record would contain both the name and and
an optional, but I think needed, asserted species concept.
The species concept is a subclass of taxon concept, but is fundamentally
different than the higher clades.
There are some guidelines as to what an entity needs to be considered a
species.
While their are no real guidelines as to what clades should be
considered genera and what clades should be considered families etc.
Assigning properties at the level of genera or family is also
problematic because it assumes that there will be inferencing and it will require rechecking
that those properties are still valid if the species within that genera
change.
So if there is some property that is common to all the species in the
genus, make that a property of each of the individual species - not a property
of the genus.
Respectfully,
- Pete
On Fri, Oct 15, 2010 at 10:45 AM, Steve Baskauf <
steve.baskauf@vanderbilt.edu> wrote:
As a background to this post, I want to reference a post by Bob called
"SubclassOrNot". I discovered this page on an early foray into the TDWG website labyrinth and it has been very influential on my thinking since then. The idea Bob discusses is central to what I'm writing below so if you haven't read it you might want to do so first. You can probably skip the "OWL Inference" section and still get the point which is described in the first two sections of his post. The URL for the page is http://wiki.tdwg.org/twiki/bin/view/TAG/SubclassOrNot .
To preface what I'm going to say below, I want to put Darwin Core
Occurrences in the context of what Bob wrote. In my mind, one of the hallmarks of the Darwin Core standard and one thing that makes it a great improvement over previous versions is that the decision was made to use what Bob called the "has a" approach rather than the "is a" approach. In particular, the Darwin Core standard has a single class called dwc:Occurrence rather than subclasses called "Specimen", "Observation", and other possible things. The way that we differentiate among different kinds of Occurrences is by using the DwC types which are the controlled values for the term dwc:basisOfRecord. Thus we say an Occurrence "has a" basisOfRecord=PreservedSpecimen rather than saying it "is a" PreservedSpecimen. We say an Occurrence "has a" basisOfRecord=HumanObservation rather than saying it "is a"HumanObservation". This approach has greatly reduced the number of different terms in the standard since we don't have to have separate "ObservedBy" and "CollectedBy" terms, but rather can just have a single "RecordedBy" term that applies to both specimens and observations. The same thing applies to many other things, like eventDate rather than DateCollected and DateObserved, locality rather than collectionLocality and observationLocality, etc. With the ratification of Darwin Core, this decision is now a fait acompli and not a subject of discussion or something optional for users of the standard. It also seems to be clear that as necessary new terms can be added to the DwC types which would then be valid controlled values for basisOfRecord.
Since the adoption of the DwC standard, the approach to Occurrences has
been what I would describe as "I know an Occurrence when I see one". I consider this as a pretty sloppy practice and as I indicated in my post last night, I think there is enough consensus about what an Occurrence is that we can come up with a better definition than "an occurrence is the category of information pertaining to evidence of an occurrence...". Another part of what I would characterize as sloppiness is the lack of a clear definition of what exactly basisOfRecord means. When I wrote my attempt at summarizing consensus last night, I dodged the question about what I called the "token". This "thing" has been called various names. In the previous discussion on the list, it was sometimes called "the evidence" of the occurrence. In the past I have called it "a representation" - however, I now think the term "token" is better because "representation" has a different technical meaning in the context of content negotiation. When we type an Occurrence by saying it has a basisOfRecord=PreservedSpecimen, we are saying that this Occurrence has as supporting evidence, or as a "token" if you prefer, all or part of the dead remains of the organism (i.e. what I'm calling "the Individual") that was being documented by the Occurrence. When we type an Occurrence by saying it has a basisOfRecord=LivingSpecimen, we are saying that this Occurrence has as a "token" the entire organism that was being documented (or some vegetative part of the live organism that was propagated). When we type an Occurrence by saying it has a basisOfRecord=HumanObservation, we are saying that the Occurrence has no supporting evidence other than the reputation of the observer to accurately record the metadata about the Occurrence. In other words, we "tag" a instance of a core class (to use Bob's words), Occurrence, by telling a metadata consumer what kind of token we are using as evidence of the Occurrence.
A fundamental part of creating a clear definition of what an Occurrence
is, is to define exactly what we are including in the concept of Occurrence. One possibility is to (1) say that the two boxes at the right side of the diagram at http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gifare fused and that both the Occurrence metadata and its associated token are what we consider to be "the Occurrence". Another approach (2) would be to say that the actual Occurrence as an entity is only the metadata part and that the token is a separate thing. A third approach is to say (3) that everything with the blue dotted lines is considered a part of the Occurrence (i.e. the metadata, the token, the event, and the locality). I don't think in an absolute sense, any one of these approaches is "right". The problem is that these approaches are used inconsistently, sometimes even by the same person, depending on the basisOfRecord. Differences in ways of thinking about this issue is a part of why people aren't understanding the way other people are approaching the structuring of metadata. I have tried to consistently take the approach (1) that the two boxes on the right are fused, i.e. that the Occurrence metadata and the token should both be considered part of the entity that we call "an Occurrence". I think this is why Rich was confused in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001666.html when I said that it was "wrong" to assert that a scientific name is a property of an Occurrence - obviously it is silly to say that the token (photons on a film, sound patterns in a digital file) has a scientific name. Yet that is exactly what people do routinely when the token is a branch cut off a tree and glued to a piece of paper. They say that they are "identifying a specimen". What I am asking (actually demanding) is that the TDWG community get its act together and come to some consistency on this. If we are going to take the approach (2), then we need to take specimens off their pedestal and treat them like we do any other token that we are using as evidence that an Occurrence happened. If we are going to do what was suggested for the BioBlitz in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001603.html, i.e. to call Occurrences "observations" and then link the tokens to them by associatedMedia, ResourceRelationship, or some other means (approach 2) then do it consistently for every kind of token, including specimens, and don't single out media tokens for punishment.
I have in a sense "thrown down the gauntlet" on this issue by proposing
that DigitalStillImage be added as a DwC type and as a controlled value for basisOfRecord (http://code.google.com/p/darwincore/issues/detail?id=68). I know what some people are going to say in response to this proposal. "Why do you need to have 'DigitalStillImage' as a value for basisOfRecord when you can just say that the resource's dcterms:type=StillImage?" The answer goes back to Bob's point. If we are going to go the "has a" path (which we already have in DwC for Occurrences) rather than subclassing everything, then we need to provide an appropriate value for the "tag" for any type of resource that a reasonable number of users will want to use as a token. I think it is clear from this and other Bioblitzes, my work in Bioimages, the whale tracking project, and many other examples, that there are plenty of people who are already using DigitalStillImages as tokens and we all need a controlled value to use for basisOfRecord.
The other thing that we accomplish when we type an Occurrence by its
basisOfRecord is to tell a consumer what kind of metadata to expect to get about the token in addition to the generic metadata that is provided for all Occurrences. Thus for a LivingSpecimen we expect to be told what zoo, botanical garden, bacterial collection, etc. contains the specimen. For a PreservedSpecimen we expect to be told the preparation type, the location of the repository, etc. For a DigitalStillImage we expect to be told the file type, accessURL, etc. Simply providing a value for dcterms:type=StillImage doesn't indicate whether the image is a physical one (i.e. on film) or a digital one. It is also unreasonable to expect a client to have to be checking two different terms (basisOfRecord and dcterms:type) to find out what they could learn from one (basisOfRecord). Of course it would be advisable to provide a value for dcterms:type as well for clients outside the biodiversity community who may not "understand" what basisOfRecord means.
I hate to keep bringing my posts back to the RDF issue, but thinking
about how one would write RDF forces clear thinking about how metadata should be structured. If we intend to separate tokens as entities from their associated Occurrence metadata, i.e. approach (2), then we open up a whole other can of worms. To associate the occurrence resources (i.e. the metadata) with the "different" resource (i.e. the token), we will have probably have to be able to create URIs for the tokens and separate RDF metadata blocks which will have to be rdfs:type'd. What are we going to use for that rdfs:type - create another Darwin Core class? I simply don't think that is a complicated road that we want to travel. It would be far easier to just say that every Occurrence has a one-to-one relationship with its token (which could be "the empty set" for observations). This would not work for people who want to hang multiple tokens on a single observation event, but I think that itself is a bad idea because it makes it even harder to have "flat" occurrence datasets. Just say that every time we collect a different token (or make an observation that has no token), it is a new Occurrence record. Realistically, a single collector can't actually take a picture of a plant at the same time he or she collects it for a specimen anyway. Those really should be considered two different events because they happen at different times.
OK, enough said. Consider this my defense of my proposal "issue 68" to
add DigitalStillImage. I would urge the powers that be to respond to the issues that I've raised here before having any kind of "vote" (or whatever is ultimately going to happen when there is an up or down decision about the proposal).
Steve
Steve Baskauf wrote: After the flurry of emails recently, I had an opportunity to carefully read all the way through the threads again, followed by enforced "think time" during my long commute. I was actually pretty cheerful after that because I think that in essence, most of the conversation about what constitutes an Occurrence really boils down to the same thing. So I have sat down and tried to summarize what seems to me to be a consensus about Occurrences. To follow my points, please refer to the diagram at: http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif
Consensus on relationships
- The fundamental definition of an Occurrence involves evidence that a
representative of a taxon occurred at a place and time. Note 1.A: For clarity, I have modified John's statement in his last email by replacing "taxon" with "representative of a taxon". I'm considering a taxon to be an abstract concept that is applied to individuals or groups of organisms. Note 1.B. This definition is far more useful than the official definition of the class Occurrence "The category of information pertaining to evidence of an occurrence..." which is essentially
circular.
Note 1.C: This statement is extremely broad because the evidence could be of many sorts, the representative could range from a single individual to all organisms on the earth, the taxon could be anyone's definition at any taxonomic level, the place could range from a GPS point with uncertainty of less than 10 meters to the entire planet earth, and the time could range from a shutter click of less than one second to 3.4 billion years. 2. The diagram is an attempt to summarize in pictorial form statements and relationships that have been described in the thread. The taxon representative is recorded as existing at a particular time and place (the arrow) and the result is an Occurrence record. That Occurrence record exists as metadata which may be associated with a token that can be used to voucher the fact that the taxon representative existed. That token may be the organism itself (or a living part of it as in a twig for grafting), all or part of the organism in preserved form, an electronic representation such as an image or sound recording, and other kinds of things like tissue or DNA samples. There may also be no token at all, in which case we call the Occurrence record an observation. Based on direct observation of the taxon representative, examination of one or more tokens, or both, some determiner asserts that a taxon concept applies to the taxon representative and as a result a scientific name can be used to "identify" the taxon representative. (There may be a lot of other complicated stuff above the Identification box, but that will have to be filled in by the taxonomists.) Note 2.A: I have mapped onto this diagram the letters that John used in his last email to refer to entities that are involved in an Occurrence (T, E, L, O, and G). I will beg the forgiveness of fossil people because I don't really know how the geological context fits in. I'm assuming that it is a way of asserting time and location on a much broader scale than we do for extant organisms. Note 2.B: I have put a dotted line around the part of the diagram that I think includes all the things that people might consider part of the Occurrence itself. I have left out "T" and the other parts related to identification because it seems to me that you can have an occurrence that you document which does not yet (and perhaps never will) have an identification. The Occurrence still asserts that a taxon representative existed at a time and place; we just don't yet know what the taxon is. 3. The red lines indicate the relationships that connect the various entities (I'm going to go ahead and call them resources). Consistent with popular opinion, the Occurrence record is the center of the universe and most things are connected to it. Note 3.A: I am sticking to my guns and refuse to connect the Identification directly to the Occurrence. It is the taxon representative that is being identified, not the occurrence. One can assert another sort of relationship between the identification and the occurrence if one wants to say that one consulted the occurrence metadata and token in order to decide about the identification, but it is not correct to say that the Identification identifies either the Occurrence metadata or the token (as Rich pointed out).
OK, so that's step one - defining what is related to what. If anyone disagrees with these relationships, please clarify or create your own diagram.
Complicating circumstances/caveats
- It is noted and recognized that some users will not care to include
all of these relationships in their models. In the interest of simplification or "flattening" the relationships, they may wish to collapse some parts of this diagram (e.g. incorporate time and location metadata within the Occurrence metadata rather than considering them separate resources, applying scientific names directly to the taxon representatives without defining a taxon concept or recording the determination metadata, connecting identifications directly to the occurrence, etc.). This doesn't mean that the relationships don't exist, it just means that some users don't care about them. 2. It is recognized that different users will be interested in or able to specify the various resources to differing degrees of precision. Examples: A photographer might record times to the nearest second, a collector may only be interested in noting the date on which a specimen was collected. A location may be specified to the precision of a GPS reading or be defined as some geographic or political subdivision. The taxon representative may be an individual organism, a flock or clump, or some larger aggregation of taxon representatives.
That's step two. If I've missed any complications, please point them
out.
My opinions about the implications of this diagram
- The circle I've labeled as "taxon representative" is the resource
type that I'm proposing to be represented by the class Individual. You will note that in both the definition of dwc:individualID ("An identifier for an individual or named group of individual organisms...") and the proposed class definition ("The category of information pertaining to an individual or named group of individual organisms represented in an Occurrence"), groups of individual organisms are included. Thus John's example of a fossil having myriad individuals, or Richard's examples of thousands of plankton, a large school of fish, herd of wildebeest, flock of birds, could all be categorized as "Individual" under this definition if there is a reasonable expectation that all of the individuals in the group are members of the same taxon. Perhaps there is a better name for this resource, but since dwc:individualID was already extant, I chose Individual as the class name for consistency with the pattern established with other classes and their associated xxxxID terms. 2. Although in note 1.C. I have given the ranges of the various resources to their logical extreme (as was done previously in the thread), I think that as a practical matter we can adopt guidelines to set reasonable values for the "normal" ranges of the resources. One such guideline might be that we suggest a range that can accommodate about 95% of the user needs within the community (this came from Rich's comment about satisfying 95% of the user need with an establishmentMeans controlled vocuabulary). For example, it was suggested that the range for the location of an Occurrence could span the entire planet Earth. True enough, but virtually nobody would find such a span useful. 95% of users would probably find a range between a GPS reading with 10 meter precision and the extent of a county or province useful for recording the location of an Occurrence. I can suggest similar "useful" ranges: one second to one day for an event time (excluding fossils), one individual organism to the number of organisms that would fit within a 50 meter radius for an "individual", and taxon identified to family for plants and maybe mammals, genus for birds, and order for insects. So framing the definition of an Occurrence in these terms it would be something like: "An occurrence involves evidence (consisting of a physical token, electronic record, or personal observation) that a representative (ranging from a single individual to the number that would fit on a football field) of a taxon (hopefully identified to some lower taxonomic level) occurred at a place (determined to a precision between that of a GPS reading and the size of a county/province) and time (spanning one second to one day)." A few people might object to this level of restrictiveness, but I would guess that it would make 95% of us happy. 3. With the exception of the "missing" class Individual, every resource type on this diagram except for the "token" and Scientific name has a Darwin Core class. Every resource type on the diagram except for "token" has a dwc:xxxxID term that can be used to refer to a GUID for the resource. The implication of this is that any resource on this diagram except for the token and taxon representative (i.e. Individual) is ready to be represented in RDF by Darwin Core terms in the sense that the relationships (red lines) can be represented by the xxxxID terms and that the resources can be rdfs:type'd using Darwin Core classes. (Lacking a class for the scientific name doesn't seem like a big deal to me since the scientific name can be a string literal - but then I'm not a taxonomist.) 4. OK, I've avoided it as long as I can, so I'm going to confess now to the RDF-phobes. The red lines and shapes are something pretty close to an RDF graph. What that means is that if the community can agree that this diagram correctly represents the relationships among the kinds of biodiversity resources that we care about, then the matter of providing guidelines on how to represent Darwin Core in RDF suddenly gets a lot simpler. Just convert the "picture" of the RDF graph into XML format and we have a template. Alright, that's an oversimplification, but I think it is essentially true because the most difficult part of achieving a consensus on RDF representations is to decide how we connect the resource types, not on the literals that we hang onto resources as properties. 5. While I'm beating the RDF drum again, the importance of my opinion number 2 can be extended into the GUID adoption process. In my comments to Kevin about the Beginner's Guide to Persistent Identifiers, I think I commented on the question of how one decides whether a GUID needs to be assigned to something or not. I believe that the answer to that question boils down to this: we need a GUID for any resource that will be referenced by more than one other resource. Do we need to be able to assign a GUID to Taxon concepts? Yes, because it is likely that many identifications will want to reference a particular taxon concept. Do we need to be able to assign a GUID to an Event? Maybe or maybe not. If every occurrence has its own separate time recorded, then no GUID is needed because the time is just a part of every separate occurrence record. If the event is defined to be a time range that represents a collecting trip, then there may be many Occurrences that are associated with that trip and all of them could reference the GUID for that event rather than repeating the event information for every Occurrence. The point here is that every shape (class of resources) on this diagram at least has the POTENTIAL to be a node connecting multiple resources and therefore should have the capability of being assigned a GUID, having its own RDF record, and being appropriately typed (presumably by a DwC class). So this is a final technical argument for why we need to have the DwC class Individual. Whether or not people ultimately choose to assign GUIDs to particular resource types or not is their own choice, but they need to at least be ABLE to if they need that resource to serve as a node given the structure of their metadata.
We need to clarify how the "token" thing fits in, but I'm stopping there for now. I would very much appreciate responses indicating that:
A. you agree with the diagram and connections (and consider this definition and diagram a consensus) B. you disagree with the diagram (and articulate why) C. you provide an alternative diagram or explanation of the relationships among the classes related to Occurrences.
Thanks for you patience with another tome. Steve
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content .
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707
http://bioimages.vanderbilt.edu _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
yes Steve, and in case one needs a normalised uri version there is also dwcterms:nameAccordingToID
On Oct 18, 2010, at 19:57, Peter DeVries wrote:
Hi Markus,
I feel your pain. :-)
Maybe an example might help clarify this.
I use the key* listed below to id my mosquitoes.
So I should mark up my RDF for the identification with something like:
dwcterms:nameAccordingToIdentification And Geographical Distribution Of The Mosquitoes: Of North America, North Of Mexico By Richard F., Jr. Darsie et al. 2004</dwcterms:nameAccordingTo>
Rather than use some other term like "dwc:identificationReferences"
Correct?
- Pete
Identification And Geographical Distribution Of The Mosquitoes: Of North America, North Of Mexico By Richard F., Jr. Darsie, RONALD A. WARD, Chien C. Chang, Taina Litwak University Press of Florida, 2004 ISBN: 0813027845 Cite: 13463
=================================================================
On Mon, Oct 18, 2010 at 12:19 PM, "Markus Döring (GBIF)" mdoering@gbif.org wrote: I am sorry I dont have the time to follow this extensive thread, but I can manage at least the first paragraphs ;) A quick comment on tying identification sources to a scientific name. As for other taxon concepts this is usually done with the sec/sensu reference which should be recorded as dwc:nameAccordingTo:
http://rs.tdwg.org/dwc/terms/index.htm#nameAccordingTo
I am slightly irritated that we seem to have some term duplicates for this use case. Maybe dwc:identificationReferences is supposed to only list additional references?
Markus
On Oct 18, 2010, at 18:49, Steve Baskauf wrote:
I've fallen behind on systematically perusing the list responses, but I would like to focus in on a point that seems to be a consensus in the responses that have shown up recently. The consensus seems to be that documenting determinations (a.k.a. instances of dwc:Identification class) that are applied to Individuals (or Occurrences if you don't believe in Individuals) is the way to go. So in my usual graphical way of thinking about this, I would draw a "relationship line" from the determination to the Individual (or Occurrence) on one side and from the determination to the species concept on the other. I will leave up to the taxonomy people the different things would be connected to the species concept and how all of their lines would be connected. The determination would have any of the properties that are terms listed in the dwc:Identification class (identifiedBy, dateIdentified, identificationReferences, identification Remarks, identificationQualifier, and typeStatus). Some properties like dateIdentified and identificationReferences would be string literals and others (especially identifiedBy) should probably be GUIDs but could be literals if they had to be.
That all seems pretty clear. However, when I've started trying to do this in real life, I immediately have questions. Take a look at http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf which should show up as a web page in your browser.
- The original label identifies the species as Juncus diffusissimus. However, there is no indicator as to who originally identified it or when. My assumption is that it was the collector (Glen N. Montz) but I don't really know that. Do I assume that, or list the original determiner as "unknown"?
- Do we draw a distinction between the initial identification and subsequent annotations? I think the answer should be "no" and that's why I refer to both generically as "determinations".
- There is really no indication given on the annotation labels as to many of the things that we would like to know, such as the concept they had in mind, any source they used (if any), or the reason why they did the annotation. So how does one connect the name that they applied to the determination when there is no indication of the concept? Is this just something we can't do for old annotations and just something that we try to do from this point forward?
- The last question is one that I really want to some opinions about. It seems to me that there are a number of reasons why one would apply a determination. One would be to correct an actual error in identification. One would be to increase the precision of a previous determination (e.g. an insect identified to family now is identified to species). One would be to assert a difference in opinion as to the correct way to group this individual with others (i.e. as in a taxonomic revision). Finally, a single determiner might apply several determinations to one individual and indicate in each determination the concept intended (i.e. if you subscribe to Cronquist, you'd call it X; if you like Radford's book, you'd call it Y; if you like Weakley's treatment, you'd call it Z). Some of these four reasons may be functionally equivalent, but how would you use Darwin Core to indicate the reason why you applied the determination? Please don't say "identificationRemarks"! From a machine-processing standpoint, this is something we should know and there should be some kind of controlled vocabulary to express it. For instance if an identification is "deprecated" because it was in error (perhaps by the determiner him/herself), one would like the incorrect determination to show up in the historical metadata, but I wouldn't want it to be listed in a website index. The same would hold true if an annotator was able to pin the taxon down to a lower taxonomic level than the original identifier. If someone goes to the trouble to connect an Individual/Occurrence to several names under alternative concepts, there should be a way the a machine would know this so that a software user could select the concept they wanted to use and the name under that concept would pop up.
I don't really see any term under the current DwC that could be used to do this last thing. Am I missing something? Do we need several terms to explain the reason why we made the determination because the reasons fall into different categories?
The other comment that I'll throw out (since this is going out to the bioblitz list as well as to tdwg-content) is that those of you who are building apps to collect metadata in the field really need to separate the process of entering (or acquiring) the collection metadata from the determination process. In at least some apps, the user immediately has to commit to a taxon as they enter the data at the time of collection. It seems to me that it would be a very common situation (especially in the case of "citizen science") that the collector/observer/photographer would have no idea what the taxonomic identity was at the time of collection. The process of determination (and the recording of the various dwc:Identification class terms) is really a separate process that should be able to happen at the time of collection OR later.
Steve
Peter DeVries wrote:
Hi Steve,
I would hypothesize that for the vast majority of identified records the process is something like this:
An individual uses some sort of key to determine what species (taxon concept) to assign to a given individual
- They may have created some sort of mental key in which once they recognize one individual mosquito they can then pretty quickly sort a number of individuals into collections.
The actual name they assign to the specimen is usually based on what their key says the name is. Often this does not specify the authorship. Most of these human identifiers have not read the original species descriptions and for the species they are identifying. So the specimen is actually tied to a concept that is based more on the "key" than the original description.
- An exception, would be where there is a key in the original description and that was what what was used.
So in a sense, the process of modeling this as if the if the identifier actually asserted that the concept was the same as that described by the original description or a subsequent revision is "fudging"
Side effects of this process include:
- A new key for North American Mosquitoes comes out that incorporates recent changes in nomenclature. The major change being the elevation of a subgenus to a genus. For most of the species described the "key concept" is unchanged.
Student identifier, Bob, in state X is using the latest key, while student identifier, Joe, is state Z is using a slightly older edition of the same key.
Bob identifies the species as Ochlerotatus triseriatus, while Joe identifies what should be the same species as Aedes triseriatus.
These show up in GBIF on two different maps, they show up in the EOL as two different pages.
Various TDWG'ers continue to argue that the original description and subsequent revisions were really important in determining what these individuals actually meant when they assigned a name to a specimen, and that this is how we should model it in excruciating detail.
I would argue this should be modeled as best as possible to what actually happens.
For example, how many of the species observed in the recent BioBlitz were identified by referring to the original species description or subsequent revisions?
In your diagram, I would suggest that you show that a taxon concept may have many names associated with it. Since it is not clear what the identifier intended by his or her choice of a name, it is often difficult to determine what taxon concept they actually meant.
This is why I advocate a move to a more taxon concept based identifier to link these data sets together because this allows the intent of the identifier is more accurately modeled.
This would be done in the form of:
"I assert that this specimen (of what I call Aedes triseriatus) was observed here. I also assert that it is an instance of the this species concept => URI"
Or I assert that this is an individual of the type "Individual of species concept X" = > URI
All of these are instances of the class "Individual"
So the resulting DarwinCore record would contain both the name and and an optional, but I think needed, asserted species concept.
The species concept is a subclass of taxon concept, but is fundamentally different than the higher clades.
There are some guidelines as to what an entity needs to be considered a species.
While their are no real guidelines as to what clades should be considered genera and what clades should be considered families etc.
Assigning properties at the level of genera or family is also problematic because it assumes that there will be inferencing and it will require rechecking that those properties are still valid if the species within that genera change.
So if there is some property that is common to all the species in the genus, make that a property of each of the individual species - not a property of the genus.
Respectfully,
- Pete
On Fri, Oct 15, 2010 at 10:45 AM, Steve Baskauf steve.baskauf@vanderbilt.edu wrote: As a background to this post, I want to reference a post by Bob called "SubclassOrNot". I discovered this page on an early foray into the TDWG website labyrinth and it has been very influential on my thinking since then. The idea Bob discusses is central to what I'm writing below so if you haven't read it you might want to do so first. You can probably skip the "OWL Inference" section and still get the point which is described in the first two sections of his post. The URL for the page is http://wiki.tdwg.org/twiki/bin/view/TAG/SubclassOrNot .
To preface what I'm going to say below, I want to put Darwin Core Occurrences in the context of what Bob wrote. In my mind, one of the hallmarks of the Darwin Core standard and one thing that makes it a great improvement over previous versions is that the decision was made to use what Bob called the "has a" approach rather than the "is a" approach. In particular, the Darwin Core standard has a single class called dwc:Occurrence rather than subclasses called "Specimen", "Observation", and other possible things. The way that we differentiate among different kinds of Occurrences is by using the DwC types which are the controlled values for the term dwc:basisOfRecord. Thus we say an Occurrence "has a" basisOfRecord=PreservedSpecimen rather than saying it "is a" PreservedSpecimen. We say an Occurrence "has a" basisOfRecord=HumanObservation rather than saying it "is a"HumanObservation". This approach has greatly reduced the number of different terms in the standard since we don't have to have separate "ObservedBy" and "CollectedBy" terms, but rather can just have a single "RecordedBy" term that applies to both specimens and observations. The same thing applies to many other things, like eventDate rather than DateCollected and DateObserved, locality rather than collectionLocality and observationLocality, etc. With the ratification of Darwin Core, this decision is now a fait acompli and not a subject of discussion or something optional for users of the standard. It also seems to be clear that as necessary new terms can be added to the DwC types which would then be valid controlled values for basisOfRecord.
Since the adoption of the DwC standard, the approach to Occurrences has been what I would describe as "I know an Occurrence when I see one". I consider this as a pretty sloppy practice and as I indicated in my post last night, I think there is enough consensus about what an Occurrence is that we can come up with a better definition than "an occurrence is the category of information pertaining to evidence of an occurrence...". Another part of what I would characterize as sloppiness is the lack of a clear definition of what exactly basisOfRecord means. When I wrote my attempt at summarizing consensus last night, I dodged the question about what I called the "token". This "thing" has been called various names. In the previous discussion on the list, it was sometimes called "the evidence" of the occurrence. In the past I have called it "a representation" - however, I now think the term "token" is better because "representation" has a different technical meaning in the context of content negotiation. When we type an Occurrence by saying it has a basisOfRecord=PreservedSpecimen, we are saying that this Occurrence has as supporting evidence, or as a "token" if you prefer, all or part of the dead remains of the organism (i.e. what I'm calling "the Individual") that was being documented by the Occurrence. When we type an Occurrence by saying it has a basisOfRecord=LivingSpecimen, we are saying that this Occurrence has as a "token" the entire organism that was being documented (or some vegetative part of the live organism that was propagated). When we type an Occurrence by saying it has a basisOfRecord=HumanObservation, we are saying that the Occurrence has no supporting evidence other than the reputation of the observer to accurately record the metadata about the Occurrence. In other words, we "tag" a instance of a core class (to use Bob's words), Occurrence, by telling a metadata consumer what kind of token we are using as evidence of the Occurrence. A fundamental part of creating a clear definition of what an Occurrence is, is to define exactly what we are including in the concept of Occurrence. One possibility is to (1) say that the two boxes at the right side of the diagram at http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif are fused and that both the Occurrence metadata and its associated token are what we consider to be "the Occurrence". Another approach (2) would be to say that the actual Occurrence as an entity is only the metadata part and that the token is a separate thing. A third approach is to say (3) that everything with the blue dotted lines is considered a part of the Occurrence (i.e. the metadata, the token, the event, and the locality). I don't think in an absolute sense, any one of these approaches is "right". The problem is that these approaches are used inconsistently, sometimes even by the same person, depending on the basisOfRecord. Differences in ways of thinking about this issue is a part of why people aren't understanding the way other people are approaching the structuring of metadata. I have tried to consistently take the approach (1) that the two boxes on the right are fused, i.e. that the Occurrence metadata and the token should both be considered part of the entity that we call "an Occurrence". I think this is why Rich was confused in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001666.html when I said that it was "wrong" to assert that a scientific name is a property of an Occurrence - obviously it is silly to say that the token (photons on a film, sound patterns in a digital file) has a scientific name. Yet that is exactly what people do routinely when the token is a branch cut off a tree and glued to a piece of paper. They say that they are "identifying a specimen". What I am asking (actually demanding) is that the TDWG community get its act together and come to some consistency on this. If we are going to take the approach (2), then we need to take specimens off their pedestal and treat them like we do any other token that we are using as evidence that an Occurrence happened. If we are going to do what was suggested for the BioBlitz in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001603.html, i.e. to call Occurrences "observations" and then link the tokens to them by associatedMedia, ResourceRelationship, or some other means (approach 2) then do it consistently for every kind of token, including specimens, and don't single out media tokens for punishment. I have in a sense "thrown down the gauntlet" on this issue by proposing that DigitalStillImage be added as a DwC type and as a controlled value for basisOfRecord (http://code.google.com/p/darwincore/issues/detail?id=68). I know what some people are going to say in response to this proposal. "Why do you need to have 'DigitalStillImage' as a value for basisOfRecord when you can just say that the resource's dcterms:type=StillImage?" The answer goes back to Bob's point. If we are going to go the "has a" path (which we already have in DwC for Occurrences) rather than subclassing everything, then we need to provide an appropriate value for the "tag" for any type of resource that a reasonable number of users will want to use as a token. I think it is clear from this and other Bioblitzes, my work in Bioimages, the whale tracking project, and many other examples, that there are plenty of people who are already using DigitalStillImages as tokens and we all need a controlled value to use for basisOfRecord. The other thing that we accomplish when we type an Occurrence by its basisOfRecord is to tell a consumer what kind of metadata to expect to get about the token in addition to the generic metadata that is provided for all Occurrences. Thus for a LivingSpecimen we expect to be told what zoo, botanical garden, bacterial collection, etc. contains the specimen. For a PreservedSpecimen we expect to be told the preparation type, the location of the repository, etc. For a DigitalStillImage we expect to be told the file type, accessURL, etc. Simply providing a value for dcterms:type=StillImage doesn't indicate whether the image is a physical one (i.e. on film) or a digital one. It is also unreasonable to expect a client to have to be checking two different terms (basisOfRecord and dcterms:type) to find out what they could learn from one (basisOfRecord). Of course it would be advisable to provide a value for dcterms:type as well for clients outside the biodiversity community who may not "understand" what basisOfRecord means. I hate to keep bringing my posts back to the RDF issue, but thinking about how one would write RDF forces clear thinking about how metadata should be structured. If we intend to separate tokens as entities from their associated Occurrence metadata, i.e. approach (2), then we open up a whole other can of worms. To associate the occurrence resources (i.e. the metadata) with the "different" resource (i.e. the token), we will have probably have to be able to create URIs for the tokens and separate RDF metadata blocks which will have to be rdfs:type'd. What are we going to use for that rdfs:type - create another Darwin Core class? I simply don't think that is a complicated road that we want to travel. It would be far easier to just say that every Occurrence has a one-to-one relationship with its token (which could be "the empty set" for observations). This would not work for people who want to hang multiple tokens on a single observation event, but I think that itself is a bad idea because it makes it even harder to have "flat" occurrence datasets. Just say that every time we collect a different token (or make an observation that has no token), it is a new Occurrence record. Realistically, a single collector can't actually take a picture of a plant at the same time he or she collects it for a specimen anyway. Those really should be considered two different events because they happen at different times.
OK, enough said. Consider this my defense of my proposal "issue 68" to add DigitalStillImage. I would urge the powers that be to respond to the issues that I've raised here before having any kind of "vote" (or whatever is ultimately going to happen when there is an up or down decision about the proposal).
Steve
Steve Baskauf wrote: After the flurry of emails recently, I had an opportunity to carefully read all the way through the threads again, followed by enforced "think time" during my long commute. I was actually pretty cheerful after that because I think that in essence, most of the conversation about what constitutes an Occurrence really boils down to the same thing. So I have sat down and tried to summarize what seems to me to be a consensus about Occurrences. To follow my points, please refer to the diagram at: http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif
Consensus on relationships
- The fundamental definition of an Occurrence involves evidence that a
representative of a taxon occurred at a place and time. Note 1.A: For clarity, I have modified John's statement in his last email by replacing "taxon" with "representative of a taxon". I'm considering a taxon to be an abstract concept that is applied to individuals or groups of organisms. Note 1.B. This definition is far more useful than the official definition of the class Occurrence "The category of information pertaining to evidence of an occurrence..." which is essentially circular. Note 1.C: This statement is extremely broad because the evidence could be of many sorts, the representative could range from a single individual to all organisms on the earth, the taxon could be anyone's definition at any taxonomic level, the place could range from a GPS point with uncertainty of less than 10 meters to the entire planet earth, and the time could range from a shutter click of less than one second to 3.4 billion years. 2. The diagram is an attempt to summarize in pictorial form statements and relationships that have been described in the thread. The taxon representative is recorded as existing at a particular time and place (the arrow) and the result is an Occurrence record. That Occurrence record exists as metadata which may be associated with a token that can be used to voucher the fact that the taxon representative existed. That token may be the organism itself (or a living part of it as in a twig for grafting), all or part of the organism in preserved form, an electronic representation such as an image or sound recording, and other kinds of things like tissue or DNA samples. There may also be no token at all, in which case we call the Occurrence record an observation. Based on direct observation of the taxon representative, examination of one or more tokens, or both, some determiner asserts that a taxon concept applies to the taxon representative and as a result a scientific name can be used to "identify" the taxon representative. (There may be a lot of other complicated stuff above the Identification box, but that will have to be filled in by the taxonomists.) Note 2.A: I have mapped onto this diagram the letters that John used in his last email to refer to entities that are involved in an Occurrence (T, E, L, O, and G). I will beg the forgiveness of fossil people because I don't really know how the geological context fits in. I'm assuming that it is a way of asserting time and location on a much broader scale than we do for extant organisms. Note 2.B: I have put a dotted line around the part of the diagram that I think includes all the things that people might consider part of the Occurrence itself. I have left out "T" and the other parts related to identification because it seems to me that you can have an occurrence that you document which does not yet (and perhaps never will) have an identification. The Occurrence still asserts that a taxon representative existed at a time and place; we just don't yet know what the taxon is. 3. The red lines indicate the relationships that connect the various entities (I'm going to go ahead and call them resources). Consistent with popular opinion, the Occurrence record is the center of the universe and most things are connected to it. Note 3.A: I am sticking to my guns and refuse to connect the Identification directly to the Occurrence. It is the taxon representative that is being identified, not the occurrence. One can assert another sort of relationship between the identification and the occurrence if one wants to say that one consulted the occurrence metadata and token in order to decide about the identification, but it is not correct to say that the Identification identifies either the Occurrence metadata or the token (as Rich pointed out).
OK, so that's step one - defining what is related to what. If anyone disagrees with these relationships, please clarify or create your own diagram.
Complicating circumstances/caveats
- It is noted and recognized that some users will not care to include
all of these relationships in their models. In the interest of simplification or "flattening" the relationships, they may wish to collapse some parts of this diagram (e.g. incorporate time and location metadata within the Occurrence metadata rather than considering them separate resources, applying scientific names directly to the taxon representatives without defining a taxon concept or recording the determination metadata, connecting identifications directly to the occurrence, etc.). This doesn't mean that the relationships don't exist, it just means that some users don't care about them. 2. It is recognized that different users will be interested in or able to specify the various resources to differing degrees of precision. Examples: A photographer might record times to the nearest second, a collector may only be interested in noting the date on which a specimen was collected. A location may be specified to the precision of a GPS reading or be defined as some geographic or political subdivision. The taxon representative may be an individual organism, a flock or clump, or some larger aggregation of taxon representatives.
That's step two. If I've missed any complications, please point them out.
My opinions about the implications of this diagram
- The circle I've labeled as "taxon representative" is the resource
type that I'm proposing to be represented by the class Individual. You will note that in both the definition of dwc:individualID ("An identifier for an individual or named group of individual organisms...") and the proposed class definition ("The category of information pertaining to an individual or named group of individual organisms represented in an Occurrence"), groups of individual organisms are included. Thus John's example of a fossil having myriad individuals, or Richard's examples of thousands of plankton, a large school of fish, herd of wildebeest, flock of birds, could all be categorized as "Individual" under this definition if there is a reasonable expectation that all of the individuals in the group are members of the same taxon. Perhaps there is a better name for this resource, but since dwc:individualID was already extant, I chose Individual as the class name for consistency with the pattern established with other classes and their associated xxxxID terms. 2. Although in note 1.C. I have given the ranges of the various resources to their logical extreme (as was done previously in the thread), I think that as a practical matter we can adopt guidelines to set reasonable values for the "normal" ranges of the resources. One such guideline might be that we suggest a range that can accommodate about 95% of the user needs within the community (this came from Rich's comment about satisfying 95% of the user need with an establishmentMeans controlled vocuabulary). For example, it was suggested that the range for the location of an Occurrence could span the entire planet Earth. True enough, but virtually nobody would find such a span useful. 95% of users would probably find a range between a GPS reading with 10 meter precision and the extent of a county or province useful for recording the location of an Occurrence. I can suggest similar "useful" ranges: one second to one day for an event time (excluding fossils), one individual organism to the number of organisms that would fit within a 50 meter radius for an "individual", and taxon identified to family for plants and maybe mammals, genus for birds, and order for insects. So framing the definition of an Occurrence in these terms it would be something like: "An occurrence involves evidence (consisting of a physical token, electronic record, or personal observation) that a representative (ranging from a single individual to the number that would fit on a football field) of a taxon (hopefully identified to some lower taxonomic level) occurred at a place (determined to a precision between that of a GPS reading and the size of a county/province) and time (spanning one second to one day)." A few people might object to this level of restrictiveness, but I would guess that it would make 95% of us happy. 3. With the exception of the "missing" class Individual, every resource type on this diagram except for the "token" and Scientific name has a Darwin Core class. Every resource type on the diagram except for "token" has a dwc:xxxxID term that can be used to refer to a GUID for the resource. The implication of this is that any resource on this diagram except for the token and taxon representative (i.e. Individual) is ready to be represented in RDF by Darwin Core terms in the sense that the relationships (red lines) can be represented by the xxxxID terms and that the resources can be rdfs:type'd using Darwin Core classes. (Lacking a class for the scientific name doesn't seem like a big deal to me since the scientific name can be a string literal - but then I'm not a taxonomist.) 4. OK, I've avoided it as long as I can, so I'm going to confess now to the RDF-phobes. The red lines and shapes are something pretty close to an RDF graph. What that means is that if the community can agree that this diagram correctly represents the relationships among the kinds of biodiversity resources that we care about, then the matter of providing guidelines on how to represent Darwin Core in RDF suddenly gets a lot simpler. Just convert the "picture" of the RDF graph into XML format and we have a template. Alright, that's an oversimplification, but I think it is essentially true because the most difficult part of achieving a consensus on RDF representations is to decide how we connect the resource types, not on the literals that we hang onto resources as properties. 5. While I'm beating the RDF drum again, the importance of my opinion number 2 can be extended into the GUID adoption process. In my comments to Kevin about the Beginner's Guide to Persistent Identifiers, I think I commented on the question of how one decides whether a GUID needs to be assigned to something or not. I believe that the answer to that question boils down to this: we need a GUID for any resource that will be referenced by more than one other resource. Do we need to be able to assign a GUID to Taxon concepts? Yes, because it is likely that many identifications will want to reference a particular taxon concept. Do we need to be able to assign a GUID to an Event? Maybe or maybe not. If every occurrence has its own separate time recorded, then no GUID is needed because the time is just a part of every separate occurrence record. If the event is defined to be a time range that represents a collecting trip, then there may be many Occurrences that are associated with that trip and all of them could reference the GUID for that event rather than repeating the event information for every Occurrence. The point here is that every shape (class of resources) on this diagram at least has the POTENTIAL to be a node connecting multiple resources and therefore should have the capability of being assigned a GUID, having its own RDF record, and being appropriately typed (presumably by a DwC class). So this is a final technical argument for why we need to have the DwC class Individual. Whether or not people ultimately choose to assign GUIDs to particular resource types or not is their own choice, but they need to at least be ABLE to if they need that resource to serve as a node given the structure of their metadata.
We need to clarify how the "token" thing fits in, but I'm stopping there for now. I would very much appreciate responses indicating that:
A. you agree with the diagram and connections (and consider this definition and diagram a consensus) B. you disagree with the diagram (and articulate why) C. you provide an alternative diagram or explanation of the relationships among the classes related to Occurrences.
Thanks for you patience with another tome. Steve
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content .
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707
http://bioimages.vanderbilt.edu _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
So are we saying that dwc:nameAccordingTo can be a property of an dwc:Identification? What's dwc:identificationReferences for? I'm sorry if this is a dumb question but I can plead ignorance on this topic. Steve
Peter DeVries wrote:
Hi Markus,
I feel your pain. :-)
Maybe an example might help clarify this.
I use the key* listed below to id my mosquitoes.
So I should mark up my RDF for the identification with something like:
dwcterms:nameAccordingToIdentification And Geographical Distribution Of The Mosquitoes: Of North America, North Of Mexico By Richard F., Jr. Darsie et al. 2004</dwcterms:nameAccordingTo>
Rather than use some other term like "dwc:identificationReferences"
Correct?
- Pete
Identification And Geographical Distribution Of The Mosquitoes: Of North America, North Of Mexico By Richard F., Jr. Darsie, RONALD A. WARD, Chien C. Chang, Taina Litwak University Press of Florida, 2004 ISBN: 0813027845 Cite: 13463
=================================================================
On Mon, Oct 18, 2010 at 12:19 PM, "Markus Döring (GBIF)" <mdoering@gbif.org mailto:mdoering@gbif.org> wrote:
I am sorry I dont have the time to follow this extensive thread, but I can manage at least the first paragraphs ;) A quick comment on tying identification sources to a scientific name. As for other taxon concepts this is usually done with the sec/sensu reference which should be recorded as dwc:nameAccordingTo: http://rs.tdwg.org/dwc/terms/index.htm#nameAccordingTo I am slightly irritated that we seem to have some term duplicates for this use case. Maybe dwc:identificationReferences is supposed to only list additional references? Markus On Oct 18, 2010, at 18:49, Steve Baskauf wrote: > I've fallen behind on systematically perusing the list responses, but I would like to focus in on a point that seems to be a consensus in the responses that have shown up recently. The consensus seems to be that documenting determinations (a.k.a. instances of dwc:Identification class) that are applied to Individuals (or Occurrences if you don't believe in Individuals) is the way to go. So in my usual graphical way of thinking about this, I would draw a "relationship line" from the determination to the Individual (or Occurrence) on one side and from the determination to the species concept on the other. I will leave up to the taxonomy people the different things would be connected to the species concept and how all of their lines would be connected. The determination would have any of the properties that are terms listed in the dwc:Identification class (identifiedBy, dateIdentified, identificationReferences, identification Remarks, identificationQualifier, and typeStatus). Some properties like dateIdentified and identificationReferences would be string literals and others (especially identifiedBy) should probably be GUIDs but could be literals if they had to be. > > That all seems pretty clear. However, when I've started trying to do this in real life, I immediately have questions. Take a look at > http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf which should show up as a web page in your browser. > > 1. The original label identifies the species as Juncus diffusissimus. However, there is no indicator as to who originally identified it or when. My assumption is that it was the collector (Glen N. Montz) but I don't really know that. Do I assume that, or list the original determiner as "unknown"? > 2. Do we draw a distinction between the initial identification and subsequent annotations? I think the answer should be "no" and that's why I refer to both generically as "determinations". > 3. There is really no indication given on the annotation labels as to many of the things that we would like to know, such as the concept they had in mind, any source they used (if any), or the reason why they did the annotation. So how does one connect the name that they applied to the determination when there is no indication of the concept? Is this just something we can't do for old annotations and just something that we try to do from this point forward? > 4. The last question is one that I really want to some opinions about. It seems to me that there are a number of reasons why one would apply a determination. One would be to correct an actual error in identification. One would be to increase the precision of a previous determination (e.g. an insect identified to family now is identified to species). One would be to assert a difference in opinion as to the correct way to group this individual with others (i.e. as in a taxonomic revision). Finally, a single determiner might apply several determinations to one individual and indicate in each determination the concept intended (i.e. if you subscribe to Cronquist, you'd call it X; if you like Radford's book, you'd call it Y; if you like Weakley's treatment, you'd call it Z). Some of these four reasons may be functionally equivalent, but how would you use Darwin Core to indicate the reason why you applied the determination? Please don't say "identificationRemarks"! From a machine-processing standpoint, this is something we should know and there should be some kind of controlled vocabulary to express it. For instance if an identification is "deprecated" because it was in error (perhaps by the determiner him/herself), one would like the incorrect determination to show up in the historical metadata, but I wouldn't want it to be listed in a website index. The same would hold true if an annotator was able to pin the taxon down to a lower taxonomic level than the original identifier. If someone goes to the trouble to connect an Individual/Occurrence to several names under alternative concepts, there should be a way the a machine would know this so that a software user could select the concept they wanted to use and the name under that concept would pop up. > > I don't really see any term under the current DwC that could be used to do this last thing. Am I missing something? Do we need several terms to explain the reason why we made the determination because the reasons fall into different categories? > > The other comment that I'll throw out (since this is going out to the bioblitz list as well as to tdwg-content) is that those of you who are building apps to collect metadata in the field really need to separate the process of entering (or acquiring) the collection metadata from the determination process. In at least some apps, the user immediately has to commit to a taxon as they enter the data at the time of collection. It seems to me that it would be a very common situation (especially in the case of "citizen science") that the collector/observer/photographer would have no idea what the taxonomic identity was at the time of collection. The process of determination (and the recording of the various dwc:Identification class terms) is really a separate process that should be able to happen at the time of collection OR later. > > Steve > > Peter DeVries wrote: >> Hi Steve, >> >> I would hypothesize that for the vast majority of identified records the process is something like this: >> >> 1) An individual uses some sort of key to determine what species (taxon concept) to assign to a given individual >> * They may have created some sort of mental key in which once they recognize one individual mosquito they can then pretty quickly sort >> a number of individuals into collections. >> >> 2) The actual name they assign to the specimen is usually based on what their key says the name is. Often this does not specify the authorship. >> Most of these human identifiers have not read the original species descriptions and for the species they are identifying. >> So the specimen is actually tied to a concept that is based more on the "key" than the original description. >> * An exception, would be where there is a key in the original description and that was what what was used. >> >> 3) So in a sense, the process of modeling this as if the if the identifier actually asserted that the concept was the same as that described by >> the original description or a subsequent revision is "fudging" >> >> Side effects of this process include: >> >> 1) A new key for North American Mosquitoes comes out that incorporates recent changes in nomenclature. The major change being the elevation of >> a subgenus to a genus. For most of the species described the "key concept" is unchanged. >> >> Student identifier, Bob, in state X is using the latest key, while student identifier, Joe, is state Z is using a slightly older edition of the same key. >> >> Bob identifies the species as Ochlerotatus triseriatus, while Joe identifies what should be the same species as Aedes triseriatus. >> >> These show up in GBIF on two different maps, they show up in the EOL as two different pages. >> >> Various TDWG'ers continue to argue that the original description and subsequent revisions were really important in determining what these individuals >> actually meant when they assigned a name to a specimen, and that this is how we should model it in excruciating detail. >> >> I would argue this should be modeled as best as possible to what actually happens. >> >> For example, how many of the species observed in the recent BioBlitz were identified by referring to the original species description or subsequent revisions? >> >> In your diagram, I would suggest that you show that a taxon concept may have many names associated with it. Since it is not clear what the identifier intended by his or her choice of a name, it is often difficult to determine what taxon concept they actually meant. >> >> This is why I advocate a move to a more taxon concept based identifier to link these data sets together because this allows the intent of the identifier >> is more accurately modeled. >> >> This would be done in the form of: >> >> "I assert that this specimen (of what I call Aedes triseriatus) was observed here. I also assert that it is an instance of the this species concept => URI" >> >> Or I assert that this is an individual of the type "Individual of species concept X" = > URI >> >> All of these are instances of the class "Individual" >> >> So the resulting DarwinCore record would contain both the name and and an optional, but I think needed, asserted species concept. >> >> The species concept is a subclass of taxon concept, but is fundamentally different than the higher clades. >> >> There are some guidelines as to what an entity needs to be considered a species. >> >> While their are no real guidelines as to what clades should be considered genera and what clades should be considered families etc. >> >> Assigning properties at the level of genera or family is also problematic because it assumes that there will be inferencing and it will require rechecking >> that those properties are still valid if the species within that genera change. >> >> So if there is some property that is common to all the species in the genus, make that a property of each of the individual species - not a property >> of the genus. >> >> Respectfully, >> >> - Pete >> >> >> >> >> >> >> >> On Fri, Oct 15, 2010 at 10:45 AM, Steve Baskauf <steve.baskauf@vanderbilt.edu <mailto:steve.baskauf@vanderbilt.edu>> wrote: >> As a background to this post, I want to reference a post by Bob called "SubclassOrNot". I discovered this page on an early foray into the TDWG website labyrinth and it has been very influential on my thinking since then. The idea Bob discusses is central to what I'm writing below so if you haven't read it you might want to do so first. You can probably skip the "OWL Inference" section and still get the point which is described in the first two sections of his post. The URL for the page is http://wiki.tdwg.org/twiki/bin/view/TAG/SubclassOrNot . >> >> To preface what I'm going to say below, I want to put Darwin Core Occurrences in the context of what Bob wrote. In my mind, one of the hallmarks of the Darwin Core standard and one thing that makes it a great improvement over previous versions is that the decision was made to use what Bob called the "has a" approach rather than the "is a" approach. In particular, the Darwin Core standard has a single class called dwc:Occurrence rather than subclasses called "Specimen", "Observation", and other possible things. The way that we differentiate among different kinds of Occurrences is by using the DwC types which are the controlled values for the term dwc:basisOfRecord. Thus we say an Occurrence "has a" basisOfRecord=PreservedSpecimen rather than saying it "is a" PreservedSpecimen. We say an Occurrence "has a" basisOfRecord=HumanObservation rather than saying it "is a"HumanObservation". This approach has greatly reduced the number of different terms in the standard since we don't have to have separate "ObservedBy" and "CollectedBy" terms, but rather can just have a single "RecordedBy" term that applies to both specimens and observations. The same thing applies to many other things, like eventDate rather than DateCollected and DateObserved, locality rather than collectionLocality and observationLocality, etc. With the ratification of Darwin Core, this decision is now a fait acompli and not a subject of discussion or something optional for users of the standard. It also seems to be clear that as necessary new terms can be added to the DwC types which would then be valid controlled values for basisOfRecord. >> >> Since the adoption of the DwC standard, the approach to Occurrences has been what I would describe as "I know an Occurrence when I see one". I consider this as a pretty sloppy practice and as I indicated in my post last night, I think there is enough consensus about what an Occurrence is that we can come up with a better definition than "an occurrence is the category of information pertaining to evidence of an occurrence...". Another part of what I would characterize as sloppiness is the lack of a clear definition of what exactly basisOfRecord means. When I wrote my attempt at summarizing consensus last night, I dodged the question about what I called the "token". This "thing" has been called various names. In the previous discussion on the list, it was sometimes called "the evidence" of the occurrence. In the past I have called it "a representation" - however, I now think the term "token" is better because "representation" has a different technical meaning in the context of content negotiation. When we type an Occurrence by saying it has a basisOfRecord=PreservedSpecimen, we are saying that this Occurrence has as supporting evidence, or as a "token" if you prefer, all or part of the dead remains of the organism (i.e. what I'm calling "the Individual") that was being documented by the Occurrence. When we type an Occurrence by saying it has a basisOfRecord=LivingSpecimen, we are saying that this Occurrence has as a "token" the entire organism that was being documented (or some vegetative part of the live organism that was propagated). When we type an Occurrence by saying it has a basisOfRecord=HumanObservation, we are saying that the Occurrence has no supporting evidence other than the reputation of the observer to accurately record the metadata about the Occurrence. In other words, we "tag" a instance of a core class (to use Bob's words), Occurrence, by telling a metadata consumer what kind of token we are using as evidence of the Occurrence. >> A fundamental part of creating a clear definition of what an Occurrence is, is to define exactly what we are including in the concept of Occurrence. One possibility is to (1) say that the two boxes at the right side of the diagram at http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif are fused and that both the Occurrence metadata and its associated token are what we consider to be "the Occurrence". Another approach (2) would be to say that the actual Occurrence as an entity is only the metadata part and that the token is a separate thing. A third approach is to say (3) that everything with the blue dotted lines is considered a part of the Occurrence (i.e. the metadata, the token, the event, and the locality). I don't think in an absolute sense, any one of these approaches is "right". The problem is that these approaches are used inconsistently, sometimes even by the same person, depending on the basisOfRecord. Differences in ways of thinking about this issue is a part of why people aren't understanding the way other people are approaching the structuring of metadata. I have tried to consistently take the approach (1) that the two boxes on the right are fused, i.e. that the Occurrence metadata and the token should both be considered part of the entity that we call "an Occurrence". I think this is why Rich was confused in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001666.html when I said that it was "wrong" to assert that a scientific name is a property of an Occurrence - obviously it is silly to say that the token (photons on a film, sound patterns in a digital file) has a scientific name. Yet that is exactly what people do routinely when the token is a branch cut off a tree and glued to a piece of paper. They say that they are "identifying a specimen". What I am asking (actually demanding) is that the TDWG community get its act together and come to some consistency on this. If we are going to take the approach (2), then we need to take specimens off their pedestal and treat them like we do any other token that we are using as evidence that an Occurrence happened. If we are going to do what was suggested for the BioBlitz in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001603.html, i.e. to call Occurrences "observations" and then link the tokens to them by associatedMedia, ResourceRelationship, or some other means (approach 2) then do it consistently for every kind of token, including specimens, and don't single out media tokens for punishment. >> I have in a sense "thrown down the gauntlet" on this issue by proposing that DigitalStillImage be added as a DwC type and as a controlled value for basisOfRecord (http://code.google.com/p/darwincore/issues/detail?id=68). I know what some people are going to say in response to this proposal. "Why do you need to have 'DigitalStillImage' as a value for basisOfRecord when you can just say that the resource's dcterms:type=StillImage?" The answer goes back to Bob's point. If we are going to go the "has a" path (which we already have in DwC for Occurrences) rather than subclassing everything, then we need to provide an appropriate value for the "tag" for any type of resource that a reasonable number of users will want to use as a token. I think it is clear from this and other Bioblitzes, my work in Bioimages, the whale tracking project, and many other examples, that there are plenty of people who are already using DigitalStillImages as tokens and we all need a controlled value to use for basisOfRecord. >> The other thing that we accomplish when we type an Occurrence by its basisOfRecord is to tell a consumer what kind of metadata to expect to get about the token in addition to the generic metadata that is provided for all Occurrences. Thus for a LivingSpecimen we expect to be told what zoo, botanical garden, bacterial collection, etc. contains the specimen. For a PreservedSpecimen we expect to be told the preparation type, the location of the repository, etc. For a DigitalStillImage we expect to be told the file type, accessURL, etc. Simply providing a value for dcterms:type=StillImage doesn't indicate whether the image is a physical one (i.e. on film) or a digital one. It is also unreasonable to expect a client to have to be checking two different terms (basisOfRecord and dcterms:type) to find out what they could learn from one (basisOfRecord). Of course it would be advisable to provide a value for dcterms:type as well for clients outside the biodiversity community who may not "understand" what basisOfRecord means. >> I hate to keep bringing my posts back to the RDF issue, but thinking about how one would write RDF forces clear thinking about how metadata should be structured. If we intend to separate tokens as entities from their associated Occurrence metadata, i.e. approach (2), then we open up a whole other can of worms. To associate the occurrence resources (i.e. the metadata) with the "different" resource (i.e. the token), we will have probably have to be able to create URIs for the tokens and separate RDF metadata blocks which will have to be rdfs:type'd. What are we going to use for that rdfs:type - create another Darwin Core class? I simply don't think that is a complicated road that we want to travel. It would be far easier to just say that every Occurrence has a one-to-one relationship with its token (which could be "the empty set" for observations). This would not work for people who want to hang multiple tokens on a single observation event, but I think that itself is a bad idea because it makes it even harder to have "flat" occurrence datasets. Just say that every time we collect a different token (or make an observation that has no token), it is a new Occurrence record. Realistically, a single collector can't actually take a picture of a plant at the same time he or she collects it for a specimen anyway. Those really should be considered two different events because they happen at different times. >> >> OK, enough said. Consider this my defense of my proposal "issue 68" to add DigitalStillImage. I would urge the powers that be to respond to the issues that I've raised here before having any kind of "vote" (or whatever is ultimately going to happen when there is an up or down decision about the proposal). >> >> Steve >> >> Steve Baskauf wrote: >> After the flurry of emails recently, I had an opportunity to carefully >> read all the way through the threads again, followed by enforced "think >> time" during my long commute. I was actually pretty cheerful after that >> because I think that in essence, most of the conversation about what >> constitutes an Occurrence really boils down to the same thing. So I >> have sat down and tried to summarize what seems to me to be a consensus >> about Occurrences. To follow my points, please refer to the diagram at: >> http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif >> >> Consensus on relationships >> 1. The fundamental definition of an Occurrence involves evidence that a >> representative of a taxon occurred at a place and time. >> Note 1.A: For clarity, I have modified John's statement in his last >> email by replacing "taxon" with "representative of a taxon". I'm >> considering a taxon to be an abstract concept that is applied to >> individuals or groups of organisms. >> Note 1.B. This definition is far more useful than the official >> definition of the class Occurrence "The category of information >> pertaining to evidence of an occurrence..." which is essentially circular. >> Note 1.C: This statement is extremely broad because the evidence could >> be of many sorts, the representative could range from a single >> individual to all organisms on the earth, the taxon could be anyone's >> definition at any taxonomic level, the place could range from a GPS >> point with uncertainty of less than 10 meters to the entire planet >> earth, and the time could range from a shutter click of less than one >> second to 3.4 billion years. >> 2. The diagram is an attempt to summarize in pictorial form statements >> and relationships that have been described in the thread. The taxon >> representative is recorded as existing at a particular time and place >> (the arrow) and the result is an Occurrence record. That Occurrence >> record exists as metadata which may be associated with a token that can >> be used to voucher the fact that the taxon representative existed. That >> token may be the organism itself (or a living part of it as in a twig >> for grafting), all or part of the organism in preserved form, an >> electronic representation such as an image or sound recording, and other >> kinds of things like tissue or DNA samples. There may also be no token >> at all, in which case we call the Occurrence record an observation. >> Based on direct observation of the taxon representative, examination of >> one or more tokens, or both, some determiner asserts that a taxon >> concept applies to the taxon representative and as a result a scientific >> name can be used to "identify" the taxon representative. (There may be >> a lot of other complicated stuff above the Identification box, but that >> will have to be filled in by the taxonomists.) >> Note 2.A: I have mapped onto this diagram the letters that John used in >> his last email to refer to entities that are involved in an Occurrence >> (T, E, L, O, and G). I will beg the forgiveness of fossil people >> because I don't really know how the geological context fits in. I'm >> assuming that it is a way of asserting time and location on a much >> broader scale than we do for extant organisms. >> Note 2.B: I have put a dotted line around the part of the diagram that I >> think includes all the things that people might consider part of the >> Occurrence itself. I have left out "T" and the other parts related to >> identification because it seems to me that you can have an occurrence >> that you document which does not yet (and perhaps never will) have an >> identification. The Occurrence still asserts that a taxon >> representative existed at a time and place; we just don't yet know what >> the taxon is. >> 3. The red lines indicate the relationships that connect the various >> entities (I'm going to go ahead and call them resources). Consistent >> with popular opinion, the Occurrence record is the center of the >> universe and most things are connected to it. >> Note 3.A: I am sticking to my guns and refuse to connect the >> Identification directly to the Occurrence. It is the taxon >> representative that is being identified, not the occurrence. One can >> assert another sort of relationship between the identification and the >> occurrence if one wants to say that one consulted the occurrence >> metadata and token in order to decide about the identification, but it >> is not correct to say that the Identification identifies either the >> Occurrence metadata or the token (as Rich pointed out). >> >> OK, so that's step one - defining what is related to what. If anyone >> disagrees with these relationships, please clarify or create your own >> diagram. >> >> Complicating circumstances/caveats >> 1. It is noted and recognized that some users will not care to include >> all of these relationships in their models. In the interest of >> simplification or "flattening" the relationships, they may wish to >> collapse some parts of this diagram (e.g. incorporate time and location >> metadata within the Occurrence metadata rather than considering them >> separate resources, applying scientific names directly to the taxon >> representatives without defining a taxon concept or recording the >> determination metadata, connecting identifications directly to the >> occurrence, etc.). This doesn't mean that the relationships don't >> exist, it just means that some users don't care about them. >> 2. It is recognized that different users will be interested in or able >> to specify the various resources to differing degrees of precision. >> Examples: A photographer might record times to the nearest second, a >> collector may only be interested in noting the date on which a specimen >> was collected. A location may be specified to the precision of a GPS >> reading or be defined as some geographic or political subdivision. The >> taxon representative may be an individual organism, a flock or clump, or >> some larger aggregation of taxon representatives. >> >> That's step two. If I've missed any complications, please point them out. >> >> My opinions about the implications of this diagram >> 1. The circle I've labeled as "taxon representative" is the resource >> type that I'm proposing to be represented by the class Individual. You >> will note that in both the definition of dwc:individualID ("An >> identifier for an individual or named group of individual organisms...") >> and the proposed class definition ("The category of information >> pertaining to an individual or named group of individual organisms >> represented in an Occurrence"), groups of individual organisms are >> included. Thus John's example of a fossil having myriad individuals, or >> Richard's examples of thousands of plankton, a large school of fish, >> herd of wildebeest, flock of >> birds, could all be categorized as "Individual" under this definition if >> there is a reasonable expectation that all of the individuals in the >> group are members of the same taxon. Perhaps there is a better name for >> this resource, but since dwc:individualID was already extant, I chose >> Individual as the class name for consistency with the pattern >> established with other classes and their associated xxxxID terms. >> 2. Although in note 1.C. I have given the ranges of the various >> resources to their logical extreme (as was done previously in the >> thread), I think that as a practical matter we can adopt guidelines to >> set reasonable values for the "normal" ranges of the resources. One >> such guideline might be that we suggest a range that can accommodate >> about 95% of the user needs within the community (this came from Rich's >> comment about satisfying 95% of the user need with an establishmentMeans >> controlled vocuabulary). For example, it was suggested that the range >> for the location of an Occurrence could span the entire planet Earth. >> True enough, but virtually nobody would find such a span useful. 95% of >> users would probably find a range between a GPS reading with 10 meter >> precision and the extent of a county or province useful for recording >> the location of an Occurrence. I can suggest similar "useful" ranges: >> one second to one day for an event time (excluding fossils), one >> individual organism to the number of organisms that would fit within a >> 50 meter radius for an "individual", and taxon identified to family for >> plants and maybe mammals, genus for birds, and order for insects. So >> framing the definition of an Occurrence in these terms it would be >> something like: "An occurrence involves evidence (consisting of a >> physical token, electronic record, or personal observation) that a >> representative (ranging from a single individual to the number that >> would fit on a football field) of a taxon (hopefully identified to some >> lower taxonomic level) occurred at a place (determined to a precision >> between that of a GPS reading and the size of a county/province) and >> time (spanning one second to one day)." A few people might object to >> this level of restrictiveness, but I would guess that it would make 95% >> of us happy. >> 3. With the exception of the "missing" class Individual, every resource >> type on this diagram except for the "token" and Scientific name has a >> Darwin Core class. Every resource type on the diagram except for "token" >> has a dwc:xxxxID term that can be used to refer to a GUID for the >> resource. The implication of this is that any resource on this diagram >> except for the token and taxon representative (i.e. Individual) is ready >> to be represented in RDF by Darwin Core terms in the sense that the >> relationships (red lines) can be represented by the xxxxID terms and >> that the resources can be rdfs:type'd using Darwin Core classes. >> (Lacking a class for the scientific name doesn't seem like a big deal to >> me since the scientific name can be a string literal - but then I'm not >> a taxonomist.) >> 4. OK, I've avoided it as long as I can, so I'm going to confess now to >> the RDF-phobes. The red lines and shapes are something pretty close to >> an RDF graph. What that means is that if the community can agree that >> this diagram correctly represents the relationships among the kinds of >> biodiversity resources that we care about, then the matter of providing >> guidelines on how to represent Darwin Core in RDF suddenly gets a lot >> simpler. Just convert the "picture" of the RDF graph into XML format >> and we have a template. Alright, that's an oversimplification, but I >> think it is essentially true because the most difficult part of >> achieving a consensus on RDF representations is to decide how we connect >> the resource types, not on the literals that we hang onto resources as >> properties. >> 5. While I'm beating the RDF drum again, the importance of my opinion >> number 2 can be extended into the GUID adoption process. In my comments >> to Kevin about the Beginner's Guide to Persistent Identifiers, I think I >> commented on the question of how one decides whether a GUID needs to be >> assigned to something or not. I believe that the answer to that >> question boils down to this: we need a GUID for any resource that will >> be referenced by more than one other resource. Do we need to be able to >> assign a GUID to Taxon concepts? Yes, because it is likely that many >> identifications will want to reference a particular taxon concept. Do >> we need to be able to assign a GUID to an Event? Maybe or maybe not. >> If every occurrence has its own separate time recorded, then no GUID is >> needed because the time is just a part of every separate occurrence >> record. If the event is defined to be a time range that represents a >> collecting trip, then there may be many Occurrences that are associated >> with that trip and all of them could reference the GUID for that event >> rather than repeating the event information for every Occurrence. The >> point here is that every shape (class of resources) on this diagram at >> least has the POTENTIAL to be a node connecting multiple resources and >> therefore should have the capability of being assigned a GUID, having >> its own RDF record, and being appropriately typed (presumably by a DwC >> class). So this is a final technical argument for why we need to have >> the DwC class Individual. Whether or not people ultimately choose to >> assign GUIDs to particular resource types or not is their own choice, >> but they need to at least be ABLE to if they need that resource to serve >> as a node given the structure of their metadata. >> >> We need to clarify how the "token" thing fits in, but I'm stopping there >> for now. I would very much appreciate responses indicating that: >> >> A. you agree with the diagram and connections (and consider this >> definition and diagram a consensus) >> B. you disagree with the diagram (and articulate why) >> C. you provide an alternative diagram or explanation of the >> relationships among the classes related to Occurrences. >> >> Thanks for you patience with another tome. >> Steve >> >> -- >> Steven J. Baskauf, Ph.D., Senior Lecturer >> Vanderbilt University Dept. of Biological Sciences >> >> postal mail address: >> VU Station B 351634 >> Nashville, TN 37235-1634, U.S.A. >> >> delivery address: >> 2125 Stevenson Center >> 1161 21st Ave., S. >> Nashville, TN 37235 >> >> office: 2128 Stevenson Center >> phone: (615) 343-4582, fax: (615) 343-6707 >> http://bioimages.vanderbilt.edu >> >> _______________________________________________ >> tdwg-content mailing list >> tdwg-content@lists.tdwg.org <mailto:tdwg-content@lists.tdwg.org> >> http://lists.tdwg.org/mailman/listinfo/tdwg-content >> . >> >> >> >> -- >> Steven J. Baskauf, Ph.D., Senior Lecturer >> Vanderbilt University Dept. of Biological Sciences >> >> postal mail address: >> VU Station B 351634 >> Nashville, TN 37235-1634, U.S.A. >> >> delivery address: >> 2125 Stevenson Center >> 1161 21st Ave., S. >> Nashville, TN 37235 >> >> office: 2128 Stevenson Center >> phone: (615) 343-4582, fax: (615) 343-6707 >> http://bioimages.vanderbilt.edu >> >> >> >> >> -- >> ---------------------------------------------------------------- >> Pete DeVries >> Department of Entomology >> University of Wisconsin - Madison >> 445 Russell Laboratories >> 1630 Linden Drive >> Madison, WI 53706 >> TaxonConcept Knowledge Base / GeoSpecies Knowledge Base >> About the GeoSpecies Knowledge Base >> ------------------------------------------------------------ > > -- > Steven J. Baskauf, Ph.D., Senior Lecturer > Vanderbilt University Dept. of Biological Sciences > > postal mail address: > VU Station B 351634 > Nashville, TN 37235-1634, U.S.A. > > delivery address: > 2125 Stevenson Center > 1161 21st Ave., S. > Nashville, TN 37235 > > office: 2128 Stevenson Center > phone: (615) 343-4582, fax: (615) 343-6707 > > http://bioimages.vanderbilt.edu > _______________________________________________ > tdwg-content mailing list > tdwg-content@lists.tdwg.org <mailto:tdwg-content@lists.tdwg.org> > http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base http://www.taxonconcept.org/ / GeoSpecies Knowledge Base http://lod.geospecies.org/ About the GeoSpecies Knowledge Base http://about.geospecies.org/
its a property of Taxon if you have a fully normalised model. But as a shortcut together with scientificName it might be right in Identification I suppose?
Markus
On Oct 18, 2010, at 21:38, Steve Baskauf wrote:
So are we saying that dwc:nameAccordingTo can be a property of an dwc:Identification? What's dwc:identificationReferences for? I'm sorry if this is a dumb question but I can plead ignorance on this topic. Steve
Peter DeVries wrote:
Hi Markus,
I feel your pain. :-)
Maybe an example might help clarify this.
I use the key* listed below to id my mosquitoes.
So I should mark up my RDF for the identification with something like:
dwcterms:nameAccordingToIdentification And Geographical Distribution Of The Mosquitoes: Of North America, North Of Mexico By Richard F., Jr. Darsie et al. 2004</dwcterms:nameAccordingTo>
Rather than use some other term like "dwc:identificationReferences"
Correct?
- Pete
Identification And Geographical Distribution Of The Mosquitoes: Of North America, North Of Mexico By Richard F., Jr. Darsie, RONALD A. WARD, Chien C. Chang, Taina Litwak University Press of Florida, 2004 ISBN: 0813027845 Cite: 13463
=================================================================
On Mon, Oct 18, 2010 at 12:19 PM, "Markus Döring (GBIF)" mdoering@gbif.org wrote: I am sorry I dont have the time to follow this extensive thread, but I can manage at least the first paragraphs ;) A quick comment on tying identification sources to a scientific name. As for other taxon concepts this is usually done with the sec/sensu reference which should be recorded as dwc:nameAccordingTo:
http://rs.tdwg.org/dwc/terms/index.htm#nameAccordingTo
I am slightly irritated that we seem to have some term duplicates for this use case. Maybe dwc:identificationReferences is supposed to only list additional references?
Markus
On Oct 18, 2010, at 18:49, Steve Baskauf wrote:
I've fallen behind on systematically perusing the list responses, but I would like to focus in on a point that seems to be a consensus in the responses that have shown up recently. The consensus seems to be that documenting determinations (a.k.a. instances of dwc:Identification class) that are applied to Individuals (or Occurrences if you don't believe in Individuals) is the way to go. So in my usual graphical way of thinking about this, I would draw a "relationship line" from the determination to the Individual (or Occurrence) on one side and from the determination to the species concept on the other. I will leave up to the taxonomy people the different things would be connected to the species concept and how all of their lines would be connected. The determination would have any of the properties that are terms listed in the dwc:Identification class (identifiedBy, dateIdentified, identificationReferences, identification Remarks, identificationQualifier, and typeStatus). Some properties like dateIdentified and identificationReferences would be string literals and others (especially identifiedBy) should probably be GUIDs but could be literals if they had to be.
That all seems pretty clear. However, when I've started trying to do this in real life, I immediately have questions. Take a look at http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf which should show up as a web page in your browser.
- The original label identifies the species as Juncus diffusissimus. However, there is no indicator as to who originally identified it or when. My assumption is that it was the collector (Glen N. Montz) but I don't really know that. Do I assume that, or list the original determiner as "unknown"?
- Do we draw a distinction between the initial identification and subsequent annotations? I think the answer should be "no" and that's why I refer to both generically as "determinations".
- There is really no indication given on the annotation labels as to many of the things that we would like to know, such as the concept they had in mind, any source they used (if any), or the reason why they did the annotation. So how does one connect the name that they applied to the determination when there is no indication of the concept? Is this just something we can't do for old annotations and just something that we try to do from this point forward?
- The last question is one that I really want to some opinions about. It seems to me that there are a number of reasons why one would apply a determination. One would be to correct an actual error in identification. One would be to increase the precision of a previous determination (e.g. an insect identified to family now is identified to species). One would be to assert a difference in opinion as to the correct way to group this individual with others (i.e. as in a taxonomic revision). Finally, a single determiner might apply several determinations to one individual and indicate in each determination the concept intended (i.e. if you subscribe to Cronquist, you'd call it X; if you like Radford's book, you'd call it Y; if you like Weakley's treatment, you'd call it Z). Some of these four reasons may be functionally equivalent, but how would you use Darwin Core to indicate the reason why you applied the determination? Please don't say "identificationRemarks"! From a machine-processing standpoint, this is something we should know and there should be some kind of controlled vocabulary to express it. For instance if an identification is "deprecated" because it was in error (perhaps by the determiner him/herself), one would like the incorrect determination to show up in the historical metadata, but I wouldn't want it to be listed in a website index. The same would hold true if an annotator was able to pin the taxon down to a lower taxonomic level than the original identifier. If someone goes to the trouble to connect an Individual/Occurrence to several names under alternative concepts, there should be a way the a machine would know this so that a software user could select the concept they wanted to use and the name under that concept would pop up.
I don't really see any term under the current DwC that could be used to do this last thing. Am I missing something? Do we need several terms to explain the reason why we made the determination because the reasons fall into different categories?
The other comment that I'll throw out (since this is going out to the bioblitz list as well as to tdwg-content) is that those of you who are building apps to collect metadata in the field really need to separate the process of entering (or acquiring) the collection metadata from the determination process. In at least some apps, the user immediately has to commit to a taxon as they enter the data at the time of collection. It seems to me that it would be a very common situation (especially in the case of "citizen science") that the collector/observer/photographer would have no idea what the taxonomic identity was at the time of collection. The process of determination (and the recording of the various dwc:Identification class terms) is really a separate process that should be able to happen at the time of collection OR later.
Steve
Peter DeVries wrote:
Hi Steve,
I would hypothesize that for the vast majority of identified records the process is something like this:
An individual uses some sort of key to determine what species (taxon concept) to assign to a given individual
- They may have created some sort of mental key in which once they recognize one individual mosquito they can then pretty quickly sort a number of individuals into collections.
The actual name they assign to the specimen is usually based on what their key says the name is. Often this does not specify the authorship. Most of these human identifiers have not read the original species descriptions and for the species they are identifying. So the specimen is actually tied to a concept that is based more on the "key" than the original description.
- An exception, would be where there is a key in the original description and that was what what was used.
So in a sense, the process of modeling this as if the if the identifier actually asserted that the concept was the same as that described by the original description or a subsequent revision is "fudging"
Side effects of this process include:
- A new key for North American Mosquitoes comes out that incorporates recent changes in nomenclature. The major change being the elevation of a subgenus to a genus. For most of the species described the "key concept" is unchanged.
Student identifier, Bob, in state X is using the latest key, while student identifier, Joe, is state Z is using a slightly older edition of the same key.
Bob identifies the species as Ochlerotatus triseriatus, while Joe identifies what should be the same species as Aedes triseriatus.
These show up in GBIF on two different maps, they show up in the EOL as two different pages.
Various TDWG'ers continue to argue that the original description and subsequent revisions were really important in determining what these individuals actually meant when they assigned a name to a specimen, and that this is how we should model it in excruciating detail.
I would argue this should be modeled as best as possible to what actually happens.
For example, how many of the species observed in the recent BioBlitz were identified by referring to the original species description or subsequent revisions?
In your diagram, I would suggest that you show that a taxon concept may have many names associated with it. Since it is not clear what the identifier intended by his or her choice of a name, it is often difficult to determine what taxon concept they actually meant.
This is why I advocate a move to a more taxon concept based identifier to link these data sets together because this allows the intent of the identifier is more accurately modeled.
This would be done in the form of:
"I assert that this specimen (of what I call Aedes triseriatus) was observed here. I also assert that it is an instance of the this species concept => URI"
Or I assert that this is an individual of the type "Individual of species concept X" = > URI
All of these are instances of the class "Individual"
So the resulting DarwinCore record would contain both the name and and an optional, but I think needed, asserted species concept.
The species concept is a subclass of taxon concept, but is fundamentally different than the higher clades.
There are some guidelines as to what an entity needs to be considered a species.
While their are no real guidelines as to what clades should be considered genera and what clades should be considered families etc.
Assigning properties at the level of genera or family is also problematic because it assumes that there will be inferencing and it will require rechecking that those properties are still valid if the species within that genera change.
So if there is some property that is common to all the species in the genus, make that a property of each of the individual species - not a property of the genus.
Respectfully,
- Pete
On Fri, Oct 15, 2010 at 10:45 AM, Steve Baskauf steve.baskauf@vanderbilt.edu wrote: As a background to this post, I want to reference a post by Bob called "SubclassOrNot". I discovered this page on an early foray into the TDWG website labyrinth and it has been very influential on my thinking since then. The idea Bob discusses is central to what I'm writing below so if you haven't read it you might want to do so first. You can probably skip the "OWL Inference" section and still get the point which is described in the first two sections of his post. The URL for the page is http://wiki.tdwg.org/twiki/bin/view/TAG/SubclassOrNot .
To preface what I'm going to say below, I want to put Darwin Core Occurrences in the context of what Bob wrote. In my mind, one of the hallmarks of the Darwin Core standard and one thing that makes it a great improvement over previous versions is that the decision was made to use what Bob called the "has a" approach rather than the "is a" approach. In particular, the Darwin Core standard has a single class called dwc:Occurrence rather than subclasses called "Specimen", "Observation", and other possible things. The way that we differentiate among different kinds of Occurrences is by using the DwC types which are the controlled values for the term dwc:basisOfRecord. Thus we say an Occurrence "has a" basisOfRecord=PreservedSpecimen rather than saying it "is a" PreservedSpecimen. We say an Occurrence "has a" basisOfRecord=HumanObservation rather than saying it "is a"HumanObservation". This approach has greatly reduced the number of different terms in the standard since we don't have to have separate "ObservedBy" and "CollectedBy" terms, but rather can just have a single "RecordedBy" term that applies to both specimens and observations. The same thing applies to many other things, like eventDate rather than DateCollected and DateObserved, locality rather than collectionLocality and observationLocality, etc. With the ratification of Darwin Core, this decision is now a fait acompli and not a subject of discussion or something optional for users of the standard. It also seems to be clear that as necessary new terms can be added to the DwC types which would then be valid controlled values for basisOfRecord.
Since the adoption of the DwC standard, the approach to Occurrences has been what I would describe as "I know an Occurrence when I see one". I consider this as a pretty sloppy practice and as I indicated in my post last night, I think there is enough consensus about what an Occurrence is that we can come up with a better definition than "an occurrence is the category of information pertaining to evidence of an occurrence...". Another part of what I would characterize as sloppiness is the lack of a clear definition of what exactly basisOfRecord means. When I wrote my attempt at summarizing consensus last night, I dodged the question about what I called the "token". This "thing" has been called various names. In the previous discussion on the list, it was sometimes called "the evidence" of the occurrence. In the past I have called it "a representation" - however, I now think the term "token" is better because "representation" has a different technical meaning in the context of content negotiation. When we type an Occurrence by saying it has a basisOfRecord=PreservedSpecimen, we are saying that this Occurrence has as supporting evidence, or as a "token" if you prefer, all or part of the dead remains of the organism (i.e. what I'm calling "the Individual") that was being documented by the Occurrence. When we type an Occurrence by saying it has a basisOfRecord=LivingSpecimen, we are saying that this Occurrence has as a "token" the entire organism that was being documented (or some vegetative part of the live organism that was propagated). When we type an Occurrence by saying it has a basisOfRecord=HumanObservation, we are saying that the Occurrence has no supporting evidence other than the reputation of the observer to accurately record the metadata about the Occurrence. In other words, we "tag" a instance of a core class (to use Bob's words), Occurrence, by telling a metadata consumer what kind of token we are using as evidence of the Occurrence. A fundamental part of creating a clear definition of what an Occurrence is, is to define exactly what we are including in the concept of Occurrence. One possibility is to (1) say that the two boxes at the right side of the diagram at http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif are fused and that both the Occurrence metadata and its associated token are what we consider to be "the Occurrence". Another approach (2) would be to say that the actual Occurrence as an entity is only the metadata part and that the token is a separate thing. A third approach is to say (3) that everything with the blue dotted lines is considered a part of the Occurrence (i.e. the metadata, the token, the event, and the locality). I don't think in an absolute sense, any one of these approaches is "right". The problem is that these approaches are used inconsistently, sometimes even by the same person, depending on the basisOfRecord. Differences in ways of thinking about this issue is a part of why people aren't understanding the way other people are approaching the structuring of metadata. I have tried to consistently take the approach (1) that the two boxes on the right are fused, i.e. that the Occurrence metadata and the token should both be considered part of the entity that we call "an Occurrence". I think this is why Rich was confused in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001666.html when I said that it was "wrong" to assert that a scientific name is a property of an Occurrence - obviously it is silly to say that the token (photons on a film, sound patterns in a digital file) has a scientific name. Yet that is exactly what people do routinely when the token is a branch cut off a tree and glued to a piece of paper. They say that they are "identifying a specimen". What I am asking (actually demanding) is that the TDWG community get its act together and come to some consistency on this. If we are going to take the approach (2), then we need to take specimens off their pedestal and treat them like we do any other token that we are using as evidence that an Occurrence happened. If we are going to do what was suggested for the BioBlitz in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001603.html, i.e. to call Occurrences "observations" and then link the tokens to them by associatedMedia, ResourceRelationship, or some other means (approach 2) then do it consistently for every kind of token, including specimens, and don't single out media tokens for punishment. I have in a sense "thrown down the gauntlet" on this issue by proposing that DigitalStillImage be added as a DwC type and as a controlled value for basisOfRecord (http://code.google.com/p/darwincore/issues/detail?id=68). I know what some people are going to say in response to this proposal. "Why do you need to have 'DigitalStillImage' as a value for basisOfRecord when you can just say that the resource's dcterms:type=StillImage?" The answer goes back to Bob's point. If we are going to go the "has a" path (which we already have in DwC for Occurrences) rather than subclassing everything, then we need to provide an appropriate value for the "tag" for any type of resource that a reasonable number of users will want to use as a token. I think it is clear from this and other Bioblitzes, my work in Bioimages, the whale tracking project, and many other examples, that there are plenty of people who are already using DigitalStillImages as tokens and we all need a controlled value to use for basisOfRecord. The other thing that we accomplish when we type an Occurrence by its basisOfRecord is to tell a consumer what kind of metadata to expect to get about the token in addition to the generic metadata that is provided for all Occurrences. Thus for a LivingSpecimen we expect to be told what zoo, botanical garden, bacterial collection, etc. contains the specimen. For a PreservedSpecimen we expect to be told the preparation type, the location of the repository, etc. For a DigitalStillImage we expect to be told the file type, accessURL, etc. Simply providing a value for dcterms:type=StillImage doesn't indicate whether the image is a physical one (i.e. on film) or a digital one. It is also unreasonable to expect a client to have to be checking two different terms (basisOfRecord and dcterms:type) to find out what they could learn from one (basisOfRecord). Of course it would be advisable to provide a value for dcterms:type as well for clients outside the biodiversity community who may not "understand" what basisOfRecord means. I hate to keep bringing my posts back to the RDF issue, but thinking about how one would write RDF forces clear thinking about how metadata should be structured. If we intend to separate tokens as entities from their associated Occurrence metadata, i.e. approach (2), then we open up a whole other can of worms. To associate the occurrence resources (i.e. the metadata) with the "different" resource (i.e. the token), we will have probably have to be able to create URIs for the tokens and separate RDF metadata blocks which will have to be rdfs:type'd. What are we going to use for that rdfs:type - create another Darwin Core class? I simply don't think that is a complicated road that we want to travel. It would be far easier to just say that every Occurrence has a one-to-one relationship with its token (which could be "the empty set" for observations). This would not work for people who want to hang multiple tokens on a single observation event, but I think that itself is a bad idea because it makes it even harder to have "flat" occurrence datasets. Just say that every time we collect a different token (or make an observation that has no token), it is a new Occurrence record. Realistically, a single collector can't actually take a picture of a plant at the same time he or she collects it for a specimen anyway. Those really should be considered two different events because they happen at different times.
OK, enough said. Consider this my defense of my proposal "issue 68" to add DigitalStillImage. I would urge the powers that be to respond to the issues that I've raised here before having any kind of "vote" (or whatever is ultimately going to happen when there is an up or down decision about the proposal).
Steve
Steve Baskauf wrote: After the flurry of emails recently, I had an opportunity to carefully read all the way through the threads again, followed by enforced "think time" during my long commute. I was actually pretty cheerful after that because I think that in essence, most of the conversation about what constitutes an Occurrence really boils down to the same thing. So I have sat down and tried to summarize what seems to me to be a consensus about Occurrences. To follow my points, please refer to the diagram at: http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif
Consensus on relationships
- The fundamental definition of an Occurrence involves evidence that a
representative of a taxon occurred at a place and time. Note 1.A: For clarity, I have modified John's statement in his last email by replacing "taxon" with "representative of a taxon". I'm considering a taxon to be an abstract concept that is applied to individuals or groups of organisms. Note 1.B. This definition is far more useful than the official definition of the class Occurrence "The category of information pertaining to evidence of an occurrence..." which is essentially circular. Note 1.C: This statement is extremely broad because the evidence could be of many sorts, the representative could range from a single individual to all organisms on the earth, the taxon could be anyone's definition at any taxonomic level, the place could range from a GPS point with uncertainty of less than 10 meters to the entire planet earth, and the time could range from a shutter click of less than one second to 3.4 billion years. 2. The diagram is an attempt to summarize in pictorial form statements and relationships that have been described in the thread. The taxon representative is recorded as existing at a particular time and place (the arrow) and the result is an Occurrence record. That Occurrence record exists as metadata which may be associated with a token that can be used to voucher the fact that the taxon representative existed. That token may be the organism itself (or a living part of it as in a twig for grafting), all or part of the organism in preserved form, an electronic representation such as an image or sound recording, and other kinds of things like tissue or DNA samples. There may also be no token at all, in which case we call the Occurrence record an observation. Based on direct observation of the taxon representative, examination of one or more tokens, or both, some determiner asserts that a taxon concept applies to the taxon representative and as a result a scientific name can be used to "identify" the taxon representative. (There may be a lot of other complicated stuff above the Identification box, but that will have to be filled in by the taxonomists.) Note 2.A: I have mapped onto this diagram the letters that John used in his last email to refer to entities that are involved in an Occurrence (T, E, L, O, and G). I will beg the forgiveness of fossil people because I don't really know how the geological context fits in. I'm assuming that it is a way of asserting time and location on a much broader scale than we do for extant organisms. Note 2.B: I have put a dotted line around the part of the diagram that I think includes all the things that people might consider part of the Occurrence itself. I have left out "T" and the other parts related to identification because it seems to me that you can have an occurrence that you document which does not yet (and perhaps never will) have an identification. The Occurrence still asserts that a taxon representative existed at a time and place; we just don't yet know what the taxon is. 3. The red lines indicate the relationships that connect the various entities (I'm going to go ahead and call them resources). Consistent with popular opinion, the Occurrence record is the center of the universe and most things are connected to it. Note 3.A: I am sticking to my guns and refuse to connect the Identification directly to the Occurrence. It is the taxon representative that is being identified, not the occurrence. One can assert another sort of relationship between the identification and the occurrence if one wants to say that one consulted the occurrence metadata and token in order to decide about the identification, but it is not correct to say that the Identification identifies either the Occurrence metadata or the token (as Rich pointed out).
OK, so that's step one - defining what is related to what. If anyone disagrees with these relationships, please clarify or create your own diagram.
Complicating circumstances/caveats
- It is noted and recognized that some users will not care to include
all of these relationships in their models. In the interest of simplification or "flattening" the relationships, they may wish to collapse some parts of this diagram (e.g. incorporate time and location metadata within the Occurrence metadata rather than considering them separate resources, applying scientific names directly to the taxon representatives without defining a taxon concept or recording the determination metadata, connecting identifications directly to the occurrence, etc.). This doesn't mean that the relationships don't exist, it just means that some users don't care about them. 2. It is recognized that different users will be interested in or able to specify the various resources to differing degrees of precision. Examples: A photographer might record times to the nearest second, a collector may only be interested in noting the date on which a specimen was collected. A location may be specified to the precision of a GPS reading or be defined as some geographic or political subdivision. The taxon representative may be an individual organism, a flock or clump, or some larger aggregation of taxon representatives.
That's step two. If I've missed any complications, please point them out.
My opinions about the implications of this diagram
- The circle I've labeled as "taxon representative" is the resource
type that I'm proposing to be represented by the class Individual. You will note that in both the definition of dwc:individualID ("An identifier for an individual or named group of individual organisms...") and the proposed class definition ("The category of information pertaining to an individual or named group of individual organisms represented in an Occurrence"), groups of individual organisms are included. Thus John's example of a fossil having myriad individuals, or Richard's examples of thousands of plankton, a large school of fish, herd of wildebeest, flock of birds, could all be categorized as "Individual" under this definition if there is a reasonable expectation that all of the individuals in the group are members of the same taxon. Perhaps there is a better name for this resource, but since dwc:individualID was already extant, I chose Individual as the class name for consistency with the pattern established with other classes and their associated xxxxID terms. 2. Although in note 1.C. I have given the ranges of the various resources to their logical extreme (as was done previously in the thread), I think that as a practical matter we can adopt guidelines to set reasonable values for the "normal" ranges of the resources. One such guideline might be that we suggest a range that can accommodate about 95% of the user needs within the community (this came from Rich's comment about satisfying 95% of the user need with an establishmentMeans controlled vocuabulary). For example, it was suggested that the range for the location of an Occurrence could span the entire planet Earth. True enough, but virtually nobody would find such a span useful. 95% of users would probably find a range between a GPS reading with 10 meter precision and the extent of a county or province useful for recording the location of an Occurrence. I can suggest similar "useful" ranges: one second to one day for an event time (excluding fossils), one individual organism to the number of organisms that would fit within a 50 meter radius for an "individual", and taxon identified to family for plants and maybe mammals, genus for birds, and order for insects. So framing the definition of an Occurrence in these terms it would be something like: "An occurrence involves evidence (consisting of a physical token, electronic record, or personal observation) that a representative (ranging from a single individual to the number that would fit on a football field) of a taxon (hopefully identified to some lower taxonomic level) occurred at a place (determined to a precision between that of a GPS reading and the size of a county/province) and time (spanning one second to one day)." A few people might object to this level of restrictiveness, but I would guess that it would make 95% of us happy. 3. With the exception of the "missing" class Individual, every resource type on this diagram except for the "token" and Scientific name has a Darwin Core class. Every resource type on the diagram except for "token" has a dwc:xxxxID term that can be used to refer to a GUID for the resource. The implication of this is that any resource on this diagram except for the token and taxon representative (i.e. Individual) is ready to be represented in RDF by Darwin Core terms in the sense that the relationships (red lines) can be represented by the xxxxID terms and that the resources can be rdfs:type'd using Darwin Core classes. (Lacking a class for the scientific name doesn't seem like a big deal to me since the scientific name can be a string literal - but then I'm not a taxonomist.) 4. OK, I've avoided it as long as I can, so I'm going to confess now to the RDF-phobes. The red lines and shapes are something pretty close to an RDF graph. What that means is that if the community can agree that this diagram correctly represents the relationships among the kinds of biodiversity resources that we care about, then the matter of providing guidelines on how to represent Darwin Core in RDF suddenly gets a lot simpler. Just convert the "picture" of the RDF graph into XML format and we have a template. Alright, that's an oversimplification, but I think it is essentially true because the most difficult part of achieving a consensus on RDF representations is to decide how we connect the resource types, not on the literals that we hang onto resources as properties. 5. While I'm beating the RDF drum again, the importance of my opinion number 2 can be extended into the GUID adoption process. In my comments to Kevin about the Beginner's Guide to Persistent Identifiers, I think I commented on the question of how one decides whether a GUID needs to be assigned to something or not. I believe that the answer to that question boils down to this: we need a GUID for any resource that will be referenced by more than one other resource. Do we need to be able to assign a GUID to Taxon concepts? Yes, because it is likely that many identifications will want to reference a particular taxon concept. Do we need to be able to assign a GUID to an Event? Maybe or maybe not. If every occurrence has its own separate time recorded, then no GUID is needed because the time is just a part of every separate occurrence record. If the event is defined to be a time range that represents a collecting trip, then there may be many Occurrences that are associated with that trip and all of them could reference the GUID for that event rather than repeating the event information for every Occurrence. The point here is that every shape (class of resources) on this diagram at least has the POTENTIAL to be a node connecting multiple resources and therefore should have the capability of being assigned a GUID, having its own RDF record, and being appropriately typed (presumably by a DwC class). So this is a final technical argument for why we need to have the DwC class Individual. Whether or not people ultimately choose to assign GUIDs to particular resource types or not is their own choice, but they need to at least be ABLE to if they need that resource to serve as a node given the structure of their metadata.
We need to clarify how the "token" thing fits in, but I'm stopping there for now. I would very much appreciate responses indicating that:
A. you agree with the diagram and connections (and consider this definition and diagram a consensus) B. you disagree with the diagram (and articulate why) C. you provide an alternative diagram or explanation of the relationships among the classes related to Occurrences.
Thanks for you patience with another tome. Steve
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content .
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707
http://bioimages.vanderbilt.edu _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707
http://bioimages.vanderbilt.edu _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
OK, I'll confess that I don't know what a fully normalised model is and a two-minute reading of Wikipedia didn't help me. I was just going by Pete's snippet of an RDF element. I'm assuming it was a part of something like:
<dwc:Identification rdf:about="http://www.example.org/determinations/12345" > ... dwcterms:nameAccordingToIdentification And Geographical Distribution Of The Mosquitoes: Of North America, North Of Mexico By Richard F., Jr. Darsie et al. 2004</dwcterms:nameAccordingTo> ... </dwc:Identification>
which would be making the statement [the determination] dwcterms:nameAccordingTo [that long literal string for the reference] in my shortcut RDF triple notation.
But it sounds like you are saying that one should say that the determination has a certain name and then that the name should have the property dwcterms:nameAccordingTo.
Maybe this is a question that should be deferred to a later discussion prior to a putative RDF guide. Steve
Markus Döring wrote:
its a property of Taxon if you have a fully normalised model. But as a shortcut together with scientificName it might be right in Identification I suppose?
Markus
On Oct 18, 2010, at 21:38, Steve Baskauf wrote:
So are we saying that dwc:nameAccordingTo can be a property of an dwc:Identification? What's dwc:identificationReferences for? I'm sorry if this is a dumb question but I can plead ignorance on this topic. Steve
Peter DeVries wrote:
Hi Markus,
I feel your pain. :-)
Maybe an example might help clarify this.
I use the key* listed below to id my mosquitoes.
So I should mark up my RDF for the identification with something like:
dwcterms:nameAccordingToIdentification And Geographical Distribution Of The Mosquitoes: Of North America, North Of Mexico By Richard F., Jr. Darsie et al. 2004</dwcterms:nameAccordingTo>
Rather than use some other term like "dwc:identificationReferences"
Correct?
- Pete
Identification And Geographical Distribution Of The Mosquitoes: Of North America, North Of Mexico By Richard F., Jr. Darsie, RONALD A. WARD, Chien C. Chang, Taina Litwak University Press of Florida, 2004 ISBN: 0813027845 Cite: 13463
=================================================================
On Mon, Oct 18, 2010 at 12:19 PM, "Markus Döring (GBIF)" mdoering@gbif.org wrote: I am sorry I dont have the time to follow this extensive thread, but I can manage at least the first paragraphs ;) A quick comment on tying identification sources to a scientific name. As for other taxon concepts this is usually done with the sec/sensu reference which should be recorded as dwc:nameAccordingTo:
http://rs.tdwg.org/dwc/terms/index.htm#nameAccordingTo
I am slightly irritated that we seem to have some term duplicates for this use case. Maybe dwc:identificationReferences is supposed to only list additional references?
Markus
On Oct 18, 2010, at 18:49, Steve Baskauf wrote:
I've fallen behind on systematically perusing the list responses, but I would like to focus in on a point that seems to be a consensus in the responses that have shown up recently. The consensus seems to be that documenting determinations (a.k.a. instances of dwc:Identification class) that are applied to Individuals (or Occurrences if you don't believe in Individuals) is the way to go. So in my usual graphical way of thinking about this, I would draw a "relationship line" from the determination to the Individual (or Occurrence) on one side and from the determination to the species concept on the other. I will leave up to the taxonomy people the different things would be connected to the species concept and how all of their lines would be connected. The determination would have any of the properties that are terms listed in the dwc:Identification class (identifiedBy, dateIdentified, identificationReferences, identification Remarks, identificationQualifier, and typeStatus). Some properties like dateIdentified and identificationReferences would be string literals and others (especially identifiedBy) should probably be GUIDs but could be literals if they had to be.
That all seems pretty clear. However, when I've started trying to do this in real life, I immediately have questions. Take a look at http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf which should show up as a web page in your browser.
- The original label identifies the species as Juncus diffusissimus. However, there is no indicator as to who originally identified it or when. My assumption is that it was the collector (Glen N. Montz) but I don't really know that. Do I assume that, or list the original determiner as "unknown"?
- Do we draw a distinction between the initial identification and subsequent annotations? I think the answer should be "no" and that's why I refer to both generically as "determinations".
- There is really no indication given on the annotation labels as to many of the things that we would like to know, such as the concept they had in mind, any source they used (if any), or the reason why they did the annotation. So how does one connect the name that they applied to the determination when there is no indication of the concept? Is this just something we can't do for old annotations and just something that we try to do from this point forward?
- The last question is one that I really want to some opinions about. It seems to me that there are a number of reasons why one would apply a determination. One would be to correct an actual error in identification. One would be to increase the precision of a previous determination (e.g. an insect identified to family now is identified to species). One would be to assert a difference in opinion as to the correct way to group this individual with others (i.e. as in a taxonomic revision). Finally, a single determiner might apply several determinations to one individual and indicate in each determination the concept intended (i.e. if you subscribe to Cronquist, you'd call it X; if you like Radford's book, you'd call it Y; if you like Weakley's treatment, you'd call it Z). Some of these four reasons may be functionally equivalent, but how would you use Darwin Core to indicate the reason why you applied the determination? Please don't say "identificationRemarks"! From a machine-processing standpoint, this is something we should know and there should be some kind of controlled vocabulary to express it. For instance if an identification is "deprecated" because it was in error (perhaps by the determiner him/herself), one would like the incorrect determination to show up in the historical metadata, but I wouldn't want it to be listed in a website index. The same would hold true if an annotator was able to pin the taxon down to a lower taxonomic level than the original identifier. If someone goes to the trouble to connect an Individual/Occurrence to several names under alternative concepts, there should be a way the a machine would know this so that a software user could select the concept they wanted to use and the name under that concept would pop up.
I don't really see any term under the current DwC that could be used to do this last thing. Am I missing something? Do we need several terms to explain the reason why we made the determination because the reasons fall into different categories?
The other comment that I'll throw out (since this is going out to the bioblitz list as well as to tdwg-content) is that those of you who are building apps to collect metadata in the field really need to separate the process of entering (or acquiring) the collection metadata from the determination process. In at least some apps, the user immediately has to commit to a taxon as they enter the data at the time of collection. It seems to me that it would be a very common situation (especially in the case of "citizen science") that the collector/observer/photographer would have no idea what the taxonomic identity was at the time of collection. The process of determination (and the recording of the various dwc:Identification class terms) is really a separate process that should be able to happen at the time of collection OR later.
Steve
Peter DeVries wrote:
Hi Steve,
I would hypothesize that for the vast majority of identified records the process is something like this:
An individual uses some sort of key to determine what species (taxon concept) to assign to a given individual
- They may have created some sort of mental key in which once they recognize one individual mosquito they can then pretty quickly sort a number of individuals into collections.
The actual name they assign to the specimen is usually based on what their key says the name is. Often this does not specify the authorship. Most of these human identifiers have not read the original species descriptions and for the species they are identifying. So the specimen is actually tied to a concept that is based more on the "key" than the original description.
- An exception, would be where there is a key in the original description and that was what what was used.
So in a sense, the process of modeling this as if the if the identifier actually asserted that the concept was the same as that described by the original description or a subsequent revision is "fudging"
Side effects of this process include:
- A new key for North American Mosquitoes comes out that incorporates recent changes in nomenclature. The major change being the elevation of a subgenus to a genus. For most of the species described the "key concept" is unchanged.
Student identifier, Bob, in state X is using the latest key, while student identifier, Joe, is state Z is using a slightly older edition of the same key.
Bob identifies the species as Ochlerotatus triseriatus, while Joe identifies what should be the same species as Aedes triseriatus.
These show up in GBIF on two different maps, they show up in the EOL as two different pages.
Various TDWG'ers continue to argue that the original description and subsequent revisions were really important in determining what these individuals actually meant when they assigned a name to a specimen, and that this is how we should model it in excruciating detail.
I would argue this should be modeled as best as possible to what actually happens.
For example, how many of the species observed in the recent BioBlitz were identified by referring to the original species description or subsequent revisions?
In your diagram, I would suggest that you show that a taxon concept may have many names associated with it. Since it is not clear what the identifier intended by his or her choice of a name, it is often difficult to determine what taxon concept they actually meant.
This is why I advocate a move to a more taxon concept based identifier to link these data sets together because this allows the intent of the identifier is more accurately modeled.
This would be done in the form of:
"I assert that this specimen (of what I call Aedes triseriatus) was observed here. I also assert that it is an instance of the this species concept => URI"
Or I assert that this is an individual of the type "Individual of species concept X" = > URI
All of these are instances of the class "Individual"
So the resulting DarwinCore record would contain both the name and and an optional, but I think needed, asserted species concept.
The species concept is a subclass of taxon concept, but is fundamentally different than the higher clades.
There are some guidelines as to what an entity needs to be considered a species.
While their are no real guidelines as to what clades should be considered genera and what clades should be considered families etc.
Assigning properties at the level of genera or family is also problematic because it assumes that there will be inferencing and it will require rechecking that those properties are still valid if the species within that genera change.
So if there is some property that is common to all the species in the genus, make that a property of each of the individual species - not a property of the genus.
Respectfully,
- Pete
On Fri, Oct 15, 2010 at 10:45 AM, Steve Baskauf steve.baskauf@vanderbilt.edu wrote: As a background to this post, I want to reference a post by Bob called "SubclassOrNot". I discovered this page on an early foray into the TDWG website labyrinth and it has been very influential on my thinking since then. The idea Bob discusses is central to what I'm writing below so if you haven't read it you might want to do so first. You can probably skip the "OWL Inference" section and still get the point which is described in the first two sections of his post. The URL for the page is http://wiki.tdwg.org/twiki/bin/view/TAG/SubclassOrNot .
To preface what I'm going to say below, I want to put Darwin Core Occurrences in the context of what Bob wrote. In my mind, one of the hallmarks of the Darwin Core standard and one thing that makes it a great improvement over previous versions is that the decision was made to use what Bob called the "has a" approach rather than the "is a" approach. In particular, the Darwin Core standard has a single class called dwc:Occurrence rather than subclasses called "Specimen", "Observation", and other possible things. The way that we differentiate among different kinds of Occurrences is by using the DwC types which are the controlled values for the term dwc:basisOfRecord. Thus we say an Occurrence "has a" basisOfRecord=PreservedSpecimen rather than saying it "is a" PreservedSpecimen. We say an Occurrence "has a" basisOfRecord=HumanObservation rather than saying it "is a"HumanObservation". This approach has greatly reduced the number of different terms in the standard since we don't have to have separate "ObservedBy" and "CollectedBy" terms, but rather can just have a single "RecordedBy" term that applies to both specimens and observations. The same thing applies to many other things, like eventDate rather than DateCollected and DateObserved, locality rather than collectionLocality and observationLocality, etc. With the ratification of Darwin Core, this decision is now a fait acompli and not a subject of discussion or something optional for users of the standard. It also seems to be clear that as necessary new terms can be added to the DwC types which would then be valid controlled values for basisOfRecord.
Since the adoption of the DwC standard, the approach to Occurrences has been what I would describe as "I know an Occurrence when I see one". I consider this as a pretty sloppy practice and as I indicated in my post last night, I think there is enough consensus about what an Occurrence is that we can come up with a better definition than "an occurrence is the category of information pertaining to evidence of an occurrence...". Another part of what I would characterize as sloppiness is the lack of a clear definition of what exactly basisOfRecord means. When I wrote my attempt at summarizing consensus last night, I dodged the question about what I called the "token". This "thing" has been called various names. In the previous discussion on the list, it was sometimes called "the evidence" of the occurrence. In the past I have called it "a representation" - however, I now think the term "token" is better because "representation" has a different technical meaning in the context of content negotiation. When we type an Occurrence by saying it has a basisOfRecord=PreservedSpecimen, we are saying that this Occurrence has as supporting evidence, or as a "token" if you prefer, all or part of the dead remains of the organism (i.e. what I'm calling "the Individual") that was being documented by the Occurrence. When we type an Occurrence by saying it has a basisOfRecord=LivingSpecimen, we are saying that this Occurrence has as a "token" the entire organism that was being documented (or some vegetative part of the live organism that was propagated). When we type an Occurrence by saying it has a basisOfRecord=HumanObservation, we are saying that the Occurrence has no supporting evidence other than the reputation of the observer to accurately record the metadata about the Occurrence. In other words, we "tag" a instance of a core class (to use Bob's words), Occurrence, by telling a metadata consumer what kind of token we are using as evidence of the Occurrence. A fundamental part of creating a clear definition of what an Occurrence is, is to define exactly what we are including in the concept of Occurrence. One possibility is to (1) say that the two boxes at the right side of the diagram at http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif are fused and that both the Occurrence metadata and its associated token are what we consider to be "the Occurrence". Another approach (2) would be to say that the actual Occurrence as an entity is only the metadata part and that the token is a separate thing. A third approach is to say (3) that everything with the blue dotted lines is considered a part of the Occurrence (i.e. the metadata, the token, the event, and the locality). I don't think in an absolute sense, any one of these approaches is "right". The problem is that these approaches are used inconsistently, sometimes even by the same person, depending on the basisOfRecord. Differences in ways of thinking about this issue is a part of why people aren't understanding the way other people are approaching the structuring of metadata. I have tried to consistently take the approach (1) that the two boxes on the right are fused, i.e. that the Occurrence metadata and the token should both be considered part of the entity that we call "an Occurrence". I think this is why Rich was confused in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001666.html when I said that it was "wrong" to assert that a scientific name is a property of an Occurrence - obviously it is silly to say that the token (photons on a film, sound patterns in a digital file) has a scientific name. Yet that is exactly what people do routinely when the token is a branch cut off a tree and glued to a piece of paper. They say that they are "identifying a specimen". What I am asking (actually demanding) is that the TDWG community get its act together and come to some consistency on this. If we are going to take the approach (2), then we need to take specimens off their pedestal and treat them like we do any other token that we are using as evidence that an Occurrence happened. If we are going to do what was suggested for the BioBlitz in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001603.html, i.e. to call Occurrences "observations" and then link the tokens to them by associatedMedia, ResourceRelationship, or some other means (approach 2) then do it consistently for every kind of token, including specimens, and don't single out media tokens for punishment. I have in a sense "thrown down the gauntlet" on this issue by proposing that DigitalStillImage be added as a DwC type and as a controlled value for basisOfRecord (http://code.google.com/p/darwincore/issues/detail?id=68). I know what some people are going to say in response to this proposal. "Why do you need to have 'DigitalStillImage' as a value for basisOfRecord when you can just say that the resource's dcterms:type=StillImage?" The answer goes back to Bob's point. If we are going to go the "has a" path (which we already have in DwC for Occurrences) rather than subclassing everything, then we need to provide an appropriate value for the "tag" for any type of resource that a reasonable number of users will want to use as a token. I think it is clear from this and other Bioblitzes, my work in Bioimages, the whale tracking project, and many other examples, that there are plenty of people who are already using DigitalStillImages as tokens and we all need a controlled value to use for basisOfRecord. The other thing that we accomplish when we type an Occurrence by its basisOfRecord is to tell a consumer what kind of metadata to expect to get about the token in addition to the generic metadata that is provided for all Occurrences. Thus for a LivingSpecimen we expect to be told what zoo, botanical garden, bacterial collection, etc. contains the specimen. For a PreservedSpecimen we expect to be told the preparation type, the location of the repository, etc. For a DigitalStillImage we expect to be told the file type, accessURL, etc. Simply providing a value for dcterms:type=StillImage doesn't indicate whether the image is a physical one (i.e. on film) or a digital one. It is also unreasonable to expect a client to have to be checking two different terms (basisOfRecord and dcterms:type) to find out what they could learn from one (basisOfRecord). Of course it would be advisable to provide a value for dcterms:type as well for clients outside the biodiversity community who may not "understand" what basisOfRecord means. I hate to keep bringing my posts back to the RDF issue, but thinking about how one would write RDF forces clear thinking about how metadata should be structured. If we intend to separate tokens as entities from their associated Occurrence metadata, i.e. approach (2), then we open up a whole other can of worms. To associate the occurrence resources (i.e. the metadata) with the "different" resource (i.e. the token), we will have probably have to be able to create URIs for the tokens and separate RDF metadata blocks which will have to be rdfs:type'd. What are we going to use for that rdfs:type - create another Darwin Core class? I simply don't think that is a complicated road that we want to travel. It would be far easier to just say that every Occurrence has a one-to-one relationship with its token (which could be "the empty set" for observations). This would not work for people who want to hang multiple tokens on a single observation event, but I think that itself is a bad idea because it makes it even harder to have "flat" occurrence datasets. Just say that every time we collect a different token (or make an observation that has no token), it is a new Occurrence record. Realistically, a single collector can't actually take a picture of a plant at the same time he or she collects it for a specimen anyway. Those really should be considered two different events because they happen at different times.
OK, enough said. Consider this my defense of my proposal "issue 68" to add DigitalStillImage. I would urge the powers that be to respond to the issues that I've raised here before having any kind of "vote" (or whatever is ultimately going to happen when there is an up or down decision about the proposal).
Steve
Steve Baskauf wrote: After the flurry of emails recently, I had an opportunity to carefully read all the way through the threads again, followed by enforced "think time" during my long commute. I was actually pretty cheerful after that because I think that in essence, most of the conversation about what constitutes an Occurrence really boils down to the same thing. So I have sat down and tried to summarize what seems to me to be a consensus about Occurrences. To follow my points, please refer to the diagram at: http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif
Consensus on relationships
- The fundamental definition of an Occurrence involves evidence that a
representative of a taxon occurred at a place and time. Note 1.A: For clarity, I have modified John's statement in his last email by replacing "taxon" with "representative of a taxon". I'm considering a taxon to be an abstract concept that is applied to individuals or groups of organisms. Note 1.B. This definition is far more useful than the official definition of the class Occurrence "The category of information pertaining to evidence of an occurrence..." which is essentially circular. Note 1.C: This statement is extremely broad because the evidence could be of many sorts, the representative could range from a single individual to all organisms on the earth, the taxon could be anyone's definition at any taxonomic level, the place could range from a GPS point with uncertainty of less than 10 meters to the entire planet earth, and the time could range from a shutter click of less than one second to 3.4 billion years. 2. The diagram is an attempt to summarize in pictorial form statements and relationships that have been described in the thread. The taxon representative is recorded as existing at a particular time and place (the arrow) and the result is an Occurrence record. That Occurrence record exists as metadata which may be associated with a token that can be used to voucher the fact that the taxon representative existed. That token may be the organism itself (or a living part of it as in a twig for grafting), all or part of the organism in preserved form, an electronic representation such as an image or sound recording, and other kinds of things like tissue or DNA samples. There may also be no token at all, in which case we call the Occurrence record an observation. Based on direct observation of the taxon representative, examination of one or more tokens, or both, some determiner asserts that a taxon concept applies to the taxon representative and as a result a scientific name can be used to "identify" the taxon representative. (There may be a lot of other complicated stuff above the Identification box, but that will have to be filled in by the taxonomists.) Note 2.A: I have mapped onto this diagram the letters that John used in his last email to refer to entities that are involved in an Occurrence (T, E, L, O, and G). I will beg the forgiveness of fossil people because I don't really know how the geological context fits in. I'm assuming that it is a way of asserting time and location on a much broader scale than we do for extant organisms. Note 2.B: I have put a dotted line around the part of the diagram that I think includes all the things that people might consider part of the Occurrence itself. I have left out "T" and the other parts related to identification because it seems to me that you can have an occurrence that you document which does not yet (and perhaps never will) have an identification. The Occurrence still asserts that a taxon representative existed at a time and place; we just don't yet know what the taxon is. 3. The red lines indicate the relationships that connect the various entities (I'm going to go ahead and call them resources). Consistent with popular opinion, the Occurrence record is the center of the universe and most things are connected to it. Note 3.A: I am sticking to my guns and refuse to connect the Identification directly to the Occurrence. It is the taxon representative that is being identified, not the occurrence. One can assert another sort of relationship between the identification and the occurrence if one wants to say that one consulted the occurrence metadata and token in order to decide about the identification, but it is not correct to say that the Identification identifies either the Occurrence metadata or the token (as Rich pointed out).
OK, so that's step one - defining what is related to what. If anyone disagrees with these relationships, please clarify or create your own diagram.
Complicating circumstances/caveats
- It is noted and recognized that some users will not care to include
all of these relationships in their models. In the interest of simplification or "flattening" the relationships, they may wish to collapse some parts of this diagram (e.g. incorporate time and location metadata within the Occurrence metadata rather than considering them separate resources, applying scientific names directly to the taxon representatives without defining a taxon concept or recording the determination metadata, connecting identifications directly to the occurrence, etc.). This doesn't mean that the relationships don't exist, it just means that some users don't care about them. 2. It is recognized that different users will be interested in or able to specify the various resources to differing degrees of precision. Examples: A photographer might record times to the nearest second, a collector may only be interested in noting the date on which a specimen was collected. A location may be specified to the precision of a GPS reading or be defined as some geographic or political subdivision. The taxon representative may be an individual organism, a flock or clump, or some larger aggregation of taxon representatives.
That's step two. If I've missed any complications, please point them out.
My opinions about the implications of this diagram
- The circle I've labeled as "taxon representative" is the resource
type that I'm proposing to be represented by the class Individual. You will note that in both the definition of dwc:individualID ("An identifier for an individual or named group of individual organisms...") and the proposed class definition ("The category of information pertaining to an individual or named group of individual organisms represented in an Occurrence"), groups of individual organisms are included. Thus John's example of a fossil having myriad individuals, or Richard's examples of thousands of plankton, a large school of fish, herd of wildebeest, flock of birds, could all be categorized as "Individual" under this definition if there is a reasonable expectation that all of the individuals in the group are members of the same taxon. Perhaps there is a better name for this resource, but since dwc:individualID was already extant, I chose Individual as the class name for consistency with the pattern established with other classes and their associated xxxxID terms. 2. Although in note 1.C. I have given the ranges of the various resources to their logical extreme (as was done previously in the thread), I think that as a practical matter we can adopt guidelines to set reasonable values for the "normal" ranges of the resources. One such guideline might be that we suggest a range that can accommodate about 95% of the user needs within the community (this came from Rich's comment about satisfying 95% of the user need with an establishmentMeans controlled vocuabulary). For example, it was suggested that the range for the location of an Occurrence could span the entire planet Earth. True enough, but virtually nobody would find such a span useful. 95% of users would probably find a range between a GPS reading with 10 meter precision and the extent of a county or province useful for recording the location of an Occurrence. I can suggest similar "useful" ranges: one second to one day for an event time (excluding fossils), one individual organism to the number of organisms that would fit within a 50 meter radius for an "individual", and taxon identified to family for plants and maybe mammals, genus for birds, and order for insects. So framing the definition of an Occurrence in these terms it would be something like: "An occurrence involves evidence (consisting of a physical token, electronic record, or personal observation) that a representative (ranging from a single individual to the number that would fit on a football field) of a taxon (hopefully identified to some lower taxonomic level) occurred at a place (determined to a precision between that of a GPS reading and the size of a county/province) and time (spanning one second to one day)." A few people might object to this level of restrictiveness, but I would guess that it would make 95% of us happy. 3. With the exception of the "missing" class Individual, every resource type on this diagram except for the "token" and Scientific name has a Darwin Core class. Every resource type on the diagram except for "token" has a dwc:xxxxID term that can be used to refer to a GUID for the resource. The implication of this is that any resource on this diagram except for the token and taxon representative (i.e. Individual) is ready to be represented in RDF by Darwin Core terms in the sense that the relationships (red lines) can be represented by the xxxxID terms and that the resources can be rdfs:type'd using Darwin Core classes. (Lacking a class for the scientific name doesn't seem like a big deal to me since the scientific name can be a string literal - but then I'm not a taxonomist.) 4. OK, I've avoided it as long as I can, so I'm going to confess now to the RDF-phobes. The red lines and shapes are something pretty close to an RDF graph. What that means is that if the community can agree that this diagram correctly represents the relationships among the kinds of biodiversity resources that we care about, then the matter of providing guidelines on how to represent Darwin Core in RDF suddenly gets a lot simpler. Just convert the "picture" of the RDF graph into XML format and we have a template. Alright, that's an oversimplification, but I think it is essentially true because the most difficult part of achieving a consensus on RDF representations is to decide how we connect the resource types, not on the literals that we hang onto resources as properties. 5. While I'm beating the RDF drum again, the importance of my opinion number 2 can be extended into the GUID adoption process. In my comments to Kevin about the Beginner's Guide to Persistent Identifiers, I think I commented on the question of how one decides whether a GUID needs to be assigned to something or not. I believe that the answer to that question boils down to this: we need a GUID for any resource that will be referenced by more than one other resource. Do we need to be able to assign a GUID to Taxon concepts? Yes, because it is likely that many identifications will want to reference a particular taxon concept. Do we need to be able to assign a GUID to an Event? Maybe or maybe not. If every occurrence has its own separate time recorded, then no GUID is needed because the time is just a part of every separate occurrence record. If the event is defined to be a time range that represents a collecting trip, then there may be many Occurrences that are associated with that trip and all of them could reference the GUID for that event rather than repeating the event information for every Occurrence. The point here is that every shape (class of resources) on this diagram at least has the POTENTIAL to be a node connecting multiple resources and therefore should have the capability of being assigned a GUID, having its own RDF record, and being appropriately typed (presumably by a DwC class). So this is a final technical argument for why we need to have the DwC class Individual. Whether or not people ultimately choose to assign GUIDs to particular resource types or not is their own choice, but they need to at least be ABLE to if they need that resource to serve as a node given the structure of their metadata.
We need to clarify how the "token" thing fits in, but I'm stopping there for now. I would very much appreciate responses indicating that:
A. you agree with the diagram and connections (and consider this definition and diagram a consensus) B. you disagree with the diagram (and articulate why) C. you provide an alternative diagram or explanation of the relationships among the classes related to Occurrences.
Thanks for you patience with another tome. Steve
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content .
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707
http://bioimages.vanderbilt.edu _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707
http://bioimages.vanderbilt.edu _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
.
In brief, dwc:nameAccordingTo is meant to be strictly about the name (Taxon) whereas the dwc:identificationReferences might include a list of references used in the assessment to make the determination. The dwc:identificationReferences might include taxonomic treatments, but they might also include something such as an expert range map or a field guide, which one would never find as the objects of a dwc:nameAccordingTo.
On Mon, Oct 18, 2010 at 1:40 PM, Steve Baskauf <steve.baskauf@vanderbilt.edu
wrote:
OK, I'll confess that I don't know what a fully normalised model is and a two-minute reading of Wikipedia didn't help me. I was just going by Pete's snippet of an RDF element. I'm assuming it was a part of something like:
<dwc:Identification rdf:about= "http://www.example.org/determinations/12345"http://www.example.org/determinations/12345> ...
<dwcterms:nameAccordingTo>Identification And Geographical Distribution
Of The Mosquitoes: Of North America, North Of Mexico By Richard F., Jr. Darsie et al. 2004</dwcterms:nameAccordingTo> ... </dwc:Identification>
which would be making the statement [the determination] dwcterms:nameAccordingTo [that long literal string for the reference] in my shortcut RDF triple notation.
But it sounds like you are saying that one should say that the determination has a certain name and then that the name should have the property dwcterms:nameAccordingTo.
Maybe this is a question that should be deferred to a later discussion prior to a putative RDF guide. Steve
Markus Döring wrote:
its a property of Taxon if you have a fully normalised model. But as a shortcut together with scientificName it might be right in Identification I suppose?
Markus
On Oct 18, 2010, at 21:38, Steve Baskauf wrote:
So are we saying that dwc:nameAccordingTo can be a property of an dwc:Identification? What's dwc:identificationReferences for? I'm sorry if this is a dumb question but I can plead ignorance on this topic. Steve
Peter DeVries wrote:
Hi Markus,
I feel your pain. :-)
Maybe an example might help clarify this.
I use the key* listed below to id my mosquitoes.
So I should mark up my RDF for the identification with something like:
dwcterms:nameAccordingToIdentification And Geographical Distribution Of The Mosquitoes: Of North America, North Of Mexico By Richard F., Jr. Darsie et al. 2004</dwcterms:nameAccordingTo>
Rather than use some other term like "dwc:identificationReferences"
Correct?
- Pete
Identification And Geographical Distribution Of The Mosquitoes: Of North America, North Of Mexico By Richard F., Jr. Darsie, RONALD A. WARD, Chien C. Chang, Taina Litwak University Press of Florida, 2004 ISBN: 0813027845 Cite: 13463
=================================================================
On Mon, Oct 18, 2010 at 12:19 PM, "Markus Döring (GBIF)" mdoering@gbif.org mdoering@gbif.org wrote: I am sorry I dont have the time to follow this extensive thread, but I can manage at least the first paragraphs ;) A quick comment on tying identification sources to a scientific name. As for other taxon concepts this is usually done with the sec/sensu reference which should be recorded as dwc:nameAccordingTo: http://rs.tdwg.org/dwc/terms/index.htm#nameAccordingTo
I am slightly irritated that we seem to have some term duplicates for this use case. Maybe dwc:identificationReferences is supposed to only list additional references?
Markus
On Oct 18, 2010, at 18:49, Steve Baskauf wrote:
I've fallen behind on systematically perusing the list responses, but I would like to focus in on a point that seems to be a consensus in the responses that have shown up recently. The consensus seems to be that documenting determinations (a.k.a. instances of dwc:Identification class) that are applied to Individuals (or Occurrences if you don't believe in Individuals) is the way to go. So in my usual graphical way of thinking about this, I would draw a "relationship line" from the determination to the Individual (or Occurrence) on one side and from the determination to the species concept on the other. I will leave up to the taxonomy people the different things would be connected to the species concept and how all of their lines would be connected. The determination would have any of the properties that are terms listed in the dwc:Identification class (identifiedBy, dateIdentified, identificationReferences, identification Remarks, identificationQuali fier, and typeStatus). Some properties like dateIdentified and identificationReferences would be string literals and others (especially identifiedBy) should probably be GUIDs but could be literals if they had to be.
That all seems pretty clear. However, when I've started trying to do this in real life, I immediately have questions. Take a look athttp://bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf which should show up as a web page in your browser.
- The original label identifies the species as Juncus diffusissimus. However, there is no indicator as to who originally identified it or when. My assumption is that it was the collector (Glen N. Montz) but I don't really know that. Do I assume that, or list the original determiner as "unknown"?
- Do we draw a distinction between the initial identification and subsequent annotations? I think the answer should be "no" and that's why I refer to both generically as "determinations".
- There is really no indication given on the annotation labels as to many of the things that we would like to know, such as the concept they had in mind, any source they used (if any), or the reason why they did the annotation. So how does one connect the name that they applied to the determination when there is no indication of the concept? Is this just something we can't do for old annotations and just something that we try to do from this point forward?
- The last question is one that I really want to some opinions about. It seems to me that there are a number of reasons why one would apply a determination. One would be to correct an actual error in identification. One would be to increase the precision of a previous determination (e.g. an insect identified to family now is identified to species). One would be to assert a difference in opinion as to the correct way to group this individual with others (i.e. as in a taxonomic revision). Finally, a single determiner might apply several determinations to one individual and indicate in each determination the concept intended (i.e. if you subscribe to Cronquist, you'd call it X; if you like Radford's book, you'd call it Y; if you like Weakley's treatment, you'd call it Z). Some of these four reasons may be functionally equivalent, but how would you use Darwin Core to indicate the reason why you applied the determination? Please don't say "identificationRemarks"! From a
machine-processing standpoint, this is something we should know and there should be some kind of controlled vocabulary to express it. For instance if an identification is "deprecated" because it was in error (perhaps by the determiner him/herself), one would like the incorrect determination to show up in the historical metadata, but I wouldn't want it to be listed in a website index. The same would hold true if an annotator was able to pin the taxon down to a lower taxonomic level than the original identifier. If someone goes to the trouble to connect an Individual/Occurrence to several names under alternative concepts, there should be a way the a machine would know this so that a software user could select the concept they wanted to use and the name under that concept would pop up.
I don't really see any term under the current DwC that could be used to do this last thing. Am I missing something? Do we need several terms to explain the reason why we made the determination because the reasons fall into different categories?
The other comment that I'll throw out (since this is going out to the bioblitz list as well as to tdwg-content) is that those of you who are building apps to collect metadata in the field really need to separate the process of entering (or acquiring) the collection metadata from the determination process. In at least some apps, the user immediately has to commit to a taxon as they enter the data at the time of collection. It seems to me that it would be a very common situation (especially in the case of "citizen science") that the collector/observer/photographer would have no idea what the taxonomic identity was at the time of collection. The process of determination (and the recording of the various dwc:Identification class terms) is really a separate process that should be able to happen at the time of collection OR later.
Steve
Peter DeVries wrote:
Hi Steve,
I would hypothesize that for the vast majority of identified records the process is something like this:
An individual uses some sort of key to determine what species (taxon concept) to assign to a given individual
- They may have created some sort of mental key in which once they recognize one individual mosquito they can then pretty quickly sort a number of individuals into collections.
The actual name they assign to the specimen is usually based on what their key says the name is. Often this does not specify the authorship. Most of these human identifiers have not read the original species descriptions and for the species they are identifying. So the specimen is actually tied to a concept that is based more on the "key" than the original description.
- An exception, would be where there is a key in the original description and that was what what was used.
So in a sense, the process of modeling this as if the if the identifier actually asserted that the concept was the same as that described by the original description or a subsequent revision is "fudging"
Side effects of this process include:
- A new key for North American Mosquitoes comes out that incorporates recent changes in nomenclature. The major change being the elevation of a subgenus to a genus. For most of the species described the "key concept" is unchanged.
Student identifier, Bob, in state X is using the latest key, while student identifier, Joe, is state Z is using a slightly older edition of the same key.
Bob identifies the species as Ochlerotatus triseriatus, while Joe identifies what should be the same species as Aedes triseriatus.
These show up in GBIF on two different maps, they show up in the EOL as two different pages.
Various TDWG'ers continue to argue that the original description and subsequent revisions were really important in determining what these individuals actually meant when they assigned a name to a specimen, and that this is how we should model it in excruciating detail.
I would argue this should be modeled as best as possible to what actually happens.
For example, how many of the species observed in the recent BioBlitz were identified by referring to the original species description or subsequent revisions?
In your diagram, I would suggest that you show that a taxon concept may have many names associated with it. Since it is not clear what the identifier intended by his or her choice of a name, it is often difficult to determine what taxon concept they actually meant.
This is why I advocate a move to a more taxon concept based identifier to link these data sets together because this allows the intent of the identifier is more accurately modeled.
This would be done in the form of:
"I assert that this specimen (of what I call Aedes triseriatus) was observed here. I also assert that it is an instance of the this species concept => URI"
Or I assert that this is an individual of the type "Individual of species concept X" = > URI
All of these are instances of the class "Individual"
So the resulting DarwinCore record would contain both the name and and an optional, but I think needed, asserted species concept.
The species concept is a subclass of taxon concept, but is fundamentally different than the higher clades.
There are some guidelines as to what an entity needs to be considered a species.
While their are no real guidelines as to what clades should be considered genera and what clades should be considered families etc.
Assigning properties at the level of genera or family is also problematic because it assumes that there will be inferencing and it will require rechecking that those properties are still valid if the species within that genera change.
So if there is some property that is common to all the species in the genus, make that a property of each of the individual species - not a property of the genus.
Respectfully,
- Pete
On Fri, Oct 15, 2010 at 10:45 AM, Steve Baskauf steve.baskauf@vanderbilt.edu steve.baskauf@vanderbilt.edu wrote: As a background to this post, I want to reference a post by Bob called "SubclassOrNot". I discovered this page on an early foray into the TDWG website labyrinth and it has been very influential on my thinking since then. The idea Bob discusses is central to what I'm writing below so if you haven't read it you might want to do so first. You can probably skip the "OWL Inference" section and still get the point which is described in the first two sections of his post. The URL for the page is http://wiki.tdwg.org/twiki/bin/view/TAG/SubclassOrNot .
To preface what I'm going to say below, I want to put Darwin Core Occurrences in the context of what Bob wrote. In my mind, one of the hallmarks of the Darwin Core standard and one thing that makes it a great improvement over previous versions is that the decision was made to use what Bob called the "has a" approach rather than the "is a" approach. In particular, the Darwin Core standard has a single class called dwc:Occurrence rather than subclasses called "Specimen", "Observation", and other possible things. The way that we differentiate among different kinds of Occurrences is by using the DwC types which are the controlled values for the term dwc:basisOfRecord. Thus we say an Occurrence "has a" basisOfRecord=PreservedSpecimen rather than saying it "is a" PreservedSpecimen. We say an Occurrence "has a" basisOfRecord=HumanObservation rather than saying it "is a"HumanObservation". This approach has greatly reduced the number of different terms in the standard since we don't have to have separate "ObservedBy" and "CollectedBy" terms, but rather can just have a single "RecordedBy" term that applies to both specimens and observations. The same thing applies to many other things, like eventDate rather than DateCollected and DateObserved, locality rather than collectionLocality and observationLocality, etc. With the ratification of Darwin Core, this decision is now a fait acompli and not a subject of discussion or something optional for users of the standard. It also seems to be clear that as necessary new terms can be added to the DwC types which would then be valid controlled values for basisOfRecord.
Since the adoption of the DwC standard, the approach to Occurrences has been what I would describe as "I know an Occurrence when I see one". I consider this as a pretty sloppy practice and as I indicated in my post last night, I think there is enough consensus about what an Occurrence is that we can come up with a better definition than "an occurrence is the category of information pertaining to evidence of an occurrence...". Another part of what I would characterize as sloppiness is the lack of a clear definition of what exactly basisOfRecord means. When I wrote my attempt at summarizing consensus last night, I dodged the question about what I called the "token". This "thing" has been called various names. In the previous discussion on the list, it was sometimes called "the evidence" of the occurrence. In the past I have called it "a representation" - however, I now think the term "token" is better because "representation" has a different technical meaning in the cont ext of content negotiation. When we type an Occurrence by saying it has a basisOfRecord=PreservedSpecimen, we are saying that this Occurrence has as supporting evidence, or as a "token" if you prefer, all or part of the dead remains of the organism (i.e. what I'm calling "the Individual") that was being documented by the Occurrence. When we type an Occurrence by saying it has a basisOfRecord=LivingSpecimen, we are saying that this Occurrence has as a "token" the entire organism that was being documented (or some vegetative part of the live organism that was propagated). When we type an Occurrence by saying it has a basisOfRecord=HumanObservation, we are saying that the Occurrence has no supporting evidence other than the reputation of the observer to accurately record the metadata about the Occurrence. In other words, we "tag" a instance of a core class (to use Bob's words), Occurrence, by telling a metadata consumer what kind of token we are using as evidence of the Occu rrence. A fundamental part of creating a clear definition of what an Occurrence is, is to define exactly what we are including in the concept of Occurrence. One possibility is to (1) say that the two boxes at the right side of the diagram at http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif are fused and that both the Occurrence metadata and its associated token are what we consider to be "the Occurrence". Another approach (2) would be to say that the actual Occurrence as an entity is only the metadata part and that the token is a separate thing. A third approach is to say (3) that everything with the blue dotted lines is considered a part of the Occurrence (i.e. the metadata, the token, the event, and the locality). I don't think in an absolute sense, any one of these approaches is "right". The problem is that these approaches are used inconsistently, sometimes e ven by the same person, depending on the basisOfRecord. Differences in ways of thinking about this issue is a part of why people aren't understanding the way other people are approaching the structuring of metadata. I have tried to consistently take the approach (1) that the two boxes on the right are fused, i.e. that the Occurrence metadata and the token should both be considered part of the entity that we call "an Occurrence". I think this is why Rich was confused in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001666.html when I said that it was "wrong" to assert that a scientific name is a property of an Occurrence - obviously it is silly to say that the token (photons on a film, sound patterns in a digital file) has a scientific name. Yet that is exactly what people do routinely when the token is a branch cut off a tree and glued to a piece o f paper. They say that they are "identifying a specimen". What I am asking (actually demanding) is that the TDWG community get its act together and come to some consistency on this. If we are going to take the approach (2), then we need to take specimens off their pedestal and treat them like we do any other token that we are using as evidence that an Occurrence happened. If we are going to do what was suggested for the BioBlitz in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001603.html, i.e. to call Occurrences "observations" and then link the tokens to them by associatedMedia, ResourceRelationship, or some other means (approach 2) then do it consistently for every kind of token, including specimens, and don't single out media tokens for punishment. I have in a sense "thrown down the gauntlet" on this issue by proposing that DigitalStillImage be added as a DwC type and as a controlled value for basisOfRecord (http://code.google.com/p/darwincore/issues/detail?id=68). I know what some people are going to say in response to this proposal. "Why do you need to have 'DigitalStillImage' as a value for basisOfRecord when you can just say that the resource's dcterms:type=StillImage?" The answer goes back to Bob's point. If we are going to go the "has a" path (which we already have in DwC for Occurrences) rather than subclassing everything, then we need to provide an appropriate value for the "tag" for any type of resource that a reasonable number of users will want to use as a token. I think it is clear from this and other Bioblitzes, my work in Bioimages, the whale tracking project, and many other examples, that there are pl enty of people who are already using DigitalStillImages as tokens and we all need a controlled value to use for basisOfRecord. The other thing that we accomplish when we type an Occurrence by its basisOfRecord is to tell a consumer what kind of metadata to expect to get about the token in addition to the generic metadata that is provided for all Occurrences. Thus for a LivingSpecimen we expect to be told what zoo, botanical garden, bacterial collection, etc. contains the specimen. For a PreservedSpecimen we expect to be told the preparation type, the location of the repository, etc. For a DigitalStillImage we expect to be told the file type, accessURL, etc. Simply providing a value for dcterms:type=StillImage doesn't indicate whether the image is a physical one (i.e. on film) or a digital one. It is also unreasonable to expect a client to have to be checking two different terms (basisOfRecord and dcterms:type) to find out what they could learn from one (basisOfRecord). Of course it would be advisable to provide a value for dcterms:type as well for clients outside the biodiversity community who may not "understand" what basisOfRecord means. I hate to keep bringing my posts back to the RDF issue, but thinking about how one would write RDF forces clear thinking about how metadata should be structured. If we intend to separate tokens as entities from their associated Occurrence metadata, i.e. approach (2), then we open up a whole other can of worms. To associate the occurrence resources (i.e. the metadata) with the "different" resource (i.e. the token), we will have probably have to be able to create URIs for the tokens and separate RDF metadata blocks which will have to be rdfs:type'd. What are we going to use for that rdfs:type - create another Darwin Core class? I simply don't think that is a complicated road that we want to travel. It would be far easier to just say that every Occurrence has a one-to-one relationship with its token (which could be "the empty set" for observations). This would not work for people who want to hang multiple tokens on a single observation event, but I think that itself is a bad idea because it makes it even harder to have "flat" occurrence datasets. Just say that every time we collect a different token (or make an observation that has no token), it is a new Occurrence record. Realistically, a single collector can't actually take a picture of a plant at the same time he or she collects it for a specimen anyway. Those really should be considered two different events because they happen at different times.
OK, enough said. Consider this my defense of my proposal "issue 68" to add DigitalStillImage. I would urge the powers that be to respond to the issues that I've raised here before having any kind of "vote" (or whatever is ultimately going to happen when there is an up or down decision about the proposal).
Steve
Steve Baskauf wrote: After the flurry of emails recently, I had an opportunity to carefully read all the way through the threads again, followed by enforced "think time" during my long commute. I was actually pretty cheerful after that because I think that in essence, most of the conversation about what constitutes an Occurrence really boils down to the same thing. So I have sat down and tried to summarize what seems to me to be a consensus about Occurrences. To follow my points, please refer to the diagram at:http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif
Consensus on relationships
- The fundamental definition of an Occurrence involves evidence that a
representative of a taxon occurred at a place and time. Note 1.A: For clarity, I have modified John's statement in his last email by replacing "taxon" with "representative of a taxon". I'm considering a taxon to be an abstract concept that is applied to individuals or groups of organisms. Note 1.B. This definition is far more useful than the official definition of the class Occurrence "The category of information pertaining to evidence of an occurrence..." which is essentially circular. Note 1.C: This statement is extremely broad because the evidence could be of many sorts, the representative could range from a single individual to all organisms on the earth, the taxon could be anyone's definition at any taxonomic level, the place could range from a GPS point with uncertainty of less than 10 meters to the entire planet earth, and the time could range from a shutter click of less than one second to 3.4 billion years. 2. The diagram is an attempt to summarize in pictorial form statements and relationships that have been described in the thread. The taxon representative is recorded as existing at a particular time and place (the arrow) and the result is an Occurrence record. That Occurrence record exists as metadata which may be associated with a token that can be used to voucher the fact that the taxon representative existed. That token may be the organism itself (or a living part of it as in a twig for grafting), all or part of the organism in preserved form, an electronic representation such as an image or sound recording, and other kinds of things like tissue or DNA samples. There may also be no token at all, in which case we call the Occurrence record an observation. Based on direct observation of the taxon representative, examination of one or more tokens, or both, some determiner asserts that a taxon concept applies to the taxon representative and as a result a scientific name can be used to "identify" the taxon representative. (There may be a lot of other complicated stuff above the Identification box, but that will have to be filled in by the taxonomists.) Note 2.A: I have mapped onto this diagram the letters that John used in his last email to refer to entities that are involved in an Occurrence (T, E, L, O, and G). I will beg the forgiveness of fossil people because I don't really know how the geological context fits in. I'm assuming that it is a way of asserting time and location on a much broader scale than we do for extant organisms. Note 2.B: I have put a dotted line around the part of the diagram that I think includes all the things that people might consider part of the Occurrence itself. I have left out "T" and the other parts related to identification because it seems to me that you can have an occurrence that you document which does not yet (and perhaps never will) have an identification. The Occurrence still asserts that a taxon representative existed at a time and place; we just don't yet know what the taxon is. 3. The red lines indicate the relationships that connect the various entities (I'm going to go ahead and call them resources). Consistent with popular opinion, the Occurrence record is the center of the universe and most things are connected to it. Note 3.A: I am sticking to my guns and refuse to connect the Identification directly to the Occurrence. It is the taxon representative that is being identified, not the occurrence. One can assert another sort of relationship between the identification and the occurrence if one wants to say that one consulted the occurrence metadata and token in order to decide about the identification, but it is not correct to say that the Identification identifies either the Occurrence metadata or the token (as Rich pointed out).
OK, so that's step one - defining what is related to what. If anyone disagrees with these relationships, please clarify or create your own diagram.
Complicating circumstances/caveats
- It is noted and recognized that some users will not care to include
all of these relationships in their models. In the interest of simplification or "flattening" the relationships, they may wish to collapse some parts of this diagram (e.g. incorporate time and location metadata within the Occurrence metadata rather than considering them separate resources, applying scientific names directly to the taxon representatives without defining a taxon concept or recording the determination metadata, connecting identifications directly to the occurrence, etc.). This doesn't mean that the relationships don't exist, it just means that some users don't care about them. 2. It is recognized that different users will be interested in or able to specify the various resources to differing degrees of precision. Examples: A photographer might record times to the nearest second, a collector may only be interested in noting the date on which a specimen was collected. A location may be specified to the precision of a GPS reading or be defined as some geographic or political subdivision. The taxon representative may be an individual organism, a flock or clump, or some larger aggregation of taxon representatives.
That's step two. If I've missed any complications, please point them out.
My opinions about the implications of this diagram
- The circle I've labeled as "taxon representative" is the resource
type that I'm proposing to be represented by the class Individual. You will note that in both the definition of dwc:individualID ("An identifier for an individual or named group of individual organisms...") and the proposed class definition ("The category of information pertaining to an individual or named group of individual organisms represented in an Occurrence"), groups of individual organisms are included. Thus John's example of a fossil having myriad individuals, or Richard's examples of thousands of plankton, a large school of fish, herd of wildebeest, flock of birds, could all be categorized as "Individual" under this definition if there is a reasonable expectation that all of the individuals in the group are members of the same taxon. Perhaps there is a better name for this resource, but since dwc:individualID was already extant, I chose Individual as the class name for consistency with the pattern established with other classes and their associated xxxxID terms. 2. Although in note 1.C. I have given the ranges of the various resources to their logical extreme (as was done previously in the thread), I think that as a practical matter we can adopt guidelines to set reasonable values for the "normal" ranges of the resources. One such guideline might be that we suggest a range that can accommodate about 95% of the user needs within the community (this came from Rich's comment about satisfying 95% of the user need with an establishmentMeans controlled vocuabulary). For example, it was suggested that the range for the location of an Occurrence could span the entire planet Earth. True enough, but virtually nobody would find such a span useful. 95% of users would probably find a range between a GPS reading with 10 meter precision and the extent of a county or province useful for recording the location of an Occurrence. I can suggest similar "useful" ranges: one second to one day for an event time (excluding fossils), one individual organism to the number of organisms that would fit within a 50 meter radius for an "individual", and taxon identified to family for plants and maybe mammals, genus for birds, and order for insects. So framing the definition of an Occurrence in these terms it would be something like: "An occurrence involves evidence (consisting of a physical token, electronic record, or personal observation) that a representative (ranging from a single individual to the number that would fit on a football field) of a taxon (hopefully identified to some lower taxonomic level) occurred at a place (determined to a precision between that of a GPS reading and the size of a county/province) and time (spanning one second to one day)." A few people might object to this level of restrictiveness, but I would guess that it would make 95% of us happy. 3. With the exception of the "missing" class Individual, every resource type on this diagram except for the "token" and Scientific name has a Darwin Core class. Every resource type on the diagram except for "token" has a dwc:xxxxID term that can be used to refer to a GUID for the resource. The implication of this is that any resource on this diagram except for the token and taxon representative (i.e. Individual) is ready to be represented in RDF by Darwin Core terms in the sense that the relationships (red lines) can be represented by the xxxxID terms and that the resources can be rdfs:type'd using Darwin Core classes. (Lacking a class for the scientific name doesn't seem like a big deal to me since the scientific name can be a string literal - but then I'm not a taxonomist.) 4. OK, I've avoided it as long as I can, so I'm going to confess now to the RDF-phobes. The red lines and shapes are something pretty close to an RDF graph. What that means is that if the community can agree that this diagram correctly represents the relationships among the kinds of biodiversity resources that we care about, then the matter of providing guidelines on how to represent Darwin Core in RDF suddenly gets a lot simpler. Just convert the "picture" of the RDF graph into XML format and we have a template. Alright, that's an oversimplification, but I think it is essentially true because the most difficult part of achieving a consensus on RDF representations is to decide how we connect the resource types, not on the literals that we hang onto resources as properties. 5. While I'm beating the RDF drum again, the importance of my opinion number 2 can be extended into the GUID adoption process. In my comments to Kevin about the Beginner's Guide to Persistent Identifiers, I think I commented on the question of how one decides whether a GUID needs to be assigned to something or not. I believe that the answer to that question boils down to this: we need a GUID for any resource that will be referenced by more than one other resource. Do we need to be able to assign a GUID to Taxon concepts? Yes, because it is likely that many identifications will want to reference a particular taxon concept. Do we need to be able to assign a GUID to an Event? Maybe or maybe not. If every occurrence has its own separate time recorded, then no GUID is needed because the time is just a part of every separate occurrence record. If the event is defined to be a time range that represents a collecting trip, then there may be many Occurrences that are associated with that trip and all of them could reference the GUID for that event rather than repeating the event information for every Occurrence. The point here is that every shape (class of resources) on this diagram at least has the POTENTIAL to be a node connecting multiple resources and therefore should have the capability of being assigned a GUID, having its own RDF record, and being appropriately typed (presumably by a DwC class). So this is a final technical argument for why we need to have the DwC class Individual. Whether or not people ultimately choose to assign GUIDs to particular resource types or not is their own choice, but they need to at least be ABLE to if they need that resource to serve as a node given the structure of their metadata.
We need to clarify how the "token" thing fits in, but I'm stopping there for now. I would very much appreciate responses indicating that:
A. you agree with the diagram and connections (and consider this definition and diagram a consensus) B. you disagree with the diagram (and articulate why) C. you provide an alternative diagram or explanation of the relationships among the classes related to Occurrences.
Thanks for you patience with another tome. Steve
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
tdwg-content mailing listtdwg-content@lists.tdwg.orghttp://lists.tdwg.org/mailman/listinfo/tdwg-content .
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu _______________________________________________ tdwg-content mailing listtdwg-content@lists.tdwg.orghttp://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu _______________________________________________ tdwg-content mailing listtdwg-content@lists.tdwg.orghttp://lists.tdwg.org/mailman/listinfo/tdwg-content
.
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
The dwc:identificationReferences might include taxonomic treatments, but they might also include something such as an expert range map or a field guide, which one would never find as the objects of a dwc:nameAccordingTo.
Thanks, John. This answers my earlier question about why one might include more than one item in identificationReferences. That is, not necessarily because multiple TNUs (representing possible incongruent taxon concepts) are referred to, but rather non-TNU-based references may also play a role in the identification.
Aloha, Rich
Hi Steve,
You need to fix this in two ways (independent of the vocab, which I did not check)
1) It should show up correctly in URIburner.
http://linkeddata.uriburner.com/about/html/http/bioimages.vanderbilt.edu/rdf...
http://linkeddata.uriburner.com/about/html/http/bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf2) In the description of the RDF itself (in your example it is at the bottom), you need to make a foaf:topic link between that element and each of the entities that start with "rdf:about". This will allow you to find the actual rdf page that describes these. To get the link back from the entity to the page add a "foaf:page" that points back to the RDF.
Remember that in the cloud or in your triple store entities like < http://www.cyberfloralouisiana.com/specimens/lsu000/0428#ind%3E are not tied to the RDF that contains statements about them, without some link to and from the page http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf
* You could get the same result by using the "dcterms:references" and its inverse "dcterms:ReferencedBy", but let me run that past someone to see if it is equally accepted.
Here is an abbreviated version of what this might look like:
<rdf:Description rdf:about=" http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf%22%3E dcterms:descriptionRDF formatted description of the preserved specimen http://www.cyberfloralouisiana.com/specimens/lsu000/0428 </dcterms:description> dcterms:modified2010-09-25T06:35:58</dcterms:modified> xmp:MetadataDate2010-09-25T06:35:58</xmp:MetadataDate> <foaf:topic rdf:resource=" http://www.cyberfloralouisiana.com/specimens/lsu000/0428#ind%22/%3E <foaf:topic rdf:resource=" http://www.cyberfloralouisiana.com/specimens/lsu000/0428#39265b%22/%3E <foaf:topic rdf:resource=" http://www.cyberfloralouisiana.com/specimens/lsu000/0428#39231b%22/%3E <foaf:topic rdf:resource=" http://www.cyberfloralouisiana.com/specimens/lsu000/0428#39231a%22/%3E <foaf:topic rdf:resource=" http://www.cyberfloralouisiana.com/specimens/lsu000/0428#39265a%22/%3E <foaf:topic rdf:resource=" http://www.cyberfloralouisiana.com/specimens/lsu000/0428%22/%3E <foaf:topic rdf:resource=" http://www.cyberfloralouisiana.com/specimens/lsu000/0428#img%22/%3E <foaf:topic rdf:resource=" http://www.cyberfloralouisiana.com/specimens/lsu000/0428#bq%22/%3E <foaf:topic rdf:resource=" http://www.cyberfloralouisiana.com/specimens/lsu000/0428#ind%22/%3E <foaf:topic rdf:resource=" http://www.cyberfloralouisiana.com/specimens/lsu000/0428#bq%22/%3E <foaf:topic rdf:resource=" http://www.cyberfloralouisiana.com/specimens/lsu000/0428#tn%22/%3E <foaf:topic rdf:resource=" http://www.cyberfloralouisiana.com/specimens/lsu000/0428#lq%22/%3E <foaf:topic rdf:resource=" http://www.cyberfloralouisiana.com/specimens/lsu000/0428#gq%22/%3E </rdf:Description>
<rdf:Description rdf:about=" http://www.cyberfloralouisiana.com/specimens/lsu000/0428#ind%22%3E <foaf:page rdf:resource=" http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf%22/%3E </rdf:Description>
Following this pattern, your RDF will be browsable as in this example:
http://linkeddata.uriburner.com/about/html/http/lod.taxonconcept.org/rdf/are...
Note how you can click back and forth between the location and the RDF that describes it.
- Pete
On Mon, Oct 18, 2010 at 11:49 AM, Steve Baskauf < steve.baskauf@vanderbilt.edu> wrote:
I've fallen behind on systematically perusing the list responses, but I would like to focus in on a point that seems to be a consensus in the responses that have shown up recently. The consensus seems to be that documenting determinations (a.k.a. instances of dwc:Identification class) that are applied to Individuals (or Occurrences if you don't believe in Individuals) is the way to go. So in my usual graphical way of thinking about this, I would draw a "relationship line" from the determination to the Individual (or Occurrence) on one side and from the determination to the species concept on the other. I will leave up to the taxonomy people the different things would be connected to the species concept and how all of their lines would be connected. The determination would have any of the properties that are terms listed in the dwc:Identification class (identifiedBy, dateIdentified, identificationReferences, identification Remarks, identificationQualifier, and typeStatus). Some properties like dateIdentified and identificationReferences would be string literals and others (especially identifiedBy) should probably be GUIDs but could be literals if they had to be.
That all seems pretty clear. However, when I've started trying to do this in real life, I immediately have questions. Take a look at http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf which should show up as a web page in your browser.
- The original label identifies the species as Juncus diffusissimus.
However, there is no indicator as to who originally identified it or when. My assumption is that it was the collector (Glen N. Montz) but I don't really know that. Do I assume that, or list the original determiner as "unknown"? 2. Do we draw a distinction between the initial identification and subsequent annotations? I think the answer should be "no" and that's why I refer to both generically as "determinations". 3. There is really no indication given on the annotation labels as to many of the things that we would like to know, such as the concept they had in mind, any source they used (if any), or the reason why they did the annotation. So how does one connect the name that they applied to the determination when there is no indication of the concept? Is this just something we can't do for old annotations and just something that we try to do from this point forward? 4. The last question is one that I really want to some opinions about. It seems to me that there are a number of reasons why one would apply a determination. One would be to correct an actual error in identification. One would be to increase the precision of a previous determination (e.g. an insect identified to family now is identified to species). One would be to assert a difference in opinion as to the correct way to group this individual with others (i.e. as in a taxonomic revision). Finally, a single determiner might apply several determinations to one individual and indicate in each determination the concept intended (i.e. if you subscribe to Cronquist, you'd call it X; if you like Radford's book, you'd call it Y; if you like Weakley's treatment, you'd call it Z). Some of these four reasons may be functionally equivalent, but how would you use Darwin Core to indicate the reason why you applied the determination? Please don't say "identificationRemarks"! From a machine-processing standpoint, this is something we should know and there should be some kind of controlled vocabulary to express it. For instance if an identification is "deprecated" because it was in error (perhaps by the determiner him/herself), one would like the incorrect determination to show up in the historical metadata, but I wouldn't want it to be listed in a website index. The same would hold true if an annotator was able to pin the taxon down to a lower taxonomic level than the original identifier. If someone goes to the trouble to connect an Individual/Occurrence to several names under alternative concepts, there should be a way the a machine would know this so that a software user could select the concept they wanted to use and the name under that concept would pop up.
I don't really see any term under the current DwC that could be used to do this last thing. Am I missing something? Do we need several terms to explain the reason why we made the determination because the reasons fall into different categories?
The other comment that I'll throw out (since this is going out to the bioblitz list as well as to tdwg-content) is that those of you who are building apps to collect metadata in the field really need to separate the process of entering (or acquiring) the collection metadata from the determination process. In at least some apps, the user immediately has to commit to a taxon as they enter the data at the time of collection. It seems to me that it would be a very common situation (especially in the case of "citizen science") that the collector/observer/photographer would have no idea what the taxonomic identity was at the time of collection. The process of determination (and the recording of the various dwc:Identification class terms) is really a separate process that should be able to happen at the time of collection OR later.
Steve
Peter DeVries wrote:
Hi Steve,
I would hypothesize that for the vast majority of identified records the process is something like this:
- An individual uses some sort of key to determine what species (taxon
concept) to assign to a given individual
- They may have created some sort of mental key in which once they
recognize one individual mosquito they can then pretty quickly sort a number of individuals into collections.
- The actual name they assign to the specimen is usually based on what
their key says the name is. Often this does not specify the authorship. Most of these human identifiers have not read the original species descriptions and for the species they are identifying. So the specimen is actually tied to a concept that is based more on the "key" than the original description. * An exception, would be where there is a key in the original description and that was what what was used.
- So in a sense, the process of modeling this as if the if the
identifier actually asserted that the concept was the same as that described by the original description or a subsequent revision is "fudging"
Side effects of this process include:
- A new key for North American Mosquitoes comes out that incorporates
recent changes in nomenclature. The major change being the elevation of a subgenus to a genus. For most of the species described the "key concept" is unchanged.
Student identifier, Bob, in state X is using the latest key, while student identifier, Joe, is state Z is using a slightly older edition of the same key.
Bob identifies the species as *Ochlerotatus triseriatus*, while Joe identifies what should be the same species as *Aedes triseriatus*.
These show up in GBIF on two different maps, they show up in the EOL as two different pages.
Various TDWG'ers continue to argue that the original description and subsequent revisions were really important in determining what these individuals actually meant when they assigned a name to a specimen, and that this is how we should model it in excruciating detail.
I would argue this should be modeled as best as possible to what actually happens.
For example, how many of the species observed in the recent BioBlitz were identified by referring to the original species description or subsequent revisions?
In your diagram, I would suggest that you show that a taxon concept may have many names associated with it. Since it is not clear what the identifier intended by his or her choice of a name, it is often difficult to determine what taxon concept they actually meant.
This is why I advocate a move to a more taxon concept based identifier to link these data sets together because this allows the intent of the identifier is more accurately modeled.
This would be done in the form of:
"I assert that this specimen (of what I call *Aedes triseriatus*) was observed here. I also assert that it is an instance of the this species concept => URI"
Or I assert that this is an individual of the type "Individual of species concept X" = > URI
All of these are instances of the class "Individual"
So the resulting DarwinCore record would contain both the name and and an optional, but I think needed, asserted species concept.
The species concept is a subclass of taxon concept, but is fundamentally different than the higher clades.
There are some guidelines as to what an entity needs to be considered a species.
While their are no real guidelines as to what clades should be considered genera and what clades should be considered families etc.
Assigning properties at the level of genera or family is also problematic because it assumes that there will be inferencing and it will require rechecking that those properties are still valid if the species within that genera change.
So if there is some property that is common to all the species in the genus, make that a property of each of the individual species - not a property of the genus.
Respectfully,
- Pete
On Fri, Oct 15, 2010 at 10:45 AM, Steve Baskauf < steve.baskauf@vanderbilt.edu> wrote:
As a background to this post, I want to reference a post by Bob called "SubclassOrNot". I discovered this page on an early foray into the TDWG website labyrinth and it has been very influential on my thinking since then. The idea Bob discusses is central to what I'm writing below so if you haven't read it you might want to do so first. You can probably skip the "OWL Inference" section and still get the point which is described in the first two sections of his post. The URL for the page is http://wiki.tdwg.org/twiki/bin/view/TAG/SubclassOrNot .
To preface what I'm going to say below, I want to put Darwin Core Occurrences in the context of what Bob wrote. In my mind, one of the hallmarks of the Darwin Core standard and one thing that makes it a great improvement over previous versions is that the decision was made to use what Bob called the "has a" approach rather than the "is a" approach. In particular, the Darwin Core standard has a single class called dwc:Occurrence rather than subclasses called "Specimen", "Observation", and other possible things. The way that we differentiate among different kinds of Occurrences is by using the DwC types which are the controlled values for the term dwc:basisOfRecord. Thus we say an Occurrence "has a" basisOfRecord=PreservedSpecimen rather than saying it "is a" PreservedSpecimen. We say an Occurrence "has a" basisOfRecord=HumanObservation rather than saying it "is a"HumanObservation". This approach has greatly reduced the number of different terms in the standard since we don't have to have separate "ObservedBy" and "CollectedBy" terms, but rather can just have a single "RecordedBy" term that applies to both specimens and observations. The same thing applies to many other things, like eventDate rather than DateCollected and DateObserved, locality rather than collectionLocality and observationLocality, etc. With the ratification of Darwin Core, this decision is now a fait acompli and not a subject of discussion or something optional for users of the standard. It also seems to be clear that as necessary new terms can be added to the DwC types which would then be valid controlled values for basisOfRecord.
Since the adoption of the DwC standard, the approach to Occurrences has been what I would describe as "I know an Occurrence when I see one". I consider this as a pretty sloppy practice and as I indicated in my post last night, I think there is enough consensus about what an Occurrence is that we can come up with a better definition than "an occurrence is the category of information pertaining to evidence of an occurrence...". Another part of what I would characterize as sloppiness is the lack of a clear definition of what exactly basisOfRecord means. When I wrote my attempt at summarizing consensus last night, I dodged the question about what I called the "token". This "thing" has been called various names. In the previous discussion on the list, it was sometimes called "the evidence" of the occurrence. In the past I have called it "a representation" - however, I now think the term "token" is better because "representation" has a different technical meaning in the context of content negotiation. When we type an Occurrence by saying it has a basisOfRecord=PreservedSpecimen, we are saying that this Occurrence has as supporting evidence, or as a "token" if you prefer, all or part of the dead remains of the organism (i.e. what I'm calling "the Individual") that was being documented by the Occurrence. When we type an Occurrence by saying it has a basisOfRecord=LivingSpecimen, we are saying that this Occurrence has as a "token" the entire organism that was being documented (or some vegetative part of the live organism that was propagated). When we type an Occurrence by saying it has a basisOfRecord=HumanObservation, we are saying that the Occurrence has no supporting evidence other than the reputation of the observer to accurately record the metadata about the Occurrence. In other words, we "tag" a instance of a core class (to use Bob's words), Occurrence, by telling a metadata consumer what kind of token we are using as evidence of the Occurrence. A fundamental part of creating a clear definition of what an Occurrence is, is to define exactly what we are including in the concept of Occurrence. One possibility is to (1) say that the two boxes at the right side of the diagram at http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gifare fused and that both the Occurrence metadata and its associated token are what we consider to be "the Occurrence". Another approach (2) would be to say that the actual Occurrence as an entity is only the metadata part and that the token is a separate thing. A third approach is to say (3) that everything with the blue dotted lines is considered a part of the Occurrence (i.e. the metadata, the token, the event, and the locality). I don't think in an absolute sense, any one of these approaches is "right". The problem is that these approaches are used inconsistently, sometimes even by the same person, depending on the basisOfRecord. Differences in ways of thinking about this issue is a part of why people aren't understanding the way other people are approaching the structuring of metadata. I have tried to consistently take the approach (1) that the two boxes on the right are fused, i.e. that the Occurrence metadata and the token should both be considered part of the entity that we call "an Occurrence". I think this is why Rich was confused in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001666.htmlwhen I said that it was "wrong" to assert that a scientific name is a property of an Occurrence - obviously it is silly to say that the token (photons on a film, sound patterns in a digital file) has a scientific name. Yet that is exactly what people do routinely when the token is a branch cut off a tree and glued to a piece of paper. They say that they are "identifying a specimen". What I am asking (actually demanding) is that the TDWG community get its act together and come to some consistency on this. If we are going to take the approach (2), then we need to take specimens off their pedestal and treat them like we do any other token that we are using as evidence that an Occurrence happened. If we are going to do what was suggested for the BioBlitz in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001603.html, i.e. to call Occurrences "observations" and then link the tokens to them by associatedMedia, ResourceRelationship, or some other means (approach 2) then do it consistently for every kind of token, including specimens, and don't single out media tokens for punishment. I have in a sense "thrown down the gauntlet" on this issue by proposing that DigitalStillImage be added as a DwC type and as a controlled value for basisOfRecord (http://code.google.com/p/darwincore/issues/detail?id=68). I know what some people are going to say in response to this proposal. "Why do you need to have 'DigitalStillImage' as a value for basisOfRecord when you can just say that the resource's dcterms:type=StillImage?" The answer goes back to Bob's point. If we are going to go the "has a" path (which we already have in DwC for Occurrences) rather than subclassing everything, then we need to provide an appropriate value for the "tag" for any type of resource that a reasonable number of users will want to use as a token. I think it is clear from this and other Bioblitzes, my work in Bioimages, the whale tracking project, and many other examples, that there are plenty of people who are already using DigitalStillImages as tokens and we all need a controlled value to use for basisOfRecord. The other thing that we accomplish when we type an Occurrence by its basisOfRecord is to tell a consumer what kind of metadata to expect to get about the token in addition to the generic metadata that is provided for all Occurrences. Thus for a LivingSpecimen we expect to be told what zoo, botanical garden, bacterial collection, etc. contains the specimen. For a PreservedSpecimen we expect to be told the preparation type, the location of the repository, etc. For a DigitalStillImage we expect to be told the file type, accessURL, etc. Simply providing a value for dcterms:type=StillImage doesn't indicate whether the image is a physical one (i.e. on film) or a digital one. It is also unreasonable to expect a client to have to be checking two different terms (basisOfRecord and dcterms:type) to find out what they could learn from one (basisOfRecord). Of course it would be advisable to provide a value for dcterms:type as well for clients outside the biodiversity community who may not "understand" what basisOfRecord means. I hate to keep bringing my posts back to the RDF issue, but thinking about how one would write RDF forces clear thinking about how metadata should be structured. If we intend to separate tokens as entities from their associated Occurrence metadata, i.e. approach (2), then we open up a whole other can of worms. To associate the occurrence resources (i.e. the metadata) with the "different" resource (i.e. the token), we will have probably have to be able to create URIs for the tokens and separate RDF metadata blocks which will have to be rdfs:type'd. What are we going to use for that rdfs:type - create another Darwin Core class? I simply don't think that is a complicated road that we want to travel. It would be far easier to just say that every Occurrence has a one-to-one relationship with its token (which could be "the empty set" for observations). This would not work for people who want to hang multiple tokens on a single observation event, but I think that itself is a bad idea because it makes it even harder to have "flat" occurrence datasets. Just say that every time we collect a different token (or make an observation that has no token), it is a new Occurrence record. Realistically, a single collector can't actually take a picture of a plant at the same time he or she collects it for a specimen anyway. Those really should be considered two different events because they happen at different times.
OK, enough said. Consider this my defense of my proposal "issue 68" to add DigitalStillImage. I would urge the powers that be to respond to the issues that I've raised here before having any kind of "vote" (or whatever is ultimately going to happen when there is an up or down decision about the proposal).
Steve
Steve Baskauf wrote:
After the flurry of emails recently, I had an opportunity to carefully read all the way through the threads again, followed by enforced "think time" during my long commute. I was actually pretty cheerful after that because I think that in essence, most of the conversation about what constitutes an Occurrence really boils down to the same thing. So I have sat down and tried to summarize what seems to me to be a consensus about Occurrences. To follow my points, please refer to the diagram at: http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif
Consensus on relationships
- The fundamental definition of an Occurrence involves evidence that a
representative of a taxon occurred at a place and time. Note 1.A: For clarity, I have modified John's statement in his last email by replacing "taxon" with "representative of a taxon". I'm considering a taxon to be an abstract concept that is applied to individuals or groups of organisms. Note 1.B. This definition is far more useful than the official definition of the class Occurrence "The category of information pertaining to evidence of an occurrence..." which is essentially circular. Note 1.C: This statement is extremely broad because the evidence could be of many sorts, the representative could range from a single individual to all organisms on the earth, the taxon could be anyone's definition at any taxonomic level, the place could range from a GPS point with uncertainty of less than 10 meters to the entire planet earth, and the time could range from a shutter click of less than one second to 3.4 billion years. 2. The diagram is an attempt to summarize in pictorial form statements and relationships that have been described in the thread. The taxon representative is recorded as existing at a particular time and place (the arrow) and the result is an Occurrence record. That Occurrence record exists as metadata which may be associated with a token that can be used to voucher the fact that the taxon representative existed. That token may be the organism itself (or a living part of it as in a twig for grafting), all or part of the organism in preserved form, an electronic representation such as an image or sound recording, and other kinds of things like tissue or DNA samples. There may also be no token at all, in which case we call the Occurrence record an observation. Based on direct observation of the taxon representative, examination of one or more tokens, or both, some determiner asserts that a taxon concept applies to the taxon representative and as a result a scientific name can be used to "identify" the taxon representative. (There may be a lot of other complicated stuff above the Identification box, but that will have to be filled in by the taxonomists.) Note 2.A: I have mapped onto this diagram the letters that John used in his last email to refer to entities that are involved in an Occurrence (T, E, L, O, and G). I will beg the forgiveness of fossil people because I don't really know how the geological context fits in. I'm assuming that it is a way of asserting time and location on a much broader scale than we do for extant organisms. Note 2.B: I have put a dotted line around the part of the diagram that I think includes all the things that people might consider part of the Occurrence itself. I have left out "T" and the other parts related to identification because it seems to me that you can have an occurrence that you document which does not yet (and perhaps never will) have an identification. The Occurrence still asserts that a taxon representative existed at a time and place; we just don't yet know what the taxon is. 3. The red lines indicate the relationships that connect the various entities (I'm going to go ahead and call them resources). Consistent with popular opinion, the Occurrence record is the center of the universe and most things are connected to it. Note 3.A: I am sticking to my guns and refuse to connect the Identification directly to the Occurrence. It is the taxon representative that is being identified, not the occurrence. One can assert another sort of relationship between the identification and the occurrence if one wants to say that one consulted the occurrence metadata and token in order to decide about the identification, but it is not correct to say that the Identification identifies either the Occurrence metadata or the token (as Rich pointed out).
OK, so that's step one - defining what is related to what. If anyone disagrees with these relationships, please clarify or create your own diagram.
Complicating circumstances/caveats
- It is noted and recognized that some users will not care to include
all of these relationships in their models. In the interest of simplification or "flattening" the relationships, they may wish to collapse some parts of this diagram (e.g. incorporate time and location metadata within the Occurrence metadata rather than considering them separate resources, applying scientific names directly to the taxon representatives without defining a taxon concept or recording the determination metadata, connecting identifications directly to the occurrence, etc.). This doesn't mean that the relationships don't exist, it just means that some users don't care about them. 2. It is recognized that different users will be interested in or able to specify the various resources to differing degrees of precision. Examples: A photographer might record times to the nearest second, a collector may only be interested in noting the date on which a specimen was collected. A location may be specified to the precision of a GPS reading or be defined as some geographic or political subdivision. The taxon representative may be an individual organism, a flock or clump, or some larger aggregation of taxon representatives.
That's step two. If I've missed any complications, please point them out.
My opinions about the implications of this diagram
- The circle I've labeled as "taxon representative" is the resource
type that I'm proposing to be represented by the class Individual. You will note that in both the definition of dwc:individualID ("An identifier for an individual or named group of individual organisms...") and the proposed class definition ("The category of information pertaining to an individual or named group of individual organisms represented in an Occurrence"), groups of individual organisms are included. Thus John's example of a fossil having myriad individuals, or Richard's examples of thousands of plankton, a large school of fish, herd of wildebeest, flock of birds, could all be categorized as "Individual" under this definition if there is a reasonable expectation that all of the individuals in the group are members of the same taxon. Perhaps there is a better name for this resource, but since dwc:individualID was already extant, I chose Individual as the class name for consistency with the pattern established with other classes and their associated xxxxID terms. 2. Although in note 1.C. I have given the ranges of the various resources to their logical extreme (as was done previously in the thread), I think that as a practical matter we can adopt guidelines to set reasonable values for the "normal" ranges of the resources. One such guideline might be that we suggest a range that can accommodate about 95% of the user needs within the community (this came from Rich's comment about satisfying 95% of the user need with an establishmentMeans controlled vocuabulary). For example, it was suggested that the range for the location of an Occurrence could span the entire planet Earth. True enough, but virtually nobody would find such a span useful. 95% of users would probably find a range between a GPS reading with 10 meter precision and the extent of a county or province useful for recording the location of an Occurrence. I can suggest similar "useful" ranges: one second to one day for an event time (excluding fossils), one individual organism to the number of organisms that would fit within a 50 meter radius for an "individual", and taxon identified to family for plants and maybe mammals, genus for birds, and order for insects. So framing the definition of an Occurrence in these terms it would be something like: "An occurrence involves evidence (consisting of a physical token, electronic record, or personal observation) that a representative (ranging from a single individual to the number that would fit on a football field) of a taxon (hopefully identified to some lower taxonomic level) occurred at a place (determined to a precision between that of a GPS reading and the size of a county/province) and time (spanning one second to one day)." A few people might object to this level of restrictiveness, but I would guess that it would make 95% of us happy. 3. With the exception of the "missing" class Individual, every resource type on this diagram except for the "token" and Scientific name has a Darwin Core class. Every resource type on the diagram except for "token" has a dwc:xxxxID term that can be used to refer to a GUID for the resource. The implication of this is that any resource on this diagram except for the token and taxon representative (i.e. Individual) is ready to be represented in RDF by Darwin Core terms in the sense that the relationships (red lines) can be represented by the xxxxID terms and that the resources can be rdfs:type'd using Darwin Core classes. (Lacking a class for the scientific name doesn't seem like a big deal to me since the scientific name can be a string literal - but then I'm not a taxonomist.) 4. OK, I've avoided it as long as I can, so I'm going to confess now to the RDF-phobes. The red lines and shapes are something pretty close to an RDF graph. What that means is that if the community can agree that this diagram correctly represents the relationships among the kinds of biodiversity resources that we care about, then the matter of providing guidelines on how to represent Darwin Core in RDF suddenly gets a lot simpler. Just convert the "picture" of the RDF graph into XML format and we have a template. Alright, that's an oversimplification, but I think it is essentially true because the most difficult part of achieving a consensus on RDF representations is to decide how we connect the resource types, not on the literals that we hang onto resources as properties. 5. While I'm beating the RDF drum again, the importance of my opinion number 2 can be extended into the GUID adoption process. In my comments to Kevin about the Beginner's Guide to Persistent Identifiers, I think I commented on the question of how one decides whether a GUID needs to be assigned to something or not. I believe that the answer to that question boils down to this: we need a GUID for any resource that will be referenced by more than one other resource. Do we need to be able to assign a GUID to Taxon concepts? Yes, because it is likely that many identifications will want to reference a particular taxon concept. Do we need to be able to assign a GUID to an Event? Maybe or maybe not. If every occurrence has its own separate time recorded, then no GUID is needed because the time is just a part of every separate occurrence record. If the event is defined to be a time range that represents a collecting trip, then there may be many Occurrences that are associated with that trip and all of them could reference the GUID for that event rather than repeating the event information for every Occurrence. The point here is that every shape (class of resources) on this diagram at least has the POTENTIAL to be a node connecting multiple resources and therefore should have the capability of being assigned a GUID, having its own RDF record, and being appropriately typed (presumably by a DwC class). So this is a final technical argument for why we need to have the DwC class Individual. Whether or not people ultimately choose to assign GUIDs to particular resource types or not is their own choice, but they need to at least be ABLE to if they need that resource to serve as a node given the structure of their metadata.
We need to clarify how the "token" thing fits in, but I'm stopping there for now. I would very much appreciate responses indicating that:
A. you agree with the diagram and connections (and consider this definition and diagram a consensus) B. you disagree with the diagram (and articulate why) C. you provide an alternative diagram or explanation of the relationships among the classes related to Occurrences.
Thanks for you patience with another tome. Steve
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content .
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base http://www.taxonconcept.org/ / GeoSpecies Knowledge Base http://lod.geospecies.org/ About the GeoSpecies Knowledge Base http://about.geospecies.org/
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
Pete, Sorry - time out. This was not intended as a functional RDF example - I only intended for people to look at the web page so they could see the specimen sheet and the metadata that shows up on the web page for the purposes of understanding the questions I was posing. The RDF was originally supposed to be put on www.cyberfloralouisiana.com, not bioimages.vanderbilt.edu, but that never was implemented, so the URIs in the RDF won't make sense at the actual current file location. That's why the page has as a warning that it's for testing purposes only. Steve
Peter DeVries wrote:
Hi Steve,
You need to fix this in two ways (independent of the vocab, which I did not check)
- It should show up correctly in URIburner.
http://linkeddata.uriburner.com/about/html/http/bioimages.vanderbilt.edu/rdf...
- In the description of the RDF itself (in your example it is at the
bottom), you need to make a foaf:topic link between that element and each of the entities that start with "rdf:about". This will allow you to find the actual rdf page that describes these. To get the link back from the entity to the page add a "foaf:page" that points back to the RDF.
Remember that in the cloud or in your triple store entities like http://www.cyberfloralouisiana.com/specimens/lsu000/0428#ind are not tied to the RDF that contains statements about them, without some link to and from the page http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf
- You could get the same result by using the "dcterms:references" and
its inverse "dcterms:ReferencedBy", but let me run that past someone to see if it is equally accepted.
Here is an abbreviated version of what this might look like:
<rdf:Description rdf:about="http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf%22%3E dcterms:descriptionRDF formatted description of the preserved specimen http://www.cyberfloralouisiana.com/specimens/lsu000/0428</dcterms:description> dcterms:modified2010-09-25T06:35:58</dcterms:modified> xmp:MetadataDate2010-09-25T06:35:58</xmp:MetadataDate> <foaf:topic rdf:resource="http://www.cyberfloralouisiana.com/specimens/lsu000/0428#ind%22/%3E <foaf:topic rdf:resource="http://www.cyberfloralouisiana.com/specimens/lsu000/0428#39265b%22/%3E <foaf:topic rdf:resource="http://www.cyberfloralouisiana.com/specimens/lsu000/0428#39231b%22/%3E <foaf:topic rdf:resource="http://www.cyberfloralouisiana.com/specimens/lsu000/0428#39231a%22/%3E <foaf:topic rdf:resource="http://www.cyberfloralouisiana.com/specimens/lsu000/0428#39265a%22/%3E <foaf:topic rdf:resource="http://www.cyberfloralouisiana.com/specimens/lsu000/0428%22/%3E <foaf:topic rdf:resource="http://www.cyberfloralouisiana.com/specimens/lsu000/0428#img%22/%3E <foaf:topic rdf:resource="http://www.cyberfloralouisiana.com/specimens/lsu000/0428#bq%22/%3E <foaf:topic rdf:resource="http://www.cyberfloralouisiana.com/specimens/lsu000/0428#ind%22/%3E <foaf:topic rdf:resource="http://www.cyberfloralouisiana.com/specimens/lsu000/0428#bq%22/%3E <foaf:topic rdf:resource="http://www.cyberfloralouisiana.com/specimens/lsu000/0428#tn%22/%3E <foaf:topic rdf:resource="http://www.cyberfloralouisiana.com/specimens/lsu000/0428#lq%22/%3E <foaf:topic rdf:resource="http://www.cyberfloralouisiana.com/specimens/lsu000/0428#gq%22/%3E </rdf:Description>
<rdf:Description rdf:about="http://www.cyberfloralouisiana.com/specimens/lsu000/0428#ind%22%3E <foaf:page rdf:resource="http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf%22/%3E </rdf:Description>
Following this pattern, your RDF will be browsable as in this example:
http://linkeddata.uriburner.com/about/html/http/lod.taxonconcept.org/rdf/are...
Note how you can click back and forth between the location and the RDF that describes it.
- Pete
On Mon, Oct 18, 2010 at 11:49 AM, Steve Baskauf <steve.baskauf@vanderbilt.edu mailto:steve.baskauf@vanderbilt.edu> wrote:
I've fallen behind on systematically perusing the list responses, but I would like to focus in on a point that seems to be a consensus in the responses that have shown up recently. The consensus seems to be that documenting determinations (a.k.a. instances of dwc:Identification class) that are applied to Individuals (or Occurrences if you don't believe in Individuals) is the way to go. So in my usual graphical way of thinking about this, I would draw a "relationship line" from the determination to the Individual (or Occurrence) on one side and from the determination to the species concept on the other. I will leave up to the taxonomy people the different things would be connected to the species concept and how all of their lines would be connected. The determination would have any of the properties that are terms listed in the dwc:Identification class (identifiedBy, dateIdentified, identificationReferences, identification Remarks, identificationQualifier, and typeStatus). Some properties like dateIdentified and identificationReferences would be string literals and others (especially identifiedBy) should probably be GUIDs but could be literals if they had to be. That all seems pretty clear. However, when I've started trying to do this in real life, I immediately have questions. Take a look at http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf which should show up as a web page in your browser. 1. The original label identifies the species as Juncus diffusissimus. However, there is no indicator as to who originally identified it or when. My assumption is that it was the collector (Glen N. Montz) but I don't really know that. Do I assume that, or list the original determiner as "unknown"? 2. Do we draw a distinction between the initial identification and subsequent annotations? I think the answer should be "no" and that's why I refer to both generically as "determinations". 3. There is really no indication given on the annotation labels as to many of the things that we would like to know, such as the concept they had in mind, any source they used (if any), or the reason why they did the annotation. So how does one connect the name that they applied to the determination when there is no indication of the concept? Is this just something we can't do for old annotations and just something that we try to do from this point forward? 4. The last question is one that I really want to some opinions about. It seems to me that there are a number of reasons why one would apply a determination. One would be to correct an actual error in identification. One would be to increase the precision of a previous determination (e.g. an insect identified to family now is identified to species). One would be to assert a difference in opinion as to the correct way to group this individual with others (i.e. as in a taxonomic revision). Finally, a single determiner might apply several determinations to one individual and indicate in each determination the concept intended (i.e. if you subscribe to Cronquist, you'd call it X; if you like Radford's book, you'd call it Y; if you like Weakley's treatment, you'd call it Z). Some of these four reasons may be functionally equivalent, but how would you use Darwin Core to indicate the reason why you applied the determination? Please don't say "identificationRemarks"! From a machine-processing standpoint, this is something we should know and there should be some kind of controlled vocabulary to express it. For instance if an identification is "deprecated" because it was in error (perhaps by the determiner him/herself), one would like the incorrect determination to show up in the historical metadata, but I wouldn't want it to be listed in a website index. The same would hold true if an annotator was able to pin the taxon down to a lower taxonomic level than the original identifier. If someone goes to the trouble to connect an Individual/Occurrence to several names under alternative concepts, there should be a way the a machine would know this so that a software user could select the concept they wanted to use and the name under that concept would pop up. I don't really see any term under the current DwC that could be used to do this last thing. Am I missing something? Do we need several terms to explain the reason why we made the determination because the reasons fall into different categories? The other comment that I'll throw out (since this is going out to the bioblitz list as well as to tdwg-content) is that those of you who are building apps to collect metadata in the field really need to separate the process of entering (or acquiring) the collection metadata from the determination process. In at least some apps, the user immediately has to commit to a taxon as they enter the data at the time of collection. It seems to me that it would be a very common situation (especially in the case of "citizen science") that the collector/observer/photographer would have no idea what the taxonomic identity was at the time of collection. The process of determination (and the recording of the various dwc:Identification class terms) is really a separate process that should be able to happen at the time of collection OR later. Steve Peter DeVries wrote:
Hi Steve, I would hypothesize that for the vast majority of identified records the process is something like this: 1) An individual uses some sort of key to determine what species (taxon concept) to assign to a given individual * They may have created some sort of mental key in which once they recognize one individual mosquito they can then pretty quickly sort a number of individuals into collections. 2) The actual name they assign to the specimen is usually based on what their key says the name is. Often this does not specify the authorship. Most of these human identifiers have not read the original species descriptions and for the species they are identifying. So the specimen is actually tied to a concept that is based more on the "key" than the original description. * An exception, would be where there is a key in the original description and that was what what was used. 3) So in a sense, the process of modeling this as if the if the identifier actually asserted that the concept was the same as that described by the original description or a subsequent revision is "fudging" Side effects of this process include: 1) A new key for North American Mosquitoes comes out that incorporates recent changes in nomenclature. The major change being the elevation of a subgenus to a genus. For most of the species described the "key concept" is unchanged. Student identifier, Bob, in state X is using the latest key, while student identifier, Joe, is state Z is using a slightly older edition of the same key. Bob identifies the species as /Ochlerotatus triseriatus/, while Joe identifies what should be the same species as /Aedes triseriatus/. These show up in GBIF on two different maps, they show up in the EOL as two different pages. Various TDWG'ers continue to argue that the original description and subsequent revisions were really important in determining what these individuals actually meant when they assigned a name to a specimen, and that this is how we should model it in excruciating detail. I would argue this should be modeled as best as possible to what actually happens. For example, how many of the species observed in the recent BioBlitz were identified by referring to the original species description or subsequent revisions? In your diagram, I would suggest that you show that a taxon concept may have many names associated with it. Since it is not clear what the identifier intended by his or her choice of a name, it is often difficult to determine what taxon concept they actually meant. This is why I advocate a move to a more taxon concept based identifier to link these data sets together because this allows the intent of the identifier is more accurately modeled. This would be done in the form of: "I assert that this specimen (of what I call /Aedes triseriatus/) was observed here. I also assert that it is an instance of the this species concept => URI" Or I assert that this is an individual of the type "Individual of species concept X" = > URI All of these are instances of the class "Individual" So the resulting DarwinCore record would contain both the name and and an optional, but I think needed, asserted species concept. The species concept is a subclass of taxon concept, but is fundamentally different than the higher clades. There are some guidelines as to what an entity needs to be considered a species. While their are no real guidelines as to what clades should be considered genera and what clades should be considered families etc. Assigning properties at the level of genera or family is also problematic because it assumes that there will be inferencing and it will require rechecking that those properties are still valid if the species within that genera change. So if there is some property that is common to all the species in the genus, make that a property of each of the individual species - not a property of the genus. Respectfully, - Pete On Fri, Oct 15, 2010 at 10:45 AM, Steve Baskauf <steve.baskauf@vanderbilt.edu <mailto:steve.baskauf@vanderbilt.edu>> wrote: As a background to this post, I want to reference a post by Bob called "SubclassOrNot". I discovered this page on an early foray into the TDWG website labyrinth and it has been very influential on my thinking since then. The idea Bob discusses is central to what I'm writing below so if you haven't read it you might want to do so first. You can probably skip the "OWL Inference" section and still get the point which is described in the first two sections of his post. The URL for the page is http://wiki.tdwg.org/twiki/bin/view/TAG/SubclassOrNot . To preface what I'm going to say below, I want to put Darwin Core Occurrences in the context of what Bob wrote. In my mind, one of the hallmarks of the Darwin Core standard and one thing that makes it a great improvement over previous versions is that the decision was made to use what Bob called the "has a" approach rather than the "is a" approach. In particular, the Darwin Core standard has a single class called dwc:Occurrence rather than subclasses called "Specimen", "Observation", and other possible things. The way that we differentiate among different kinds of Occurrences is by using the DwC types which are the controlled values for the term dwc:basisOfRecord. Thus we say an Occurrence "has a" basisOfRecord=PreservedSpecimen rather than saying it "is a" PreservedSpecimen. We say an Occurrence "has a" basisOfRecord=HumanObservation rather than saying it "is a"HumanObservation". This approach has greatly reduced the number of different terms in the standard since we don't have to have separate "ObservedBy" and "CollectedBy" terms, but rather can just have a single "RecordedBy" term that applies to both specimens and observations. The same thing applies to many other things, like eventDate rather than DateCollected and DateObserved, locality rather than collectionLocality and observationLocality, etc. With the ratification of Darwin Core, this decision is now a fait acompli and not a subject of discussion or something optional for users of the standard. It also seems to be clear that as necessary new terms can be added to the DwC types which would then be valid controlled values for basisOfRecord. Since the adoption of the DwC standard, the approach to Occurrences has been what I would describe as "I know an Occurrence when I see one". I consider this as a pretty sloppy practice and as I indicated in my post last night, I think there is enough consensus about what an Occurrence is that we can come up with a better definition than "an occurrence is the category of information pertaining to evidence of an occurrence...". Another part of what I would characterize as sloppiness is the lack of a clear definition of what exactly basisOfRecord means. When I wrote my attempt at summarizing consensus last night, I dodged the question about what I called the "token". This "thing" has been called various names. In the previous discussion on the list, it was sometimes called "the evidence" of the occurrence. In the past I have called it "a representation" - however, I now think the term "token" is better because "representation" has a different technical meaning in the context of content negotiation. When we type an Occurrence by saying it has a basisOfRecord=PreservedSpecimen, we are saying that this Occurrence has as supporting evidence, or as a "token" if you prefer, all or part of the dead remains of the organism (i.e. what I'm calling "the Individual") that was being documented by the Occurrence. When we type an Occurrence by saying it has a basisOfRecord=LivingSpecimen, we are saying that this Occurrence has as a "token" the entire organism that was being documented (or some vegetative part of the live organism that was propagated). When we type an Occurrence by saying it has a basisOfRecord=HumanObservation, we are saying that the Occurrence has no supporting evidence other than the reputation of the observer to accurately record the metadata about the Occurrence. In other words, we "tag" a instance of a core class (to use Bob's words), Occurrence, by telling a metadata consumer what kind of token we are using as evidence of the Occurrence. A fundamental part of creating a clear definition of what an Occurrence is, is to define exactly what we are including in the concept of Occurrence. One possibility is to (1) say that the two boxes at the right side of the diagram at http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif are fused and that both the Occurrence metadata and its associated token are what we consider to be "the Occurrence". Another approach (2) would be to say that the actual Occurrence as an entity is only the metadata part and that the token is a separate thing. A third approach is to say (3) that everything with the blue dotted lines is considered a part of the Occurrence (i.e. the metadata, the token, the event, and the locality). I don't think in an absolute sense, any one of these approaches is "right". The problem is that these approaches are used inconsistently, sometimes even by the same person, depending on the basisOfRecord. Differences in ways of thinking about this issue is a part of why people aren't understanding the way other people are approaching the structuring of metadata. I have tried to consistently take the approach (1) that the two boxes on the right are fused, i.e. that the Occurrence metadata and the token should both be considered part of the entity that we call "an Occurrence". I think this is why Rich was confused in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001666.html when I said that it was "wrong" to assert that a scientific name is a property of an Occurrence - obviously it is silly to say that the token (photons on a film, sound patterns in a digital file) has a scientific name. Yet that is exactly what people do routinely when the token is a branch cut off a tree and glued to a piece of paper. They say that they are "identifying a specimen". What I am asking (actually demanding) is that the TDWG community get its act together and come to some consistency on this. If we are going to take the approach (2), then we need to take specimens off their pedestal and treat them like we do any other token that we are using as evidence that an Occurrence happened. If we are going to do what was suggested for the BioBlitz in http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001603.html, i.e. to call Occurrences "observations" and then link the tokens to them by associatedMedia, ResourceRelationship, or some other means (approach 2) then do it consistently for every kind of token, including specimens, and don't single out media tokens for punishment. I have in a sense "thrown down the gauntlet" on this issue by proposing that DigitalStillImage be added as a DwC type and as a controlled value for basisOfRecord (http://code.google.com/p/darwincore/issues/detail?id=68). I know what some people are going to say in response to this proposal. "Why do you need to have 'DigitalStillImage' as a value for basisOfRecord when you can just say that the resource's dcterms:type=StillImage?" The answer goes back to Bob's point. If we are going to go the "has a" path (which we already have in DwC for Occurrences) rather than subclassing everything, then we need to provide an appropriate value for the "tag" for any type of resource that a reasonable number of users will want to use as a token. I think it is clear from this and other Bioblitzes, my work in Bioimages, the whale tracking project, and many other examples, that there are plenty of people who are already using DigitalStillImages as tokens and we all need a controlled value to use for basisOfRecord. The other thing that we accomplish when we type an Occurrence by its basisOfRecord is to tell a consumer what kind of metadata to expect to get about the token in addition to the generic metadata that is provided for all Occurrences. Thus for a LivingSpecimen we expect to be told what zoo, botanical garden, bacterial collection, etc. contains the specimen. For a PreservedSpecimen we expect to be told the preparation type, the location of the repository, etc. For a DigitalStillImage we expect to be told the file type, accessURL, etc. Simply providing a value for dcterms:type=StillImage doesn't indicate whether the image is a physical one (i.e. on film) or a digital one. It is also unreasonable to expect a client to have to be checking two different terms (basisOfRecord and dcterms:type) to find out what they could learn from one (basisOfRecord). Of course it would be advisable to provide a value for dcterms:type as well for clients outside the biodiversity community who may not "understand" what basisOfRecord means. I hate to keep bringing my posts back to the RDF issue, but thinking about how one would write RDF forces clear thinking about how metadata should be structured. If we intend to separate tokens as entities from their associated Occurrence metadata, i.e. approach (2), then we open up a whole other can of worms. To associate the occurrence resources (i.e. the metadata) with the "different" resource (i.e. the token), we will have probably have to be able to create URIs for the tokens and separate RDF metadata blocks which will have to be rdfs:type'd. What are we going to use for that rdfs:type - create another Darwin Core class? I simply don't think that is a complicated road that we want to travel. It would be far easier to just say that every Occurrence has a one-to-one relationship with its token (which could be "the empty set" for observations). This would not work for people who want to hang multiple tokens on a single observation event, but I think that itself is a bad idea because it makes it even harder to have "flat" occurrence datasets. Just say that every time we collect a different token (or make an observation that has no token), it is a new Occurrence record. Realistically, a single collector can't actually take a picture of a plant at the same time he or she collects it for a specimen anyway. Those really should be considered two different events because they happen at different times. OK, enough said. Consider this my defense of my proposal "issue 68" to add DigitalStillImage. I would urge the powers that be to respond to the issues that I've raised here before having any kind of "vote" (or whatever is ultimately going to happen when there is an up or down decision about the proposal). Steve Steve Baskauf wrote: After the flurry of emails recently, I had an opportunity to carefully read all the way through the threads again, followed by enforced "think time" during my long commute. I was actually pretty cheerful after that because I think that in essence, most of the conversation about what constitutes an Occurrence really boils down to the same thing. So I have sat down and tried to summarize what seems to me to be a consensus about Occurrences. To follow my points, please refer to the diagram at: http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif Consensus on relationships 1. The fundamental definition of an Occurrence involves evidence that a representative of a taxon occurred at a place and time. Note 1.A: For clarity, I have modified John's statement in his last email by replacing "taxon" with "representative of a taxon". I'm considering a taxon to be an abstract concept that is applied to individuals or groups of organisms. Note 1.B. This definition is far more useful than the official definition of the class Occurrence "The category of information pertaining to evidence of an occurrence..." which is essentially circular. Note 1.C: This statement is extremely broad because the evidence could be of many sorts, the representative could range from a single individual to all organisms on the earth, the taxon could be anyone's definition at any taxonomic level, the place could range from a GPS point with uncertainty of less than 10 meters to the entire planet earth, and the time could range from a shutter click of less than one second to 3.4 billion years. 2. The diagram is an attempt to summarize in pictorial form statements and relationships that have been described in the thread. The taxon representative is recorded as existing at a particular time and place (the arrow) and the result is an Occurrence record. That Occurrence record exists as metadata which may be associated with a token that can be used to voucher the fact that the taxon representative existed. That token may be the organism itself (or a living part of it as in a twig for grafting), all or part of the organism in preserved form, an electronic representation such as an image or sound recording, and other kinds of things like tissue or DNA samples. There may also be no token at all, in which case we call the Occurrence record an observation. Based on direct observation of the taxon representative, examination of one or more tokens, or both, some determiner asserts that a taxon concept applies to the taxon representative and as a result a scientific name can be used to "identify" the taxon representative. (There may be a lot of other complicated stuff above the Identification box, but that will have to be filled in by the taxonomists.) Note 2.A: I have mapped onto this diagram the letters that John used in his last email to refer to entities that are involved in an Occurrence (T, E, L, O, and G). I will beg the forgiveness of fossil people because I don't really know how the geological context fits in. I'm assuming that it is a way of asserting time and location on a much broader scale than we do for extant organisms. Note 2.B: I have put a dotted line around the part of the diagram that I think includes all the things that people might consider part of the Occurrence itself. I have left out "T" and the other parts related to identification because it seems to me that you can have an occurrence that you document which does not yet (and perhaps never will) have an identification. The Occurrence still asserts that a taxon representative existed at a time and place; we just don't yet know what the taxon is. 3. The red lines indicate the relationships that connect the various entities (I'm going to go ahead and call them resources). Consistent with popular opinion, the Occurrence record is the center of the universe and most things are connected to it. Note 3.A: I am sticking to my guns and refuse to connect the Identification directly to the Occurrence. It is the taxon representative that is being identified, not the occurrence. One can assert another sort of relationship between the identification and the occurrence if one wants to say that one consulted the occurrence metadata and token in order to decide about the identification, but it is not correct to say that the Identification identifies either the Occurrence metadata or the token (as Rich pointed out). OK, so that's step one - defining what is related to what. If anyone disagrees with these relationships, please clarify or create your own diagram. Complicating circumstances/caveats 1. It is noted and recognized that some users will not care to include all of these relationships in their models. In the interest of simplification or "flattening" the relationships, they may wish to collapse some parts of this diagram (e.g. incorporate time and location metadata within the Occurrence metadata rather than considering them separate resources, applying scientific names directly to the taxon representatives without defining a taxon concept or recording the determination metadata, connecting identifications directly to the occurrence, etc.). This doesn't mean that the relationships don't exist, it just means that some users don't care about them. 2. It is recognized that different users will be interested in or able to specify the various resources to differing degrees of precision. Examples: A photographer might record times to the nearest second, a collector may only be interested in noting the date on which a specimen was collected. A location may be specified to the precision of a GPS reading or be defined as some geographic or political subdivision. The taxon representative may be an individual organism, a flock or clump, or some larger aggregation of taxon representatives. That's step two. If I've missed any complications, please point them out. My opinions about the implications of this diagram 1. The circle I've labeled as "taxon representative" is the resource type that I'm proposing to be represented by the class Individual. You will note that in both the definition of dwc:individualID ("An identifier for an individual or named group of individual organisms...") and the proposed class definition ("The category of information pertaining to an individual or named group of individual organisms represented in an Occurrence"), groups of individual organisms are included. Thus John's example of a fossil having myriad individuals, or Richard's examples of thousands of plankton, a large school of fish, herd of wildebeest, flock of birds, could all be categorized as "Individual" under this definition if there is a reasonable expectation that all of the individuals in the group are members of the same taxon. Perhaps there is a better name for this resource, but since dwc:individualID was already extant, I chose Individual as the class name for consistency with the pattern established with other classes and their associated xxxxID terms. 2. Although in note 1.C. I have given the ranges of the various resources to their logical extreme (as was done previously in the thread), I think that as a practical matter we can adopt guidelines to set reasonable values for the "normal" ranges of the resources. One such guideline might be that we suggest a range that can accommodate about 95% of the user needs within the community (this came from Rich's comment about satisfying 95% of the user need with an establishmentMeans controlled vocuabulary). For example, it was suggested that the range for the location of an Occurrence could span the entire planet Earth. True enough, but virtually nobody would find such a span useful. 95% of users would probably find a range between a GPS reading with 10 meter precision and the extent of a county or province useful for recording the location of an Occurrence. I can suggest similar "useful" ranges: one second to one day for an event time (excluding fossils), one individual organism to the number of organisms that would fit within a 50 meter radius for an "individual", and taxon identified to family for plants and maybe mammals, genus for birds, and order for insects. So framing the definition of an Occurrence in these terms it would be something like: "An occurrence involves evidence (consisting of a physical token, electronic record, or personal observation) that a representative (ranging from a single individual to the number that would fit on a football field) of a taxon (hopefully identified to some lower taxonomic level) occurred at a place (determined to a precision between that of a GPS reading and the size of a county/province) and time (spanning one second to one day)." A few people might object to this level of restrictiveness, but I would guess that it would make 95% of us happy. 3. With the exception of the "missing" class Individual, every resource type on this diagram except for the "token" and Scientific name has a Darwin Core class. Every resource type on the diagram except for "token" has a dwc:xxxxID term that can be used to refer to a GUID for the resource. The implication of this is that any resource on this diagram except for the token and taxon representative (i.e. Individual) is ready to be represented in RDF by Darwin Core terms in the sense that the relationships (red lines) can be represented by the xxxxID terms and that the resources can be rdfs:type'd using Darwin Core classes. (Lacking a class for the scientific name doesn't seem like a big deal to me since the scientific name can be a string literal - but then I'm not a taxonomist.) 4. OK, I've avoided it as long as I can, so I'm going to confess now to the RDF-phobes. The red lines and shapes are something pretty close to an RDF graph. What that means is that if the community can agree that this diagram correctly represents the relationships among the kinds of biodiversity resources that we care about, then the matter of providing guidelines on how to represent Darwin Core in RDF suddenly gets a lot simpler. Just convert the "picture" of the RDF graph into XML format and we have a template. Alright, that's an oversimplification, but I think it is essentially true because the most difficult part of achieving a consensus on RDF representations is to decide how we connect the resource types, not on the literals that we hang onto resources as properties. 5. While I'm beating the RDF drum again, the importance of my opinion number 2 can be extended into the GUID adoption process. In my comments to Kevin about the Beginner's Guide to Persistent Identifiers, I think I commented on the question of how one decides whether a GUID needs to be assigned to something or not. I believe that the answer to that question boils down to this: we need a GUID for any resource that will be referenced by more than one other resource. Do we need to be able to assign a GUID to Taxon concepts? Yes, because it is likely that many identifications will want to reference a particular taxon concept. Do we need to be able to assign a GUID to an Event? Maybe or maybe not. If every occurrence has its own separate time recorded, then no GUID is needed because the time is just a part of every separate occurrence record. If the event is defined to be a time range that represents a collecting trip, then there may be many Occurrences that are associated with that trip and all of them could reference the GUID for that event rather than repeating the event information for every Occurrence. The point here is that every shape (class of resources) on this diagram at least has the POTENTIAL to be a node connecting multiple resources and therefore should have the capability of being assigned a GUID, having its own RDF record, and being appropriately typed (presumably by a DwC class). So this is a final technical argument for why we need to have the DwC class Individual. Whether or not people ultimately choose to assign GUIDs to particular resource types or not is their own choice, but they need to at least be ABLE to if they need that resource to serve as a node given the structure of their metadata. We need to clarify how the "token" thing fits in, but I'm stopping there for now. I would very much appreciate responses indicating that: A. you agree with the diagram and connections (and consider this definition and diagram a consensus) B. you disagree with the diagram (and articulate why) C. you provide an alternative diagram or explanation of the relationships among the classes related to Occurrences. Thanks for you patience with another tome. Steve -- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A. delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235 office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org <mailto:tdwg-content@lists.tdwg.org> http://lists.tdwg.org/mailman/listinfo/tdwg-content . -- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A. delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235 office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu -- ---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base <http://www.taxonconcept.org/> / GeoSpecies Knowledge Base <http://lod.geospecies.org/> About the GeoSpecies Knowledge Base <http://about.geospecies.org/> ------------------------------------------------------------
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A. delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235 office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base http://www.taxonconcept.org/ / GeoSpecies Knowledge Base http://lod.geospecies.org/ About the GeoSpecies Knowledge Base http://about.geospecies.org/
All,
I'm in Stockholm, and right now it's 10am in Hawaii, and I've effectively been awake since 7pm Hawaii time -- so my brain is a bit mush. But I'll take a chance and comment anyway.
I will leave up to the taxonomy people the different things would be connected to the species concept and how all of their lines would be connected.
In my mind the "fully-normalised" (sensu Döring) relationship graph is something like this (notation is [One]--<[Many]; [One]--[One]) (Be sure to view as a fixed-width font, like Courier):
[identifiedBy] | [Location]--<[Event]--<[Occurrence]>--[Individual]--<[Identification]--[Taxo nNameUsage]>--[nameAccordingTo] | | | [eventTime] [dateIdentified] [scientificName]
I'm following what I *think* Steve defined for [Individual], which is that it can be either a single individual organism or a defined set of organisms (e.g., up to at least a population).
So, an Occurrence is the intersection of an Individual and an Event. An Event is a Location+Time[+other metadata]. Each Event may have multiple Occurrences (i.e., one for each distinct Individual at the same Location+Time). Also, an Individual may have multiple Occurrences (one for each Event at which the same Individual was documented).
An Individual may have multiple Identifcations. I make no distinction between "Identification" and "Determination" (nor do I make a distinction between the first identification and subsequent identifications). I slightly prefer "Identification", because "Determination" seems to imply that there is a correct answer, whereas "Identification" (to me, anyway), implies an opinion. Steve, I didn't quite follow how you were distinguishing these two terms -- so if you have a clear reason for distinguishing them, I'd like to understand it better.
A single Identification should, in my mind, always join a single individual with a single "TaxonNameUsage" instance. I'm not 100% sure how TaxonNameUsage maps in DwC. I *think* it's an instance of a dwc:Taxon, as most of the core attributes of a TNU (acceptedNameUsage[ID], parentNameUsage[ID], originalNameUsage[ID], scientificName, taxonRank) are represented as terms in the Taxon Class. But I'm a little fuzzy on whether a "taxonID" maps directly to a TNUID, or if a TNUID more correcly maps to taxonConceptID.
The determination would have any of the properties that are terms listed in the dwc:Identification class (identifiedBy, dateIdentified, identificationReferences, identification Remarks, identificationQualifier, and typeStatus). Some properties like dateIdentified and identificationReferences would be string literals and others (especially identifiedBy) should probably be GUIDs but could be literals if they had to be.
I agree with what Steve wrote above. However, I'm uncomfortable with Markus' suggestion of treating dwc:nameAccordingTo as a property of an Indentification -- even as a shortcut. I think this is a bit dangerous. If there is no TaxonID instance (aka "TaxonNameUsage" in my diagram above) available to link the Identification to, then I would suggest using identificationReferences as the shortcut. But that would still force you to attached scientificName directly to the Identification instance, which I think is also unwise. I'd rather the Best Practice be to "manufacture" a place-holder dwc:Taxon instance (if a proper one doesn't already exist in the content source), and apply the scientificName property to that Taxon instance, rather than directly to an Identification. I know it's often short-hand to attach the scientificName directly to the Occurrence instance; but I actually feel less uneasy about that, because it is much more obviously a shortcut. But if you're going to the trouble to provide an instantiated "Identification", then you ought to anchor it to a Taxon instance (manufactured or real).
But, I guess as Greg said in his post, it may not really matter, as in the long run, we'll probably be able to make inferences about the proper Individual<-->TaxonConcept mapping, even when it's not explicitly documented.
- The original label identifies the species as Juncus
diffusissimus. However, there is no indicator as to who originally identified it or when. My assumption is that it was the collector (Glen N. Montz) but I don't really know that. Do I assume that, or list the original determiner as "unknown"?
I would make no assumptions about who was the identifiedBy person. Instead, in these cases I handle these cases by either going with "Unspecified", or, in some cases (when I have confidence), something like "Bishop Museum Staff Member". Often I can deduce the identifier with some degree of confidence, but usually I don't have the time to do this. The dateIdentified can either not be provided, or set as some range (e.g., at the very worst, on or after the eventDate/eventTime, and before today).
This is why I think that identification tags ("annotations" sensu Baskauf) can be "documentation sources for TNUs.
In the web example given by Steve, we have an idetification as follows:
Juncus diffusissimus Buckl. Determined by: L. Urbatsch Determination date: 2009
Completely independantly of the specimen itself, we can infer from the tag that:
- Sometime between 1 Jan 2009 and 31 Dec 2009, L. Urbatsch regarded the genus "Juncus" as valid. - Sometime between 1 Jan 2009 and 31 Dec 2009, L. Urbatsch regarded the species epithet "diffusissimus" [of Buckl.] as a valid species, placed within the genus "Juncus".
Thus, we have at least two implied TNUs from this identification, which was documented on a piece of paper that happens to be fixed to LSU-BR 39823.
The Identification instance would link the Individual (manifest as a specimen, in this case) to the TNU of "[Juncus] diffusissimus Buckl. sec L. Urbatsch 2009". The nameAccordingTo would be "L. Urbatsch 2009". This may seem redundant to have "L. Urbatsch 2009" in both the nameAccordingTo attribute of thr Taxon instance, and in the identifiedBy & dateIdentified attributes of the Identification instance -- but the fact remains they are fundamentally different pieces of information. One establishes an instance of an (implied) taxon concept, and the other establishes the placement of LSU-BR 39823 within that taxon concept circumscription.
Eventually, a third party may be able to deduce (perhaps through a suite of other, external information) a RelationshipAssertion that maps the TNU "[Juncus] diffusissimus Buckl. sec L. Urbatsch 2009" to some other, perhaps published and well-defined taxon concept (of the same or different name). Also, if there are 100 specimens in the collection that L. Urbatsch identified as "Juncus diffusissimus Buckl." in 2009, then anchoring all 100 Identification instances to the one TNU, allows all of those specimens to inherit the mapping of the one "[Juncus] diffusissimus Buckl. sec L. Urbatsch 2009" TNU instance to some other better-defined taxon concept.
I know this is a lot of stuff to keep in one's head at the same time -- but as cumbersome as it seems, I am conviced it can be packacged into a relatively straightforward and intuitive user UI, and modelling it this way improves the utility of the data (maybe dramatically) in the long run.
- Do we draw a distinction between the initial identification and
subsequent annotations?
I think the answer should be "no" and that's why I refer to both
generically as "determinations".
I agree.
- There is really no indication given on the annotation
labels as to many of the things that we would like to know, such as the concept they had in mind, any source they used (if any), or the reason why they did the annotation. So how does one connect the name that they applied to the determination when there is no indication of the concept?
As I said in an earlier post, the single most important way to reduce taxonomic ambiguity is to try to capture (or confidently deduce) the source (=mapping to taxon concept). But if it can't be done, then it can't be done -- so I'm inclined to establish a "place-holder" dwc:Taxon instance, with no nameAccordingTo, and no other metadata besides the scientificName.
Is this just something we can't do for old annotations and just something that we try to do from this point forward?
Probably.
- The last question is one that I really want to some
opinions about. It seems to me that there are a number of reasons why one would apply a determination.
Hmmm....I don't think this is really useful information. I don't undersatand how you would use this information ina machine-processing sort of way. An Identification is an Identification. In some cases, the Identifier may not even be aware of the previous identification, and so we can necessarily infer there was a particular "reason". And even if there is a reason, how doe we use that information? What if there is more than one reason (i.e., if we are restricted to a controlled vocabulary)?
As far as I'm concerned, the Identifications should stand as they are. If needed people can annotate the Identification instances; but I don't see the value in machine-processing these things.
Also:
Finally, a single determiner might apply several determinations to one individual and indicate in each determination the concept intended (i.e. if you subscribe to Cronquist, you'd call it X; if you like Radford's book, you'd call it Y; if you like Weakley's treatment, you'd call it Z).
YIKES! I don't like the idea of loading all that information on an Identification instance. If the person wants to make this sort of assertion, then they should establish the appropriate relationshipAssertion instances among the various taxonConcepts cited.
Damn. Now my head is really tired. And so is the rest of me....
Aloha, and g'night..
Rich
Rich, Thanks for the great summary diagram and even more amazing that it was made under mushed brain conditions. Hopefully you've gotten sleep since then. Unfortunately, when I tried to look at it I had some problems with line breaks. I've tried to recreate your diagram at http://bioimages.vanderbilt.edu/pages/rich-diagram1.gif Please correct me if I didn't get it right. My arrow-drawing utility put the arrow heads on the other end of the lines, but I think the arrows still maintain the "many to one" relationships you were trying to represent. I also replaced eventTime with eventDate since the latter is a broader term that also can include the time.
In principle, I agree with this diagram to the left of taxonNameUsage completely. (I still need clarification about a few things on the right end.) My main reason for using determination as a term rather than identification is because it is not ambiguous to refer to the person doing the identifying as the determiner, whereas referring to that person as the "identifier" creates confusion between that person and the identifying string for resources (as in "persistent identifier"). So if we agree that determination, annotation, and identification all mean the same thing (namely an instance of the dwc:Identification class), I'm happy to just use the term "identification". For the person doing it, I guess dwc:identifiedBy would be the best term although it's a bit awkward in regular speech so I may slip and still say "determiner".
Although I agree in principle that there can be many occurrences at an Event and many events at a Location, I think there are two practical reasons why it may be better to assign separate eventDate and Location metadata to each Occurrence. The first is that it makes the database structure simpler. As Markus has already noted, we really would prefer for the database to be as "flat" as possible. When I look at the terms listed in the DwC term page (http://rs.tdwg.org/dwc/terms/index.htm) under Event, the most important one that I see which everyone should be providing is eventDate. The rest I would pretty much consider optional and as a shortcut Rich's diagram could be collapsed to make them direct properties of the Occurrence. The second reason involves the practical matter of defining a Location. I will note that my thinking about this has been deeply influenced by a previous discussion on the topic from 2008-2009 summarized at http://www.sernec.org/files/summary-of-discussion.pdf on p.78-84. I don't think most people will want to wade through all of that text, so I'll just sum it up here. Somebody (I think it might have been Debbie Paul at Morphbank) suggested to me that we really have an intrinsically globally unique identifier for Location. It's the combination of dwc:decimalLatitude and dwc:decimalLongitude along with dwc:coordinateUncertaintyInMeters to establish precision and dwc:geodeticDatum to establish the reference system. (If we like geo:lat and geo:long, then the reference system is implied and we are down to three terms to unambiguously define a Location and its uncertainty. For the benefits of humans, a Locality description is probably also beneficial. Also, elevation and depth might be provided, although at least in theory elevation could be calculated with a sufficiently good digital elevation model). I will grant that we don't have this information for a lot of old records, but based on the massive efforts to geolocate specimens, I would say it's pretty clear that this is what we would like to have if we could get it. I certainly hope that there aren't any serious collectors, observers, and live organism photographers who aren't by this point trying to record this information as they establish new Occurrence records. If you look at all of the Location terms on the dwc list, most of the other terms are either concessions to the fact that we don't have what we want (e.g. the "verbatum" terms), things we could generate using a computer program if we were clever (like stateProvince, county, etc. - I know at least Mike Giddens has succeeded in doing this), ways of indicating how we got lat and long from old records (e.g. georefererenceSources), or methods to define larger scale Locations that aren't points (e.g. footprintWKT). I think it is safe to say that in the future (if not now already), many or most Events associated with Occurrences will have an associated button click (on a GPS receiver, camera phone, or GPS enabled camera) that will automatically generate dwc:eventDate, dwc:decimalLatitude, dwc:decimalLongitude (with geodeticDatum=WGS84) and maybe coordinateUncertaintyInMeters. Thus designing a system that requires that these time/space snapshots be grouped together into artificial "Locations" is really counterproductive when those data are now generated and can be associated with Occurrences automatically. I don't know if Greg Riccardi of Morphbank is following this thread or not. If so he may want to comment on this issue based on practical experience at Morphbank. When the Morphbank system was set up, it required the creation of a separate Location record which was assigned a unique Morphbank identifier. Specimens were then linked to this Location. What ended up happening was that each Specimen having GPS metadata ended up being assigned to its own separate Location even if it was 20 meters from another specimen. In effect, each Occurrence record ended up having its own decimalLatitude/decimalLongitude record anyway. So the system ended up being more complicated than necessary.
As I said, I agree in principle with the left side of Rich's diagram. Taking the practical considerations I just mentioned into account, I would simplify the diagram as http://bioimages.vanderbilt.edu/pages/rich-diagram2.gif Superficially, it looks more complicated, but I've gotten rid of several "one to many" relationships and enthroned Occurrence at its accustomed place in the center of the universe (or at least the center of the left side of the diagram). I don't have any philosophical objections to people structuring their data according to Rich's original diagram and the existing Darwin Core terms certainly make it possible to do so (well except for the Individual thing). However, I submit that many people will find it simpler (and easier to use tools like Darwin Core Archives) if they use the flatter structure that I have in the revised diagram.
I will save my questions about the right side of Rich's diagram for later. Steve
Richard Pyle wrote:
All,
I'm in Stockholm, and right now it's 10am in Hawaii, and I've effectively been awake since 7pm Hawaii time -- so my brain is a bit mush. But I'll take a chance and comment anyway.
I will leave up to the taxonomy people the different things would be connected to the species concept and how all of their lines would be connected.
In my mind the "fully-normalised" (sensu Döring) relationship graph is something like this (notation is [One]--<[Many]; [One]--[One]) (Be sure to view as a fixed-width font, like Courier):
[identifiedBy] |
[Location]--<[Event]--<[Occurrence]>--[Individual]--<[Identification]--[Taxo nNameUsage]>--[nameAccordingTo] | | | [eventTime] [dateIdentified] [scientificName]
I'm following what I *think* Steve defined for [Individual], which is that it can be either a single individual organism or a defined set of organisms (e.g., up to at least a population).
So, an Occurrence is the intersection of an Individual and an Event. An Event is a Location+Time[+other metadata]. Each Event may have multiple Occurrences (i.e., one for each distinct Individual at the same Location+Time). Also, an Individual may have multiple Occurrences (one for each Event at which the same Individual was documented).
An Individual may have multiple Identifcations. I make no distinction between "Identification" and "Determination" (nor do I make a distinction between the first identification and subsequent identifications). I slightly prefer "Identification", because "Determination" seems to imply that there is a correct answer, whereas "Identification" (to me, anyway), implies an opinion. Steve, I didn't quite follow how you were distinguishing these two terms -- so if you have a clear reason for distinguishing them, I'd like to understand it better.
A single Identification should, in my mind, always join a single individual with a single "TaxonNameUsage" instance. I'm not 100% sure how TaxonNameUsage maps in DwC. I *think* it's an instance of a dwc:Taxon, as most of the core attributes of a TNU (acceptedNameUsage[ID], parentNameUsage[ID], originalNameUsage[ID], scientificName, taxonRank) are represented as terms in the Taxon Class. But I'm a little fuzzy on whether a "taxonID" maps directly to a TNUID, or if a TNUID more correcly maps to taxonConceptID.
The determination would have any of the properties that are terms listed in the dwc:Identification class (identifiedBy, dateIdentified, identificationReferences, identification Remarks, identificationQualifier, and typeStatus). Some properties like dateIdentified and identificationReferences would be string literals and others (especially identifiedBy) should probably be GUIDs but could be literals if they had to be.
I agree with what Steve wrote above. However, I'm uncomfortable with Markus' suggestion of treating dwc:nameAccordingTo as a property of an Indentification -- even as a shortcut. I think this is a bit dangerous. If there is no TaxonID instance (aka "TaxonNameUsage" in my diagram above) available to link the Identification to, then I would suggest using identificationReferences as the shortcut. But that would still force you to attached scientificName directly to the Identification instance, which I think is also unwise. I'd rather the Best Practice be to "manufacture" a place-holder dwc:Taxon instance (if a proper one doesn't already exist in the content source), and apply the scientificName property to that Taxon instance, rather than directly to an Identification. I know it's often short-hand to attach the scientificName directly to the Occurrence instance; but I actually feel less uneasy about that, because it is much more obviously a shortcut. But if you're going to the trouble to provide an instantiated "Identification", then you ought to anchor it to a Taxon instance (manufactured or real).
But, I guess as Greg said in his post, it may not really matter, as in the long run, we'll probably be able to make inferences about the proper Individual<-->TaxonConcept mapping, even when it's not explicitly documented.
- The original label identifies the species as Juncus
diffusissimus. However, there is no indicator as to who originally identified it or when. My assumption is that it was the collector (Glen N. Montz) but I don't really know that. Do I assume that, or list the original determiner as "unknown"?
I would make no assumptions about who was the identifiedBy person. Instead, in these cases I handle these cases by either going with "Unspecified", or, in some cases (when I have confidence), something like "Bishop Museum Staff Member". Often I can deduce the identifier with some degree of confidence, but usually I don't have the time to do this. The dateIdentified can either not be provided, or set as some range (e.g., at the very worst, on or after the eventDate/eventTime, and before today).
This is why I think that identification tags ("annotations" sensu Baskauf) can be "documentation sources for TNUs.
In the web example given by Steve, we have an idetification as follows:
Juncus diffusissimus Buckl. Determined by: L. Urbatsch Determination date: 2009
Completely independantly of the specimen itself, we can infer from the tag that:
- Sometime between 1 Jan 2009 and 31 Dec 2009, L. Urbatsch regarded the
genus "Juncus" as valid.
- Sometime between 1 Jan 2009 and 31 Dec 2009, L. Urbatsch regarded the
species epithet "diffusissimus" [of Buckl.] as a valid species, placed within the genus "Juncus".
Thus, we have at least two implied TNUs from this identification, which was documented on a piece of paper that happens to be fixed to LSU-BR 39823.
The Identification instance would link the Individual (manifest as a specimen, in this case) to the TNU of "[Juncus] diffusissimus Buckl. sec L. Urbatsch 2009". The nameAccordingTo would be "L. Urbatsch 2009". This may seem redundant to have "L. Urbatsch 2009" in both the nameAccordingTo attribute of thr Taxon instance, and in the identifiedBy & dateIdentified attributes of the Identification instance -- but the fact remains they are fundamentally different pieces of information. One establishes an instance of an (implied) taxon concept, and the other establishes the placement of LSU-BR 39823 within that taxon concept circumscription.
Eventually, a third party may be able to deduce (perhaps through a suite of other, external information) a RelationshipAssertion that maps the TNU "[Juncus] diffusissimus Buckl. sec L. Urbatsch 2009" to some other, perhaps published and well-defined taxon concept (of the same or different name). Also, if there are 100 specimens in the collection that L. Urbatsch identified as "Juncus diffusissimus Buckl." in 2009, then anchoring all 100 Identification instances to the one TNU, allows all of those specimens to inherit the mapping of the one "[Juncus] diffusissimus Buckl. sec L. Urbatsch 2009" TNU instance to some other better-defined taxon concept.
I know this is a lot of stuff to keep in one's head at the same time -- but as cumbersome as it seems, I am conviced it can be packacged into a relatively straightforward and intuitive user UI, and modelling it this way improves the utility of the data (maybe dramatically) in the long run.
- Do we draw a distinction between the initial identification and
subsequent annotations?
I think the answer should be "no" and that's why I refer to both
generically as "determinations".
I agree.
- There is really no indication given on the annotation
labels as to many of the things that we would like to know, such as the concept they had in mind, any source they used (if any), or the reason why they did the annotation. So how does one connect the name that they applied to the determination when there is no indication of the concept?
As I said in an earlier post, the single most important way to reduce taxonomic ambiguity is to try to capture (or confidently deduce) the source (=mapping to taxon concept). But if it can't be done, then it can't be done -- so I'm inclined to establish a "place-holder" dwc:Taxon instance, with no nameAccordingTo, and no other metadata besides the scientificName.
Is this just something we can't do for old annotations and just something that we try to do from this point forward?
Probably.
- The last question is one that I really want to some
opinions about. It seems to me that there are a number of reasons why one would apply a determination.
Hmmm....I don't think this is really useful information. I don't undersatand how you would use this information ina machine-processing sort of way. An Identification is an Identification. In some cases, the Identifier may not even be aware of the previous identification, and so we can necessarily infer there was a particular "reason". And even if there is a reason, how doe we use that information? What if there is more than one reason (i.e., if we are restricted to a controlled vocabulary)?
As far as I'm concerned, the Identifications should stand as they are. If needed people can annotate the Identification instances; but I don't see the value in machine-processing these things.
Also:
Finally, a single determiner might apply several determinations to one individual and indicate in each determination the concept intended (i.e. if you subscribe to Cronquist, you'd call it X; if you like Radford's book, you'd call it Y; if you like Weakley's treatment, you'd call it Z).
YIKES! I don't like the idea of loading all that information on an Identification instance. If the person wants to make this sort of assertion, then they should establish the appropriate relationshipAssertion instances among the various taxonConcepts cited.
Damn. Now my head is really tired. And so is the rest of me....
Aloha, and g'night..
Rich
.
On Oct 19, 2010, at 11:35 AM, Steve Baskauf wrote:
I've tried to recreate your diagram at http://bioimages.vanderbilt.edu/pages/rich-diagram1.gif
Note that the visible label gives the correct URL (http://bioimages.vanderbilt.edu/pages/rich-diagram1.gif ), but for some reason its linked to the wrong URL-- so don't click it, just cut & paste it.
Arlin
Please correct me if I didn't get it right. My arrow-drawing utility put the arrow heads on the other end of the lines, but I think the arrows still maintain the "many to one" relationships you were trying to represent. I also replaced eventTime with eventDate since the latter is a broader term that also can include the time.
In principle, I agree with this diagram to the left of taxonNameUsage completely. (I still need clarification about a few things on the right end.) My main reason for using determination as a term rather than identification is because it is not ambiguous to refer to the person doing the identifying as the determiner, whereas referring to that person as the "identifier" creates confusion between that person and the identifying string for resources (as in "persistent identifier"). So if we agree that determination, annotation, and identification all mean the same thing (namely an instance of the dwc:Identification class), I'm happy to just use the term "identification". For the person doing it, I guess dwc:identifiedBy would be the best term although it's a bit awkward in regular speech so I may slip and still say "determiner".
Although I agree in principle that there can be many occurrences at an Event and many events at a Location, I think there are two practical reasons why it may be better to assign separate eventDate and Location metadata to each Occurrence. The first is that it makes the database structure simpler. As Markus has already noted, we really would prefer for the database to be as "flat" as possible. When I look at the terms listed in the DwC term page (http://rs.tdwg.org/dwc/terms/index.htm ) under Event, the most important one that I see which everyone should be providing is eventDate. The rest I would pretty much consider optional and as a shortcut Rich's diagram could be collapsed to make them direct properties of the Occurrence. The second reason involves the practical matter of defining a Location. I will note that my thinking about this has been deeply influenced by a previous discussion on the topic from 2008-2009 summarized at http://www.sernec.org/files/summary-of-discussion.pdf on p.78-84. I don't think most people will want to wade through all of that text, so I'll just sum it up here. Somebody (I think it might have been Debbie Paul at Morphbank) suggested to me that we really have an intrinsically globally unique identifier for Location. It's the combination of dwc:decimalLatitude and dwc:decimalLongitude along with dwc:coordinateUncertaintyInMeters to establish precision and dwc:geodeticDatum to establish the reference system. (If we like geo:lat and geo:long, then the reference system is implied and we are down to three terms to unambiguously define a Location and its uncertainty. For the benefits of humans, a Locality description is probably also beneficial. Also, elevation and depth might be provided, although at least in theory elevation could be calculated with a sufficiently good digital elevation model). I will grant that we don't have this information for a lot of old records, but based on the massive efforts to geolocate specimens, I would say it's pretty clear that this is what we would like to have if we could get it. I certainly hope that there aren't any serious collectors, observers, and live organism photographers who aren't by this point trying to record this information as they establish new Occurrence records. If you look at all of the Location terms on the dwc list, most of the other terms are either concessions to the fact that we don't have what we want (e.g. the "verbatum" terms), things we could generate using a computer program if we were clever (like stateProvince, county, etc. - I know at least Mike Giddens has succeeded in doing this), ways of indicating how we got lat and long from old records (e.g. georefererenceSources), or methods to define larger scale Locations that aren't points (e.g. footprintWKT). I think it is safe to say that in the future (if not now already), many or most Events associated with Occurrences will have an associated button click (on a GPS receiver, camera phone, or GPS enabled camera) that will automatically generate dwc:eventDate, dwc:decimalLatitude, dwc:decimalLongitude (with geodeticDatum=WGS84) and maybe coordinateUncertaintyInMeters. Thus designing a system that requires that these time/space snapshots be grouped together into artificial "Locations" is really counterproductive when those data are now generated and can be associated with Occurrences automatically. I don't know if Greg Riccardi of Morphbank is following this thread or not. If so he may want to comment on this issue based on practical experience at Morphbank. When the Morphbank system was set up, it required the creation of a separate Location record which was assigned a unique Morphbank identifier. Specimens were then linked to this Location. What ended up happening was that each Specimen having GPS metadata ended up being assigned to its own separate Location even if it was 20 meters from another specimen. In effect, each Occurrence record ended up having its own decimalLatitude/decimalLongitude record anyway. So the system ended up being more complicated than necessary.
As I said, I agree in principle with the left side of Rich's diagram. Taking the practical considerations I just mentioned into account, I would simplify the diagram as http://bioimages.vanderbilt.edu/pages/rich-diagram2.gif Superficially, it looks more complicated, but I've gotten rid of several "one to many" relationships and enthroned Occurrence at its accustomed place in the center of the universe (or at least the center of the left side of the diagram). I don't have any philosophical objections to people structuring their data according to Rich's original diagram and the existing Darwin Core terms certainly make it possible to do so (well except for the Individual thing). However, I submit that many people will find it simpler (and easier to use tools like Darwin Core Archives) if they use the flatter structure that I have in the revised diagram.
I will save my questions about the right side of Rich's diagram for later. Steve
Richard Pyle wrote:
All,
I'm in Stockholm, and right now it's 10am in Hawaii, and I've effectively been awake since 7pm Hawaii time -- so my brain is a bit mush. But I'll take a chance and comment anyway.
I will leave up to the taxonomy people the different things would be connected to the species concept and how all of their lines would be connected.
In my mind the "fully-normalised" (sensu Döring) relationship graph is something like this (notation is [One]--<[Many]; [One]--[One]) (Be sure to view as a fixed-width font, like Courier):
[identifiedBy] |
[Location]--<[Event]--<[Occurrence]>--[Individual]-- <[Identification]--[Taxo nNameUsage]>--[nameAccordingTo] | | | [eventTime] [dateIdentified] [scientificName]
I'm following what I *think* Steve defined for [Individual], which is that it can be either a single individual organism or a defined set of organisms (e.g., up to at least a population).
So, an Occurrence is the intersection of an Individual and an Event. An Event is a Location+Time[+other metadata]. Each Event may have multiple Occurrences (i.e., one for each distinct Individual at the same Location+Time). Also, an Individual may have multiple Occurrences (one for each Event at which the same Individual was documented).
An Individual may have multiple Identifcations. I make no distinction between "Identification" and "Determination" (nor do I make a distinction between the first identification and subsequent identifications). I slightly prefer "Identification", because "Determination" seems to imply that there is a correct answer, whereas "Identification" (to me, anyway), implies an opinion. Steve, I didn't quite follow how you were distinguishing these two terms -- so if you have a clear reason for distinguishing them, I'd like to understand it better.
A single Identification should, in my mind, always join a single individual with a single "TaxonNameUsage" instance. I'm not 100% sure how TaxonNameUsage maps in DwC. I *think* it's an instance of a dwc:Taxon, as most of the core attributes of a TNU (acceptedNameUsage[ID], parentNameUsage[ID], originalNameUsage[ID], scientificName, taxonRank) are represented as terms in the Taxon Class. But I'm a little fuzzy on whether a "taxonID" maps directly to a TNUID, or if a TNUID more correcly maps to taxonConceptID.
The determination would have any of the properties that are terms listed in the dwc:Identification class (identifiedBy, dateIdentified, identificationReferences, identification Remarks, identificationQualifier, and typeStatus). Some properties like dateIdentified and identificationReferences would be string literals and others (especially identifiedBy) should probably be GUIDs but could be literals if they had to be.
I agree with what Steve wrote above. However, I'm uncomfortable with Markus' suggestion of treating dwc:nameAccordingTo as a property of an Indentification -- even as a shortcut. I think this is a bit dangerous. If there is no TaxonID instance (aka "TaxonNameUsage" in my diagram above) available to link the Identification to, then I would suggest using identificationReferences as the shortcut. But that would still force you to attached scientificName directly to the Identification instance, which I think is also unwise. I'd rather the Best Practice be to "manufacture" a place-holder dwc:Taxon instance (if a proper one doesn't already exist in the content source), and apply the scientificName property to that Taxon instance, rather than directly to an Identification. I know it's often short-hand to attach the scientificName directly to the Occurrence instance; but I actually feel less uneasy about that, because it is much more obviously a shortcut. But if you're going to the trouble to provide an instantiated "Identification", then you ought to anchor it to a Taxon instance (manufactured or real).
But, I guess as Greg said in his post, it may not really matter, as in the long run, we'll probably be able to make inferences about the proper Individual<-->TaxonConcept mapping, even when it's not explicitly documented.
- The original label identifies the species as Juncus
diffusissimus. However, there is no indicator as to who originally identified it or when. My assumption is that it was the collector (Glen N. Montz) but I don't really know that. Do I assume that, or list the original determiner as "unknown"?
I would make no assumptions about who was the identifiedBy person. Instead, in these cases I handle these cases by either going with "Unspecified", or, in some cases (when I have confidence), something like "Bishop Museum Staff Member". Often I can deduce the identifier with some degree of confidence, but usually I don't have the time to do this. The dateIdentified can either not be provided, or set as some range (e.g., at the very worst, on or after the eventDate/eventTime, and before today).
This is why I think that identification tags ("annotations" sensu Baskauf) can be "documentation sources for TNUs.
In the web example given by Steve, we have an idetification as follows:
Juncus diffusissimus Buckl. Determined by: L. Urbatsch Determination date: 2009
Completely independantly of the specimen itself, we can infer from the tag that:
- Sometime between 1 Jan 2009 and 31 Dec 2009, L. Urbatsch regarded
the genus "Juncus" as valid.
- Sometime between 1 Jan 2009 and 31 Dec 2009, L. Urbatsch regarded
the species epithet "diffusissimus" [of Buckl.] as a valid species, placed within the genus "Juncus".
Thus, we have at least two implied TNUs from this identification, which was documented on a piece of paper that happens to be fixed to LSU-BR 39823.
The Identification instance would link the Individual (manifest as a specimen, in this case) to the TNU of "[Juncus] diffusissimus Buckl. sec L. Urbatsch 2009". The nameAccordingTo would be "L. Urbatsch 2009". This may seem redundant to have "L. Urbatsch 2009" in both the nameAccordingTo attribute of thr Taxon instance, and in the identifiedBy & dateIdentified attributes of the Identification instance -- but the fact remains they are fundamentally different pieces of information. One establishes an instance of an (implied) taxon concept, and the other establishes the placement of LSU-BR 39823 within that taxon concept circumscription.
Eventually, a third party may be able to deduce (perhaps through a suite of other, external information) a RelationshipAssertion that maps the TNU "[Juncus] diffusissimus Buckl. sec L. Urbatsch 2009" to some other, perhaps published and well-defined taxon concept (of the same or different name). Also, if there are 100 specimens in the collection that L. Urbatsch identified as "Juncus diffusissimus Buckl." in 2009, then anchoring all 100 Identification instances to the one TNU, allows all of those specimens to inherit the mapping of the one "[Juncus] diffusissimus Buckl. sec L. Urbatsch 2009" TNU instance to some other better-defined taxon concept.
I know this is a lot of stuff to keep in one's head at the same time -- but as cumbersome as it seems, I am conviced it can be packacged into a relatively straightforward and intuitive user UI, and modelling it this way improves the utility of the data (maybe dramatically) in the long run.
- Do we draw a distinction between the initial identification and
subsequent annotations?
I think the answer should be "no" and that's why I refer to both
generically as "determinations".
I agree.
- There is really no indication given on the annotation
labels as to many of the things that we would like to know, such as the concept they had in mind, any source they used (if any), or the reason why they did the annotation. So how does one connect the name that they applied to the determination when there is no indication of the concept?
As I said in an earlier post, the single most important way to reduce taxonomic ambiguity is to try to capture (or confidently deduce) the source (=mapping to taxon concept). But if it can't be done, then it can't be done -- so I'm inclined to establish a "place-holder" dwc:Taxon instance, with no nameAccordingTo, and no other metadata besides the scientificName.
Is this just something we can't do for old annotations and just something that we try to do from this point forward?
Probably.
- The last question is one that I really want to some
opinions about. It seems to me that there are a number of reasons why one would apply a determination.
Hmmm....I don't think this is really useful information. I don't undersatand how you would use this information ina machine- processing sort of way. An Identification is an Identification. In some cases, the Identifier may not even be aware of the previous identification, and so we can necessarily infer there was a particular "reason". And even if there is a reason, how doe we use that information? What if there is more than one reason (i.e., if we are restricted to a controlled vocabulary)?
As far as I'm concerned, the Identifications should stand as they are. If needed people can annotate the Identification instances; but I don't see the value in machine-processing these things.
Also:
Finally, a single determiner might apply several determinations to one individual and indicate in each determination the concept intended (i.e. if you subscribe to Cronquist, you'd call it X; if you like Radford's book, you'd call it Y; if you like Weakley's treatment, you'd call it Z).
YIKES! I don't like the idea of loading all that information on an Identification instance. If the person wants to make this sort of assertion, then they should establish the appropriate relationshipAssertion instances among the various taxonConcepts cited.
Damn. Now my head is really tired. And so is the rest of me....
Aloha, and g'night..
Rich
.
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu <ATT00001.txt>
------- Arlin Stoltzfus (arlin@umd.edu) Fellow, IBBR; Adj. Assoc. Prof., UMCP; Research Biologist, NIST IBBR, 9600 Gudelsky Drive, Rockville, MD tel: 240 314 6208; web: www.molevol.org
Thanks, Steve.
The diagram looks about right, except for the arrow heads as you noted. If there's a way you can replace the arrows with some sort of 1:Many line notation, that would be better. As you have it now, the arrowhead is on the "one" side; but I think it's more intuitive to have a "crows-foot" sort of symbol on the "many" side. I can send an example of what I mean. Not a big deal.
Yeah, I originally had it as eventDate, but then switched to eventTime. If Date can include time (and Time is assumed not to include date), then using eventDate is fine.
In principle, I agree with this diagram to the left of taxonNameUsage
completely.
(I still need clarification about a few things on the right end.)
Yes, that's a failing on my part to get better documentation out there for GNUB (from which TaxonNameUsage, and all of the other "Usage" terms in DWC come). I hope to correct this by the end of the year.
My main reason for using determination as a term rather than identification is because it is not ambiguous to refer to the person doing the identifying as the determiner, whereas referring to that person as the "identifier" creates confusion between that person and the identifying string for resources (as in "persistent identifier").
Ah! Got it. Makes sense.
So if we agree that determination, annotation, and identification all mean the same thing (namely an instance of the dwc:Identification class), I'm happy to just use the term "identification". For the person doing it, I guess dwc:identifiedBy would be the best term although it's a bit awkward in regular speech so I may slip and still say "determiner".
Either way. Now that you put it in that context, I'm also happy to go with "Determination" and "Determiner".
But I would avoid "Annotation". That word has a much more general meaning, and we'll likely be hearing more and more about it (in the more general sense) as several big-ish projects are working on Annotations (in general) right now.
Although I agree in principle that there can be many occurrences at an Event and many events at a Location, I think there are two practical reasons why it may be better to assign separate eventDate and Location metadata to each Occurrence.
Hmmm...not sure I follow. Are you saying that a new Event record (ID) should be created for every Occurrence record, and that a new Location record (ID) should be created for every Event record? If so, then it's going to be very difficult to convicne me of this. I don't think that our database is unusual in having many (sometimes hundreds) of Occurrences at the same Event (e.g., a large fish poison station), and many (again, sometimes hundreds) of Events at the same Location.
The first is that it makes the database structure simpler. As Markus has already noted, we really would prefer for the database to be as "flat" as possible.
Which database? Are we talking about DwCA? If so, I understand the rationale for flattening out content to make it easier to batch-package records amnd ship them around. But if we're talking about actual database implementations at the content provider end, I think I'm not alone in wanting to stick with a more normalized approach. Besides, what's the point of even defining the different DwC classes, each with their own ID, if we're just going flatten them all out anyway (as per old Dwc)?
When I look at the terms listed in the DwC term page (http://rs.tdwg.org/dwc/terms/index.htm) under Event, the most important one that I see which everyone should be providing is eventDate. The rest I would pretty much consider optional and as a shortcut Rich's diagram could be collapsed to make them direct properties of the Occurrence.
Yes, they could be collapsed to Occurrence -- in the same way that properties of "Individual" are currently collapsed to Occurrence. But after pleading your case to normalize "Individual" as its own separate class, I'm kinda surprised to see you arguing in favor of collapsing the Event class into Occurrence.
The second reason involves the practical matter of defining a Location. I will note that my thinking about this has been deeply influenced by a previous discussion on the topic from 2008-2009 summarized at http://www.sernec.org/files/summary-of-discussion.pdf on p.78-84. I don't think most people will want to wade through all of that text, so I'll just sum it up here. Somebody (I think it might have been Debbie Paul at Morphbank) suggested to me that we really have an intrinsically globally unique identifier for Location. It's the combination of dwc:decimalLatitude and dwc:decimalLongitude along with dwc:coordinateUncertaintyInMeters to establish precision and dwc:geodeticDatum to establish the reference system.
Yes....sort of. Doesn't helpf for localities defined as bounded boxes, polygons or lines (e.g., transects, as we often have for data from plankton tows) -- but it certainly does serve as a hand "natural key" of sorts for point localities. The problem is that so much of our exiting content is not reliably georeferenced yet. Thus, we need all those other terms to accommodate various location descriptors, which will eventually allow us to after-the-fact georeference the localities. Also, many after-the-fact georeferenced points are interpretations. Keeping the descriptors around can allow someone else to come up with a better/more precise lat/long/uncertainty interpretation. Also, errors are abundant (particularly in failing to represent decimal degrees with negatives). Having the descriptors allows us to catch such errors much more quickly.
(If we like geo:lat and geo:long, then the reference system is implied and we are down to three terms to unambiguously define a Location and its uncertainty. For the benefits of humans, a Locality description is probably also beneficial. Also, elevation and depth might be provided, although at least in theory elevation could be calculated with a sufficiently good digital elevation model).
Well, that depends on the extent of coordinateUncertaintyInMeters. Original data often have fairly precise elevations, but imprecise lat/long. Thus, one can often narrow down the likely location more precisely than a circle described by a point/radius. In other words, Lat+Long+coordinateUncertaintyInMeters may describe a circle that includes a range of elevations, and thus an elevation cannot be reliably calculated. Moreover, a lot of modelling use-cases will want as precise of an elevation as possible.
I will grant that we don't have this information for a lot of old records, but based on the massive efforts to geolocate specimens, I would say it's pretty clear that this is what we would like to have if we could get it.
Sure -- but we can't really ignore the massive numbers of non-georeferenced datapointn that already exist. And even when they are georeferenced, we'll still want to keep the original location descriptors.
I certainly hope that there aren't any serious collectors, observers, and live organism photographers who aren't by this point trying to record this information as they establish new Occurrence records. If you look at all of the Location terms on the dwc list, most of the other terms are either concessions to the fact that we don't have what we want (e.g. the "verbatum" terms), things we could generate using a computer program if we were clever (like stateProvince, county, etc. - I know at least Mike Giddens has succeeded in doing this), ways of indicating how we got lat and long from old records (e.g. georefererenceSources), or methods to define larger scale Locations that aren't points (e.g. footprintWKT). I think it is safe to say that in the future (if not now already), many or most Events associated with Occurrences will have an associated button click (on a GPS receiver, camera phone, or GPS enabled camera) that will automatically generate dwc:eventDate, dwc:decimalLatitude, dwc:decimalLongitude (with geodeticDatum=WGS84) and maybe coordinateUncertaintyInMeters. Thus designing a system that requires that these time/space snapshots be grouped together into artificial "Locations" is really counterproductive when those data are now generated and can be associated with Occurrences automatically.
OK, I see where you're coming from. But I guess my rsponse is that we're a LONG way from:
- A world where most existing Occurrence content is well georeferenced; - A world where reliable services allow me to easily/automatically query records on, say "Northwestern Hawaiian Islands", based on GIS polygon querying - A world where content holders are not going to want to share all the textual locality descriptiors with their reocrds
Besides, we're going to always want to maintain the ability to define locations as bounded boxes, polygons, and lines; not just point-radius.
Moreover, there are many cases where a single location is re-used for many different events. The same tree monitored continuously over years. The same field station re-visted year after year. The same transect repeated every month/season/year to monitor populations & presence/absence. LTER data. Christmas Count data. In other words, it's common to have multiple Events stacked at the same locality; and we won't want to always limit ourselves to defining that locality using point/radius.
So I still see the advantage of keeping Location as a separate class, maintaining those extra "human-friendly" descriptor terms, and conceptualizing 1:M Location:Events.
I don't know if Greg Riccardi of Morphbank is following this thread or not. If so he may want to comment on this issue based on practical experience at Morphbank. When the Morphbank system was set up, it required the creation of a separate Location record which was assigned a unique Morphbank identifier. Specimens were then linked to this Location. What ended up happening was that each Specimen having GPS metadata ended up being assigned to its own separate Location even if it was 20 meters from another specimen. In effect, each Occurrence record ended up having its own decimalLatitude/decimalLongitude record anyway. So the system ended up being more complicated than necessary.
Yes -- that's definitely a trend of modern collecting data with GPS....a tendency towards fewer and fewer instances of Events per instance of Location -- to the point where many records are now 1:1. However, that doesn't change the fact that an enormous volue of content currently is, and always will be best structured as many Events per location, and many Occurrences per Event. I think DwC needs to accommodate that content. It's easy to store 1:1 records in a structure designed to accommodate 1:M. But it's a lot messier to generate 1:M content in a structure designed for 1:1.
As I said, I agree in principle with the left side of Rich's diagram. Taking the practical considerations I just mentioned into account, I would simplify the diagram as http://bioimages.vanderbilt.edu/pages/rich-diagram2.gif
I think that's perfectly fine as a simplified structure for data exchange, such as DwCA and other applications that aggregate content and/or provide value-added or indexing services on aggregated content. But for original source data, I think it would be unwise to advocate such a simplified structure to anything but small ad-hoc project-based systems. It's incredibly easy to flatten out the more normalized form into the more flattened form; not so easy to go the other way around.
I also still find it interesting that you are quite content to flatten Locality and Event class terms into Occurrence, while simultaneously wanting to normalize Individual as a new class, sparate from Occurrence. I'm not saying that establishing an Individual class is a bad idea -- in principle I support it. But I am curious as to why you think it important ti push for normalization on the Individual side of an Occurrence, but push for de-normalization on the Event side of an Occurrence.
Superficially, it looks more complicated, but I've gotten rid of several "one to many" relationships and enthroned Occurrence at its accustomed place in the center of the universe (or at least the center of the left side of the diagram). I don't have any philosophical objections to people structuring their data according to Rich's original diagram and the existing Darwin Core terms certainly make it possible to do so (well except for the Individual thing). However, I submit that many people will find it simpler (and easier to use tools like Darwin Core Archives) if they use the flatter structure that I have in the revised diagram.
That may be true. But I've spent more than two decades taking over-simplified, flattened database structures and transforming them into more normalized structures, because the flattened structures consistely limited my ability to ask novel questions of the database, and also encouraged inconsistency of data entry practices. The small price I pay for increased normalization has yielded ample return in more "powerful" datasets (i.e., more flexibility in how I can frame and/or analyze the data).
I will save my questions about the right side of Rich's diagram for later.
That would be best answered through documentation of GNUB, which I will be working on intensively over the next two months.
Aloha, Rich
On Tue, Oct 19, 2010 at 10:12 AM, Richard Pyle deepreef@bishopmuseum.org wrote:
Yeah, I originally had it as eventDate, but then switched to eventTime. If Date can include time (and Time is assumed not to include date), then using eventDate is fine.
I would recommend using eventTime for a date + time-of-day. "Time" is more general than "date." This is the usage in the Ruby world.
///ark Web Applications Developer Center for Applied Biodiversity Informatics California Academy of Sciences
I was going by the definitions at http://rs.tdwg.org/dwc/terms/index.htm#eventDate and http://rs.tdwg.org/dwc/terms/index.htm#eventTime Going by these definitions, eventDate is an ISO 8601 encoded thing that can include both date and time (or only date at a lower resolution). eventTime appears to only refer to the time (at least based on the examples). If we are going to call these things dwc:eventDate and dwc:eventTime we have to go with the way they are defined in the Darwin Core standard. Steve
Mark Wilden wrote:
On Tue, Oct 19, 2010 at 10:12 AM, Richard Pyle deepreef@bishopmuseum.org wrote:
Yeah, I originally had it as eventDate, but then switched to eventTime. If Date can include time (and Time is assumed not to include date), then using eventDate is fine.
I would recommend using eventTime for a date + time-of-day. "Time" is more general than "date." This is the usage in the Ruby world.
///ark Web Applications Developer Center for Applied Biodiversity Informatics California Academy of Sciences
Those definitions in the DwC documetation are correct. Note that they are not implementation-specific. The caution here is that an ISO 8601 date time is much more expressive than an xs:datetime (a specific implementation), for example.
On Tue, Oct 19, 2010 at 12:16 PM, Steve Baskauf < steve.baskauf@vanderbilt.edu> wrote:
I was going by the definitions at http://rs.tdwg.org/dwc/terms/index.htm#eventDate and http://rs.tdwg.org/dwc/terms/index.htm#eventTime Going by these definitions, eventDate is an ISO 8601 encoded thing that can include both date and time (or only date at a lower resolution). eventTime appears to only refer to the time (at least based on the examples). If we are going to call these things dwc:eventDate and dwc:eventTime we have to go with the way they are defined in the Darwin Core standard. Steve
Mark Wilden wrote:
On Tue, Oct 19, 2010 at 10:12 AM, Richard Pyledeepreef@bishopmuseum.org deepreef@bishopmuseum.org wrote:
Yeah, I originally had it as eventDate, but then switched to eventTime. If Date can include time (and Time is assumed not to include date), then using eventDate is fine.
I would recommend using eventTime for a date + time-of-day. "Time" is more general than "date." This is the usage in the Ruby world.
///ark Web Applications Developer Center for Applied Biodiversity Informatics California Academy of Sciences
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
That was my original rationale; but I'll go with whatever the consensus is.
Rich
-----Original Message----- From: Mark Wilden [mailto:mark@mwilden.com] Sent: Tuesday, October 19, 2010 8:21 AM To: Richard Pyle Cc: Steve Baskauf; tdwg-content@lists.tdwg.org; tdwg-bioblitz@googlegroups.com Subject: Re: [tdwg-content] practical details of recording a determination What is an Occurrence?
On Tue, Oct 19, 2010 at 10:12 AM, Richard Pyle deepreef@bishopmuseum.org wrote:
Yeah, I originally had it as eventDate, but then switched to eventTime. If Date can include time (and Time is assumed not to include date), then using eventDate is fine.
I would recommend using eventTime for a date + time-of-day. "Time" is more general than "date." This is the usage in the Ruby world.
///ark Web Applications Developer Center for Applied Biodiversity Informatics California Academy of Sciences
Specific responses inline
Hmmm...not sure I follow. Are you saying that a new Event record (ID) should be created for every Occurrence record, and that a new Location record (ID) should be created for every Event record? If so, then it's going to be very difficult to convicne me of this. I don't think that our database is unusual in having many (sometimes hundreds) of Occurrences at the same Event (e.g., a large fish poison station), and many (again, sometimes hundreds) of Events at the same Location.
I think I answered this (in a sense) in the other email that I sent a little while ago. In principle, if every record of an Occurrence has a time that is discernibly different from the times of other Occurrences (i.e. because the time was recorded to the nearest second by a machine), then yes, I consider it to be a different event. To force people to lump such Occurrences together into one Event (say comprising a day or some other time interval larger than a second) is essentially throwing away data that we already have about the Occurrences. I have already found it useful to know exactly what time of day a flower was opened rather than closed or the order in which I took several images.
As for creating a new Event record ID for all of those one-second events, I would submit the same solution that I gave in the previous post. In my database, I don't have separate Event records for the times when I took live plant images (which I am rightly or wrongly calling Occurrences). I just have a "flat" table where each image has a eventTime (the time recorded automatically in the EXIF data for the image). If in the context of having an RDF structure that was compatible with the kind of structure used by people who have many Occurrences associated with a single event (e.g. your fish kill), for the image http://bioimages.vanderbilt.edu/baskauf/57755 I would automatically create an event identifier for its creation as http://bioimages.vanderbilt.edu/baskauf/57755#event . There! I have a perfectly valid GUID that represents the event without any additional burden on my record-keeping system since I don't have to keep any additional records about that GUID beyond the ones I'm already keeping for the image. I'm not doing this now, but I suppose I should if there is a consensus that things are related to each other in the way shown in your diagram. The same approach could be taken for Location. People who care about having unrelated identifiers for their Events and Locations (because of one-to-many relationships) would be welcome to do so, but I wouldn't need to in my internal database..
Which database? Are we talking about DwCA? If so, I understand the rationale for flattening out content to make it easier to batch-package records amnd ship them around. But if we're talking about actual database implementations at the content provider end, I think I'm not alone in wanting to stick with a more normalized approach. Besides, what's the point of even defining the different DwC classes, each with their own ID, if we're just going flatten them all out anyway (as per old Dwc)?
Well, yes I had DwCA (Darwin Core Archives) specifically in mind. But it could be any other "shipping format" or local database format. I think we need to draw the distinction between having a way that people can understand the meaning of metadata records and fields that we are shipping to them (e.g. DwCA), and describing the properties and connections between resources (e.g. in RDF). I think this may be what Pete means when he says that we need two "kinds" of DwC. The first use is pretty much "ready to go". I don't think we know how close we are to being able to use the existing DwC terms for describing relationships until we have some more conversation of the sort we are having now as well as conversation about which existing terms can be used (perhaps in ways that weren't originally intended) to express the relationships needed in RDF. I feel like I hear Pete saying that we are still lacking a lot of the predicates we need, while I (and maybe Cam) feel that we are most of the way there.
For clarification, when I argue that the class Individual should exist in Darwin Core, it's not because I'm insisting that all users must have an Individual table in their database. What I want is for people to be ABLE to have an Individual table in their database (if they need it) and have others understand what it means and how the entities described in that table are related conceptually to other things like Identifications and Occurrences. If all of the records in their database have only one occurrence per individual, they don't "need" to keep track of Individuals.
Yes, they could be collapsed to Occurrence -- in the same way that properties of "Individual" are currently collapsed to Occurrence. But after pleading your case to normalize "Individual" as its own separate class, I'm kinda surprised to see you arguing in favor of collapsing the Event class into Occurrence.
I confess my crime. In penance, I freely confess that the Event class exists and that people should use it in their databases if it helps them cluster Occurrences. In addition, I confess that I should probably acknowledge the existence of Events and their relationship to other Darwin Core classes when I write RDF. Guilty as charged!
Yes....sort of. Doesn't helpf for localities defined as bounded boxes, polygons or lines (e.g., transects, as we often have for data from plankton tows) -- but it certainly does serve as a hand "natural key" of sorts for point localities. The problem is that so much of our exiting content is not reliably georeferenced yet. Thus, we need all those other terms to accommodate various location descriptors, which will eventually allow us to after-the-fact georeference the localities. Also, many after-the-fact georeferenced points are interpretations. Keeping the descriptors around can allow someone else to come up with a better/more precise lat/long/uncertainty interpretation. Also, errors are abundant (particularly in failing to represent decimal degrees with negatives). Having the descriptors allows us to catch such errors much more quickly.
Here I will confess the crime of ignorance. I'm still trying to understand the need for and uses of a number of the Darwin Core dcterms:Location class terms. I guess I need to spend some more time reading the Guide to Best Practices for Georeferencing (I was going to include the link here, but the link at http://www.biogeomancer.org/library.html is broken).
Sure -- but we can't really ignore the massive numbers of non-georeferenced datapointn that already exist. And even when they are georeferenced, we'll still want to keep the original location descriptors.
Agree on this and all other points I deleted here.
OK, I see where you're coming from. But I guess my rsponse is that we're a LONG way from:
- A world where most existing Occurrence content is well georeferenced;
[text omitted for brevity]
So I still see the advantage of keeping Location as a separate class, maintaining those extra "human-friendly" descriptor terms, and conceptualizing 1:M Location:Events.
I'm totally convinced. Location and Event belong where you had them in the diagram.
[more text listed below]
I also still find it interesting that you are quite content to flatten Locality and Event class terms into Occurrence, while simultaneously wanting to normalize Individual as a new class, sparate from Occurrence. I'm not saying that establishing an Individual class is a bad idea -- in principle I support it. But I am curious as to why you think it important ti push for normalization on the Individual side of an Occurrence, but push for de-normalization on the Event side of an Occurrence.
Again, I plead guilty as charged. The left side of the chart should be allowed to not be flat. However, I maintain my stance that many (most?) new Occurrence records will in the future have their own individual latitude/longitude/elevation or depth/time. Those "atomized events points" could easily be aggregated by software into larger scale events and locations by some simple rules about the timespan for events and geographic bounds for locations. Those larger scale events and locations could be used to ask the kinds of questions you describe below. As long as you aren't requiring me to do this kind of aggregation BEFORE I create my records (and hence requiring me to lose the data that my GPS has collected for me automatically), I'm happy to allow others to define events and locations on larger scales and with 1:M relationships.
That may be true. But I've spent more than two decades taking over-simplified, flattened database structures and transforming them into more normalized structures, because the flattened structures consistely limited my ability to ask novel questions of the database, and also encouraged inconsistency of data entry practices. The small price I pay for increased normalization has yielded ample return in more "powerful" datasets (i.e., more flexibility in how I can frame and/or analyze the data).
As I said in my earlier email, I'm encouraged by the consistency in the way I hear people talking about the relationships among the DwC classes. I was a bit afraid when we started this thread that I would turn out to have some kind of fringe ideas. Now what I'm seeing is a lot of variation on how people choose to "collapse" the basic model to meet their individual needs, but not a lot of disagreement about what the basic model is.
Thanks for your great feedback and for challenging my statements. I need that!
Steve
I think I answered this (in a sense) in the other email that I sent a
little while ago.
Yes, you did; and I now understand where you're coming from.
In principle, if every record of an Occurrence has a time that is
discernibly
different from the times of other Occurrences (i.e. because the time was recorded to the nearest second by a machine), then yes, I consider it to
be
a different event. To force people to lump such Occurrences together into
one Event (say comprising a day or some other time interval larger than a second) is essentially throwing away data that we already have about the Occurrences. I have already found it useful to know exactly what time of day a flower was opened rather than closed or the order in which I took several images.
Understood. But we essentially never throuw data away -- the problem is we have no way of tracking the high resolution data. For example, if we run a plankton tow, or a fish poison station, we have no real way of putting individual timestamps or geopoints on every individual -- our only option is to collapse it all into one event. We can then ditto the same values for all 200 specimens; or we can normalize it to one Event linked to 200 Individuals.
Likewise with locality. There are specific study sites that get revisted over and over, so it's useful to have a single Locality instance linked to all of the events. Also, many, many records in our database only have generic locality descriptors (e.g., "Honolulu"), so no point in duplicating the lat/long/uncertainty/descriptors for hundreds of events -- just define a unique locality, and link to the hundereds of events. Then if we get a re-interpretation of lat/long/uncertainty for "Honolulu", or if we establish it as a polygon, we simply attached that metadata to the one Locality record, rather than the hundreds of events (or thousands of Occurences). You know -- the usual reasons for nomralizing a data model.
For clarification, when I argue that the class Individual should exist in Darwin Core, it's not because I'm insisting that all users must have an Individual table in their database. What I want is for people to be ABLE to have an Individual table in their database (if they need it) and have others understand what it means and how the entities described in that table are related conceptually to other things like Identifications and Occurrences. If all of the records in their database have only one occurrence per individual, they don't "need" to keep track of Individuals.
Having read through this complete discussion, I am now convinced you are right. We already have dwc:individualID; so we're primed. Logically, if such a class is establishd, then there are several terms in the Occurrence Class that out to be migrated to the Individual Class. I don't know how much inertia must be overcome in order to proposer/review/discuss/vote/ratify such a change in DwC, but if/when we get to that point, count me as a strong supporter.
I confess my crime. In penance, I freely confess that the Event class exists and that people should use it in their databases if it helps them cluster Occurrences. In addition, I confess that I should probably acknowledge the existence of Events and their relationship to other Darwin Core classes when I write RDF. Guilty as charged!
Likewise for me! I realized afterward that I was defending the collapsing of Individual into Occurrence, while at the same time fighting the (equally justified) collapse of Location and/or Event into Occurrence. So I was playing both sides too. Mea culpa, and I now support the class Individual.
[Lots of stuff we agree on deleted]
As I said in my earlier email, I'm encouraged by the consistency in the way I hear people talking about the relationships among the DwC classes. I was a bit afraid when we started this thread that I would turn out to have some kind of fringe ideas. Now what I'm seeing is a lot of variation on how people choose to "collapse" the basic model to meet their individual needs, but not a lot of disagreement about what the basic model is.
I have to say, this has been about the most productive (if volumunous) list-discussion I've had in...well...maybe ever. It seems we've both been equally persuasive, and equally willing to concede. How rare that happens in an internet forum! I'm not sure there's anything left that we disagee about. If the "diagram1" seems to resonate with everyone as the most "normalized" ER diagram we'll likely ever need, and if we can somehow accommodate flexibility in RDF for collapsing attributes to different classes (but only from the "one" side to the "many" side) -- then we might have achived the elusive Holy Grail of biodiversity informatics: true consensus.
Thanks for your great feedback and for challenging my statements. I need
that!
Likewise!
Aloha, Rich
Well, I also feel pretty good about most of that diagram, but I'm still struggling with the whole "token" thing. I feel the need to discuss basisOfRecord=LivingSpecimen which is the most complicated case and is also related to the previous discussion about dwc:establishmentMeans as well as my proposal to move it to the proposed Individual class. It is also related to another issue that I haven't broached here but which is discussed in my paper - "Occurrences" that aren't directly derived from an individual. I'm beginning to think that part of what I wrote there (in the paper) was wrong, but I'm not sure what the alternative is. That issue will probably come up if I comment about what Cam wrote in his email. So there may be more to hash out, but I can't handle it today because I've got too many other things to do. I've been mentally composing what I hope is a lucid presentation, but it hasn't hit the keyboard yet.
Steve
I have to say, this has been about the most productive (if volumunous) list-discussion I've had in...well...maybe ever. It seems we've both been equally persuasive, and equally willing to concede. How rare that happens in an internet forum! I'm not sure there's anything left that we disagee about. If the "diagram1" seems to resonate with everyone as the most "normalized" ER diagram we'll likely ever need, and if we can somehow accommodate flexibility in RDF for collapsing attributes to different classes (but only from the "one" side to the "many" side) -- then we might have achived the elusive Holy Grail of biodiversity informatics: true consensus.
Thanks for your great feedback and for challenging my statements. I need
that!
Likewise!
Aloha, Rich
.
Sticking my head up with a few suggestions
1. (Social) I've only been a lurker in this historically long thread since all my formal Biology training came from Mr. Siegler at Red Bank, NJ High School in 1957. But I have noticed that dwc:establishmentMeans sounds like something for which I recall that the Invasive Species informatics community requires a fairly fine-grained vocabulary. Yet I've only noticed one participant (Jerry Cooper) who I recognize travels in that community. (But I don't know all the participants). TDWG in general, but perhaps not this list in particular, has increasingly strong connections to the Invasive Species world, but their use-cases will still need to be aggressively sought.
2. (Technical). The conversation often has words like "attribute" "class" and RDF in the same sentence. In my experience, when people begin to formalize this constellation using the RDF stack, the first thing they do is translate ER-like diagrams into triples that look, for example, like dwc:establishmentMeans rdfs:domain dwc:Individual IMO, this should not be done lightly, because in rdf it would entail that should someone choose to apply dwc:establishmentMeans to, say a pqr:Population object P, then that pqr:Population P would necessarily be a dwc:Individual, which sounds naughty to me. It may well be that the intent is that dwc:establishmentMeans is meant to apply only to Individuals (though in my naiveté that would surprise me) but such decisions should not be taken lightly if there is any desire to have RDF as a basis for logical reasoning about Life, the Universe, and Everything---or at least about Life. Informal narrative like "move dwc:establishmentMeans to the proposed Individual class" could dig itself into the rdfs:domain hole...
--Bob Morris
I would have to think about the specific details Bob brings up, but he mentions the kinds of issues that I think are being overlooked.
What might make sense when mapped to a relational table (XMLish), might not make sense when represented as triples.
- Pete
On Wed, Oct 20, 2010 at 10:00 PM, Bob Morris morris.bob@gmail.com wrote:
Sticking my head up with a few suggestions
- (Social) I've only been a lurker in this historically long thread
since all my formal Biology training came from Mr. Siegler at Red Bank, NJ High School in 1957. But I have noticed that dwc:establishmentMeans sounds like something for which I recall that the Invasive Species informatics community requires a fairly fine-grained vocabulary. Yet I've only noticed one participant (Jerry Cooper) who I recognize travels in that community. (But I don't know all the participants). TDWG in general, but perhaps not this list in particular, has increasingly strong connections to the Invasive Species world, but their use-cases will still need to be aggressively sought.
- (Technical). The conversation often has words like "attribute"
"class" and RDF in the same sentence. In my experience, when people begin to formalize this constellation using the RDF stack, the first thing they do is translate ER-like diagrams into triples that look, for example, like dwc:establishmentMeans rdfs:domain dwc:Individual IMO, this should not be done lightly, because in rdf it would entail that should someone choose to apply dwc:establishmentMeans to, say a pqr:Population object P, then that pqr:Population P would necessarily be a dwc:Individual, which sounds naughty to me. It may well be that the intent is that dwc:establishmentMeans is meant to apply only to Individuals (though in my naiveté that would surprise me) but such decisions should not be taken lightly if there is any desire to have RDF as a basis for logical reasoning about Life, the Universe, and Everything---or at least about Life. Informal narrative like "move dwc:establishmentMeans to the proposed Individual class" could dig itself into the rdfs:domain hole...
--Bob Morris
-- Robert A. Morris Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 Associate, Harvard University Herbaria email: morris.bob@gmail.com web: http://bdei.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram phone (+1) 857 222 7992 (mobile)
On Wed, Oct 20, 2010 at 9:25 PM, Steve Baskauf steve.baskauf@vanderbilt.edu wrote:
Well, I also feel pretty good about most of that diagram, but I'm still struggling with the whole "token" thing. I feel the need to discuss basisOfRecord=LivingSpecimen which is the most complicated case and is
also
related to the previous discussion about dwc:establishmentMeans as well
as
my proposal to move it to the proposed Individual class. It is also
related
to another issue that I haven't broached here but which is discussed in
my
paper - "Occurrences" that aren't directly derived from an individual.
I'm
beginning to think that part of what I wrote there (in the paper) was
wrong,
but I'm not sure what the alternative is. That issue will probably come
up
if I comment about what Cam wrote in his email. So there may be more to hash out, but I can't handle it today because I've got too many other
things
to do. I've been mentally composing what I hope is a lucid presentation, but it hasn't hit the keyboard yet.
Steve
I have to say, this has been about the most productive (if volumunous) list-discussion I've had in...well...maybe ever. It seems we've both
been
equally persuasive, and equally willing to concede. How rare that
happens
in an internet forum! I'm not sure there's anything left that we disagee about. If the "diagram1" seems to resonate with everyone as the most "normalized" ER diagram we'll likely ever need, and if we can somehow accommodate flexibility in RDF for collapsing attributes to different classes (but only from the "one" side to the "many" side) -- then we
might
have achived the elusive Holy Grail of biodiversity informatics: true consensus.
Thanks for your great feedback and for challenging my statements. I need
that!
Likewise!
Aloha, Rich
.
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Robert A. Morris Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 Associate, Harvard University Herbaria email: morris.bob@gmail.com web: http://bdei.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram phone (+1) 857 222 7992 (mobile) _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
On Wed, Oct 20, 2010 at 8:00 PM, Bob Morris morris.bob@gmail.com wrote:
Sticking my head up with a few suggestions
- (Social) I've only been a lurker in this historically long thread
since all my formal Biology training came from Mr. Siegler at Red Bank, NJ High School in 1957. But I have noticed that dwc:establishmentMeans sounds like something for which I recall that the Invasive Species informatics community requires a fairly fine-grained vocabulary. Yet I've only noticed one participant (Jerry Cooper) who I recognize travels in that community. (But I don't know all the participants). TDWG in general, but perhaps not this list in particular, has increasingly strong connections to the Invasive Species world, but their use-cases will still need to be aggressively sought.
- (Technical). The conversation often has words like "attribute"
"class" and RDF in the same sentence. In my experience, when people begin to formalize this constellation using the RDF stack, the first thing they do is translate ER-like diagrams into triples that look, for example, like dwc:establishmentMeans rdfs:domain dwc:Individual IMO, this should not be done lightly, because in rdf it would entail that should someone choose to apply dwc:establishmentMeans to, say a pqr:Population object P, then that pqr:Population P would necessarily be a dwc:Individual, which sounds naughty to me. It may well be that the intent is that dwc:establishmentMeans is meant to apply only to Individuals (though in my naiveté that would surprise me) but such decisions should not be taken lightly if there is any desire to have RDF as a basis for logical reasoning about Life, the Universe, and Everything---or at least about Life. Informal narrative like "move dwc:establishmentMeans to the proposed Individual class" could dig itself into the rdfs:domain hole...
It was exactly this observation that resulted in the removal of all rdfs:domain assignments in the Darwin Core as we have it today. Even without the domain assignments, I have trouble reconciling dwc:establishmentMeans as a property that describes an Individual. Instead seems to me a relationship between the Individual and an Event happening at a Location. An example that sends me down this line of reasoning is a rhino calf taken from the wild, an Event where the dwc:establishmentMeans would be 'wild', and placed in a zoo, where the dwc:establishmentMeans for the same Individual would be 'captive'.
--Bob Morris
-- Robert A. Morris Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 Associate, Harvard University Herbaria email: morris.bob@gmail.com web: http://bdei.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram phone (+1) 857 222 7992 (mobile)
On Wed, Oct 20, 2010 at 9:25 PM, Steve Baskauf steve.baskauf@vanderbilt.edu wrote:
Well, I also feel pretty good about most of that diagram, but I'm still struggling with the whole "token" thing. I feel the need to discuss basisOfRecord=LivingSpecimen which is the most complicated case and is
also
related to the previous discussion about dwc:establishmentMeans as well
as
my proposal to move it to the proposed Individual class. It is also
related
to another issue that I haven't broached here but which is discussed in
my
paper - "Occurrences" that aren't directly derived from an individual.
I'm
beginning to think that part of what I wrote there (in the paper) was
wrong,
but I'm not sure what the alternative is. That issue will probably come
up
if I comment about what Cam wrote in his email. So there may be more to hash out, but I can't handle it today because I've got too many other
things
to do. I've been mentally composing what I hope is a lucid presentation, but it hasn't hit the keyboard yet.
Steve
I have to say, this has been about the most productive (if volumunous) list-discussion I've had in...well...maybe ever. It seems we've both
been
equally persuasive, and equally willing to concede. How rare that
happens
in an internet forum! I'm not sure there's anything left that we disagee about. If the "diagram1" seems to resonate with everyone as the most "normalized" ER diagram we'll likely ever need, and if we can somehow accommodate flexibility in RDF for collapsing attributes to different classes (but only from the "one" side to the "many" side) -- then we
might
have achived the elusive Holy Grail of biodiversity informatics: true consensus.
Thanks for your great feedback and for challenging my statements. I need
that!
Likewise!
Aloha, Rich
.
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Robert A. Morris Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 Associate, Harvard University Herbaria email: morris.bob@gmail.com web: http://bdei.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram phone (+1) 857 222 7992 (mobile) _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
I see dwc:establishmentMeans as being very clearly a property of "Individual At Place" (again, scoping "Individual" up to at least population). The closest thing we have to that in the diagram1 is Occurrence. The only hitch is that Occurrence isn't exactly "Individual At Place", so much as "Individual At Event[=Place+Time]" Some people have suggected that dwc:establishmentMeans is a function of Time as well as Place, in which case it is very clearly (to me) a property of Occurrence.
Rich
________________________________
From: Steve Baskauf [mailto:steve.baskauf@vanderbilt.edu] Sent: Wednesday, October 20, 2010 3:25 PM To: Richard Pyle Cc: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] practical details of recording a determination What is an Occurrence? Well, I also feel pretty good about most of that diagram, but I'm still struggling with the whole "token" thing. I feel the need to discuss basisOfRecord=LivingSpecimen which is the most complicated case and is also related to the previous discussion about dwc:establishmentMeans as well as my proposal to move it to the proposed Individual class. It is also related to another issue that I haven't broached here but which is discussed in my paper - "Occurrences" that aren't directly derived from an individual. I'm beginning to think that part of what I wrote there (in the paper) was wrong, but I'm not sure what the alternative is. That issue will probably come up if I comment about what Cam wrote in his email. So there may be more to hash out, but I can't handle it today because I've got too many other things to do. I've been mentally composing what I hope is a lucid presentation, but it hasn't hit the keyboard yet. Steve
I have to say, this has been about the most productive (if volumunous) list-discussion I've had in...well...maybe ever. It seems we've both been equally persuasive, and equally willing to concede. How rare that happens in an internet forum! I'm not sure there's anything left that we disagee about. If the "diagram1" seems to resonate with everyone as the most "normalized" ER diagram we'll likely ever need, and if we can somehow accommodate flexibility in RDF for collapsing attributes to different classes (but only from the "one" side to the "many" side) -- then we might have achived the elusive Holy Grail of biodiversity informatics: true consensus.
Thanks for your great feedback and for challenging my statements. I need
that! Likewise! Aloha, Rich .
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A. delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235 office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
The "Guide to Best Practices for Georeferencing" (roughly 100 pages) can be found among the GBIF Training Manuals at http://www.gbif.org/participation/training/resources/gbif-training-manuals/ .
For those with less patience (or time, or need) for details, the primary source for the "Best Practices", the "MaNIS/HerpNet/ORNIS Georeferencing Guidelines" (roughly 28 printed pages) at http://www.gbif.org/participation/training/resources/gbif-training-manuals/ is a much shorter read with all of the fundamental salient points.
If you are really time-limited, or in need of a practical how-to guide, the "Georeferencing for Dummies" (seven printed pages in Excel spreadsheet form) at http://herpnet.org/documents/georeffordummy.xls is for you.
There are lots of other resources, and great tools for georeferencing, but be advised that excessive georeferencing can lead to unstable mental states. More than one of the graduates of the "Georeferencing Roadshow" (a series of more than twenty international georeferencing workshops) has subsequently required therapy when they ran out of records to georeference.
You've been warned. ;-)
On Wed, Oct 20, 2010 at 1:06 PM, Steve Baskauf <steve.baskauf@vanderbilt.edu
wrote:
[extensive snipping...]
Here I will confess the crime of ignorance. I'm still trying to understand the need for and uses of a number of the Darwin Core dcterms:Location class terms. I guess I need to spend some more time reading the Guide to Best Practices for Georeferencing (I was going to include the link here, but the link at http://www.biogeomancer.org/library.html is broken).
Sure -- but we can't really ignore the massive numbers of non-georeferenced datapointn that already exist. And even when they are georeferenced, we'll still want to keep the original location descriptors.
[more extensive snipping...]
Oops. Mistaken reference. The "MaNIS/HerpNet/ORNIS Georeferencing Guidelines" are at http://manisnet.org/GeorefGuide.html. Sorry about that.
On Wed, Oct 20, 2010 at 9:58 PM, John Wieczorek tuco@berkeley.edu wrote:
The "Guide to Best Practices for Georeferencing" (roughly 100 pages) can be found among the GBIF Training Manuals at http://www.gbif.org/participation/training/resources/gbif-training-manuals/ .
For those with less patience (or time, or need) for details, the primary source for the "Best Practices", the "MaNIS/HerpNet/ORNIS Georeferencing Guidelines" (roughly 28 printed pages) at http://www.gbif.org/participation/training/resources/gbif-training-manuals/ is a much shorter read with all of the fundamental salient points.
If you are really time-limited, or in need of a practical how-to guide, the "Georeferencing for Dummies" (seven printed pages in Excel spreadsheet form) at http://herpnet.org/documents/georeffordummy.xls is for you.
There are lots of other resources, and great tools for georeferencing, but be advised that excessive georeferencing can lead to unstable mental states. More than one of the graduates of the "Georeferencing Roadshow" (a series of more than twenty international georeferencing workshops) has subsequently required therapy when they ran out of records to georeference.
You've been warned. ;-)
On Wed, Oct 20, 2010 at 1:06 PM, Steve Baskauf < steve.baskauf@vanderbilt.edu> wrote:
[extensive snipping...]
Here I will confess the crime of ignorance. I'm still trying to understand the need for and uses of a number of the Darwin Core dcterms:Location class terms. I guess I need to spend some more time reading the Guide to Best Practices for Georeferencing (I was going to include the link here, but the link at http://www.biogeomancer.org/library.html is broken).
Sure -- but we can't really ignore the massive numbers of non-georeferenced datapointn that already exist. And even when they are georeferenced, we'll still want to keep the original location descriptors.
[more extensive snipping...]
Steve,
Our information system requirements here at CANB have resulted in a data model that looks pretty-much like Rich's diagramhttp://bioimages.vanderbilt.edu/pages/rich-diagram1.gif 1 (except we choose, as you suggest, to sink location into event - though for the reason that each event will normally result in a unique description of its locality) but we do have applications that prefer to view these data as you have modelled it at diagram2http://bioimages.vanderbilt.edu/pages/rich-diagram2.gif . Interestingly the former is very close to the ASC data model (Blum 199? which I could not find online) and the latter, very like ABCD (if you collapse individual to an instance of occurrence). Neither of these look much like Darwin Core Archive format which requires one to collapse everything to properties of an occurrence.
In our system very few properties have definitions that correspond to Darwin Core terms. Even our classes have different definitions. When we interchange data between Australian herbaria we choose to stick with hispidhttp://hiscom.chah.org.au/wiki/HISPID5
which provides for more precise definitions and tighter controlled vocabularies than are possible with Darwin Core (eventually I would hope that these narrower terms might be mapped vertically with Darwin Core within a TDWG knowledge organisation system implemented using broader standards) but this does not prevent us mapping local vocabularies into darwin core for the purpose of delivering data to GBIF, the ALA or any network that chooses to to accept data in a more generic state. In many cases, especially when we consider individual values for any given term, this results in information loss but as this is essentially one way traffic its not our loss.
So for the benefit of those consumers of generic TDWG product we try to provide services delivering DwC, ABCD, HISPID, TCS, TDWG-RDF, etc as expected. Thats what our clients expect and it works for us because it simplifies the development of local API.
In our world occurrence is an abstraction and one of the hardest things to deliver using DwC. It simply does not exist as a distinct object within our information system. As a relation using gathering and taxon it is essentially an ephemeral thing. Sooner or later it will change … same identifiers for a different taxon at the same locality. These data are also highly repetitive and a dilemma we face in delivery is how to choose which values to omit when mapping to Darwin Core Occurrence.
The point is that in Darwin core we have a standard for communication at a very generic level. The fact that ones "individual" maps to DwC as "occurrence" or that we have 200 million specimen annotations that cannot be mapped to concepts or that I must choose one of 27 unique identifiers for objects resulting from a single gathering to construct a DwC record does not prevent us from choosing to model the world in a way that best suits the our particular requirements or delivering data into the Global network using Darwin Core.
When comes to RDF representations there is a simple rule that we use well known forms wherever possible. But here we have the advantage of being able to incorporate generic vocabulary within more expressive content. To say what we mean without excluding consumers with only the core vocabulary.
greg
On 20 October 2010 02:35, Steve Baskauf steve.baskauf@vanderbilt.eduwrote:
Rich, Thanks for the great summary diagram and even more amazing that it was made under mushed brain conditions. Hopefully you've gotten sleep since then. Unfortunately, when I tried to look at it I had some problems with line breaks. I've tried to recreate your diagram at http://bioimages.vanderbilt.edu/pages/rich-diagram1.gifhttp://bioimages.vanderbilt.edu/pages/rich1.gif Please correct me if I didn't get it right. My arrow-drawing utility put the arrow heads on the other end of the lines, but I think the arrows still maintain the "many to one" relationships you were trying to represent. I also replaced eventTime with eventDate since the latter is a broader term that also can include the time.
In principle, I agree with this diagram to the left of taxonNameUsage completely. (I still need clarification about a few things on the right end.) My main reason for using determination as a term rather than identification is because it is not ambiguous to refer to the person doing the identifying as the determiner, whereas referring to that person as the "identifier" creates confusion between that person and the identifying string for resources (as in "persistent identifier"). So if we agree that determination, annotation, and identification all mean the same thing (namely an instance of the dwc:Identification class), I'm happy to just use the term "identification". For the person doing it, I guess dwc:identifiedBy would be the best term although it's a bit awkward in regular speech so I may slip and still say "determiner".
Although I agree in principle that there can be many occurrences at an Event and many events at a Location, I think there are two practical reasons why it may be better to assign separate eventDate and Location metadata to each Occurrence. The first is that it makes the database structure simpler. As Markus has already noted, we really would prefer for the database to be as "flat" as possible. When I look at the terms listed in the DwC term page (http://rs.tdwg.org/dwc/terms/index.htm) under Event, the most important one that I see which everyone should be providing is eventDate. The rest I would pretty much consider optional and as a shortcut Rich's diagram could be collapsed to make them direct properties of the Occurrence. The second reason involves the practical matter of defining a Location. I will note that my thinking about this has been deeply influenced by a previous discussion on the topic from 2008-2009 summarized at http://www.sernec.org/files/summary-of-discussion.pdf on p.78-84. I don't think most people will want to wade through all of that text, so I'll just sum it up here. Somebody (I think it might have been Debbie Paul at Morphbank) suggested to me that we really have an intrinsically globally unique identifier for Location. It's the combination of dwc:decimalLatitude and dwc:decimalLongitude along with dwc:coordinateUncertaintyInMeters to establish precision and dwc:geodeticDatum to establish the reference system. (If we like geo:lat and geo:long, then the reference system is implied and we are down to three terms to unambiguously define a Location and its uncertainty. For the benefits of humans, a Locality description is probably also beneficial. Also, elevation and depth might be provided, although at least in theory elevation could be calculated with a sufficiently good digital elevation model). I will grant that we don't have this information for a lot of old records, but based on the massive efforts to geolocate specimens, I would say it's pretty clear that this is what we would like to have if we could get it. I certainly hope that there aren't any serious collectors, observers, and live organism photographers who aren't by this point trying to record this information as they establish new Occurrence records. If you look at all of the Location terms on the dwc list, most of the other terms are either concessions to the fact that we don't have what we want (e.g. the "verbatum" terms), things we could generate using a computer program if we were clever (like stateProvince, county, etc. - I know at least Mike Giddens has succeeded in doing this), ways of indicating how we got lat and long from old records (e.g. georefererenceSources), or methods to define larger scale Locations that aren't points (e.g. footprintWKT). I think it is safe to say that in the future (if not now already), many or most Events associated with Occurrences will have an associated button click (on a GPS receiver, camera phone, or GPS enabled camera) that will automatically generate dwc:eventDate, dwc:decimalLatitude, dwc:decimalLongitude (with geodeticDatum=WGS84) and maybe coordinateUncertaintyInMeters. Thus designing a system that requires that these time/space snapshots be grouped together into artificial "Locations" is really counterproductive when those data are now generated and can be associated with Occurrences automatically. I don't know if Greg Riccardi of Morphbank is following this thread or not. If so he may want to comment on this issue based on practical experience at Morphbank. When the Morphbank system was set up, it required the creation of a separate Location record which was assigned a unique Morphbank identifier. Specimens were then linked to this Location. What ended up happening was that each Specimen having GPS metadata ended up being assigned to its own separate Location even if it was 20 meters from another specimen. In effect, each Occurrence record ended up having its own decimalLatitude/decimalLongitude record anyway. So the system ended up being more complicated than necessary.
As I said, I agree in principle with the left side of Rich's diagram. Taking the practical considerations I just mentioned into account, I would simplify the diagram as http://bioimages.vanderbilt.edu/pages/rich-diagram2.gif Superficially, it looks more complicated, but I've gotten rid of several "one to many" relationships and enthroned Occurrence at its accustomed place in the center of the universe (or at least the center of the left side of the diagram). I don't have any philosophical objections to people structuring their data according to Rich's original diagram and the existing Darwin Core terms certainly make it possible to do so (well except for the Individual thing). However, I submit that many people will find it simpler (and easier to use tools like Darwin Core Archives) if they use the flatter structure that I have in the revised diagram.
I will save my questions about the right side of Rich's diagram for later. Steve
Richard Pyle wrote:
All,
I'm in Stockholm, and right now it's 10am in Hawaii, and I've effectively been awake since 7pm Hawaii time -- so my brain is a bit mush. But I'll take a chance and comment anyway.
I will leave up to the taxonomy people the different things would be connected to the species concept and how all of their lines would be connected.
In my mind the "fully-normalised" (sensu Döring) relationship graph is something like this (notation is [One]--<[Many]; [One]--[One]) (Be sure to view as a fixed-width font, like Courier):
[identifiedBy] |
[Location]--<[Event]--<[Occurrence]>--[Individual]--<[Identification]--[Taxo nNameUsage]>--[nameAccordingTo] | | | [eventTime] [dateIdentified] [scientificName]
I'm following what I *think* Steve defined for [Individual], which is that it can be either a single individual organism or a defined set of organisms (e.g., up to at least a population).
So, an Occurrence is the intersection of an Individual and an Event. An Event is a Location+Time[+other metadata]. Each Event may have multiple Occurrences (i.e., one for each distinct Individual at the same Location+Time). Also, an Individual may have multiple Occurrences (one for each Event at which the same Individual was documented).
An Individual may have multiple Identifcations. I make no distinction between "Identification" and "Determination" (nor do I make a distinction between the first identification and subsequent identifications). I slightly prefer "Identification", because "Determination" seems to imply that there is a correct answer, whereas "Identification" (to me, anyway), implies an opinion. Steve, I didn't quite follow how you were distinguishing these two terms -- so if you have a clear reason for distinguishing them, I'd like to understand it better.
A single Identification should, in my mind, always join a single individual with a single "TaxonNameUsage" instance. I'm not 100% sure how TaxonNameUsage maps in DwC. I *think* it's an instance of a dwc:Taxon, as most of the core attributes of a TNU (acceptedNameUsage[ID], parentNameUsage[ID], originalNameUsage[ID], scientificName, taxonRank) are represented as terms in the Taxon Class. But I'm a little fuzzy on whether a "taxonID" maps directly to a TNUID, or if a TNUID more correcly maps to taxonConceptID.
The determination would have any of the properties that are terms listed in the dwc:Identification class (identifiedBy, dateIdentified, identificationReferences, identification Remarks, identificationQualifier, and typeStatus). Some properties like dateIdentified and identificationReferences would be string literals and others (especially identifiedBy) should probably be GUIDs but could be literals if they had to be.
I agree with what Steve wrote above. However, I'm uncomfortable with Markus' suggestion of treating dwc:nameAccordingTo as a property of an Indentification -- even as a shortcut. I think this is a bit dangerous. If there is no TaxonID instance (aka "TaxonNameUsage" in my diagram above) available to link the Identification to, then I would suggest using identificationReferences as the shortcut. But that would still force you to attached scientificName directly to the Identification instance, which I think is also unwise. I'd rather the Best Practice be to "manufacture" a place-holder dwc:Taxon instance (if a proper one doesn't already exist in the content source), and apply the scientificName property to that Taxon instance, rather than directly to an Identification. I know it's often short-hand to attach the scientificName directly to the Occurrence instance; but I actually feel less uneasy about that, because it is much more obviously a shortcut. But if you're going to the trouble to provide an instantiated "Identification", then you ought to anchor it to a Taxon instance (manufactured or real).
But, I guess as Greg said in his post, it may not really matter, as in the long run, we'll probably be able to make inferences about the proper Individual<-->TaxonConcept mapping, even when it's not explicitly documented.
- The original label identifies the species as Juncus
diffusissimus. However, there is no indicator as to who originally identified it or when. My assumption is that it was the collector (Glen N. Montz) but I don't really know that. Do I assume that, or list the original determiner as "unknown"?
I would make no assumptions about who was the identifiedBy person. Instead, in these cases I handle these cases by either going with "Unspecified", or, in some cases (when I have confidence), something like "Bishop Museum Staff Member". Often I can deduce the identifier with some degree of confidence, but usually I don't have the time to do this. The dateIdentified can either not be provided, or set as some range (e.g., at the very worst, on or after the eventDate/eventTime, and before today).
This is why I think that identification tags ("annotations" sensu Baskauf) can be "documentation sources for TNUs.
In the web example given by Steve, we have an idetification as follows:
Juncus diffusissimus Buckl. Determined by: L. Urbatsch Determination date: 2009
Completely independantly of the specimen itself, we can infer from the tag that:
- Sometime between 1 Jan 2009 and 31 Dec 2009, L. Urbatsch regarded the
genus "Juncus" as valid.
- Sometime between 1 Jan 2009 and 31 Dec 2009, L. Urbatsch regarded the
species epithet "diffusissimus" [of Buckl.] as a valid species, placed within the genus "Juncus".
Thus, we have at least two implied TNUs from this identification, which was documented on a piece of paper that happens to be fixed to LSU-BR 39823.
The Identification instance would link the Individual (manifest as a specimen, in this case) to the TNU of "[Juncus] diffusissimus Buckl. sec L. Urbatsch 2009". The nameAccordingTo would be "L. Urbatsch 2009". This may seem redundant to have "L. Urbatsch 2009" in both the nameAccordingTo attribute of thr Taxon instance, and in the identifiedBy & dateIdentified attributes of the Identification instance -- but the fact remains they are fundamentally different pieces of information. One establishes an instance of an (implied) taxon concept, and the other establishes the placement of LSU-BR 39823 within that taxon concept circumscription.
Eventually, a third party may be able to deduce (perhaps through a suite of other, external information) a RelationshipAssertion that maps the TNU "[Juncus] diffusissimus Buckl. sec L. Urbatsch 2009" to some other, perhaps published and well-defined taxon concept (of the same or different name). Also, if there are 100 specimens in the collection that L. Urbatsch identified as "Juncus diffusissimus Buckl." in 2009, then anchoring all 100 Identification instances to the one TNU, allows all of those specimens to inherit the mapping of the one "[Juncus] diffusissimus Buckl. sec L. Urbatsch 2009" TNU instance to some other better-defined taxon concept.
I know this is a lot of stuff to keep in one's head at the same time -- but as cumbersome as it seems, I am conviced it can be packacged into a relatively straightforward and intuitive user UI, and modelling it this way improves the utility of the data (maybe dramatically) in the long run.
- Do we draw a distinction between the initial identification and
subsequent annotations?
I think the answer should be "no" and that's why I refer to both
generically as "determinations".
I agree.
- There is really no indication given on the annotation
labels as to many of the things that we would like to know, such as the concept they had in mind, any source they used (if any), or the reason why they did the annotation. So how does one connect the name that they applied to the determination when there is no indication of the concept?
As I said in an earlier post, the single most important way to reduce taxonomic ambiguity is to try to capture (or confidently deduce) the source (=mapping to taxon concept). But if it can't be done, then it can't be done -- so I'm inclined to establish a "place-holder" dwc:Taxon instance, with no nameAccordingTo, and no other metadata besides the scientificName.
Is this just something we can't do for old annotations and just something that we try to do from this point forward?
Probably.
- The last question is one that I really want to some
opinions about. It seems to me that there are a number of reasons why one would apply a determination.
Hmmm....I don't think this is really useful information. I don't undersatand how you would use this information ina machine-processing sort of way. An Identification is an Identification. In some cases, the Identifier may not even be aware of the previous identification, and so we can necessarily infer there was a particular "reason". And even if there is a reason, how doe we use that information? What if there is more than one reason (i.e., if we are restricted to a controlled vocabulary)?
As far as I'm concerned, the Identifications should stand as they are. If needed people can annotate the Identification instances; but I don't see the value in machine-processing these things.
Also:
Finally, a single determiner might apply several determinations to one individual and indicate in each determination the concept intended (i.e. if you subscribe to Cronquist, you'd call it X; if you like Radford's book, you'd call it Y; if you like Weakley's treatment, you'd call it Z).
YIKES! I don't like the idea of loading all that information on an Identification instance. If the person wants to make this sort of assertion, then they should establish the appropriate relationshipAssertion instances among the various taxonConcepts cited.
Damn. Now my head is really tired. And so is the rest of me....
Aloha, and g'night..
Rich
.
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Dear Steve and Rich,
Encouraged by your discussion of models of Occurrences and Individuals, and by Steve's related Biodiv. Informatics paper, I have modeled a real example of an individual plant and some of its various occurrences in RDF, using Steve's sernec terms to provide the predicates that are missing from DwC. As I did so, a number of questions came up relating to choices of terms, and I would greatly appreciate your input on these choices. The following includes all the choices considered, and so may not be semantically correct. The questions (Q1-9) are interspersed with the RDF (serialized as Turtle).
@prefix dwc: http://rs.tdwg.org/dwc/terms/ . @prefix dwcvoc: http://rs.tdwg.org/ontology/voc/ . @prefix dcterms: http://purl.org/dc/terms/ . @prefix geo: http://www.w3.org/2003/01/geo/wgs84_pos# . @prefix sernec: http://bioimages.vanderbilt.edu/rdf/terms# .
http://phylodiversity.net/xmalesia/indiv/9 a sernec:Individual ; # Q1 - Is this an Individual ... a dwc:individualID ; # Q1 - ... or an individualID (Baskauf 2010 app'x)?
# Specimen sernec:derivativeOccurrence [ # Q2 : Use generic Occurrence from dwc:Occurrence ... a dwc:Occurrence ; dwc:basisOfRecord "PreservedSpecimen" ; dwc:recordNumber "Webb 5008" ; dwc:recordedBy "Cam Webb" ; # Q2 : ... or treat directly as a Specimen? a dwcvoc:Specimen ; dwcvoc:collectorsFieldNumber "5008" ; dwcvoc:collector "Cam Webb" ; # Q3 : Add the dwc:eventDate here as suggested by Baskauf? dwc:eventDate "2008-01-01" ; # Q4 : Treat occurrence as generic resource, using dc metadata? dcterms:creator "Cam Webb" ; dcterms:created "2008-01-01" ; # Q5 : Add dwc location data for Occurrence... dwc:coordinateUncertaintyInMeters "100" ; dwc:decimalLatitude "-1.25530" ; dwc:decimalLongitude "109.95371" ; dwc:geodeticDatum "WGS84" ; dwc:locality "Sukadana" ; # Q5 : ... or a Location. dcterms:spatial _:blank1 ; ] ;
# Photo: sernec:derivativeOccurrence [ # Q6 : a dwc:Occurrence... a dwc:Occurrence ; # Q6 : ... or a dwcvoc:TaxonOccurrence. Which is better? a dwcvoc:TaxonOccurrence ; # Q7 : Again, use dwc terms... dwc:occurrenceID http://phylodiversity.net/xmalimg/cw_28617.400px.jpg ; dwc:basisOfRecord "DigitalStillImage" ; dwc:recordedBy "Cam Webb" ; dwc:eventDate "2008-01-01" ; # Q7 : or cd terms? dcterms:identifier http://phylodiversity.net/xmalimg/cw_28617.400px.jpg ; dcterms:creator "Cam Webb" ; dcterms:created "2008-01-01" ; dcterms:type http://purl.org/dc/dcmitype/StillImage ; # Q8 : Spatial data, same issue as above dcterms:spatial _:blank1 ; ] .
# Determination [] a dwc:Identification ; sernec:identifiesIndividual http://phylodiversity.net/xmalesia/indiv/9 ; dwc:identifiedBy "Ferry Slik" ; dwc:taxonConceptID urn:lsid:ubio.org:namebank:5963772 ; dwc:dateIdentified "2009-02-22" ; # Q9 : Use dwc:identificationReferences or... dwc:identificationReferences http://phylodiversity.net/xmalimg/cw_28617.400px.jpg ; # Q9 : ... sernec:basedOnOccurrence ? sernec:basedOnOccurrence http://phylodiversity.net/xmalimg/cw_28617.400px.jpg .
# Location data for photo and specimen _:blank1 a dcterms:Location ; geo:lon "109.95371" ; geo:lat "-1.25530" ; dwc:locality "Sukadana, on Tanah Merah road to beach" ; dwc:coordinateUncertaintyInMeters "100" .
I realize that for LOD applications the blank nodes should eventually have GUIDs. Now, here is a slimmed down version of the above with my own choices. In general, I went with dcterms over dwc, where appropriate. You can also see the network (via dot) at: http://phylodiversity.net/cwebb/img/indiv9-slim.jpg or http://linkeddata.uriburner.com/ode/?uri=http://phylodiversity.net/cwebb/tmp...
@prefix dwc: http://rs.tdwg.org/dwc/terms/ . @prefix dwcvoc: http://rs.tdwg.org/ontology/voc/ . @prefix dcterms: http://purl.org/dc/terms/ . @prefix geo: http://www.w3.org/2003/01/geo/wgs84_pos# . @prefix sernec: http://bioimages.vanderbilt.edu/rdf/terms# .
http://phylodiversity.net/xmalesia/indiv/9 a sernec:Individual ; sernec:derivativeOccurrence [ # Specimen a dwc:Occurrence ; dwc:basisOfRecord "PreservedSpecimen" ; dcterms:identifier "Webb 5008" ; dcterms:creator "Cam Webb" ; dcterms:created "2008-01-01" ; dcterms:spatial _:blank1 ; ] ; sernec:derivativeOccurrence [ # Photo a dwc:Occurrence ; dwc:basisOfRecord "DigitalStillImage" ; dcterms:identifier http://phylodiversity.net/xmalimg/cw_28617.400px.jpg ; dcterms:creator "Cam Webb" ; dcterms:created "2008-01-01" ; dcterms:spatial _:blank1 ; ] .
[] a dwc:Identification ; sernec:identifiesIndividual http://phylodiversity.net/xmalesia/indiv/9 ; dwc:identifiedBy "Ferry Slik" ; dwc:taxonConceptID urn:lsid:ubio.org:namebank:5963772 ; dwc:dateIdentified "2009-02-22" ; sernec:basedOnOccurrence http://phylodiversity.net/xmalimg/cw_28617.400px.jpg .
_:blank1 a dcterms:Location ; geo:lon "109.95371" ; geo:lat "-1.25530" ; dwc:locality "Sukadana, on Tanah Merah road to beach" ; dwc:coordinateUncertaintyInMeters "100" .
I didn't think this could be done without creating new terms, so I'm very pleased be getting closer to my goal of a LOD representation of our data that maintains the Individuals as base entities.
Many thanks in advance for any thoughts.
Best,
Cam
Because of actual work that I've had to get done, I haven't had time yet to carefully read Rich's response and to carefully go through Cam and Pete's posts to digest them. But I also am encouraged by this discussion because it seems like most people are agreeing on the basic conceptual arrangement of entities in Rich's diagram. In some cases people choose to "collapse" the more general model when some of the components have only one-to-one connections (e.g. leave out individuals because all individuals in a database have only one occurrence, leave out event because every occurrence in a database has a separate event defined by atomized lat/long/time) but there seems to be a general agreement that those omitted components exist conceptually and that it is convenient for other users to include them when they are needed as nodes for one-to-many relationships. This makes the creation of an eventual general template for RDF simpler because it means there will be less arguing about how entities in the RDF should be "connected" to each other (i.e. what are the appropriate classes of subjects and objects).
As I said, I haven't yet looked carefully at Cam's example, but she made a comment about blank nodes. One of the things that's troubled me is how to have a consistent RDF template that can be used for both records generated by people who are "compressing" their databases as I described above and people who aren't "compressing". For example, if people have a database that only contains one specimen per individual, they are probably going to be generating GUIDs (i.e. URIs) for the specimens but not for the individuals that exist but weren't explicitly recognized in the database they are using to generate the RDF. According to the "Rich diagram1" general model, the dwc:Identification should be connected to the Individual and the Individual to the Occurrence, but since the specimen databaser didn't explicitly assign a URI to the Individual, the RDF would have a blank node for the Individual. The solution I settled on (illustrated in the http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf example with the non-actionable URIs, use view page source to see the underlying RDF) was to create a default URI for the assumed Individual by slapping a "#ind" onto the end of the URI for the specimen. I should probably do the same thing in RDF like http://bioimages.vanderbilt.edu/baskauf/51249.rdf where I ignore dwc:Location as an entity (i.e. I "collapse" Rich's model because all of my Occurrences have separate Lat/Long), i.e. I should probably enclose the RDF for the location metadata in a rdf:Description about http://bioimages.vanderbilt.edu/baskauf/51249#location element. That would make my RDF format consistent with that of others who connected multiple Occurrences to a single Location and would also make it possible for someone to "reuse" my Location identifier if they later wanted to assert that an event happened at the same location. (I got this idea by looking at Pete's RDF!)
This question of when one needs to apply a GUID to a resource came up in the draft Beginners Guide to Persistent Identifiers. In cases like I discussed above where there is only a single resource connected to another resource that has an explicitly assigned GUID, having a default method for creating "assumed" URIs would reduce the need to generate and maintain a lot of separate identifiers for entities that that the creator of the GUID isn't really interested in.
Steve
Cam Webb wrote:
Dear Steve and Rich,
Encouraged by your discussion of models of Occurrences and Individuals, and by Steve's related Biodiv. Informatics paper, I have modeled a real example of an individual plant and some of its various occurrences in RDF, using Steve's sernec terms to provide the predicates that are missing from DwC. As I did so, a number of questions came up relating to choices of terms, and I would greatly appreciate your input on these choices. The following includes all the choices considered, and so may not be semantically correct. The questions (Q1-9) are interspersed with the RDF (serialized as Turtle).
@prefix dwc: http://rs.tdwg.org/dwc/terms/ . @prefix dwcvoc: http://rs.tdwg.org/ontology/voc/ . @prefix dcterms: http://purl.org/dc/terms/ . @prefix geo: http://www.w3.org/2003/01/geo/wgs84_pos# . @prefix sernec: http://bioimages.vanderbilt.edu/rdf/terms# .
http://phylodiversity.net/xmalesia/indiv/9 a sernec:Individual ; # Q1 - Is this an Individual ... a dwc:individualID ; # Q1 - ... or an individualID (Baskauf 2010 app'x)?
# Specimen sernec:derivativeOccurrence [ # Q2 : Use generic Occurrence from dwc:Occurrence ... a dwc:Occurrence ; dwc:basisOfRecord "PreservedSpecimen" ; dwc:recordNumber "Webb 5008" ; dwc:recordedBy "Cam Webb" ; # Q2 : ... or treat directly as a Specimen? a dwcvoc:Specimen ; dwcvoc:collectorsFieldNumber "5008" ; dwcvoc:collector "Cam Webb" ; # Q3 : Add the dwc:eventDate here as suggested by Baskauf? dwc:eventDate "2008-01-01" ; # Q4 : Treat occurrence as generic resource, using dc metadata? dcterms:creator "Cam Webb" ; dcterms:created "2008-01-01" ; # Q5 : Add dwc location data for Occurrence... dwc:coordinateUncertaintyInMeters "100" ; dwc:decimalLatitude "-1.25530" ; dwc:decimalLongitude "109.95371" ; dwc:geodeticDatum "WGS84" ; dwc:locality "Sukadana" ; # Q5 : ... or a Location. dcterms:spatial _:blank1 ; ] ; # Photo: sernec:derivativeOccurrence [ # Q6 : a dwc:Occurrence... a dwc:Occurrence ; # Q6 : ... or a dwcvoc:TaxonOccurrence. Which is better? a dwcvoc:TaxonOccurrence ; # Q7 : Again, use dwc terms... dwc:occurrenceID <http://phylodiversity.net/xmalimg/cw_28617.400px.jpg> ; dwc:basisOfRecord "DigitalStillImage" ; dwc:recordedBy "Cam Webb" ; dwc:eventDate "2008-01-01" ; # Q7 : or cd terms? dcterms:identifier <http://phylodiversity.net/xmalimg/cw_28617.400px.jpg> ; dcterms:creator "Cam Webb" ; dcterms:created "2008-01-01" ; dcterms:type <http://purl.org/dc/dcmitype/StillImage> ; # Q8 : Spatial data, same issue as above dcterms:spatial _:blank1 ; ] .
# Determination [] a dwc:Identification ; sernec:identifiesIndividual http://phylodiversity.net/xmalesia/indiv/9 ; dwc:identifiedBy "Ferry Slik" ; dwc:taxonConceptID urn:lsid:ubio.org:namebank:5963772 ; dwc:dateIdentified "2009-02-22" ; # Q9 : Use dwc:identificationReferences or... dwc:identificationReferences http://phylodiversity.net/xmalimg/cw_28617.400px.jpg ; # Q9 : ... sernec:basedOnOccurrence ? sernec:basedOnOccurrence http://phylodiversity.net/xmalimg/cw_28617.400px.jpg .
# Location data for photo and specimen _:blank1 a dcterms:Location ; geo:lon "109.95371" ; geo:lat "-1.25530" ; dwc:locality "Sukadana, on Tanah Merah road to beach" ; dwc:coordinateUncertaintyInMeters "100" .
I realize that for LOD applications the blank nodes should eventually have GUIDs. Now, here is a slimmed down version of the above with my own choices. In general, I went with dcterms over dwc, where appropriate. You can also see the network (via dot) at: http://phylodiversity.net/cwebb/img/indiv9-slim.jpg or http://linkeddata.uriburner.com/ode/?uri=http://phylodiversity.net/cwebb/tmp...
@prefix dwc: http://rs.tdwg.org/dwc/terms/ . @prefix dwcvoc: http://rs.tdwg.org/ontology/voc/ . @prefix dcterms: http://purl.org/dc/terms/ . @prefix geo: http://www.w3.org/2003/01/geo/wgs84_pos# . @prefix sernec: http://bioimages.vanderbilt.edu/rdf/terms# .
http://phylodiversity.net/xmalesia/indiv/9 a sernec:Individual ; sernec:derivativeOccurrence [ # Specimen a dwc:Occurrence ; dwc:basisOfRecord "PreservedSpecimen" ; dcterms:identifier "Webb 5008" ; dcterms:creator "Cam Webb" ; dcterms:created "2008-01-01" ; dcterms:spatial _:blank1 ; ] ; sernec:derivativeOccurrence [ # Photo a dwc:Occurrence ; dwc:basisOfRecord "DigitalStillImage" ; dcterms:identifier http://phylodiversity.net/xmalimg/cw_28617.400px.jpg ; dcterms:creator "Cam Webb" ; dcterms:created "2008-01-01" ; dcterms:spatial _:blank1 ; ] .
[] a dwc:Identification ; sernec:identifiesIndividual http://phylodiversity.net/xmalesia/indiv/9 ; dwc:identifiedBy "Ferry Slik" ; dwc:taxonConceptID urn:lsid:ubio.org:namebank:5963772 ; dwc:dateIdentified "2009-02-22" ; sernec:basedOnOccurrence http://phylodiversity.net/xmalimg/cw_28617.400px.jpg .
_:blank1 a dcterms:Location ; geo:lon "109.95371" ; geo:lat "-1.25530" ; dwc:locality "Sukadana, on Tanah Merah road to beach" ; dwc:coordinateUncertaintyInMeters "100" .
I didn't think this could be done without creating new terms, so I'm very pleased be getting closer to my goal of a LOD representation of our data that maintains the Individuals as base entities.
Many thanks in advance for any thoughts.
Best,
Cam
.
Cam, quickly and confessing I havent read all, but dwc and the older tdwg ontology were not meant to be mixed. There is obviously a huge overlap.
@prefix dwc: http://rs.tdwg.org/dwc/terms/ . @prefix dwcvoc: http://rs.tdwg.org/ontology/voc/ .
Cam, I have finally taken the time to look carefully at your RDF example (I'm not used to the Turtle serialization but managed to translate it into XML which is the way I "think" about RDF). I'm not going to try comment on every question that you asked because since this discussion has been going on I've changed my thinking somewhat about how I would model certain things. But you raise a number of important questions and I'll give opinions on a few. Whether those opinions are shared by others or not remains to be seen and should be part of the discussion if an RDF task group gets off the ground.
Q6. The question of how DwC resources should be rdf:type'd remains open. When I first tried to write RDF using DwC terms, I tried to type things using dwcvoc: . However, there were too many types of resources that didn't have terms there and when I looked at the ontology, I wasn't sure that some of the terms that were there actually meant what I thought they should. So I gave up and just decided to use the Darwin Core classes since they also qualified as "well known". The DwC type vocabulary is another possibility for typing since it includes both some of the DwC classes as well as other types, such as PreservedSpecimen, which we would need if the model of separating the "token" from the Occurrences were followed. However, the Identification class isn't included in the DwC type vocabulary (is that intentional or an oversight?). Also, tokens that are StillImages, Sounds, etc. would have to be typed using the Dublin Core type vocabulary. So at this point it seems like the rdf:type values would have to be drawn from at least three different sources to get the job done.
Q1. I think sernec:Individual would be right rather than dwc:individualID (as described above). I originally used dwc:individualID, but I now think that is not right and that the xxxxxID terms should be used to show the relationship among described resources.
Q2-4. If the "token" is separated from the Occurrence, then dwc:recordedBy is a property of the Occurrence and dcterms:created and dcterms:creator are properties of the token (if it's a create-able thing).
Q3 and Q5. I think that for the sake of a consistent RDF structure that a client could actually know how to "crawl" and "understand", it would be best to have nodes (having URI identifiers) for all of the resources that end up being in a consensus fully-normalized model like http://bioimages.vanderbilt.edu/pages/token-explicit.gif . As you know, I suggested the strategy of creating hash URIs for naming nodes for which the user doesn't care to maintain as separate database items. This has worked well for me in my experimentation.
Q7. I think we had a discussion in an earlier thread as to whether in RDF the xxxxxID terms should be used to identify the subject resource or just be reserved for indicating a reference to another related resource. It was suggested (and I agree) that since a tag like <dwc:occurrence rdf:about="http://phylodiversity.net/xmalesia/occur/9-1%22%3E already indicates that http://phylodiversity.net/xmalesia/occur/9-1 is a URI that identifies the occurrence, it's a bit redundant to also assert an identifier as an explicit property of the occurrence. But I suppose it doesn't hurt anything. Pete uses dcterms:identifier to do this as you did in your image example.
Q9. If dwc:identificationReferences is appropriate here, then sernec:basedOnOccurrence does not need to exist. Actually sernec:basedOnOccurrence probably shouldn't be used anyway if we separate tokens from their Occurrences (the appropriate term would then be basedOnToken or something like that).
Hmm. I guess I ended up commenting on most of the questions anyway. Two more general comments. 1. After considerably thought, I've decided that I don't want to use direct access URLs for images as their identifying URIs. There is nothing "wrong" with doing so, but once you use it as a GUID, you're stuck with keeping the image at that location forever. Also, that URI then refers to the specific pattern of bytes in the particular version of the image that you are serving from that URL which also may not ever change (i.e. no editing). A lot of the image metadata applies to any sized version of the image, not just the one that you've identified using the URL. Then there are content negotiation issues with using a .jpg extension for a URI which I could discuss but won't get into here. For all of these reasons, I've decided for myself that I'd prefer to consider the image as a conceptual thing (non-information resource) and assign it an identifier with no extension which could then be subject to content negotiation. I then use MRTG service access class instances to provide the mrtg:accessURL's for whatever sizes of images I want to provide. Because the accessURLs are metadata and not themselves identifiers, I can change the access URLs without breaking any GUID rules. This gives you the option to move your high-res images to an image repository rather than serving it from the domain from which your RDF is being served. You can see an example of this approach at: http://bioimages.vanderbilt.edu/baskauf/10685.rdf 2. If one assigns URIs to each resource included in the RDF file (i.e. the Individual, the image, the Occurrence, the Event, etc.), the degree of nesting can be reduced and blank nodes eliminated. Of course you then need a way to connect the various resources. I have been using the xxxxxID terms for this, i.e. to say that the individual has a certain Occurrence I say [individual] dwc:occurrenceID [occurrence] I think this is within the spirit of what the xxxxxxxID terms were intended to do and if we can use them in this way, it greatly reduces the number of new terms that would have to be created to express DwC in RDF (i.e. we don't need to make up dwc:hasOccurrence). The downside to this is that few (none?) of the relationships that could be expressed by xxxxxxID terms have inverse properties defined. I made up a few in the sernec: vocabulary, but the need for such terms would have to be discussed at some point in a future RDF task group. I don't know enough about how semantic clients work to know if just providing the properties in one direction would be good enough for the client to infer the inverse relationship and make use of it as necessary.
Hope these comments are helpful. I am a novice RDF user, so take what I've said with a grain of salt. Steve
Cam Webb wrote:
Dear Steve and Rich,
Encouraged by your discussion of models of Occurrences and Individuals, and by Steve's related Biodiv. Informatics paper, I have modeled a real example of an individual plant and some of its various occurrences in RDF, using Steve's sernec terms to provide the predicates that are missing from DwC. As I did so, a number of questions came up relating to choices of terms, and I would greatly appreciate your input on these choices. The following includes all the choices considered, and so may not be semantically correct. The questions (Q1-9) are interspersed with the RDF (serialized as Turtle).
@prefix dwc: http://rs.tdwg.org/dwc/terms/ . @prefix dwcvoc: http://rs.tdwg.org/ontology/voc/ . @prefix dcterms: http://purl.org/dc/terms/ . @prefix geo: http://www.w3.org/2003/01/geo/wgs84_pos# . @prefix sernec: http://bioimages.vanderbilt.edu/rdf/terms# .
http://phylodiversity.net/xmalesia/indiv/9 a sernec:Individual ; # Q1 - Is this an Individual ... a dwc:individualID ; # Q1 - ... or an individualID (Baskauf 2010 app'x)?
# Specimen sernec:derivativeOccurrence [ # Q2 : Use generic Occurrence from dwc:Occurrence ... a dwc:Occurrence ; dwc:basisOfRecord "PreservedSpecimen" ; dwc:recordNumber "Webb 5008" ; dwc:recordedBy "Cam Webb" ; # Q2 : ... or treat directly as a Specimen? a dwcvoc:Specimen ; dwcvoc:collectorsFieldNumber "5008" ; dwcvoc:collector "Cam Webb" ; # Q3 : Add the dwc:eventDate here as suggested by Baskauf? dwc:eventDate "2008-01-01" ; # Q4 : Treat occurrence as generic resource, using dc metadata? dcterms:creator "Cam Webb" ; dcterms:created "2008-01-01" ; # Q5 : Add dwc location data for Occurrence... dwc:coordinateUncertaintyInMeters "100" ; dwc:decimalLatitude "-1.25530" ; dwc:decimalLongitude "109.95371" ; dwc:geodeticDatum "WGS84" ; dwc:locality "Sukadana" ; # Q5 : ... or a Location. dcterms:spatial _:blank1 ; ] ; # Photo: sernec:derivativeOccurrence [ # Q6 : a dwc:Occurrence... a dwc:Occurrence ; # Q6 : ... or a dwcvoc:TaxonOccurrence. Which is better? a dwcvoc:TaxonOccurrence ; # Q7 : Again, use dwc terms... dwc:occurrenceID <http://phylodiversity.net/xmalimg/cw_28617.400px.jpg> ; dwc:basisOfRecord "DigitalStillImage" ; dwc:recordedBy "Cam Webb" ; dwc:eventDate "2008-01-01" ; # Q7 : or cd terms? dcterms:identifier <http://phylodiversity.net/xmalimg/cw_28617.400px.jpg> ; dcterms:creator "Cam Webb" ; dcterms:created "2008-01-01" ; dcterms:type <http://purl.org/dc/dcmitype/StillImage> ; # Q8 : Spatial data, same issue as above dcterms:spatial _:blank1 ; ] .
# Determination [] a dwc:Identification ; sernec:identifiesIndividual http://phylodiversity.net/xmalesia/indiv/9 ; dwc:identifiedBy "Ferry Slik" ; dwc:taxonConceptID urn:lsid:ubio.org:namebank:5963772 ; dwc:dateIdentified "2009-02-22" ; # Q9 : Use dwc:identificationReferences or... dwc:identificationReferences http://phylodiversity.net/xmalimg/cw_28617.400px.jpg ; # Q9 : ... sernec:basedOnOccurrence ? sernec:basedOnOccurrence http://phylodiversity.net/xmalimg/cw_28617.400px.jpg .
# Location data for photo and specimen _:blank1 a dcterms:Location ; geo:lon "109.95371" ; geo:lat "-1.25530" ; dwc:locality "Sukadana, on Tanah Merah road to beach" ; dwc:coordinateUncertaintyInMeters "100" .
I realize that for LOD applications the blank nodes should eventually have GUIDs. Now, here is a slimmed down version of the above with my own choices. In general, I went with dcterms over dwc, where appropriate. You can also see the network (via dot) at: http://phylodiversity.net/cwebb/img/indiv9-slim.jpg or http://linkeddata.uriburner.com/ode/?uri=http://phylodiversity.net/cwebb/tmp...
@prefix dwc: http://rs.tdwg.org/dwc/terms/ . @prefix dwcvoc: http://rs.tdwg.org/ontology/voc/ . @prefix dcterms: http://purl.org/dc/terms/ . @prefix geo: http://www.w3.org/2003/01/geo/wgs84_pos# . @prefix sernec: http://bioimages.vanderbilt.edu/rdf/terms# .
http://phylodiversity.net/xmalesia/indiv/9 a sernec:Individual ; sernec:derivativeOccurrence [ # Specimen a dwc:Occurrence ; dwc:basisOfRecord "PreservedSpecimen" ; dcterms:identifier "Webb 5008" ; dcterms:creator "Cam Webb" ; dcterms:created "2008-01-01" ; dcterms:spatial _:blank1 ; ] ; sernec:derivativeOccurrence [ # Photo a dwc:Occurrence ; dwc:basisOfRecord "DigitalStillImage" ; dcterms:identifier http://phylodiversity.net/xmalimg/cw_28617.400px.jpg ; dcterms:creator "Cam Webb" ; dcterms:created "2008-01-01" ; dcterms:spatial _:blank1 ; ] .
[] a dwc:Identification ; sernec:identifiesIndividual http://phylodiversity.net/xmalesia/indiv/9 ; dwc:identifiedBy "Ferry Slik" ; dwc:taxonConceptID urn:lsid:ubio.org:namebank:5963772 ; dwc:dateIdentified "2009-02-22" ; sernec:basedOnOccurrence http://phylodiversity.net/xmalimg/cw_28617.400px.jpg .
_:blank1 a dcterms:Location ; geo:lon "109.95371" ; geo:lat "-1.25530" ; dwc:locality "Sukadana, on Tanah Merah road to beach" ; dwc:coordinateUncertaintyInMeters "100" .
I didn't think this could be done without creating new terms, so I'm very pleased be getting closer to my goal of a LOD representation of our data that maintains the Individuals as base entities.
Many thanks in advance for any thoughts.
Best,
Cam
.
Hi Steve,
Thanks for taking the time to work through my example.
(I'm not used to the Turtle serialization but managed to translate it into XML which is the way I "think" about RDF).
[ Everyone has their favorite tools, but in case anyone is looking for RDF tools, I really recommend:
- Emacs, with n3-mode by Hugo Haas for writing Turtle, and nXML for working with RDFXML - Redland's http://librdf.org/ rapper for turtle-to-rdfxml and roqet for SPARQL queries - Rapper also converts turtle to dot files, and dot from the graphviz library can make jpegs of the network ]
Q2-4. If the "token" is separated from the Occurrence, then dwc:recordedBy is a property of the Occurrence and dcterms:created and dcterms:creator are properties of the token (if it's a create-able thing).
Is there any benefit to specifying both OccurrenceX dwc:recordedBy PersonY and TokenX dcterms:creator PersonY? If not, which is the more `natural' home for the information? I think it would be TokenX dcterms:creator PersonY.
- After considerably thought, I've decided that I don't want to use
direct access URLs for images as their identifying URIs.
This is an important point - but does of course make things a bit more difficult from a programming point of view.
... if we can use them in this way, it greatly reduces the number of new terms that would have to be created to express DwC in RDF (i.e. we don't need to make up dwc:hasOccurrence). The downside to this is that few (none?) of the relationships that could be expressed by xxxxxxID terms have inverse properties defined. I made up a few in the sernec: vocabulary, but the need for such terms would have to be discussed at some point in a future RDF task group.
I'd actually be in favor of coining new terms like dwc:hasOccurrence, because a term xxxxxxID seems to imply that the object of the triple is not the URI of the resource itself, but some additional identifier.
I don't know enough about how semantic clients work to know if just providing the properties in one direction would be good enough for the client to infer the inverse relationship and make use of it as necessary.
I think the inverse relationships would be put into the OWL ontology that officially specified all the TDWG RDF terms, along with domain and ranges, etc.
Best,
Cam
I think both dwc:recordedBy for the Occurrence and dcterms:created for some tokens should be provided. Depending on the situation, they might be different entities (I think John Wieczorek pointed this out in an earlier thread). dwc:recordedBy is specifically supposed to be a person whereas I think dcterms:creator could be a person or an institution. For example, if the token is a PreservedSpecimen, one might want to consider the specimen to have been created by the museum or herbarium rather than the collector. Also, there could be Occurrences without tokens (like human observations, unless you go with Rich's "memory" as a token), and images that are serving as tokens might be part of a database that includes images that are NOT serving as tokens for Occurrences (and in that case consumers who aren't interested in biodiversity information would like to know the dcterms:creator without having to look at the Occurrence record).
Steve
Cam Webb wrote:
Q2-4. If the "token" is separated from the Occurrence, then dwc:recordedBy is a property of the Occurrence and dcterms:created and dcterms:creator are properties of the token (if it's a create-able thing).
Is there any benefit to specifying both OccurrenceX dwc:recordedBy PersonY and TokenX dcterms:creator PersonY? If not, which is the more `natural' home for the information? I think it would be TokenX dcterms:creator PersonY.
On 29/10/2010, at 12:41 AM, Steve Baskauf wrote:
I think both dwc:recordedBy for the Occurrence and dcterms:created for some tokens should be provided. Depending on the situation, they might be different entities (I think John Wieczorek pointed this out in an earlier thread). dwc:recordedBy is specifically supposed to be a person whereas I think dcterms:creator could be a person or an institution.
Perhaps it might be worthwhile leveraging the FOAF vocabulary (Friend of a Friend). It's mainly meant for social networking, but nevertheless it does contain terms such as Person, Organisation, Group and Project. (Project is interesting - collection activities perhaps are FOAF Projects).
The spec is here: http://xmlns.com/foaf/spec/
We can envisage the day where, by following links on taxonomic web pages, you could eventually find an Author's current twitter address, or ask the semantic web "find me all specimens of genus Tandanus collected by teams affiliated with the university of NSW between 2005 and 2007".
------ If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
------
I was thoroughly delighted to learn recently that FOAF uses terms in almost exactly the same way that I had structured my "Agents" data (right down to the same exat terms, in most cases). I plan to move forward with the FOAF terms that are relevant (thanks to John W. for pointing this out to me at TDWG).
Rich
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Paul Murray Sent: Tuesday, November 02, 2010 4:18 PM To: Steve Baskauf Cc: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] Comments on Cam's RDF practical details of recording a determination What is an Occurrence? [SEC=UNCLASSIFIED]
On 29/10/2010, at 12:41 AM, Steve Baskauf wrote:
I think both dwc:recordedBy for the Occurrence and
dcterms:created for some tokens should be provided. Depending on the situation, they might be different entities (I think John Wieczorek pointed this out in an earlier thread). dwc:recordedBy is specifically supposed to be a person whereas I think dcterms:creator could be a person or an institution.
Perhaps it might be worthwhile leveraging the FOAF vocabulary (Friend of a Friend). It's mainly meant for social networking, but nevertheless it does contain terms such as Person, Organisation, Group and Project. (Project is interesting - collection activities perhaps are FOAF Projects).
The spec is here: http://xmlns.com/foaf/spec/
We can envisage the day where, by following links on taxonomic web pages, you could eventually find an Author's current twitter address, or ask the semantic web "find me all specimens of genus Tandanus collected by teams affiliated with the university of NSW between 2005 and 2007".
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
If machine reasoning is a goal, I would be wary of FOAF. An OWL2-DL, or other OWL2 tractable reasoning profile, version remains a moving target, to the best of my knowledge. The reasons that http://xmlns.com/foaf/spec/ is not subject to tractable reasoning are relatively manageable, but I can no longer find the Zimmerman proposal for a FOAF DL version referenced in the thread ending at http://lists.w3.org/Archives/Public/public-lod/2010Jul/0378.html
Can someone point me at a DL version of FOAF and indication that it is actively under discussion somewhere?
Thanks
On Tue, Nov 2, 2010 at 10:38 PM, Richard Pyle deepreef@bishopmuseum.orgwrote:
I was thoroughly delighted to learn recently that FOAF uses terms in almost exactly the same way that I had structured my "Agents" data (right down to the same exat terms, in most cases). I plan to move forward with the FOAF terms that are relevant (thanks to John W. for pointing this out to me at TDWG).
Rich
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Paul Murray Sent: Tuesday, November 02, 2010 4:18 PM To: Steve Baskauf Cc: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] Comments on Cam's RDF practical details of recording a determination What is an Occurrence? [SEC=UNCLASSIFIED]
On 29/10/2010, at 12:41 AM, Steve Baskauf wrote:
I think both dwc:recordedBy for the Occurrence and
dcterms:created for some tokens should be provided. Depending on the situation, they might be different entities (I think John Wieczorek pointed this out in an earlier thread). dwc:recordedBy is specifically supposed to be a person whereas I think dcterms:creator could be a person or an institution.
Perhaps it might be worthwhile leveraging the FOAF vocabulary (Friend of a Friend). It's mainly meant for social networking, but nevertheless it does contain terms such as Person, Organisation, Group and Project. (Project is interesting - collection activities perhaps are FOAF Projects).
The spec is here: http://xmlns.com/foaf/spec/
We can envisage the day where, by following links on taxonomic web pages, you could eventually find an Author's current twitter address, or ask the semantic web "find me all specimens of genus Tandanus collected by teams affiliated with the university of NSW between 2005 and 2007".
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
This is mostly over my head, but I do have a more general question along these lines:
To what extent are we likely to be implementing substantive machine reasoning for Agents within the context of biodiversity informatics? I can see some value in terms of de-duplication of literature citations, and maybe a few other things here and there such as copyright ownership. But I take the absence of an Agent class within DwC as an indication that our community does not have as much a need for semantic reasoning for Agents (compared to, say, taxa and localities, among others).
If I'm missing something here, I'd very much like to be informed.
Aloha, Rich
_____
From: Bob Morris [mailto:morris.bob@gmail.com] Sent: Tuesday, November 02, 2010 6:56 PM To: Richard Pyle Cc: Paul Murray; Steve Baskauf; tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] Comments on Cam's RDF practical details of recording a determination What is an Occurrence? [SEC=UNCLASSIFIED]
If machine reasoning is a goal, I would be wary of FOAF. An OWL2-DL, or other OWL2 tractable reasoning profile, version remains a moving target, to the best of my knowledge. The reasons that http://xmlns.com/foaf/spec/ is not subject to tractable reasoning are relatively manageable, but I can no longer find the Zimmerman proposal for a FOAF DL version referenced in the thread ending at http://lists.w3.org/Archives/Public/public-lod/2010Jul/0378.html
Can someone point me at a DL version of FOAF and indication that it is actively under discussion somewhere?
Thanks
On Tue, Nov 2, 2010 at 10:38 PM, Richard Pyle deepreef@bishopmuseum.org wrote:
I was thoroughly delighted to learn recently that FOAF uses terms in almost exactly the same way that I had structured my "Agents" data (right down to the same exat terms, in most cases). I plan to move forward with the FOAF terms that are relevant (thanks to John W. for pointing this out to me at TDWG).
Rich
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Paul Murray Sent: Tuesday, November 02, 2010 4:18 PM To: Steve Baskauf Cc: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] Comments on Cam's RDF practical details of recording a determination What is an Occurrence? [SEC=UNCLASSIFIED]
On 29/10/2010, at 12:41 AM, Steve Baskauf wrote:
I think both dwc:recordedBy for the Occurrence and
dcterms:created for some tokens should be provided. Depending on the situation, they might be different entities (I think John Wieczorek pointed this out in an earlier thread). dwc:recordedBy is specifically supposed to be a person whereas I think dcterms:creator could be a person or an institution.
Perhaps it might be worthwhile leveraging the FOAF vocabulary (Friend of a Friend). It's mainly meant for social networking, but nevertheless it does contain terms such as Person, Organisation, Group and Project. (Project is interesting - collection activities perhaps are FOAF Projects).
The spec is here: http://xmlns.com/foaf/spec/
We can envisage the day where, by following links on taxonomic web pages, you could eventually find an Author's current twitter address, or ask the semantic web "find me all specimens of genus Tandanus collected by teams affiliated with the university of NSW between 2005 and 2007".
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
On Wed, Nov 3, 2010 at 1:15 AM, Richard Pyle deepreef@bishopmuseum.orgwrote:
This is mostly over my head, but I do have a more general question along these lines:
To what extent are we likely to be implementing substantive machine reasoning for Agents within the context of biodiversity informatics? I can see some value in terms of de-duplication of literature citations, and maybe a few other things here and there such as copyright ownership. But I take the absence of an Agent class within DwC as an indication that our community does not have as much a need for semantic reasoning for Agents (compared to, say, taxa and localities, among others).
If I'm missing something here, I'd very much like to be informed.
Aloha, Rich
My words "be wary of" were chosen intentionally. I do not mean "do not use in any circumstances". The following are roughly true:
It's not that hard to come up with an arguably important use cases for reasoning on agents. For example, deciding whether two observation or specimen data records represent distinct or the same Occurrence can hinge---with enough agreement on other values---on deciding whether the observers are the same or different people. Herbarium duplicate sheets often suffer inconsistent misspellings of collector names due to data entry errors.
Including an intractable ontology in an otherwise tractable one can poison the latter.
[1] - [3] show that there are plenty of ways out of the current, not very deep, reasoning weaknesses that FOAF shares with many "OWL Full" ontologies. The risk is mainly in getting on board the wrong train of the many. For example, one way favors huge data but requires small class hierarchies, and another the reverse. It's not hard to imagine biodiversity data models that demand both large class hierarchies and large data. Homegrown hybrids might be possible, but then might require homegrown tools, etc. etc.
[3] is particularly interesting and probably readable with just a little exposure to formal ontologies, especially if you pretend that the acronyms and other stuff you don't understand don't matter very much to getting the big picture.
[1] http://www.w3.org/TR/2008/WD-owl2-profiles-20081202/#Introduction [2] http://www.semanticoverflow.com/questions/1210/owl-full-and-reasoning [3] Edward Thomas et al. Lightweight Reasoning and the Web of Data for Web Science, Web Science Conf. 2010, April 26-27, 2010, Raleigh, NC, USA http://journal.webscience.org/319/
Bob Morris
Robert A. Morris Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 Associate, Harvard University Herbaria email: morris.bob@gmail.com web: http://bdei.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram http://www.cs.umb.edu/%7Eram phone (+1) 857 222 7992 (mobile)
*From:* Bob Morris [mailto:morris.bob@gmail.com] *Sent:* Tuesday, November 02, 2010 6:56 PM *To:* Richard Pyle *Cc:* Paul Murray; Steve Baskauf; tdwg-content@lists.tdwg.org
*Subject:* Re: [tdwg-content] Comments on Cam's RDF practical details of recording a determination What is an Occurrence? [SEC=UNCLASSIFIED]
If machine reasoning is a goal, I would be wary of FOAF. An OWL2-DL, or other OWL2 tractable reasoning profile, version remains a moving target, to the best of my knowledge. The reasons that http://xmlns.com/foaf/spec/ is not subject to tractable reasoning are relatively manageable, but I can no longer find the Zimmerman proposal for a FOAF DL version referenced in the thread ending at http://lists.w3.org/Archives/Public/public-lod/2010Jul/0378.html
Can someone point me at a DL version of FOAF and indication that it is actively under discussion somewhere?
Thanks
On Tue, Nov 2, 2010 at 10:38 PM, Richard Pyle deepreef@bishopmuseum.orgwrote:
I was thoroughly delighted to learn recently that FOAF uses terms in almost exactly the same way that I had structured my "Agents" data (right down to the same exat terms, in most cases). I plan to move forward with the FOAF terms that are relevant (thanks to John W. for pointing this out to me at TDWG).
Rich
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Paul Murray Sent: Tuesday, November 02, 2010 4:18 PM To: Steve Baskauf Cc: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] Comments on Cam's RDF practical details of recording a determination What is an Occurrence? [SEC=UNCLASSIFIED]
On 29/10/2010, at 12:41 AM, Steve Baskauf wrote:
I think both dwc:recordedBy for the Occurrence and
dcterms:created for some tokens should be provided. Depending on the situation, they might be different entities (I think John Wieczorek pointed this out in an earlier thread). dwc:recordedBy is specifically supposed to be a person whereas I think dcterms:creator could be a person or an institution.
Perhaps it might be worthwhile leveraging the FOAF vocabulary (Friend of a Friend). It's mainly meant for social networking, but nevertheless it does contain terms such as Person, Organisation, Group and Project. (Project is interesting - collection activities perhaps are FOAF Projects).
The spec is here: http://xmlns.com/foaf/spec/
We can envisage the day where, by following links on taxonomic web pages, you could eventually find an Author's current twitter address, or ask the semantic web "find me all specimens of genus Tandanus collected by teams affiliated with the university of NSW between 2005 and 2007".
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Robert A. Morris Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 Associate, Harvard University Herbaria email: morris.bob@gmail.com web: http://bdei.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram http://www.cs.umb.edu/%7Eram phone (+1) 857 222 7992 (mobile)
-
There are a few pages about this.
I am finding that the version available at http://xmlns.com/foaf/spec/index.rdf reads into Protege ok, and the HermiT reasoner does not complain about it - that being my usual test. The Owl consistency checker at http://www.mindswap.org/2003/pellet/demo.shtml highlights a couple of DL issues.
* Some of the social properties (icq, msn chat, etc) are functional inverse datatype properties (if two persons have the same ICQ number, then they are the same person). The problem is that there is no general way to check data literals for equality (the OLW spec, for instance, supports fractions, floats and doubles). But these properties can be blithely ignored in most cases by simple dint of not using them. * name is declared to be a subtype of rdfs:label, but rdfs:label is an annotation property * the vocabulary does not import its dependencies, and so some declarations are implied
Annotation properties are a bit of a bugbear. As far as I can tell, if something is an annotation property then it should only be used to describe vocabulary terms. The actual subject matter of an ontology should be described with regular properties. Thus, declaring name to be a subproperty of label is simply wrong: a thing's name is ontology, not vocabulary. The machine-level issue is that RDF can't tell the difference between the two, so OWL/RDF cannot behave according to the OWL specification in certain respects. But dcterms title and description fit the bill admirably, anyway.
You can use the FOAF properties and classes without explicitly importing FOAF - you can just declare them. The OWL-DL incompatibilities are not serious. On the other hand ... you may dislike the fact that there are rules in FOAF at all. Maybe two different people happen to share an organisational ICQ address. FOAF at the moment would force you to create a Organisation or Group object for the team, and assign the ICQ address to that.
And so yes, there are issues. The simple way to deal with it is to create TDWG terms mimicking the FOAF terms, declare them to be sameAs, and if that becomes unworkable then to break the equivalence.
On 03/11/2010, at 3:55 PM, Bob Morris wrote:
If machine reasoning is a goal, I would be wary of FOAF. An OWL2-DL, or other OWL2 tractable reasoning profile, version remains a moving target, to the best of my knowledge. The reasons that http://xmlns.com/foaf/spec/ is not subject to tractable reasoning are relatively manageable, but I can no longer find the Zimmerman proposal for a FOAF DL version referenced in the thread ending at http://lists.w3.org/Archives/Public/public-lod/2010Jul/0378.html
Can someone point me at a DL version of FOAF and indication that it is actively under discussion somewhere?
Thanks
On Tue, Nov 2, 2010 at 10:38 PM, Richard Pyle deepreef@bishopmuseum.org wrote:
I was thoroughly delighted to learn recently that FOAF uses terms in almost exactly the same way that I had structured my "Agents" data (right down to the same exat terms, in most cases). I plan to move forward with the FOAF terms that are relevant (thanks to John W. for pointing this out to me at TDWG).
Rich
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Paul Murray Sent: Tuesday, November 02, 2010 4:18 PM To: Steve Baskauf Cc: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] Comments on Cam's RDF practical details of recording a determination What is an Occurrence? [SEC=UNCLASSIFIED]
On 29/10/2010, at 12:41 AM, Steve Baskauf wrote:
I think both dwc:recordedBy for the Occurrence and
dcterms:created for some tokens should be provided. Depending on the situation, they might be different entities (I think John Wieczorek pointed this out in an earlier thread). dwc:recordedBy is specifically supposed to be a person whereas I think dcterms:creator could be a person or an institution.
Perhaps it might be worthwhile leveraging the FOAF vocabulary (Friend of a Friend). It's mainly meant for social networking, but nevertheless it does contain terms such as Person, Organisation, Group and Project. (Project is interesting - collection activities perhaps are FOAF Projects).
The spec is here: http://xmlns.com/foaf/spec/
We can envisage the day where, by following links on taxonomic web pages, you could eventually find an Author's current twitter address, or ask the semantic web "find me all specimens of genus Tandanus collected by teams affiliated with the university of NSW between 2005 and 2007".
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Robert A. Morris Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 Associate, Harvard University Herbaria email: morris.bob@gmail.com web: http://bdei.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram phone (+1) 857 222 7992 (mobile)
------ If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
------
The OWL checker (http://www.mindswap.org/2003/pellet/demo.shtml) is outdated see the message at the top
Pellet has been moved to http://pellet.owldl.com
*The information on this page is kept for historical/archival reasons. These pages are not updated any more. Please update your link.* * *
This is an OWL 2 validator.
http://owl.cs.manchester.ac.uk/validator/
- Pete
On Thu, Nov 4, 2010 at 12:21 AM, Paul Murray pmurray@anbg.gov.au wrote:
There are a few pages about this.
I am finding that the version available at http://xmlns.com/foaf/spec/index.rdf reads into Protege ok, and the HermiT reasoner does not complain about it - that being my usual test. The Owl consistency checker at http://www.mindswap.org/2003/pellet/demo.shtml highlights a couple of DL issues.
- Some of the social properties (icq, msn chat, etc) are functional inverse
datatype properties (if two persons have the same ICQ number, then they are the same person). The problem is that there is no general way to check data literals for equality (the OLW spec, for instance, supports fractions, floats and doubles). But these properties can be blithely ignored in most cases by simple dint of not using them.
- name is declared to be a subtype of rdfs:label, but rdfs:label is an
annotation property
- the vocabulary does not import its dependencies, and so some declarations
are implied
Annotation properties are a bit of a bugbear. As far as I can tell, if something is an annotation property then it should only be used to describe vocabulary terms. The actual subject matter of an ontology should be described with regular properties. Thus, declaring name to be a subproperty of label is simply wrong: a thing's name is ontology, not vocabulary. The machine-level issue is that RDF can't tell the difference between the two, so OWL/RDF cannot behave according to the OWL specification in certain respects. But dcterms title and description fit the bill admirably, anyway.
You can use the FOAF properties and classes without explicitly importing FOAF - you can just declare them. The OWL-DL incompatibilities are not serious. On the other hand ... you may dislike the fact that there are rules in FOAF at all. Maybe two different people happen to share an organisational ICQ address. FOAF at the moment would force you to create a Organisation or Group object for the team, and assign the ICQ address to that.
And so yes, there are issues. The simple way to deal with it is to create TDWG terms mimicking the FOAF terms, declare them to be sameAs, and if that becomes unworkable then to break the equivalence.
On 03/11/2010, at 3:55 PM, Bob Morris wrote:
If machine reasoning is a goal, I would be wary of FOAF. An OWL2-DL, or other OWL2 tractable reasoning profile, version remains a moving target, to the best of my knowledge. The reasons that http://xmlns.com/foaf/spec/ is not subject to tractable reasoning are relatively manageable, but I can no longer find the Zimmerman proposal for a FOAF DL version referenced in the thread ending at http://lists.w3.org/Archives/Public/public-lod/2010Jul/0378.html
Can someone point me at a DL version of FOAF and indication that it is actively under discussion somewhere?
Thanks
On Tue, Nov 2, 2010 at 10:38 PM, Richard Pyle deepreef@bishopmuseum.orgwrote:
I was thoroughly delighted to learn recently that FOAF uses terms in almost exactly the same way that I had structured my "Agents" data (right down to the same exat terms, in most cases). I plan to move forward with the FOAF terms that are relevant (thanks to John W. for pointing this out to me at TDWG).
Rich
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Paul Murray Sent: Tuesday, November 02, 2010 4:18 PM To: Steve Baskauf Cc: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] Comments on Cam's RDF practical details of recording a determination What is an Occurrence? [SEC=UNCLASSIFIED]
On 29/10/2010, at 12:41 AM, Steve Baskauf wrote:
I think both dwc:recordedBy for the Occurrence and
dcterms:created for some tokens should be provided. Depending on the situation, they might be different entities (I think John Wieczorek pointed this out in an earlier thread). dwc:recordedBy is specifically supposed to be a person whereas I think dcterms:creator could be a person or an institution.
Perhaps it might be worthwhile leveraging the FOAF vocabulary (Friend of a Friend). It's mainly meant for social networking, but nevertheless it does contain terms such as Person, Organisation, Group and Project. (Project is interesting - collection activities perhaps are FOAF Projects).
The spec is here: http://xmlns.com/foaf/spec/
We can envisage the day where, by following links on taxonomic web pages, you could eventually find an Author's current twitter address, or ask the semantic web "find me all specimens of genus Tandanus collected by teams affiliated with the university of NSW between 2005 and 2007".
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Robert A. Morris Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 Associate, Harvard University Herbaria email: morris.bob@gmail.com web: http://bdei.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram http://www.cs.umb.edu/%7Eram phone (+1) 857 222 7992 (mobile)
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
participants (16)
-
"Markus Döring (GBIF)"
-
Arlin Stoltzfus
-
Bailly, Nicolas (WorldFish)
-
Bob Morris
-
Cam Webb
-
Greg Whitbread
-
greg whitbread
-
Gregor Hagedorn
-
John Wieczorek
-
Mark Wilden
-
Markus Döring
-
Nico Franz
-
Paul Murray
-
Peter DeVries
-
Richard Pyle
-
Steve Baskauf