Re: [tdwg-content] taxonomy != identification
Collections contain things that do not map nicely to a single taxon name of any (or no) rank. It's not clear to me if this proposal will support those kinds of data or not. A few examples:
Uncertainty: http://arctos.database.museum/guid/KWP:Ento:1703 Composite specimens: http://arctos.database.museum/guid/UAM:Herb:12718 Hybrids: http://arctos.database.museum/guid/UAM:Mamm:3517 Things that aren't taxonomy at all: http://arctos.database.museum/guid/UAM:ES:3405
-D
On Wed, Nov 3, 2010 at 10:07 PM, Peter DeVries pete.devries@gmail.comwrote:
What I would recommend is that you treat a specimen that is identified to an order (Perciformes) with something like the following.
Species => Order Perciformes species undetermined.
The individual is still an instance of a species, however that species has yet to be determined.
What would work best is to have some standard way of writing the green string above.
This would allow the occurrences that are of individuals identified only to the Order Perciformes, to be interpreted as a species that falls somewhere within the Order Perciformes.
- Pete
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base http://www.taxonconcept.org/ / GeoSpecies Knowledge Base http://lod.geospecies.org/ About the GeoSpecies Knowledge Base http://about.geospecies.org/
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
On Thu, Nov 4, 2010 at 12:54 AM, Dusty dlmcdonald@alaska.edu wrote:
Collections contain things that do not map nicely to a single taxon name of any (or no) rank. It's not clear to me if this proposal will support those kinds of data or not. A few examples:
Uncertainty: http://arctos.database.museum/guid/KWP:Ento:1703 => This is a Genus Erebia species undetermined. Composite specimens: http://arctos.database.museum/guid/UAM:Herb:12718 => This is one of those batches/jars Hybrids: http://arctos.database.museum/guid/UAM:Mamm:3517 => Canis latrans Say, 1823 x Canis lupus familiaris Linnaeus, 1758 (HybridConcept) Things that aren't taxonomy at all: http://arctos.database.museum/guid/UAM:ES:3405 => This is some other groups vocabulary / standards (Geology)
-D
On Wed, Nov 3, 2010 at 10:07 PM, Peter DeVries pete.devries@gmail.comwrote:
What I would recommend is that you treat a specimen that is identified to an order (Perciformes) with something like the following.
Species => Order Perciformes species undetermined.
The individual is still an instance of a species, however that species has yet to be determined.
What would work best is to have some standard way of writing the green string above.
This would allow the occurrences that are of individuals identified only to the Order Perciformes, to be interpreted as a species that falls somewhere within the Order Perciformes.
- Pete
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base http://www.taxonconcept.org/ / GeoSpecies Knowledge Base http://lod.geospecies.org/ About the GeoSpecies Knowledge Base http://about.geospecies.org/
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi Dusty,
Collections contain things that do not map nicely to a single taxon name of any (or no) rank. It's not clear to me if this proposal will support those kinds of data or not. A few examples:
Uncertainty: http://arctos.database.museum/guid/KWP:Ento:1703
This is an excellent example of something I have to deal with occassionally, and was going to be part of my never-sent post on dealing with ambiguous identifications. In the context of DwC, my feeling is that this taxon should be represented as "Erebia" in dwc:scientificName, and the two possible species epithets included in dwc:identificationRemarks.
Composite specimens: http://arctos.database.museum/guid/UAM:Herb:12718
This one could be represented as "Bupleurum" for the Individual instance representing the sheet, but then I would be inclined to establish two "child" individuals (semantically related to the "parent" sheet), one each identified to the two different taxa.
I think a lot of data models (including GNUB) treat hybrid formulae as though they are separate "taxa", with the hybrid formula as the name. Although it doesn't seem to be addressed in the DwC documentation, I would put "Canis latrans x Canis lupus familiaris" in dwc:scientificName.
Now....this may be one of those semantics-breaking pseudo-conventions that the RDF'ers will pull their hair out over (along the lines of Bob's post concerning different kinds of aggregations), in which case we should probably have an0other thread on this topic.
Things that aren't taxonomy at all:
http://arctos.database.museum/guid/UAM:ES:3405
Outside the scope of DwC?
Aloha, Rich
On Thu, Nov 4, 2010 at 1:14 AM, Peter DeVries pete.devries@gmail.comwrote:
Uncertainty: http://arctos.database.museum/guid/KWP:Ento:1703 => This is a Genus Erebia species undetermined.
No, it isn't. We know more than that. It's not* Erebia embla*http://arctos.database.museum/name/Erebia%20embla, for example.
On Thu, Nov 4, 2010 at 2:10 AM, Richard Pyle deepreef@bishopmuseum.orgwrote:
Hi Dusty,
Collections contain things that do not map nicely to a single taxon name of any (or no) rank. It's not clear to me if this proposal will support those kinds of data or not. A few examples:
Uncertainty: http://arctos.database.museum/guid/KWP:Ento:1703
This is an excellent example of something I have to deal with occassionally, and was going to be part of my never-sent post on dealing with ambiguous identifications. In the context of DwC, my feeling is that this taxon should be represented as "Erebia" in dwc:scientificName, and the two possible species epithets included in dwc:identificationRemarks.
But that's not the data.
Composite specimens: http://arctos.database.museum/guid/UAM:Herb:12718
This one could be represented as "Bupleurum" for the Individual instance representing the sheet, but then I would be inclined to establish two "child" individuals (semantically related to the "parent" sheet), one each identified to the two different taxa.
So I picked an easy example. Here's a slightly harder one: http://arctos.database.museum/guid/MVZ:Egg:2355.
I think a lot of data models (including GNUB) treat hybrid formulae as though they are separate "taxa", with the hybrid formula as the name. Although it doesn't seem to be addressed in the DwC documentation, I would put "Canis latrans x Canis lupus familiaris" in dwc:scientificName.
Now....this may be one of those semantics-breaking pseudo-conventions that the RDF'ers will pull their hair out over (along the lines of Bob's post concerning different kinds of aggregations), in which case we should probably have an0other thread on this topic.
Things that aren't taxonomy at all:
http://arctos.database.museum/guid/UAM:ES:3405
Outside the scope of DwC?
Maybe so, but there it is: http://data.gbif.org/occurrences/242032297/. Excluding that would, I think, force you to exclude things like http://arctos.database.museum/guid/UAM:ES:3359 as well - it's all from the same administrative unit. I don't have or want any control over what Curators enter - any scope-limiting filter will have to happen elsewhere.
The point is simply that these are real data. We won't change them to some approximation of themselves or stuff them into a remarks field somewhere. They'll get more complicated before we're done. Anything that's to be useful to us must acknowledge the realities of collections data.
If anyone is interested, we accomplish the above by separating Identifications and Taxonomy. Arctos has roots deep in the ASC model discussed recently, but the link between specimens and taxonomy was one of our early divergences from that model. Assigning TaxonIDs directly to specimens is a no-win game - you either end up with the really valuable data buried in a remarks field somewhere, or you end up with an infinite list of strings that you must pretend are taxon names. Neither is acceptable. A fairly recent ER diagram can be had from http://arctos.googlecode.com/files/arctos_erd_20100129_single.pdf. Taxonomy and Identifications are in dark purple.
--D
Dusty, Nice thought-provoking examples. I think that it is safe to say that not everyone is going to want to use this broadly-defined concept of Individual. But it will be there for people who want (or in my case need) to use it. Those who want to use in in complex ways will have to bear the burden of figuring out how. A few comments inline:
Dusty wrote:
...
> Composite specimens: http://arctos.database.museum/guid/UAM:Herb:12718 This one could be represented as "Bupleurum" for the Individual instance representing the sheet, but then I would be inclined to establish two "child" individuals (semantically related to the "parent" sheet), one each identified to the two different taxa.
So I picked an easy example. Here's a slightly harder one: http://arctos.database.museum/guid/MVZ:Egg:2355.
How about something like this? Assign an Individual ID to the bird that build the nest which would be an Occurrence (documentary evidence of the bird that built it). Assign an Individual ID to the brood-parasitic bird that laid the egg in the nest which would also be an Occurrence. Use the DwC Resource Relationship terms (http://rs.tdwg.org/dwc/terms/index.htm#ResourceRelationship) to define the parasitism relationship between the two Occurrences. This would be a good opportunity for John to demonstrate how one does this - I've never been clear exactly how it is supposed to work.
... > Things that aren't taxonomy at all: http://arctos.database.museum/guid/UAM:ES:3405 Outside the scope of DwC?
Maybe so, but there it is: http://data.gbif.org/occurrences/242032297/. Excluding that would, I think, force you to exclude things like http://arctos.database.museum/guid/UAM:ES:3359 as well - it's all from the same administrative unit. I don't have or want any control over what Curators enter - any scope-limiting filter will have to happen elsewhere.
People are always going to misapply terms. I think sending records of rocks and minerals to GBIF as occurrences is an error (out of scope). So I don't feel any need to explain how to handle something like that.
The point is simply that these are real data. We won't change them to some approximation of themselves or stuff them into a remarks field somewhere. They'll get more complicated before we're done. Anything that's to be useful to us must acknowledge the realities of collections data.
True enough. Thanks for the challenge!
Steve
Quick comment:
How about something like this? Assign an Individual ID to the bird that build the nest which would be an Occurrence (documentary evidence of the bird that built it). Assign an Individual ID to the brood-parasitic bird that laid the egg in the nest which would also be an Occurrence. Use the DwC Resource Relationship terms (http://rs.tdwg.org/dwc/terms/index.htm#ResourceRelationship) to define the parasitism relationship between the two Occurrences. This would be a good opportunity for John to demonstrate how one does this - I've never been clear exactly how it is supposed to work.
I had originally assumed that the objects of interest were the eggs from two different species of birds, aggregated into a single nest. If that's the case, then I think how I described it is appropriate. But I also think what Steve describes above is appropriate, if that's a better context for how the record should be represented.
Aloha, Rich
This is an excellent example of something I have to deal with
occassionally,
and was going to be part of my never-sent post on dealing with ambiguous identifications. In the context of DwC, my feeling is that this taxon should be represented as "Erebia" in dwc:scientificName, and the two possible species epithets included in dwc:identificationRemarks.
But that's not the data.
I would argue that it's an *accurate* representation of the data, just not a completely *precise* representation. We all have data that cannot easily be represented in DwC (without resorting to some xxxxRemarks term) -- which is a necessary compromise of a practical data exchange system designed to work across highly heterogenous datasets.
So I picked an easy example. Here's a slightly harder one: http://arctos.database.museum/guid/MVZ:Egg:2355.
Not harder at all. Two individuals (one identified as Pipilo aberti dumeticolus, and the other identified as Molothrus ater obscurus). Both are children of a parent Individual, which either doesn't have any taxon Idientification associated with it (if the object consists of the nest itself, as well as the eggs), or has an Identification of "Passeriformes" associated with it (if the nest itself is considered extraneous material, and the eggs are the real object of interest).
Maybe so, but there it is: http://data.gbif.org/occurrences/242032297/.
Well....I think this pushes (exceeds, really) the intended purpose of DwC. That it was picked up by GBIF is only a result of it having been presented by the content provider.
Excluding that would, I think, force you to exclude things like http://arctos.database.museum/guid/UAM:ES:3359 as well
- it's all from the same administrative unit.
Just because it's from the same administrative unit doesn't mean that it has be, or not be, considered within scope for DwC. I think a fossil is a legitimate within-scope record for DwC. The other information can, perhaps, be presented within the GeologicalContext class (or maybe not). But DwC is a data exchange system for information about organisms.
I don't have or want any control over what Curators enter - any scope-limiting filter will have to happen elsewhere.
That seems to me to be a question of database management within an institution -- not about what subset of that information gets exposed as DwC records. If the database is capable of filtering out the non-biological-relevant stuff at the time the records are generated for packaging within DwC, then such a filter should be applied accordingly. If this is not possible, then consumers will have to deal with the occassional out-of-scope records. I suspect the ratio of in-scope to out-of-scope records is such that the value of the latter vastly exceeds the cost of the former.
The point is simply that these are real data. We won't change them to some approximation of themselves or stuff them into a remarks field somewhere. They'll get more complicated before we're done. Anything that's to be useful to us must acknowledge the realities of collections data.
Fair enough; but as a collection wishing to present data for sharing via the DwC standard, the content provider needs to decide the relative costs/benefits of either filtering out-of-scope records out of the exposed DwC datasets, or accepting some small fraction of out-of-scope records being misinterpreted by consumers/users as in-scope records.
If anyone is interested, we accomplish the above by separating Identifications and Taxonomy. Arctos has roots deep in the ASC model discussed recently, but the link between specimens and taxonomy was one of our early divergences from that model. Assigning TaxonIDs directly to specimens is a no-win game - you either end up with the really valuable data buried in a remarks field somewhere, or you end up with an infinite list of strings that you must pretend are taxon names. Neither is acceptable. A fairly recent ER diagram can be had from http://arctos.googlecode.com/files/arctos_erd_20100129_single.pdf. Taxonomy and Identifications are in dark purple.
This seems to be a very standard way of representing Identifications and taxon names. I'm not sure I understand the issue here. The only part that I'm not clear on is the meaning of the "VARIABLE" attribute of the IDENTIFICATION_TAXONOMY entity. Is this how you enable identifications such as "Erebia youngi or Erebia lafontainei"?
But am I to understand correctly that there is a record in the TAXONOMY table where FULL_TAXON_NAME is populated with "Dark grey shale", with an INFRASPECIFIC_RANK of "Subspecies"? Wouldn't it then be worthwhile to add a field for "IS_BIOLOGICAL" to this table, to allow filtering out such taxa? Or, at least making an effort to put some standard term like "Non-Biological" within the TAXON_REMARKS field?
Getting back to your example identified as "Erebia youngi or Erebia lafontainei". I don't actually see this as breaking the rule I tried to articulate in a previous post, which asserted that a single Individual can have only one legitimate taxon identification. Here's what I wrote:
My proposed solution is to rigidly maintain that an instance of "Individual" can not be partitioned to have multiple separate but concurrently legitimate Identifications associated with it. It can have multiple Identifications, but they would be considered to either be competing with each other (when different taxa are asserted) or reinforing each other (when the same taxon is asserted).
So, although I maintain that my "accurate but less precise" method of presenting this record in DwC is still legitimate, perhaps a better way to represent identifications for your specimen http://arctos.database.museum/guid/KWP:Ento:1703 is as follows:
identificationID: 1 individualID: http://arctos.database.museum/guid/KWP:Ento:1703 taxonID: http://arctos.database.museum/name/Erebia%20youngi identifiedBy: Kenelm W. Philip dateIdentified: 1974-07-04 identificationQualifier: Alternative identificationRemarks: Erebia youngi/lafontainei
identificationID: 2 individualID: http://arctos.database.museum/guid/KWP:Ento:1703 taxonID: http://arctos.database.museum/name/Erebia%20lafontainei identifiedBy: Kenelm W. Philip dateIdentified: 1974-07-04 identificationQualifier: Alternative identificationRemarks: Erebia youngi/lafontainei
The only part I made up here is the dwc:identificationQualifier term of "Alternate". Perhaps when someone proposes a controlled vocabulary for dwc:identificationQualifier, something like "Alternate" could be included, with the meaning that it is one of multiple possible identifications.
The important point is that those multiple possible identifications are still mutually exclusive (and competitive), and hence conforms to the rule I proposed for only one concurrent legitimate identification per Individual.
Aloha, Rich
On Thu, Nov 4, 2010 at 12:07 PM, Richard Pyle deepreef@bishopmuseum.orgwrote:
This is an excellent example of something I have to deal with
occassionally,
and was going to be part of my never-sent post on dealing with
ambiguous
identifications. In the context of DwC, my feeling is that this taxon should be represented as "Erebia" in dwc:scientificName, and the two possible species epithets included in dwc:identificationRemarks.
But that's not the data.
I would argue that it's an *accurate* representation of the data, just not a completely *precise* representation. We all have data that cannot easily be represented in DwC (without resorting to some xxxxRemarks term) -- which is a necessary compromise of a practical data exchange system designed to work across highly heterogenous datasets.
My interest in this conversation is largely to point out that I believe defensible collections management practices should drive informatics, not the other way around. I generally agree with the above, and I think we're all willing to make compromises for DWC. We expect to have to concatenate stuff, exclude details, and omit auxiliary data. Mangling taxon assertions to fit some data model is another thing altogether. Excluding the sticky parts is not an appropriate way of dealing with heterogeneous data.
So I picked an easy example. Here's a slightly harder one: http://arctos.database.museum/guid/MVZ:Egg:2355.
Not harder at all. Two individuals (one identified as Pipilo aberti dumeticolus, and the other identified as Molothrus ater obscurus). Both are children of a parent Individual, which either doesn't have any taxon Idientification associated with it (if the object consists of the nest itself, as well as the eggs), or has an Identification of "Passeriformes" associated with it (if the nest itself is considered extraneous material, and the eggs are the real object of interest).
Fine: Not harder, but certainly less precise. Get a botanist interested in the grass in the nest and we're down to something like "Eukaryota." We can and should do better than that.
Maybe so, but there it is: http://data.gbif.org/occurrences/242032297/.
Well....I think this pushes (exceeds, really) the intended purpose of DwC. That it was picked up by GBIF is only a result of it having been presented by the content provider.
Excluding that would, I think, force you to exclude things like http://arctos.database.museum/guid/UAM:ES:3359 as well
- it's all from the same administrative unit.
Just because it's from the same administrative unit doesn't mean that it has be, or not be, considered within scope for DwC. I think a fossil is a legitimate within-scope record for DwC. The other information can, perhaps, be presented within the GeologicalContext class (or maybe not). But DwC is a data exchange system for information about organisms.
I don't have or want any control over what Curators enter - any scope-limiting filter will have to happen elsewhere.
That seems to me to be a question of database management within an institution -- not about what subset of that information gets exposed as DwC records. If the database is capable of filtering out the non-biological-relevant stuff at the time the records are generated for packaging within DwC, then such a filter should be applied accordingly. If this is not possible, then consumers will have to deal with the occassional out-of-scope records. I suspect the ratio of in-scope to out-of-scope records is such that the value of the latter vastly exceeds the cost of the former.
I suspect that alleged non-biological or "out of scope" records often aren't quite as boring or left-field as they seem. The paleontologists probably picked up that rock because it is fossiliferous, but that information has never been confirmed/entered. Knowing that it's a such-and-such rock from such-and-such a place could be the key that leads someone to go looking for those fossils in the specimen. We're working on putting ethnological objects in the same system. Those will seem even more biologically irrelevant on the surface, but they're also the best place to find things like pre-industrial walrus ivory. I'd consider that important and relevant, and I certainly wouldn't want the job of excluding records that don't seem important to my interests.
The point is simply that these are real data. We won't change them to some approximation of themselves or stuff them into a remarks field somewhere. They'll get more complicated before we're done. Anything that's to be useful to us must acknowledge the realities of collections data.
Fair enough; but as a collection wishing to present data for sharing via the DwC standard, the content provider needs to decide the relative costs/benefits of either filtering out-of-scope records out of the exposed DwC datasets, or accepting some small fraction of out-of-scope records being misinterpreted by consumers/users as in-scope records.
If anyone is interested, we accomplish the above by separating Identifications and Taxonomy. Arctos has roots deep in the ASC model discussed recently, but the link between specimens and taxonomy was one of our early divergences from that model. Assigning TaxonIDs directly to specimens is a no-win game - you either end up with the really valuable data buried in a remarks field somewhere, or you end up with an infinite list of strings that you must pretend are taxon names. Neither is acceptable. A fairly recent ER diagram can be had from http://arctos.googlecode.com/files/arctos_erd_20100129_single.pdf. Taxonomy and Identifications are in dark purple.
This seems to be a very standard way of representing Identifications and taxon names. I'm not sure I understand the issue here. The only part that I'm not clear on is the meaning of the "VARIABLE" attribute of the IDENTIFICATION_TAXONOMY entity. Is this how you enable identifications such as "Erebia youngi or Erebia lafontainei"?
But am I to understand correctly that there is a record in the TAXONOMY table where FULL_TAXON_NAME is populated with "Dark grey shale", with an INFRASPECIFIC_RANK of "Subspecies"? Wouldn't it then be worthwhile to add a field for "IS_BIOLOGICAL" to this table, to allow filtering out such taxa? Or, at least making an effort to put some standard term like "Non-Biological" within the TAXON_REMARKS field?
No, and that's the power of separating Identifications from Taxonomy. There are two scientific names involved. Identification.Scientific_Name may be things like "Canis latrans," "Sorex sp.," or "little squishy thing." Each of those things has a relationship to a Taxonomy.Scientific_Name record - "Canis latrans," "Sorex," and "unidentifiable" (our only non-biological Taxonomy.Scientific_Name), respectively. The goal is to put only taxonomy (which we define more or less as strings that can be traced back to publications) in Taxonomy, while allowing most anything in Identification. The VARIABLE mentioned above, in conjunction with TAXA_FORMULA, lets us form
1 relationship between Identification and Taxonomy, e.g., for hybrids.
Getting back to your example identified as "Erebia youngi or Erebia lafontainei". I don't actually see this as breaking the rule I tried to articulate in a previous post, which asserted that a single Individual can have only one legitimate taxon identification. Here's what I wrote:
My proposed solution is to rigidly maintain that an instance of "Individual" can not be partitioned to have multiple separate but concurrently legitimate Identifications associated with it. It can have multiple Identifications, but they would be considered to either be competing with each other (when different taxa are asserted) or reinforing each other (when the same taxon is asserted).
So, although I maintain that my "accurate but less precise" method of presenting this record in DwC is still legitimate, perhaps a better way to represent identifications for your specimen http://arctos.database.museum/guid/KWP:Ento:1703 is as follows:
identificationID: 1 individualID: http://arctos.database.museum/guid/KWP:Ento:1703 taxonID: http://arctos.database.museum/name/Erebia%20youngi identifiedBy: Kenelm W. Philip dateIdentified: 1974-07-04 identificationQualifier: Alternative identificationRemarks: Erebia youngi/lafontainei
identificationID: 2 individualID: http://arctos.database.museum/guid/KWP:Ento:1703 taxonID: http://arctos.database.museum/name/Erebia%20lafontainei identifiedBy: Kenelm W. Philip dateIdentified: 1974-07-04 identificationQualifier: Alternative identificationRemarks: Erebia youngi/lafontainei
The only thing stopping that is our fairly arbitrary idea that only one Identification may be "Accepted." My inclination is the "A or B" method is more intuitive and easier for people to immediately grasp, and it's certainly more flexible. We could easily (by entering one record in a code table) create a TAXA_FORMULA of "A x ((B x C) x D)" to deal with some 3rd-generation hybrid, for example.
I don't suggest that DWC should be able to deal with the formulae - that's a data creation thing - but accepting something like the following might be appropriate.
<identification> <IdString>Erebia youngi or Erebia lafontainei</IdString> <taxon>http://arctos.database.museum/name/Erebia%20youngi</taxon> <taxon>http://arctos.database.museum/name/Erebia%20lafontainei</taxon> <otherStuff>bla bla bla</otherStuff> </identification>
-D
The only part I made up here is the dwc:identificationQualifier term of "Alternate". Perhaps when someone proposes a controlled vocabulary for dwc:identificationQualifier, something like "Alternate" could be included, with the meaning that it is one of multiple possible identifications.
The important point is that those multiple possible identifications are still mutually exclusive (and competitive), and hence conforms to the rule I proposed for only one concurrent legitimate identification per Individual.
Aloha, Rich
participants (4)
-
Dusty
-
Peter DeVries
-
Richard Pyle
-
Steve Baskauf