On Thu, Nov 4, 2010 at 12:07 PM, Richard Pyle deepreef@bishopmuseum.orgwrote:
This is an excellent example of something I have to deal with
occassionally,
and was going to be part of my never-sent post on dealing with
ambiguous
identifications. In the context of DwC, my feeling is that this taxon should be represented as "Erebia" in dwc:scientificName, and the two possible species epithets included in dwc:identificationRemarks.
But that's not the data.
I would argue that it's an *accurate* representation of the data, just not a completely *precise* representation. We all have data that cannot easily be represented in DwC (without resorting to some xxxxRemarks term) -- which is a necessary compromise of a practical data exchange system designed to work across highly heterogenous datasets.
My interest in this conversation is largely to point out that I believe defensible collections management practices should drive informatics, not the other way around. I generally agree with the above, and I think we're all willing to make compromises for DWC. We expect to have to concatenate stuff, exclude details, and omit auxiliary data. Mangling taxon assertions to fit some data model is another thing altogether. Excluding the sticky parts is not an appropriate way of dealing with heterogeneous data.
So I picked an easy example. Here's a slightly harder one: http://arctos.database.museum/guid/MVZ:Egg:2355.
Not harder at all. Two individuals (one identified as Pipilo aberti dumeticolus, and the other identified as Molothrus ater obscurus). Both are children of a parent Individual, which either doesn't have any taxon Idientification associated with it (if the object consists of the nest itself, as well as the eggs), or has an Identification of "Passeriformes" associated with it (if the nest itself is considered extraneous material, and the eggs are the real object of interest).
Fine: Not harder, but certainly less precise. Get a botanist interested in the grass in the nest and we're down to something like "Eukaryota." We can and should do better than that.
Maybe so, but there it is: http://data.gbif.org/occurrences/242032297/.
Well....I think this pushes (exceeds, really) the intended purpose of DwC. That it was picked up by GBIF is only a result of it having been presented by the content provider.
Excluding that would, I think, force you to exclude things like http://arctos.database.museum/guid/UAM:ES:3359 as well
- it's all from the same administrative unit.
Just because it's from the same administrative unit doesn't mean that it has be, or not be, considered within scope for DwC. I think a fossil is a legitimate within-scope record for DwC. The other information can, perhaps, be presented within the GeologicalContext class (or maybe not). But DwC is a data exchange system for information about organisms.
I don't have or want any control over what Curators enter - any scope-limiting filter will have to happen elsewhere.
That seems to me to be a question of database management within an institution -- not about what subset of that information gets exposed as DwC records. If the database is capable of filtering out the non-biological-relevant stuff at the time the records are generated for packaging within DwC, then such a filter should be applied accordingly. If this is not possible, then consumers will have to deal with the occassional out-of-scope records. I suspect the ratio of in-scope to out-of-scope records is such that the value of the latter vastly exceeds the cost of the former.
I suspect that alleged non-biological or "out of scope" records often aren't quite as boring or left-field as they seem. The paleontologists probably picked up that rock because it is fossiliferous, but that information has never been confirmed/entered. Knowing that it's a such-and-such rock from such-and-such a place could be the key that leads someone to go looking for those fossils in the specimen. We're working on putting ethnological objects in the same system. Those will seem even more biologically irrelevant on the surface, but they're also the best place to find things like pre-industrial walrus ivory. I'd consider that important and relevant, and I certainly wouldn't want the job of excluding records that don't seem important to my interests.
The point is simply that these are real data. We won't change them to some approximation of themselves or stuff them into a remarks field somewhere. They'll get more complicated before we're done. Anything that's to be useful to us must acknowledge the realities of collections data.
Fair enough; but as a collection wishing to present data for sharing via the DwC standard, the content provider needs to decide the relative costs/benefits of either filtering out-of-scope records out of the exposed DwC datasets, or accepting some small fraction of out-of-scope records being misinterpreted by consumers/users as in-scope records.
If anyone is interested, we accomplish the above by separating Identifications and Taxonomy. Arctos has roots deep in the ASC model discussed recently, but the link between specimens and taxonomy was one of our early divergences from that model. Assigning TaxonIDs directly to specimens is a no-win game - you either end up with the really valuable data buried in a remarks field somewhere, or you end up with an infinite list of strings that you must pretend are taxon names. Neither is acceptable. A fairly recent ER diagram can be had from http://arctos.googlecode.com/files/arctos_erd_20100129_single.pdf. Taxonomy and Identifications are in dark purple.
This seems to be a very standard way of representing Identifications and taxon names. I'm not sure I understand the issue here. The only part that I'm not clear on is the meaning of the "VARIABLE" attribute of the IDENTIFICATION_TAXONOMY entity. Is this how you enable identifications such as "Erebia youngi or Erebia lafontainei"?
But am I to understand correctly that there is a record in the TAXONOMY table where FULL_TAXON_NAME is populated with "Dark grey shale", with an INFRASPECIFIC_RANK of "Subspecies"? Wouldn't it then be worthwhile to add a field for "IS_BIOLOGICAL" to this table, to allow filtering out such taxa? Or, at least making an effort to put some standard term like "Non-Biological" within the TAXON_REMARKS field?
No, and that's the power of separating Identifications from Taxonomy. There are two scientific names involved. Identification.Scientific_Name may be things like "Canis latrans," "Sorex sp.," or "little squishy thing." Each of those things has a relationship to a Taxonomy.Scientific_Name record - "Canis latrans," "Sorex," and "unidentifiable" (our only non-biological Taxonomy.Scientific_Name), respectively. The goal is to put only taxonomy (which we define more or less as strings that can be traced back to publications) in Taxonomy, while allowing most anything in Identification. The VARIABLE mentioned above, in conjunction with TAXA_FORMULA, lets us form
1 relationship between Identification and Taxonomy, e.g., for hybrids.
Getting back to your example identified as "Erebia youngi or Erebia lafontainei". I don't actually see this as breaking the rule I tried to articulate in a previous post, which asserted that a single Individual can have only one legitimate taxon identification. Here's what I wrote:
My proposed solution is to rigidly maintain that an instance of "Individual" can not be partitioned to have multiple separate but concurrently legitimate Identifications associated with it. It can have multiple Identifications, but they would be considered to either be competing with each other (when different taxa are asserted) or reinforing each other (when the same taxon is asserted).
So, although I maintain that my "accurate but less precise" method of presenting this record in DwC is still legitimate, perhaps a better way to represent identifications for your specimen http://arctos.database.museum/guid/KWP:Ento:1703 is as follows:
identificationID: 1 individualID: http://arctos.database.museum/guid/KWP:Ento:1703 taxonID: http://arctos.database.museum/name/Erebia%20youngi identifiedBy: Kenelm W. Philip dateIdentified: 1974-07-04 identificationQualifier: Alternative identificationRemarks: Erebia youngi/lafontainei
identificationID: 2 individualID: http://arctos.database.museum/guid/KWP:Ento:1703 taxonID: http://arctos.database.museum/name/Erebia%20lafontainei identifiedBy: Kenelm W. Philip dateIdentified: 1974-07-04 identificationQualifier: Alternative identificationRemarks: Erebia youngi/lafontainei
The only thing stopping that is our fairly arbitrary idea that only one Identification may be "Accepted." My inclination is the "A or B" method is more intuitive and easier for people to immediately grasp, and it's certainly more flexible. We could easily (by entering one record in a code table) create a TAXA_FORMULA of "A x ((B x C) x D)" to deal with some 3rd-generation hybrid, for example.
I don't suggest that DWC should be able to deal with the formulae - that's a data creation thing - but accepting something like the following might be appropriate.
<identification> <IdString>Erebia youngi or Erebia lafontainei</IdString> <taxon>http://arctos.database.museum/name/Erebia%20youngi</taxon> <taxon>http://arctos.database.museum/name/Erebia%20lafontainei</taxon> <otherStuff>bla bla bla</otherStuff> </identification>
-D
The only part I made up here is the dwc:identificationQualifier term of "Alternate". Perhaps when someone proposes a controlled vocabulary for dwc:identificationQualifier, something like "Alternate" could be included, with the meaning that it is one of multiple possible identifications.
The important point is that those multiple possible identifications are still mutually exclusive (and competitive), and hence conforms to the rule I proposed for only one concurrent legitimate identification per Individual.
Aloha, Rich