[tdwg-content] taxonomy != identification

Dusty dlmcdonald at alaska.edu
Fri Nov 5 13:36:52 CET 2010


On Thu, Nov 4, 2010 at 12:07 PM, Richard Pyle <deepreef at bishopmuseum.org>wrote:

> > > This is an excellent example of something I have to deal with
> occassionally,
> > > and was going to be part of my never-sent post on dealing with
> ambiguous
> > > identifications.  In the context of DwC, my feeling is that this taxon
> > > should be represented as "Erebia" in dwc:scientificName, and the two
> > > possible species epithets included in dwc:identificationRemarks.
> >
> > But that's not the data.
>
> I would argue that it's an *accurate* representation of the data, just not
> a
> completely *precise* representation.  We all have data that cannot easily
> be
> represented in DwC (without resorting to some xxxxRemarks term) -- which is
> a necessary compromise of a practical data exchange system designed to work
> across highly heterogenous datasets.
>

My interest in this conversation is largely to point out that I believe
defensible collections management practices should drive informatics, not
the other way around. I generally agree with the above, and I think we're
all willing to make compromises for DWC. We expect to have to concatenate
stuff, exclude details, and omit auxiliary data. Mangling taxon assertions
to fit some data model is another thing altogether. Excluding the sticky
parts is not an appropriate way of dealing with heterogeneous data.


>
> > So I picked an easy example. Here's a slightly
> > harder one: http://arctos.database.museum/guid/MVZ:Egg:2355.
>
> Not harder at all.  Two individuals (one identified as Pipilo aberti
> dumeticolus, and the other identified as Molothrus ater obscurus). Both are
> children of a parent Individual, which either doesn't have any taxon
> Idientification associated with it (if the object consists of the nest
> itself, as well as the eggs), or has an Identification of "Passeriformes"
> associated with it (if the nest itself is considered extraneous material,
> and the eggs are the real object of interest).
>

Fine: Not harder, but certainly less precise. Get a botanist interested in
the grass in the nest and we're down to something like "Eukaryota." We can
and should do better than that.


> > Maybe so, but there it is: http://data.gbif.org/occurrences/242032297/.
>
> Well....I think this pushes (exceeds, really) the intended purpose of DwC.
> That it was picked up by GBIF is only a result of it having been presented
> by the content provider.
>
> > Excluding that would, I think, force you to exclude things
> > like http://arctos.database.museum/guid/UAM:ES:3359 as well
> - it's all from the same administrative unit.
>
> Just because it's from the same administrative unit doesn't mean that it
> has
> be, or not be, considered within scope for DwC.  I think a fossil is a
> legitimate within-scope record for DwC. The other information can, perhaps,
> be presented within the GeologicalContext class (or maybe not).  But DwC is
> a data exchange system for information about organisms.
>
> > I don't have or want any control over what Curators
> > enter - any scope-limiting filter will have to happen
> > elsewhere.
>
> That seems to me to be a question of database management within an
> institution -- not about what subset of that information gets exposed as
> DwC
> records.  If the database is capable of filtering out the
> non-biological-relevant stuff at the time the records are generated for
> packaging within DwC, then such a filter should be applied accordingly.  If
> this is not possible, then consumers will have to deal with the occassional
> out-of-scope records.  I suspect the ratio of in-scope to out-of-scope
> records is such that the value of the latter vastly exceeds the cost of the
> former.
>

I suspect that alleged non-biological or "out of scope" records often aren't
quite as boring or left-field as they seem. The paleontologists probably
picked up that rock because it is fossiliferous, but that information has
never been confirmed/entered. Knowing that it's a such-and-such rock from
such-and-such a place could be the key that leads someone to go looking for
those fossils in the specimen. We're working on putting ethnological objects
in the same system. Those will seem even more biologically irrelevant on the
surface, but they're also the best place to find things like pre-industrial
walrus ivory. I'd consider that important and relevant, and I certainly
wouldn't want the job of excluding records that don't seem important to my
interests.


>
> > The point is simply that these are real data. We won't
> > change them to some approximation of themselves or stuff
> > them into a remarks field somewhere. They'll get more
> > complicated before we're done. Anything that's to be
> > useful to us must acknowledge the realities of
> > collections data.
>
> Fair enough; but as a collection wishing to present data for sharing via
> the
> DwC standard, the content provider needs to decide the relative
> costs/benefits of either filtering out-of-scope records out of the exposed
> DwC datasets, or accepting some small fraction of out-of-scope records
> being
> misinterpreted by consumers/users as in-scope records.
>
> > If anyone is interested, we accomplish the above by
> > separating Identifications and Taxonomy. Arctos has
> > roots deep in the ASC model discussed recently, but
> > the link between specimens and taxonomy was one of
> > our early divergences from that model. Assigning
> > TaxonIDs directly to specimens is a no-win game -
> > you either end up with the really valuable data
> > buried in a remarks field somewhere, or you end up
> > with an infinite list of strings that you must
> > pretend are taxon names. Neither is acceptable.
> > A fairly recent ER diagram can be had from
> > http://arctos.googlecode.com/files/arctos_erd_20100129_single.pdf.
> > Taxonomy and Identifications are in dark purple.
>
> This seems to be a very standard way of representing Identifications and
> taxon names.  I'm not sure I understand the issue here.  The only part that
> I'm not clear on is the meaning of the "VARIABLE" attribute of the
> IDENTIFICATION_TAXONOMY entity. Is this how you enable identifications such
> as "Erebia youngi or Erebia lafontainei"?
>
> But am I to understand correctly that there is a record in the TAXONOMY
> table where FULL_TAXON_NAME is populated with "Dark grey shale", with an
> INFRASPECIFIC_RANK of "Subspecies"?  Wouldn't it then be worthwhile to add
> a
> field for "IS_BIOLOGICAL" to this table, to allow filtering out such taxa?
> Or, at least making an effort to put some standard term like
> "Non-Biological" within the TAXON_REMARKS field?
>

No, and that's the power of separating Identifications from Taxonomy. There
are two scientific names involved. Identification.Scientific_Name may be
things like "Canis latrans," "Sorex sp.," or "little squishy thing." Each of
those things has a relationship to a Taxonomy.Scientific_Name record -
"Canis latrans," "Sorex," and "unidentifiable" (our only non-biological
Taxonomy.Scientific_Name), respectively. The goal is to put only taxonomy
(which we define more or less as strings that can be traced back to
publications) in Taxonomy, while allowing most anything in Identification.
The VARIABLE mentioned above, in conjunction with TAXA_FORMULA, lets us form
>1 relationship between Identification and Taxonomy, e.g., for hybrids.


> Getting back to your example identified as "Erebia youngi or Erebia
> lafontainei".  I don't actually see this as breaking the rule I tried to
> articulate in a previous post, which asserted that a single Individual can
> have only one legitimate taxon identification.  Here's what I wrote:
>
> > My proposed solution is to rigidly maintain that an
> > instance of "Individual" can not be partitioned to
> > have multiple separate but concurrently legitimate
> > Identifications associated with it. It can have
> > multiple Identifications, but they would be considered
> > to either be competing with each other (when different
> > taxa are asserted) or reinforing each other (when the
> > same taxon is asserted).
>
> So, although I maintain that my "accurate but less precise" method of
> presenting this record in DwC is still legitimate, perhaps a better way to
> represent identifications for your specimen
> http://arctos.database.museum/guid/KWP:Ento:1703 is as follows:
>
> identificationID: 1
> individualID: http://arctos.database.museum/guid/KWP:Ento:1703
> taxonID: http://arctos.database.museum/name/Erebia%20youngi
> identifiedBy: Kenelm W. Philip
> dateIdentified: 1974-07-04
> identificationQualifier: Alternative
> identificationRemarks: Erebia youngi/lafontainei
>
> identificationID: 2
> individualID: http://arctos.database.museum/guid/KWP:Ento:1703
> taxonID: http://arctos.database.museum/name/Erebia%20lafontainei
> identifiedBy: Kenelm W. Philip
> dateIdentified: 1974-07-04
> identificationQualifier: Alternative
> identificationRemarks: Erebia youngi/lafontainei
>

The only thing stopping that is our fairly arbitrary idea that only one
Identification may be "Accepted." My inclination is the "A or B" method is
more intuitive and easier for people to immediately grasp, and it's
certainly more flexible. We could easily (by entering one record in a code
table) create a TAXA_FORMULA of "A x ((B x C) x D)" to deal with some
3rd-generation hybrid, for example.

I don't suggest that DWC should be able to deal with the formulae - that's a
data creation thing - but accepting something like the following might be
appropriate.

<identification>
  <IdString>Erebia youngi or Erebia lafontainei</IdString>
  <taxon>http://arctos.database.museum/name/Erebia%20youngi</taxon>
  <taxon>http://arctos.database.museum/name/Erebia%20lafontainei</taxon>
  <otherStuff>bla bla bla</otherStuff>
</identification>


-D


> The only part I made up here is the dwc:identificationQualifier term of
> "Alternate".  Perhaps when someone proposes a controlled vocabulary for
> dwc:identificationQualifier, something like "Alternate" could be included,
> with the meaning that it is one of multiple possible identifications.
>
> The important point is that those multiple possible identifications are
> still mutually exclusive (and competitive), and hence conforms to the rule
> I
> proposed for only one concurrent legitimate identification per Individual.
>
> Aloha,
> Rich
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20101105/c53433de/attachment.html 


More information about the tdwg-content mailing list