[tdwg-content] Darwin Core vs. Simple Darwin Core

Wed Jul 27 01:05:49 CEST 2011

Joel,

Not sure if you saw my reply over the weekend on the vernacularName thread (http://lists.tdwg.org/pipermail/tdwg-content/2011-July/002686.html).  As we expand beyond Simple DwC (interpreted as completely non-repeating, flat DwC), we need to ensure that consumers can reliably and consistently derive the best Simple DwC record for any Occurrence.  

Darwin Core is addressing a range of use cases.  We have the interests of taxonomists and collection managers to be able to retrieve as much information as possible about each specimen (or observation).  Class-based DwC, like ABCD, will allow publication of very rich specimen data, with a complete history of identifications, collectors, etc.  On the other hand, we have also many users (including software systems) which really need to know as reliably and efficiently as possible 1) to what species the specimen is currently assigned, 2) where it was collected, 3) when it was collected, and 4) how much evidence there is for these assertions.  In a sense this may be a serious simplification, but this precise level of detail is important for ecologists, planning agencies, software indexes, etc.

That means that we should rigorously define how repeating elements can be included in DwC while allowing users unambiguously to derive this core subset.  My belief is that repeating vernacularName poses no problem in this case.  A consumer can choose to take all, one or none of the supplied vernacularName values without serious harm.  A much bigger problem is that addressed in Peter DeVries' message.  I really want to know the language associated with a vernacularName.  In cases where there are multiple vernacular names in the same language, I'd also like to know if one of them is considered by the provider to be the "preferred" name.  This implies that naked vernacularNames without further metadata may not be as useful as they should be.  I am however a little uneasy about Peter's suggested solution.  It works perfectly when we use RDF/OWL, but the semantics get lost in other contexts.  DwC is intended to be as neutral as possible on encodings.  There is some advantage to semantics being a secondary layer built on top of naked terms.  Using dwc:vernacularName_en, etc. would compromise that.  It may be that the benefits outweigh the disadvantages but it should be considered.

The bigger problem for consumers is the more general issue of cases where more complex DwC does not clearly indicate which values would be the best to select for the Simple DwC what-species-occurred-when-and-where question.  Multiple identifications without a preferred identification is the real problem case.  Take the first example under "Classes and Containment" at http://rs.tdwg.org/dwc/terms/guides/xml/index.htm - this shows a specimen with the following identifications:

1 - Ctenomys sp. by Richard Sage in 2000
2 - Ctenomys sociabilis by James L Patton on 14 September 2001

There is no indication whether one of these is preferred (ABCD used an attribute to indicate this).  How should a consumer needing Simple DwC (e.g. GBIF) interpret this?  Is it safe to assume that the most recent identification is preferred?  That may normally be correct but there are good reasons why it could be a mistaken inference.  In the absence of further detail, should the consumer simply treat this as a Ctenomys of unknown species (in other words select the narrowest taxon including all taxa referenced by identifications).  This seems really unfortunate.  

There are various ways to solve the problem, but I believe the value of DwC will best be maintained and enhanced by our ensuring this issue is handled in the specification.

By the way, I find something else really puzzling about this example from the XML Guide.  Why, oh why, does the Taxon object link back to the Identification object rather than the other way around????  This seems to me seriously to compromise the idea that we can reuse a DwC Taxon class in a semantically consistent fashion across collection data and species checklists.

Thanks,

Donald

Donald Hobern, Director, Atlas of Living Australia
CSIRO Ecosystem Sciences, GPO Box 1700, Canberra, ACT 2601
Phone: (02) 62464352 Mobile: 0437990208 
Email: Donald.Hobern at csiro.au
Web: http://www.ala.org.au/ 

-----Original Message-----
From: tdwg-content-bounces at lists.tdwg.org [mailto:tdwg-content-bounces at lists.tdwg.org] On Behalf Of joel sachs
Sent: Wednesday, 27 July 2011 1:48 AM
To: tdwg-content at lists.tdwg.org
Subject: [tdwg-content] Darwin Core vs. Simple Darwin Core

Darwin Core is one of my favourite things. It's simple, elegant, and flexible. I wasn't there at design time, so I don't know if it was designed with the semantic web in mind, but it looks like it. It is, as John put it, primarily a collection of terms [and their definitions]. So if two people/agents use the same terms, they will share the same semantics. (This is why I think that a "more semantic Darwin Core" is not the appropriate goal for a Darwin Core/rdf working group.)

I'm concerned that there's so much confusion concerning DwC, since confusion is (typically) a barrier to adoption.

One source of confusion is Simple Darwin Core. A huge fraction of DwC records can be expressed as spreadsheets. Since *all* Simple DwC records can be expressed as spreadsheets, many people think

Simple Darwin Core = spreadsheet-expressible Darwin Core

(which isn't true). This means that if they want to express their data as a spreadsheet, they think they need to conform to Simple Darwin Core.

The requirement of Simple Darwin Core is that there be no repeated elements. But the requirement for spreadsheet-expressible Darwin Core is that there be no repeated nested elements. I previously argued
(http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002220.html) in favour of using subscripts to represent elements in repeated nests (thereby permitting their use in spreadsheets). Even if we don't permit that, I'm not sure that the benefits of maintaing a separate Simple Darwin Core standard, in addition to the regular Darwin Core standard, are greater than the costs in terms of giving people wrong ideas. (I prefer the presentation at http://rs.tdwg.org/dwc/terms/guides/xml/index.htm,
where Simple DwC is presented as simply one of several XML schemas for Darwin Core.)

I *think* I see the motivation for Simple DwC. Suppose X wants to use Darwin Core, but doesn't know much about databases, and just wants to put all his data in a spreadsheet. He might not know what a repeated, nested data structure is. So it's easiest to just say to him "don't repeat any elements, and you'll be fine - your records will be spreadsheet-expressible". I agree that that's a benefit. Are there others?

Thanks -
Joel.

_______________________________________________
tdwg-content mailing list
tdwg-content at lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content