Re: [tdwg-content] Darwin Core vs. Simple Darwin Core

4 Aug 2011

      Hi Joel,

When I wrote that "using dwc:vernacularName_en, etc. would compromise" the naked terms (the primary layer, in the multi-layer model being suggested), you are quite correct that we could add a bunch of vernacularName_xx terms to DwC, but this would involve 1) an enormous expansion of defined DwC terms, and 2) a need for consumers either to use a secondary semantic layer or special ad-hoc string manipulation to recognise that these can all be treated as subproperties of vernacularName.  It's just a major adjustment to the basic simplicity of consuming DwC.  On the other hand, I guess that vernacular names are really rather a secondary label field for most application uses and the loss might not be too drastic.

If the perceived issue relates to the repeated use of a single DwC term within a record, I also think the proposed solution may not meet the case.  A species can just as easily have multiple vernacular names in use within a single language.  Consumers which might have had a problem consuming >1 vernacularName element are likely still to have the same problem even if they are distinguished as coming from different languages.

Donald

Donald Hobern, Director, Atlas of Living Australia
CSIRO Ecosystem Sciences, GPO Box 1700, Canberra, ACT 2601
Phone: (02) 62464352 Mobile: 0437990208 
Email: Donald.Hobern@csiro.au
Web: http://www.ala.org.au/ 

-----Original Message-----
From: joel sachs [mailto:jsachs@csee.umbc.edu] 
Sent: Thursday, 4 August 2011 12:10 AM
To: Hobern, Donald (CES, Black Mountain)
Cc: tdwg-content@lists.tdwg.org
Subject: RE: [tdwg-content] Darwin Core vs. Simple Darwin Core

Hi Donald,

Sorry for the delayed response (we had a holiday in Canada). A few responses ...
...
...
I am however a little uneasy about Peter's suggested solution.  It 
works perfectly when we use RDF/OWL, but the semantics get lost in 
other contexts.  DwC is intended to be as neutral as possible on 
encodings.
Yes, although if we could do everything we wanted without the semantic web, then we wouldn't need the semantic web. The need to distinguisgh between DwC and DwC/RDF came up last week both in the exchange between Pete and Markus, and also in the one between Steve and Bob. I'm very interested in squeezing semantics out of non-rdf data, and so see value in distinguishing between use cases for rdf and non-rdf representations. On the other hand, I find this to be sometimes difficult. For example, any time you introduce the notion of a "class" (as Darwin Core does), the notion of "subclass" is pretty natural.
...
...
There is some advantage to semantics being a secondary layer built on 
top of naked terms.
I agree. I like the idea of layered semantics, with not just secondary layers, but tertiary and quaternary. Users start with the base layer, and then import the further layers that their use case demands.
...
...
Using dwc:vernacularName_en, etc. would compromise that.
I'm not sure why. Would it be wrong to add a bunch of vernacularName_x terms to Darwin Core? (As well as adding, at one of the higher semantic layers, a bunch of "vernacularName_x subPropertyOf varnacularName" 
statements.)
...
...
1 - Ctenomys sp. by Richard Sage in 2000
2 - Ctenomys sociabilis by James L Patton on 14 September 2001
...
...
There is no indication whether one of these is preferred (ABCD used 
an attribute to indicate this).  How should a consumer needing Simple 
DwC (e.g. GBIF) interpret this?
The proposed "identificationVerificationStatus" term is necessary, but not sufficient to address this, right?

Regards,
Joel.

On Wed, 27 Jul 2011, Donald.Hobern@csiro.au wrote:
...
Joel,
Not sure if you saw my reply over the weekend on the vernacularName thread (http://lists.tdwg.org/pipermail/tdwg-content/2011-July/002686.html).  As we expand beyond Simple DwC (interpreted as completely non-repeating, flat DwC), we need to ensure that consumers can reliably and consistently derive the best Simple DwC record for any Occurrence.
Darwin Core is addressing a range of use cases.  We have the interests of taxonomists and collection managers to be able to retrieve as much information as possible about each specimen (or observation).  Class-based DwC, like ABCD, will allow publication of very rich specimen data, with a complete history of identifications, collectors, etc.  On the other hand, we have also many users (including software systems) which really need to know as reliably and efficiently as possible 1) to what species the specimen is currently assigned, 2) where it was collected, 3) when it was collected, and 4) how much evidence there is for these assertions.  In a sense this may be a serious simplification, but this precise level of detail is important for ecologists, planning agencies, software indexes, etc.
That means that we should rigorously define how repeating elements can be included in DwC while allowing users unambiguously to derive this core subset.  My belief is that repeating vernacularName poses no problem in this case.  A consumer can choose to take all, one or none of the supplied vernacularName values without serious harm.  A much bigger problem is that addressed in Peter DeVries' message.  I really want to know the language associated with a vernacularName.  In cases where there are multiple vernacular names in the same language, I'd also like to know if one of them is considered by the provider to be the "preferred" name.  This implies that naked vernacularNames without further metadata may not be as useful as they should be.  I am however a little uneasy about Peter's suggested solution.  It works perfectly when we use RDF/OWL, but the semantics get lost in other contexts.  DwC is intended to be as neutral as possible on encodings.  There is some advantage to semantics being a secondary layer built on top of naked terms.  Using dwc:vernacularName_en, etc. would compromise that.  It may be that the benefits outweigh the disadvantages but it should be considered.
The bigger problem for consumers is the more general issue of cases where more complex DwC does not clearly indicate which values would be the best to select for the Simple DwC what-species-occurred-when-and-where question.  Multiple identifications without a preferred identification is the real problem case.  Take the first example under "Classes and Containment" at http://rs.tdwg.org/dwc/terms/guides/xml/index.htm - this shows a specimen with the following identifications:
1 - Ctenomys sp. by Richard Sage in 2000
2 - Ctenomys sociabilis by James L Patton on 14 September 2001
There is no indication whether one of these is preferred (ABCD used an attribute to indicate this).  How should a consumer needing Simple DwC (e.g. GBIF) interpret this?  Is it safe to assume that the most recent identification is preferred?  That may normally be correct but there are good reasons why it could be a mistaken inference.  In the absence of further detail, should the consumer simply treat this as a Ctenomys of unknown species (in other words select the narrowest taxon including all taxa referenced by identifications).  This seems really unfortunate.
There are various ways to solve the problem, but I believe the value of DwC will best be maintained and enhanced by our ensuring this issue is handled in the specification.
By the way, I find something else really puzzling about this example from the XML Guide.  Why, oh why, does the Taxon object link back to the Identification object rather than the other way around????  This seems to me seriously to compromise the idea that we can reuse a DwC Taxon class in a semantically consistent fashion across collection data and species checklists.
Thanks,
Donald
Donald Hobern, Director, Atlas of Living Australia CSIRO Ecosystem 
Sciences, GPO Box 1700, Canberra, ACT 2601
Phone: (02) 62464352 Mobile: 0437990208
Email: Donald.Hobern@csiro.au
Web: http://www.ala.org.au/
-----Original Message-----
From: tdwg-content-bounces@lists.tdwg.org 
[mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of joel sachs
Sent: Wednesday, 27 July 2011 1:48 AM
To: tdwg-content@lists.tdwg.org
Subject: [tdwg-content] Darwin Core vs. Simple Darwin Core
Darwin Core is one of my favourite things. It's simple, elegant, and 
flexible. I wasn't there at design time, so I don't know if it was 
designed with the semantic web in mind, but it looks like it. It is, 
as John put it, primarily a collection of terms [and their 
definitions]. So if two people/agents use the same terms, they will 
share the same semantics. (This is why I think that a "more semantic 
Darwin Core" is not the appropriate goal for a Darwin Core/rdf working 
group.)
I'm concerned that there's so much confusion concerning DwC, since confusion is (typically) a barrier to adoption.
One source of confusion is Simple Darwin Core. A huge fraction of DwC 
records can be expressed as spreadsheets. Since *all* Simple DwC 
records can be expressed as spreadsheets, many people think
Simple Darwin Core = spreadsheet-expressible Darwin Core
(which isn't true). This means that if they want to express their data as a spreadsheet, they think they need to conform to Simple Darwin Core.
The requirement of Simple Darwin Core is that there be no repeated 
elements. But the requirement for spreadsheet-expressible Darwin Core 
is that there be no repeated nested elements. I previously argued
(http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002220.html
) in favour of using subscripts to represent elements in repeated 
nests (thereby permitting their use in spreadsheets). Even if we don't 
permit that, I'm not sure that the benefits of maintaing a separate 
Simple Darwin Core standard, in addition to the regular Darwin Core 
standard, are greater than the costs in terms of giving people wrong 
ideas. (I prefer the presentation at 
http://rs.tdwg.org/dwc/terms/guides/xml/index.htm,
where Simple DwC is presented as simply one of several XML schemas for 
Darwin Core.)
I *think* I see the motivation for Simple DwC. Suppose X wants to use Darwin Core, but doesn't know much about databases, and just wants to put all his data in a spreadsheet. He might not know what a repeated, nested data structure is. So it's easiest to just say to him "don't repeat any elements, and you'll be fine - your records will be spreadsheet-expressible". I agree that that's a benefit. Are there others?
Thanks -
Joel.
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content

Re: [tdwg-content] Darwin Core vs. Simple Darwin Core

Donald.Hobern＠csiro.au