[tdwg-content] Darwin Core vs. Simple Darwin Core
RichardsK at landcareresearch.co.nz
Thu Aug 4 06:11:23 CEST 2011
I agree with Donald here. If you start trying to deal with the complexity of vernacular names, then there are other aspects that come into scope as well such as spatial extent and chronological extent of the vernacular names - you can't nicely handle all these cases in a simple, flat DwC structure. When trying to use "complex" data with "simple" structures I suspect it is better to go with a concatenated field, eg comma separated list of vernacular names in one field - this does not solve the whole issue, but you are going to get these sort of issues if you are adamant about using a very simple, flat structure.
Personally I don't see it being much harder to use the GBIF style star schema approach, so why deal with the hassles.
From: tdwg-content-bounces at lists.tdwg.org [mailto:tdwg-content-bounces at lists.tdwg.org] On Behalf Of Donald.Hobern at csiro.au
Sent: Thursday, 4 August 2011 4:00 p.m.
To: jsachs at csee.umbc.edu
Cc: tdwg-content at lists.tdwg.org
Subject: Re: [tdwg-content] Darwin Core vs. Simple Darwin Core
When I wrote that "using dwc:vernacularName_en, etc. would compromise" the naked terms (the primary layer, in the multi-layer model being suggested), you are quite correct that we could add a bunch of vernacularName_xx terms to DwC, but this would involve 1) an enormous expansion of defined DwC terms, and 2) a need for consumers either to use a secondary semantic layer or special ad-hoc string manipulation to recognise that these can all be treated as subproperties of vernacularName. It's just a major adjustment to the basic simplicity of consuming DwC. On the other hand, I guess that vernacular names are really rather a secondary label field for most application uses and the loss might not be too drastic.
If the perceived issue relates to the repeated use of a single DwC term within a record, I also think the proposed solution may not meet the case. A species can just as easily have multiple vernacular names in use within a single language. Consumers which might have had a problem consuming >1 vernacularName element are likely still to have the same problem even if they are distinguished as coming from different languages.
Donald Hobern, Director, Atlas of Living Australia CSIRO Ecosystem Sciences, GPO Box 1700, Canberra, ACT 2601
Phone: (02) 62464352 Mobile: 0437990208
Email: Donald.Hobern at csiro.au
From: joel sachs [mailto:jsachs at csee.umbc.edu]
Sent: Thursday, 4 August 2011 12:10 AM
To: Hobern, Donald (CES, Black Mountain)
Cc: tdwg-content at lists.tdwg.org
Subject: RE: [tdwg-content] Darwin Core vs. Simple Darwin Core
Sorry for the delayed response (we had a holiday in Canada). A few responses ...
>> I am however a little uneasy about Peter's suggested solution. It
>> works perfectly when we use RDF/OWL, but the semantics get lost in
>> other contexts. DwC is intended to be as neutral as possible on
Yes, although if we could do everything we wanted without the semantic web, then we wouldn't need the semantic web. The need to distinguisgh between DwC and DwC/RDF came up last week both in the exchange between Pete and Markus, and also in the one between Steve and Bob. I'm very interested in squeezing semantics out of non-rdf data, and so see value in distinguishing between use cases for rdf and non-rdf representations. On the other hand, I find this to be sometimes difficult. For example, any time you introduce the notion of a "class" (as Darwin Core does), the notion of "subclass" is pretty natural.
>> There is some advantage to semantics being a secondary layer built on
>> top of naked terms.
I agree. I like the idea of layered semantics, with not just secondary layers, but tertiary and quaternary. Users start with the base layer, and then import the further layers that their use case demands.
>> Using dwc:vernacularName_en, etc. would compromise that.
I'm not sure why. Would it be wrong to add a bunch of vernacularName_x terms to Darwin Core? (As well as adding, at one of the higher semantic layers, a bunch of "vernacularName_x subPropertyOf varnacularName"
>> 1 - Ctenomys sp. by Richard Sage in 2000
>> 2 - Ctenomys sociabilis by James L Patton on 14 September 2001
>> There is no indication whether one of these is preferred (ABCD used
>> an attribute to indicate this). How should a consumer needing Simple
>> DwC (e.g. GBIF) interpret this?
The proposed "identificationVerificationStatus" term is necessary, but not sufficient to address this, right?
On Wed, 27 Jul 2011, Donald.Hobern at csiro.au wrote:
> Not sure if you saw my reply over the weekend on the vernacularName thread (http://lists.tdwg.org/pipermail/tdwg-content/2011-July/002686.html). As we expand beyond Simple DwC (interpreted as completely non-repeating, flat DwC), we need to ensure that consumers can reliably and consistently derive the best Simple DwC record for any Occurrence.
> Darwin Core is addressing a range of use cases. We have the interests of taxonomists and collection managers to be able to retrieve as much information as possible about each specimen (or observation). Class-based DwC, like ABCD, will allow publication of very rich specimen data, with a complete history of identifications, collectors, etc. On the other hand, we have also many users (including software systems) which really need to know as reliably and efficiently as possible 1) to what species the specimen is currently assigned, 2) where it was collected, 3) when it was collected, and 4) how much evidence there is for these assertions. In a sense this may be a serious simplification, but this precise level of detail is important for ecologists, planning agencies, software indexes, etc.
> That means that we should rigorously define how repeating elements can be included in DwC while allowing users unambiguously to derive this core subset. My belief is that repeating vernacularName poses no problem in this case. A consumer can choose to take all, one or none of the supplied vernacularName values without serious harm. A much bigger problem is that addressed in Peter DeVries' message. I really want to know the language associated with a vernacularName. In cases where there are multiple vernacular names in the same language, I'd also like to know if one of them is considered by the provider to be the "preferred" name. This implies that naked vernacularNames without further metadata may not be as useful as they should be. I am however a little uneasy about Peter's suggested solution. It works perfectly when we use RDF/OWL, but the semantics get lost in other contexts. DwC is intended to be as neutral as possible on encodings. There is some advantage to
semantics being a secondary layer built on top of naked terms. Using dwc:vernacularName_en, etc. would compromise that. It may be that the benefits outweigh the disadvantages but it should be considered.
> The bigger problem for consumers is the more general issue of cases where more complex DwC does not clearly indicate which values would be the best to select for the Simple DwC what-species-occurred-when-and-where question. Multiple identifications without a preferred identification is the real problem case. Take the first example under "Classes and Containment" at http://rs.tdwg.org/dwc/terms/guides/xml/index.htm - this shows a specimen with the following identifications:
> 1 - Ctenomys sp. by Richard Sage in 2000
> 2 - Ctenomys sociabilis by James L Patton on 14 September 2001
> There is no indication whether one of these is preferred (ABCD used an attribute to indicate this). How should a consumer needing Simple DwC (e.g. GBIF) interpret this? Is it safe to assume that the most recent identification is preferred? That may normally be correct but there are good reasons why it could be a mistaken inference. In the absence of further detail, should the consumer simply treat this as a Ctenomys of unknown species (in other words select the narrowest taxon including all taxa referenced by identifications). This seems really unfortunate.
> There are various ways to solve the problem, but I believe the value of DwC will best be maintained and enhanced by our ensuring this issue is handled in the specification.
> By the way, I find something else really puzzling about this example from the XML Guide. Why, oh why, does the Taxon object link back to the Identification object rather than the other way around???? This seems to me seriously to compromise the idea that we can reuse a DwC Taxon class in a semantically consistent fashion across collection data and species checklists.
> Donald Hobern, Director, Atlas of Living Australia CSIRO Ecosystem
> Sciences, GPO Box 1700, Canberra, ACT 2601
> Phone: (02) 62464352 Mobile: 0437990208
> Email: Donald.Hobern at csiro.au
> Web: http://www.ala.org.au/
> -----Original Message-----
> From: tdwg-content-bounces at lists.tdwg.org
> [mailto:tdwg-content-bounces at lists.tdwg.org] On Behalf Of joel sachs
> Sent: Wednesday, 27 July 2011 1:48 AM
> To: tdwg-content at lists.tdwg.org
> Subject: [tdwg-content] Darwin Core vs. Simple Darwin Core
> Darwin Core is one of my favourite things. It's simple, elegant, and
> flexible. I wasn't there at design time, so I don't know if it was
> designed with the semantic web in mind, but it looks like it. It is,
> as John put it, primarily a collection of terms [and their
> definitions]. So if two people/agents use the same terms, they will
> share the same semantics. (This is why I think that a "more semantic
> Darwin Core" is not the appropriate goal for a Darwin Core/rdf working
> I'm concerned that there's so much confusion concerning DwC, since confusion is (typically) a barrier to adoption.
> One source of confusion is Simple Darwin Core. A huge fraction of DwC
> records can be expressed as spreadsheets. Since *all* Simple DwC
> records can be expressed as spreadsheets, many people think
> Simple Darwin Core = spreadsheet-expressible Darwin Core
> (which isn't true). This means that if they want to express their data as a spreadsheet, they think they need to conform to Simple Darwin Core.
> The requirement of Simple Darwin Core is that there be no repeated
> elements. But the requirement for spreadsheet-expressible Darwin Core
> is that there be no repeated nested elements. I previously argued
> ) in favour of using subscripts to represent elements in repeated
> nests (thereby permitting their use in spreadsheets). Even if we don't
> permit that, I'm not sure that the benefits of maintaing a separate
> Simple Darwin Core standard, in addition to the regular Darwin Core
> standard, are greater than the costs in terms of giving people wrong
> ideas. (I prefer the presentation at
> where Simple DwC is presented as simply one of several XML schemas for
> Darwin Core.)
> I *think* I see the motivation for Simple DwC. Suppose X wants to use Darwin Core, but doesn't know much about databases, and just wants to put all his data in a spreadsheet. He might not know what a repeated, nested data structure is. So it's easiest to just say to him "don't repeat any elements, and you'll be fine - your records will be spreadsheet-expressible". I agree that that's a benefit. Are there others?
> Thanks -
> tdwg-content mailing list
> tdwg-content at lists.tdwg.org
tdwg-content mailing list
tdwg-content at lists.tdwg.org
Please consider the environment before printing this email
Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails.
The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
More information about the tdwg-content