[tdwg-content] [IPT] Reverting the process of DwC standardization

Shorthouse, David david.shorthouse at umontreal.ca
Wed Oct 28 18:57:39 CET 2015


All,

Is part of the issue being expressed here because the raw ecological data
sets we're discussing are small-ish matrices rather than occurrences, with
site codes as columns, taxa as rows and measures of density/abundance as
cells (and similar for environmental variables)? Such structures are often
used as input for software that executes eg ordinations, classification &
regression trees, species richness estimates. The shortcoming of such a
structure is the inherent idiosyncratic nature of "site codes", with
variable numbers of them, i.e. an arbitrary number of columns. I doubt it
was ever designed for ease of dataset integration, but rather for ease of
computation. Representing this structure as Event core requires significant
transposition & potential for error if it were manual. Open Refine is one
such tool that could permit bi-directional transpositions (DwC -> matrix
and then matrix -> DwC), but it is still clunky and accommodation of
extensions is virtually non-existent. But, perhaps Open Refine recipes and
guides gets us one step closer to finding a balance between the need for
standardized representation & efficient transport (DwC) vs. end-users who
want matrices for ease of computation.

David P. Shorthouse

On Tue, Oct 27, 2015 at 7:36 AM, David Valentim Dias <dvdias at sibbr.gov.br>
wrote:

> Hi again,
>
> I think the problem target both. DwC because is a solution to a problem
> creating another problem to researchers less "skilled" in table
> manipulation. Ecological data with occurrence is resulting in three tables
> and manipulation of these are getting harder with the number of core or
> extensions used.
> Two possible solutions comes in mind: create a new term describing the
> original layout of the columns (so we can use csvjoin like Menashe suggest)
> or ipt with option to store the original table associated with resource.
> We can always use external links in eml and save the file somewhere but
> this means creating another service and managing more login (aka resource
> cost and new problems).
>
> I think any solution will need ipt changes.
>
> 2015-10-27 9:08 GMT-02:00 Menashe' Eliezer <menashe.eliezer at gmail.com>:
>
>> Hi Tim,
>> I believe that the IPT feature I've requested long ago could be helpful
>> for David: https://github.com/gbif/ipt/issues/1165
>> Consumers and also the data providers don't have a DwC-A viewer, and they
>> need to join the separate csv files for having one table in a worksheet.
>> Web applications like the one at OBIS website do let end users download
>> one big table.
>>
>> Best regards,
>> Menashè
>>
>>
>> 2015-10-27 9:53 GMT+01:00 Tim Robertson <trobertson at gbif.org>:
>>
>>> Hi David
>>> (CC’ing the IPT list as this might be an IPT specific thread -
>>> http://lists.gbif.org/mailman/listinfo/ipt)
>>>
>>> For clarification - is your question specific to the DwC-A standard
>>> which is possible as Alex says or is it specific to the IPT tool please?
>>>
>>> Do you imagine a scenario where you’d effectively map the same extension
>>> 2 times - once to interpreted and once to verbatim - or do you envisage a
>>> different data schema for each?
>>>
>>> Thanks,
>>> Tim
>>>
>>>
>>>
>>> On 23 Oct 2015, at 16:00, Alex Thompson <godfoder at acis.ufl.edu> wrote:
>>>
>>> David,
>>>
>>> It's certainly possible, within the context of a Darwin Core Archive, to
>>> include other files within the ZIP file that lie outside the schema of the
>>> archive. Both GBIF and iDigBio do this when generating downloads for
>>> various reasons (RIGHTS & LICENSE files, additional EML metadata, etc).
>>> However, I do not believe it is possible to do this within IPT. You might
>>> submit an issue on the IPT issue tracker (
>>> https://github.com/gbif/ipt/issues) for potential inclusion of this
>>> feature in a future version of IPT.
>>>
>>> There are workarounds you can use to include additional data in Darwin
>>> Core archives, but none of them will exactly match your old format. For
>>> instance, including an additional Occurrence file with the values as JSON
>>> in dynamicProperties or in some other verbatim format in the
>>> occurrenceRemarks field. Both of those would at least give some method of
>>> single-row access (vs joining multiple measurementOrFacts to a single event
>>> id) if that is the primary concern, even if they would require additional
>>> parsing steps to be useful.
>>>
>>> Alex Thompson
>>> iDigBio Infrastructure
>>>
>>>
>>> On 10/23/2015 09:40 AM, David Valentim Dias wrote:
>>>
>>> Dear colleagues,
>>>
>>> Here on SiBBr we're using the new eventCore and measurementOrFacts and
>>> after the process of standardization to DwC and publishing we think some
>>> users/researchers will want the "original" table format because of multiple
>>> reasons.
>>>
>>> Is possible to have a vertabimTable or some place where we can store the
>>> original table/column format?
>>>
>>> Regards
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20151028/9a62ed64/attachment.html 


More information about the tdwg-content mailing list