Re: [tdwg-content] [IPT] Reverting the process of DwC standardization
Hello, Resending the same message due to a subscription problem. -- Menashè 2015-10-29 12:15 GMT+01:00 Menashe' Eliezer <menashe.eliezer@gmail.com>:
Hello, Please see my updated suggestion at https://github.com/gbif/ipt/issues/1165 IMHO Open Refine is not the right tool. One can simply use org.apache.poi in his Java application for reading all the information from the different files inside the DwC, and create an ODS file with the combined matrix, which takes into consideration also possible parentEventID. I'm sorry I don't have time to do it myself. I hope it's clear. -- Menashè
2015-10-28 18:57 GMT+01:00 Shorthouse, David < david.shorthouse@umontreal.ca>:
All,
Is part of the issue being expressed here because the raw ecological data sets we're discussing are small-ish matrices rather than occurrences, with site codes as columns, taxa as rows and measures of density/abundance as cells (and similar for environmental variables)? Such structures are often used as input for software that executes eg ordinations, classification & regression trees, species richness estimates. The shortcoming of such a structure is the inherent idiosyncratic nature of "site codes", with variable numbers of them, i.e. an arbitrary number of columns. I doubt it was ever designed for ease of dataset integration, but rather for ease of computation. Representing this structure as Event core requires significant transposition & potential for error if it were manual. Open Refine is one such tool that could permit bi-directional transpositions (DwC -> matrix and then matrix -> DwC), but it is still clunky and accommodation of extensions is virtually non-existent. But, perhaps Open Refine recipes and guides gets us one step closer to finding a balance between the need for standardized representation & efficient transport (DwC) vs. end-users who want matrices for ease of computation.
David P. Shorthouse
On Tue, Oct 27, 2015 at 7:36 AM, David Valentim Dias <dvdias@sibbr.gov.br
wrote:
Hi again,
I think the problem target both. DwC because is a solution to a problem creating another problem to researchers less "skilled" in table manipulation. Ecological data with occurrence is resulting in three tables and manipulation of these are getting harder with the number of core or extensions used. Two possible solutions comes in mind: create a new term describing the original layout of the columns (so we can use csvjoin like Menashe suggest) or ipt with option to store the original table associated with resource. We can always use external links in eml and save the file somewhere but this means creating another service and managing more login (aka resource cost and new problems).
I think any solution will need ipt changes.
2015-10-27 9:08 GMT-02:00 Menashe' Eliezer <menashe.eliezer@gmail.com>:
Hi Tim, I believe that the IPT feature I've requested long ago could be helpful for David: https://github.com/gbif/ipt/issues/1165 Consumers and also the data providers don't have a DwC-A viewer, and they need to join the separate csv files for having one table in a worksheet. Web applications like the one at OBIS website do let end users download one big table.
Best regards, Menashè
2015-10-27 9:53 GMT+01:00 Tim Robertson <trobertson@gbif.org>:
Hi David (CC’ing the IPT list as this might be an IPT specific thread - http://lists.gbif.org/mailman/listinfo/ipt)
For clarification - is your question specific to the DwC-A standard which is possible as Alex says or is it specific to the IPT tool please?
Do you imagine a scenario where you’d effectively map the same extension 2 times - once to interpreted and once to verbatim - or do you envisage a different data schema for each?
Thanks, Tim
On 23 Oct 2015, at 16:00, Alex Thompson <godfoder@acis.ufl.edu> wrote:
David,
It's certainly possible, within the context of a Darwin Core Archive, to include other files within the ZIP file that lie outside the schema of the archive. Both GBIF and iDigBio do this when generating downloads for various reasons (RIGHTS & LICENSE files, additional EML metadata, etc). However, I do not believe it is possible to do this within IPT. You might submit an issue on the IPT issue tracker ( https://github.com/gbif/ipt/issues) for potential inclusion of this feature in a future version of IPT.
There are workarounds you can use to include additional data in Darwin Core archives, but none of them will exactly match your old format. For instance, including an additional Occurrence file with the values as JSON in dynamicProperties or in some other verbatim format in the occurrenceRemarks field. Both of those would at least give some method of single-row access (vs joining multiple measurementOrFacts to a single event id) if that is the primary concern, even if they would require additional parsing steps to be useful.
Alex Thompson iDigBio Infrastructure
On 10/23/2015 09:40 AM, David Valentim Dias wrote:
Dear colleagues,
Here on SiBBr we're using the new eventCore and measurementOrFacts and after the process of standardization to DwC and publishing we think some users/researchers will want the "original" table format because of multiple reasons.
Is possible to have a vertabimTable or some place where we can store the original table/column format?
Regards
IPT mailing list IPT@lists.gbif.org http://lists.gbif.org/mailman/listinfo/ipt
participants (1)
-
Menashe' Eliezer