[tdwg-tapir] Fwd: Tapir protocol - Harvest methods? [SEC=UNCLASSIFIED]

Wed May 14 16:19:32 CEST 2008

This discussion starts to remind me another one in Google Appengine
discussio group. They talk about different ways to bulk upload data to
their Big Table database.

http://groups.google.com/group/google-appengine/browse_thread/thread/18d246b30e267da4

I have read so far:
-XML
-CSV
-RDF
-JSON
-AMF
-SQL
-OOXML
-TSV

Uff so many ideas...

I would take whatever Google finally decide as it will probable become
a defacto standard :D

The discussion is funny :D

Cheers.

On Wed, May 14, 2008 at 4:04 PM, Dave Vieglais <vieglais at ku.edu> wrote:
> Perhaps it could be put into some form of xml to preserve the
> relational model?  Maybe a mechanism could be developed so that others
> could access the xml as well.  How about even putting some sort of
> subsetting mechanism so that entire data sets need not be retrieved.
>
> just a thought...
>
>
> On Wed, May 14, 2008 at 9:25 AM, Aaron D. Steele <eightysteele at gmail.com> wrote:
>> for preserving relational data, we could also just dump tapirlink
>>  resources to an sqlite database file (http://www.sqlite.org), zip it
>>  up, and again make it available via the web service. we use sqlite
>>  internally for many projects, and it's both easy to use and well
>>  supported by jdbc, php, python, etc.
>>
>>  would something like this be a useful option?
>>
>>  thanks,
>>  aaron
>>
>>  On Wed, May 14, 2008 at 2:21 AM, Markus Döring <mdoering at gbif.org> wrote:
>>  > Interesting that we all come to the same conclusions...
>>  >  The trouble I had with just a simple flat csv file is repeating
>>  >  properties like multiple image urls. ABCD clients dont use ABCD just
>>  >  because its complex, but because they want to transport this
>>  >  relational data. We were considering 2 solutions to extending this csv
>>  >  approach. The first would be to have a single large denormalised csv
>>  >  file with many rows for the same record. It would require knowledge
>>  >  about the related entities though and could grow in size rapidly. The
>>  >  second idea which we think to adopt is allowing a single level of 1-
>>  >  many related entities. It is basically a "star" design with the core
>>  >  dwc table in the center and any number of extension tables around it.
>>  >  Each "table" aka csv file will have the record id as the first column,
>>  >  so the files can be related easily and it only needs a single
>>  >  identifier per record and not for the extension entities. This would
>>  >  give a lot of flexibility while keeping things pretty simple to deal
>>  >  with. It would even satisfy the ABCD needs as I havent yet seen anyone
>>  >  requiring 2 levels of related tables (other than lookup tables). Those
>>  >  extensions could even be a simple 1-1 relation, but would keep things
>>  >  semantically together just like a xml namespace. The darwin core
>>  >  extensions would be good for example.
>>  >
>>  >  So we could have a gzipped set of files, maybe with a simple metafile
>>  >  indicating the semantics of the columns for each file.
>>  >  An example could look like this:
>>  >
>>  >
>>  >  # darwincore.csv
>>  >  102    Aster alpinus subsp. parviceps    ...
>>  >  103    Polygala vulgaris    ...
>>  >
>>  >  # curatorial.csv
>>  >  102    Kew Herbarium
>>  >  103    Reading Herbarium
>>  >
>>  >  # identification.csv
>>  >  102    2003-05-04    Karl Marx    Aster alpinus L.
>>  >  102    2007-01-11    Mark Twain    Aster korshinskyi Tamamsch.
>>  >  102    2007-09-13    Roger Hyam    Aster alpinus subsp. parviceps
>>  >  Novopokr.
>>  >  103    2001-02-21    Steve Bekow    Polygala vulgaris L.
>>  >
>>  >
>>  >
>>  >  I know this looks old fashioned, but it is just so simple and gives us
>>  >  so much flexibility.
>>  >  Markus
>>  >
>>  >
>>  >
>>  >
>>  >  On 14 May, 2008, at 24:39, Greg Whitbread wrote:
>>  >
>>  >  > We have used a very similar protocol to assemble the latest AVH cache.
>>  >  > It should be noted that this is an as-well-as protocol that only works
>>  >  > because we have an established semantic standard (hispid/abcd).
>>  >  >
>>  >  > greg
>>  >  >
>>  >  > trobertson at gbif.org wrote:
>>  >  >> Hi All,
>>  >  >>
>>  >  >> This is very interesting too me, as I came up with the same
>>  >  >> conclusion
>>  >  >> while harvesting for GBIF.
>>  >  >>
>>  >  >> As a "harvester of all records" it is best described with an example:
>>  >  >>
>>  >  >> - Complete Inventory of ScientificNames: 7 minutes @ the limited 200
>>  >  >> records per page
>>  >  >> - Complete Harvesting of records:
>>  >  >>  - 260,000 records
>>  >  >>  - 9 hours harvesting duration
>>  >  >>  - 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and
>>  >  >> curatorial
>>  >  >> extensions)
>>  >  >> - Extraction of DwC records from harvested XML: <2 minutes
>>  >  >>  - Resulting file size 32MB, Gzipped to <3MB
>>  >  >>
>>  >  >> I spun hard drives for 9 hours, and took up bandwidth that is paid
>>  >  >> for, to
>>  >  >> retrieve something that could have been generated provider side in
>>  >  >> minutes
>>  >  >> and transferred in seconds (3MB).
>>  >  >>
>>  >  >> I sent a proposal to TDWG last year termed "datamaps" which was
>>  >  >> effectively what you are describing, and I based it on the Sitemaps
>>  >  >> protocol, but I got nowhere with it.  With Markus, we are making more
>>  >  >> progress and I have spoken with several GBIF data providers about a
>>  >  >> proposed new standard for full dataset harvesting and it has been
>>  >  >> received
>>  >  >> well.  So Markus and I have started a new proposal and have a
>>  >  >> working name
>>  >  >> of 'Localised DwC Index' file generation (it is an index if you
>>  >  >> have more
>>  >  >> than DwC data, and DwC is still standards compliant) which is
>>  >  >> really a
>>  >  >> GZipped Tab file dump of the data, which is slightly extensible.  The
>>  >  >> document is not ready to circulate yet but the benefits section reads
>>  >  >> currently:
>>  >  >>
>>  >  >> - Provider database load reduced, allowing it to serve real
>>  >  >> distributed
>>  >  >> queries rather than "full datasource" harvesters
>>  >  >> - Providers can choose to publish their index as it suits them,
>>  >  >> giving
>>  >  >> control back to the provider
>>  >  >> - Localised index generation can be built into tools not yet
>>  >  >> capable of
>>  >  >> integrating with TDWG protocol networks such as GBIF
>>  >  >> - Harvesters receive a full dataset view in one request, making it
>>  >  >> very
>>  >  >> easy to determine what records are eligible for deletion
>>  >  >> - It becomes very simple to write clients that consume entire
>>  >  >> datasets.
>>  >  >> E.g. data cleansing tools that the provider can run:
>>  >  >>  -  Give me ISO Country Codes for my dataset
>>  >  >>     -  The application pulls down the providers index file,
>>  >  >> generates ISO
>>  >  >> country code, returns a simple table using the providers own
>>  >  >> identifier
>>  >  >>  - Check my names for spelling mistakes
>>  >  >>    - Application skims over the records and provides a list that
>>  >  >> are not
>>  >  >> known to the application
>>  >  >> - Providers such as UK NBN cannot serve 20 million records to the
>>  >  >> GBIF
>>  >  >> index using the existing protocols efficiently.
>>  >  >>  - They have the ability to generate a localised index however
>>  >  >> - Harvesters can very quickly build up searchable indexes and it is
>>  >  >> easy
>>  >  >> to create large indices.
>>  >  >>  - Node Portal can easily aggregate index data files
>>  >  >> - true index to data, not an illusion of a cache. More like Google
>>  >  >> sitemaps
>>  >  >>
>>  >  >> It is the ease at which one can offer tools to data providers that
>>  >  >> really
>>  >  >> interests me.  The technical threshold required to produce services
>>  >  >> that
>>  >  >> offer reporting tools on peoples data is really very low with this
>>  >  >> mechanism.  This and the fact that large datasets will be
>>  >  >> harvestable - we
>>  >  >> have even considered the likes of bit-torrent for the large ones
>>  >  >> although
>>  >  >> I think this is overkill.
>>  >  >>
>>  >  >> As a consumer therefore I fully support this move as a valuable
>>  >  >> addition
>>  >  >> to the wrapper tools.
>>  >  >>
>>  >  >> Cheers
>>  >  >>
>>  >  >> Tim
>>  >  >> (wrote the GBIF harvesting, and new to this list)
>>  >  >>
>>  >  >>
>>  >  >>>
>>  >  >>> Begin forwarded message:
>>  >  >>>
>>  >  >>>> From: "Aaron D. Steele" <eightysteele at gmail.com>
>>  >  >>>> Date: 13 de mayo de 2008 22:40:09 GMT+02:00
>>  >  >>>> To: tdwg-tapir at lists.tdwg.org
>>  >  >>>> Cc: Aaron Steele <asteele at berkeley.edu>
>>  >  >>>> Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
>>  >  >>>>
>>  >  >>>> at berkeley we've recently prototyped a simple php program that
>>  >  >>>> uses
>>  >  >>>> an existing tapirlink installation to periodically dump tapir
>>  >  >>>> resources into a csv file. the solution is totally generic and can
>>  >  >>>> dump darwin core (and technically abcd schema, although it's
>>  >  >>>> currently
>>  >  >>>> untested). the resulting csv files are zip archived and made
>>  >  >>>> accessible using a web service. it's a simple approach that has
>>  >  >>>> proven
>>  >  >>>> to be, at least internally, quite reliable and useful.
>>  >  >>>>
>>  >  >>>> for example, several of our caching applications use the web
>>  >  >>>> service
>>  >  >>>> to harvest csv data from tapirlink resources using the following
>>  >  >>>> process:
>>  >  >>>> 1) download latest csv dump for a resource using the web service.
>>  >  >>>> 2) flush all locally cached records for the resource.
>>  >  >>>> 3) bulk load the latest csv data into the cache.
>>  >  >>>>
>>  >  >>>> in this way, cached data are always synchronized with the
>>  >  >>>> resource and
>>  >  >>>> there's no need to track new, deleted, or changed records. as an
>>  >  >>>> aside, each time these cached data are queried by the caching
>>  >  >>>> application or selected in the user interface, log-only search
>>  >  >>>> requests are sent back to the resource.
>>  >  >>>>
>>  >  >>>> after discussion with renato giovanni and john wieczorek, we've
>>  >  >>>> decided that merging this functionality into the tapirlink codebase
>>  >  >>>> would benefit the broader community. csv generation support would
>>  >  >>>> be
>>  >  >>>> declared through capabilities. although incremental harvesting
>>  >  >>>> wouldn't be immediately implemented, we could certainly extend the
>>  >  >>>> service to include it later.
>>  >  >>>>
>>  >  >>>> i'd like to pause here to gauge the consensus, thoughts,
>>  >  >>>> concerns, and
>>  >  >>>> ideas of others. anyone?
>>  >  >>>>
>>  >  >>>> thanks,
>>  >  >>>> aaron
>>  >  >>>>
>>  >  >>>> 2008/5/5 Kevin Richards <RichardsK at landcareresearch.co.nz>:
>>  >  >>>>>
>>  >  >>>>> I think I agree here.
>>  >  >>>>>
>>  >  >>>>> The harvesting "procedure" is really defined outside the Tapir
>>  >  >>>>> protocol, is
>>  >  >>>>> it not?  So it is really an agreement between the harvester and
>>  >  >>>>> the
>>  >  >>>>> harvestees.
>>  >  >>>>>
>>  >  >>>>> So what is really needed here is the standard procedure for
>>  >  >>>>> maintaining a
>>  >  >>>>> "harvestable" dataset and the standard procedure for harvesting
>>  >  >>>>> that
>>  >  >>>>> dataset.
>>  >  >>>>> We have a general rule at Landcare, that we never delete records
>>  >  >>>>> in
>>  >  >>>>> our
>>  >  >>>>> datasets - they are either deprecated in favour of another record,
>>  >  >>>>> and so
>>  >  >>>>> the resolution of that record would point to the new record, or
>>  >  >>>>> the
>>  >  >>>>> are set
>>  >  >>>>> to a state of "deleted", but are still kept in the dataset, and
>>  >  >>>>> can
>>  >  >>>>> be
>>  >  >>>>> resolved (which would indicate a state of deleted).
>>  >  >>>>>
>>  >  >>>>> Kevin
>>  >  >>>>>
>>  >  >>>>>
>>  >  >>>>>>>> "Renato De Giovanni" <renato at cria.org.br> 6/05/2008 7:33 a.m.
>>  >  >>>>>>>> >>>
>>  >  >>>>>
>>  >  >>>>> Hi Markus,
>>  >  >>>>>
>>  >  >>>>> I would suggest creating new concepts for incremental harvesting,
>>  >  >>>>> either in the data standards themselves or in some new
>>  >  >>>>> extension. In
>>  >  >>>>> the case of TAPIR, GBIF could easily check the mapped concepts
>>  >  >>>>> before
>>  >  >>>>> deciding between incremental or full harvesting.
>>  >  >>>>>
>>  >  >>>>> Actually it could be just one new concept such as "recordStatus"
>>  >  >>>>> or
>>  >  >>>>> "deletionFlag". Or perhaps you could also want to create your own
>>  >  >>>>> definition for dateLastModified indicating which set of concepts
>>  >  >>>>> should be considered to see if something has changed or not, but I
>>  >  >>>>> guess this level of granularity would be difficult to be
>>  >  >>>>> supported.
>>  >  >>>>>
>>  >  >>>>> Regards,
>>  >  >>>>> --
>>  >  >>>>> Renato
>>  >  >>>>>
>>  >  >>>>> On 5 May 2008 at 11:24, Markus Döring wrote:
>>  >  >>>>>
>>  >  >>>>>> Phil,
>>  >  >>>>>> incremental harvesting is not implemented on the GBIF side as far
>>  >  >>>>>> as I
>>  >  >>>>>> am aware. And I dont think that will be a simple thing to
>>  >  >>>>>> implement on
>>  >  >>>>>> the current system. Also, even if we can detect only the changed
>>  >  >>>>>> records since the last harevesting via dateLastModified we still
>>  >  >>>>>> have
>>  >  >>>>>> no information about deletions. We could have an arrangement
>>  >  >>>>>> saying
>>  >  >>>>>> that you keep deleted records as empty records with just the ID
>>  >  >>>>>> and
>>  >  >>>>>> nothing else (I vaguely remember LSIDs were supposed to work like
>>  >  >>>>>> this
>>  >  >>>>>> too). But that also needs to be supported on your side then,
>>  >  >>>>>> never
>>  >  >>>>>> entirely removing any record. I will have a discussion with the
>>  >  >>>>>> others
>>  >  >>>>>> at GBIF about that.
>>  >  >>>>>>
>>  >  >>>>>> Markus
>>  >  >>>>> _______________________________________________
>>  >  >>>>> tdwg-tapir mailing list
>>  >  >>>>> tdwg-tapir at lists.tdwg.org
>>  >  >>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>  >  >>>>>
>>  >  >>>>>
>>  >  >>>>>
>>  >  >>>>>
>>  >  >>>>> Please consider the environment before printing this email
>>  >  >>>>>
>>  >  >>>>> WARNING : This email and any attachments may be confidential and/
>>  >  >>>>> or
>>  >  >>>>> privileged. They are intended for the addressee only and are not
>>  >  >>>>> to
>>  >  >>>>> be read,
>>  >  >>>>> used, copied or disseminated by anyone receiving them in error. If
>>  >  >>>>> you are
>>  >  >>>>> not the intended recipient, please notify the sender by return
>>  >  >>>>> email and
>>  >  >>>>> delete this message and any attachments.
>>  >  >>>>>
>>  >  >>>>> The views expressed in this email are those of the sender and do
>>  >  >>>>> not
>>  >  >>>>> necessarily reflect the
>>  >  >>>>> official views of Landcare Research. http://
>>  >  >>>>> www.landcareresearch.co.nz
>>  >  >>>>> _______________________________________________
>>  >  >>>>> tdwg-tapir mailing list
>>  >  >>>>> tdwg-tapir at lists.tdwg.org
>>  >  >>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>  >  >>>>>
>>  >  >>>>>
>>  >  >>>> _______________________________________________
>>  >  >>>> tdwg-tapir mailing list
>>  >  >>>> tdwg-tapir at lists.tdwg.org
>>  >  >>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>  >  >>>
>>  >  >>
>>  >  >>
>>  >  >> _______________________________________________
>>  >  >> tdwg-tapir mailing list
>>  >  >> tdwg-tapir at lists.tdwg.org
>>  >  >> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>  >  >
>>  >  > --
>>  >  >
>>  >  > Australian Centre for Plant BIodiversity Research<------------------+
>>  >  > National            greg whitBread             voice: +61 2 62509 482
>>  >  > Botanic Integrated Botanical Information System  fax: +61 2 62509 599
>>  >  > Gardens                      S........ I.T. happens.. ghw at anbg.gov.au
>>  >  > +----------------------------------------->GPO Box 1777 Canberra 2601
>>  >  >
>>  >  >
>>  >  >
>>  >  > ------
>>  >  > If you have received this transmission in error please notify us
>>  >  > immediately by return e-mail and delete all copies. If this e-mail
>>  >  > or any attachments have been sent to you in error, that error does
>>  >  > not constitute waiver of any confidentiality, privilege or copyright
>>  >  > in respect of information in the e-mail or attachments.
>>  >  >
>>  >  >
>>  >  >
>>  >  > Please consider the environment before printing this email.
>>  >  >
>>  >  > ------
>>  >  >
>>  >  > _______________________________________________
>>  >  > tdwg-tapir mailing list
>>  >  > tdwg-tapir at lists.tdwg.org
>>  >  > http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>  >  >
>>  >
>>  >  _______________________________________________
>>  >  tdwg-tapir mailing list
>>  >  tdwg-tapir at lists.tdwg.org
>>  >  http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>  >
>>  _______________________________________________
>>  tdwg-tapir mailing list
>>  tdwg-tapir at lists.tdwg.org
>>  http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>
> _______________________________________________
> tdwg-tapir mailing list
> tdwg-tapir at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>