[tdwg-tapir] Fwd: Tapir protocol - Harvest methods?[SEC=UNCLASSIFIED]

Thu May 15 17:59:54 CEST 2008

I second that.

On Thu, May 15, 2008 at 5:11 AM, Markus Döring <mdoering at gbif.org> wrote:

> that's right. So they need to be escaped if they really want to have
> control characters in their dumps.
>
> But this is no different from escaping xml or any other document. It
> would just be nice if the number of escape characters is kept to a
> minimum. For this reason I personally prefer tab files, as escaping
> line returns and the delimiting tab space is rather little work.
>
>
> Markus
>
>
> On 15 May, 2008, at 13:40, Holetschek, Jörg wrote:
>
> > Hi guys,
> >
> > sorry for the late reaction, but I put off reading all the mails
> > until today.
> >
> > Using CSV and tab delimited files will cause problems when the dumps
> > contains freetext data, e.g. locality description or notes. When I
> > pushed our BioCASE cache (50 million occurrence records) between
> > different DBMS using tab delimited files, I had to experience that
> > people are very eager to use tabs and new lines in freetext fields.
> > Any character you choose for delimiting contents you will find in
> > freetext fields...
> >
> > Cheers from Berlin,
> > Jörg
> >
> > -----Ursprüngliche Nachricht-----
> > Von: tdwg-tapir-bounces at lists.tdwg.org
> > [mailto:tdwg-tapir-bounces at lists.tdwg.org]Im Auftrag von Markus Döring
> > Gesendet: Mittwoch, 14. Mai 2008 15:35
> > An: Aaron D. Steele
> > Cc: TAPIR mailing list
> > Betreff: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest
> > methods?[SEC=UNCLASSIFIED]
> >
> >
> > it would keep the relations, but we dont really want any relational
> > structure to be served up.
> > And using sqlite binaries for the dwc star scheme would not be easier
> > to work with than plain text files. they can even be loaded into excel
> > straight away, can be versioned with svn and so on. If there is a
> > geospatial extension file which has the GUID in the first column,
> > applications might grab that directly and not even touch the central
> > core file if they only want location data.
> >
> > I'd prefer to stick with a csv or tab delimited file.
> > The simpler the better. And it also cant get corrupted as easily.
> >
> > Markus
> >
> >
> >
> > On 14 May, 2008, at 15:25, Aaron D. Steele wrote:
> >
> >> for preserving relational data, we could also just dump tapirlink
> >> resources to an sqlite database file (http://www.sqlite.org), zip it
> >> up, and again make it available via the web service. we use sqlite
> >> internally for many projects, and it's both easy to use and well
> >> supported by jdbc, php, python, etc.
> >>
> >> would something like this be a useful option?
> >>
> >> thanks,
> >> aaron
> >>
> >> On Wed, May 14, 2008 at 2:21 AM, Markus Döring <mdoering at gbif.org>
> >> wrote:
> >>> Interesting that we all come to the same conclusions...
> >>> The trouble I had with just a simple flat csv file is repeating
> >>> properties like multiple image urls. ABCD clients dont use ABCD just
> >>> because its complex, but because they want to transport this
> >>> relational data. We were considering 2 solutions to extending this
> >>> csv
> >>> approach. The first would be to have a single large denormalised csv
> >>> file with many rows for the same record. It would require knowledge
> >>> about the related entities though and could grow in size rapidly.
> >>> The
> >>> second idea which we think to adopt is allowing a single level of 1-
> >>> many related entities. It is basically a "star" design with the core
> >>> dwc table in the center and any number of extension tables around
> >>> it.
> >>> Each "table" aka csv file will have the record id as the first
> >>> column,
> >>> so the files can be related easily and it only needs a single
> >>> identifier per record and not for the extension entities. This would
> >>> give a lot of flexibility while keeping things pretty simple to deal
> >>> with. It would even satisfy the ABCD needs as I havent yet seen
> >>> anyone
> >>> requiring 2 levels of related tables (other than lookup tables).
> >>> Those
> >>> extensions could even be a simple 1-1 relation, but would keep
> >>> things
> >>> semantically together just like a xml namespace. The darwin core
> >>> extensions would be good for example.
> >>>
> >>> So we could have a gzipped set of files, maybe with a simple
> >>> metafile
> >>> indicating the semantics of the columns for each file.
> >>> An example could look like this:
> >>>
> >>>
> >>> # darwincore.csv
> >>> 102    Aster alpinus subsp. parviceps    ...
> >>> 103    Polygala vulgaris    ...
> >>>
> >>> # curatorial.csv
> >>> 102    Kew Herbarium
> >>> 103    Reading Herbarium
> >>>
> >>> # identification.csv
> >>> 102    2003-05-04    Karl Marx    Aster alpinus L.
> >>> 102    2007-01-11    Mark Twain    Aster korshinskyi Tamamsch.
> >>> 102    2007-09-13    Roger Hyam    Aster alpinus subsp. parviceps
> >>> Novopokr.
> >>> 103    2001-02-21    Steve Bekow    Polygala vulgaris L.
> >>>
> >>>
> >>>
> >>> I know this looks old fashioned, but it is just so simple and gives
> >>> us
> >>> so much flexibility.
> >>> Markus
> >>>
> >>>
> >>>
> >>>
> >>> On 14 May, 2008, at 24:39, Greg Whitbread wrote:
> >>>
> >>>> We have used a very similar protocol to assemble the latest AVH
> >>>> cache.
> >>>> It should be noted that this is an as-well-as protocol that only
> >>>> works
> >>>> because we have an established semantic standard (hispid/abcd).
> >>>>
> >>>> greg
> >>>>
> >>>> trobertson at gbif.org wrote:
> >>>>> Hi All,
> >>>>>
> >>>>> This is very interesting too me, as I came up with the same
> >>>>> conclusion
> >>>>> while harvesting for GBIF.
> >>>>>
> >>>>> As a "harvester of all records" it is best described with an
> >>>>> example:
> >>>>>
> >>>>> - Complete Inventory of ScientificNames: 7 minutes @ the limited
> >>>>> 200
> >>>>> records per page
> >>>>> - Complete Harvesting of records:
> >>>>> - 260,000 records
> >>>>> - 9 hours harvesting duration
> >>>>> - 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and
> >>>>> curatorial
> >>>>> extensions)
> >>>>> - Extraction of DwC records from harvested XML: <2 minutes
> >>>>> - Resulting file size 32MB, Gzipped to <3MB
> >>>>>
> >>>>> I spun hard drives for 9 hours, and took up bandwidth that is paid
> >>>>> for, to
> >>>>> retrieve something that could have been generated provider side in
> >>>>> minutes
> >>>>> and transferred in seconds (3MB).
> >>>>>
> >>>>> I sent a proposal to TDWG last year termed "datamaps" which was
> >>>>> effectively what you are describing, and I based it on the
> >>>>> Sitemaps
> >>>>> protocol, but I got nowhere with it.  With Markus, we are making
> >>>>> more
> >>>>> progress and I have spoken with several GBIF data providers
> >>>>> about a
> >>>>> proposed new standard for full dataset harvesting and it has been
> >>>>> received
> >>>>> well.  So Markus and I have started a new proposal and have a
> >>>>> working name
> >>>>> of 'Localised DwC Index' file generation (it is an index if you
> >>>>> have more
> >>>>> than DwC data, and DwC is still standards compliant) which is
> >>>>> really a
> >>>>> GZipped Tab file dump of the data, which is slightly extensible.
> >>>>> The
> >>>>> document is not ready to circulate yet but the benefits section
> >>>>> reads
> >>>>> currently:
> >>>>>
> >>>>> - Provider database load reduced, allowing it to serve real
> >>>>> distributed
> >>>>> queries rather than "full datasource" harvesters
> >>>>> - Providers can choose to publish their index as it suits them,
> >>>>> giving
> >>>>> control back to the provider
> >>>>> - Localised index generation can be built into tools not yet
> >>>>> capable of
> >>>>> integrating with TDWG protocol networks such as GBIF
> >>>>> - Harvesters receive a full dataset view in one request, making it
> >>>>> very
> >>>>> easy to determine what records are eligible for deletion
> >>>>> - It becomes very simple to write clients that consume entire
> >>>>> datasets.
> >>>>> E.g. data cleansing tools that the provider can run:
> >>>>> -  Give me ISO Country Codes for my dataset
> >>>>>   -  The application pulls down the providers index file,
> >>>>> generates ISO
> >>>>> country code, returns a simple table using the providers own
> >>>>> identifier
> >>>>> - Check my names for spelling mistakes
> >>>>>  - Application skims over the records and provides a list that
> >>>>> are not
> >>>>> known to the application
> >>>>> - Providers such as UK NBN cannot serve 20 million records to the
> >>>>> GBIF
> >>>>> index using the existing protocols efficiently.
> >>>>> - They have the ability to generate a localised index however
> >>>>> - Harvesters can very quickly build up searchable indexes and it
> >>>>> is
> >>>>> easy
> >>>>> to create large indices.
> >>>>> - Node Portal can easily aggregate index data files
> >>>>> - true index to data, not an illusion of a cache. More like Google
> >>>>> sitemaps
> >>>>>
> >>>>> It is the ease at which one can offer tools to data providers that
> >>>>> really
> >>>>> interests me.  The technical threshold required to produce
> >>>>> services
> >>>>> that
> >>>>> offer reporting tools on peoples data is really very low with this
> >>>>> mechanism.  This and the fact that large datasets will be
> >>>>> harvestable - we
> >>>>> have even considered the likes of bit-torrent for the large ones
> >>>>> although
> >>>>> I think this is overkill.
> >>>>>
> >>>>> As a consumer therefore I fully support this move as a valuable
> >>>>> addition
> >>>>> to the wrapper tools.
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>> Tim
> >>>>> (wrote the GBIF harvesting, and new to this list)
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Begin forwarded message:
> >>>>>>
> >>>>>>> From: "Aaron D. Steele" <eightysteele at gmail.com>
> >>>>>>> Date: 13 de mayo de 2008 22:40:09 GMT+02:00
> >>>>>>> To: tdwg-tapir at lists.tdwg.org
> >>>>>>> Cc: Aaron Steele <asteele at berkeley.edu>
> >>>>>>> Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
> >>>>>>>
> >>>>>>> at berkeley we've recently prototyped a simple php program that
> >>>>>>> uses
> >>>>>>> an existing tapirlink installation to periodically dump tapir
> >>>>>>> resources into a csv file. the solution is totally generic and
> >>>>>>> can
> >>>>>>> dump darwin core (and technically abcd schema, although it's
> >>>>>>> currently
> >>>>>>> untested). the resulting csv files are zip archived and made
> >>>>>>> accessible using a web service. it's a simple approach that has
> >>>>>>> proven
> >>>>>>> to be, at least internally, quite reliable and useful.
> >>>>>>>
> >>>>>>> for example, several of our caching applications use the web
> >>>>>>> service
> >>>>>>> to harvest csv data from tapirlink resources using the following
> >>>>>>> process:
> >>>>>>> 1) download latest csv dump for a resource using the web
> >>>>>>> service.
> >>>>>>> 2) flush all locally cached records for the resource.
> >>>>>>> 3) bulk load the latest csv data into the cache.
> >>>>>>>
> >>>>>>> in this way, cached data are always synchronized with the
> >>>>>>> resource and
> >>>>>>> there's no need to track new, deleted, or changed records. as an
> >>>>>>> aside, each time these cached data are queried by the caching
> >>>>>>> application or selected in the user interface, log-only search
> >>>>>>> requests are sent back to the resource.
> >>>>>>>
> >>>>>>> after discussion with renato giovanni and john wieczorek, we've
> >>>>>>> decided that merging this functionality into the tapirlink
> >>>>>>> codebase
> >>>>>>> would benefit the broader community. csv generation support
> >>>>>>> would
> >>>>>>> be
> >>>>>>> declared through capabilities. although incremental harvesting
> >>>>>>> wouldn't be immediately implemented, we could certainly extend
> >>>>>>> the
> >>>>>>> service to include it later.
> >>>>>>>
> >>>>>>> i'd like to pause here to gauge the consensus, thoughts,
> >>>>>>> concerns, and
> >>>>>>> ideas of others. anyone?
> >>>>>>>
> >>>>>>> thanks,
> >>>>>>> aaron
> >>>>>>>
> >>>>>>> 2008/5/5 Kevin Richards <RichardsK at landcareresearch.co.nz>:
> >>>>>>>>
> >>>>>>>> I think I agree here.
> >>>>>>>>
> >>>>>>>> The harvesting "procedure" is really defined outside the Tapir
> >>>>>>>> protocol, is
> >>>>>>>> it not?  So it is really an agreement between the harvester and
> >>>>>>>> the
> >>>>>>>> harvestees.
> >>>>>>>>
> >>>>>>>> So what is really needed here is the standard procedure for
> >>>>>>>> maintaining a
> >>>>>>>> "harvestable" dataset and the standard procedure for harvesting
> >>>>>>>> that
> >>>>>>>> dataset.
> >>>>>>>> We have a general rule at Landcare, that we never delete
> >>>>>>>> records
> >>>>>>>> in
> >>>>>>>> our
> >>>>>>>> datasets - they are either deprecated in favour of another
> >>>>>>>> record,
> >>>>>>>> and so
> >>>>>>>> the resolution of that record would point to the new record, or
> >>>>>>>> the
> >>>>>>>> are set
> >>>>>>>> to a state of "deleted", but are still kept in the dataset, and
> >>>>>>>> can
> >>>>>>>> be
> >>>>>>>> resolved (which would indicate a state of deleted).
> >>>>>>>>
> >>>>>>>> Kevin
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>>> "Renato De Giovanni" <renato at cria.org.br> 6/05/2008 7:33
> >>>>>>>>>>> a.m.
> >>>>>>>>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Markus,
> >>>>>>>>
> >>>>>>>> I would suggest creating new concepts for incremental
> >>>>>>>> harvesting,
> >>>>>>>> either in the data standards themselves or in some new
> >>>>>>>> extension. In
> >>>>>>>> the case of TAPIR, GBIF could easily check the mapped concepts
> >>>>>>>> before
> >>>>>>>> deciding between incremental or full harvesting.
> >>>>>>>>
> >>>>>>>> Actually it could be just one new concept such as
> >>>>>>>> "recordStatus"
> >>>>>>>> or
> >>>>>>>> "deletionFlag". Or perhaps you could also want to create your
> >>>>>>>> own
> >>>>>>>> definition for dateLastModified indicating which set of
> >>>>>>>> concepts
> >>>>>>>> should be considered to see if something has changed or not,
> >>>>>>>> but I
> >>>>>>>> guess this level of granularity would be difficult to be
> >>>>>>>> supported.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> --
> >>>>>>>> Renato
> >>>>>>>>
> >>>>>>>> On 5 May 2008 at 11:24, Markus Döring wrote:
> >>>>>>>>
> >>>>>>>>> Phil,
> >>>>>>>>> incremental harvesting is not implemented on the GBIF side as
> >>>>>>>>> far
> >>>>>>>>> as I
> >>>>>>>>> am aware. And I dont think that will be a simple thing to
> >>>>>>>>> implement on
> >>>>>>>>> the current system. Also, even if we can detect only the
> >>>>>>>>> changed
> >>>>>>>>> records since the last harevesting via dateLastModified we
> >>>>>>>>> still
> >>>>>>>>> have
> >>>>>>>>> no information about deletions. We could have an arrangement
> >>>>>>>>> saying
> >>>>>>>>> that you keep deleted records as empty records with just the
> >>>>>>>>> ID
> >>>>>>>>> and
> >>>>>>>>> nothing else (I vaguely remember LSIDs were supposed to work
> >>>>>>>>> like
> >>>>>>>>> this
> >>>>>>>>> too). But that also needs to be supported on your side then,
> >>>>>>>>> never
> >>>>>>>>> entirely removing any record. I will have a discussion with
> >>>>>>>>> the
> >>>>>>>>> others
> >>>>>>>>> at GBIF about that.
> >>>>>>>>>
> >>>>>>>>> Markus
> >>>>>>>> _______________________________________________
> >>>>>>>> tdwg-tapir mailing list
> >>>>>>>> tdwg-tapir at lists.tdwg.org
> >>>>>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Please consider the environment before printing this email
> >>>>>>>>
> >>>>>>>> WARNING : This email and any attachments may be confidential
> >>>>>>>> and/
> >>>>>>>> or
> >>>>>>>> privileged. They are intended for the addressee only and are
> >>>>>>>> not
> >>>>>>>> to
> >>>>>>>> be read,
> >>>>>>>> used, copied or disseminated by anyone receiving them in
> >>>>>>>> error. If
> >>>>>>>> you are
> >>>>>>>> not the intended recipient, please notify the sender by return
> >>>>>>>> email and
> >>>>>>>> delete this message and any attachments.
> >>>>>>>>
> >>>>>>>> The views expressed in this email are those of the sender and
> >>>>>>>> do
> >>>>>>>> not
> >>>>>>>> necessarily reflect the
> >>>>>>>> official views of Landcare Research. http://
> >>>>>>>> www.landcareresearch.co.nz
> >>>>>>>> _______________________________________________
> >>>>>>>> tdwg-tapir mailing list
> >>>>>>>> tdwg-tapir at lists.tdwg.org
> >>>>>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
> >>>>>>>>
> >>>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> tdwg-tapir mailing list
> >>>>>>> tdwg-tapir at lists.tdwg.org
> >>>>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
> >>>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> tdwg-tapir mailing list
> >>>>> tdwg-tapir at lists.tdwg.org
> >>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
> >>>>
> >>>> --
> >>>>
> >>>> Australian Centre for Plant BIodiversity
> >>>> Research<------------------+
> >>>> National            greg whitBread             voice: +61 2 62509
> >>>> 482
> >>>> Botanic Integrated Botanical Information System  fax: +61 2 62509
> >>>> 599
> >>>> Gardens                      S........ I.T. happens..
> >>>> ghw at anbg.gov.au
> >>>> +----------------------------------------->GPO Box 1777 Canberra
> >>>> 2601
> >>>>
> >>>>
> >>>>
> >>>> ------
> >>>> If you have received this transmission in error please notify us
> >>>> immediately by return e-mail and delete all copies. If this e-mail
> >>>> or any attachments have been sent to you in error, that error does
> >>>> not constitute waiver of any confidentiality, privilege or
> >>>> copyright
> >>>> in respect of information in the e-mail or attachments.
> >>>>
> >>>>
> >>>>
> >>>> Please consider the environment before printing this email.
> >>>>
> >>>> ------
> >>>>
> >>>> _______________________________________________
> >>>> tdwg-tapir mailing list
> >>>> tdwg-tapir at lists.tdwg.org
> >>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
> >>>>
> >>>
> >>> _______________________________________________
> >>> tdwg-tapir mailing list
> >>> tdwg-tapir at lists.tdwg.org
> >>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
> >>>
> >> _______________________________________________
> >> tdwg-tapir mailing list
> >> tdwg-tapir at lists.tdwg.org
> >> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
> >>
> >
> > _______________________________________________
> > tdwg-tapir mailing list
> > tdwg-tapir at lists.tdwg.org
> > http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
> >
>
> _______________________________________________
> tdwg-tapir mailing list
> tdwg-tapir at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-tag/attachments/20080515/c5955769/attachment.html