[tdwg-tapir] Fwd: Tapir protocol - Harvest methods? [SEC=UNCLASSIFIED]

Wed May 14 16:35:52 CEST 2008

... and because of appengine we were considering to use YAML for a  
very simple metafile for the conceptual binding instead of having  
column header rows.
http://code.google.com/appengine/docs/configuringanapp.html

On 14 May, 2008, at 16:19, Javier de la Torre wrote:

> This discussion starts to remind me another one in Google Appengine
> discussio group. They talk about different ways to bulk upload data to
> their Big Table database.
>
> http://groups.google.com/group/google-appengine/browse_thread/thread/18d246b30e267da4
>
> I have read so far:
> -XML
> -CSV
> -RDF
> -JSON
> -AMF
> -SQL
> -OOXML
> -TSV
>
> Uff so many ideas...
>
> I would take whatever Google finally decide as it will probable become
> a defacto standard :D
>
> The discussion is funny :D
>
> Cheers.
>
>
> On Wed, May 14, 2008 at 4:04 PM, Dave Vieglais <vieglais at ku.edu>  
> wrote:
>> Perhaps it could be put into some form of xml to preserve the
>> relational model?  Maybe a mechanism could be developed so that  
>> others
>> could access the xml as well.  How about even putting some sort of
>> subsetting mechanism so that entire data sets need not be retrieved.
>>
>> just a thought...
>>
>>
>> On Wed, May 14, 2008 at 9:25 AM, Aaron D. Steele <eightysteele at gmail.com 
>> > wrote:
>>> for preserving relational data, we could also just dump tapirlink
>>> resources to an sqlite database file (http://www.sqlite.org), zip it
>>> up, and again make it available via the web service. we use sqlite
>>> internally for many projects, and it's both easy to use and well
>>> supported by jdbc, php, python, etc.
>>>
>>> would something like this be a useful option?
>>>
>>> thanks,
>>> aaron
>>>
>>> On Wed, May 14, 2008 at 2:21 AM, Markus Döring <mdoering at gbif.org>  
>>> wrote:
>>>> Interesting that we all come to the same conclusions...
>>>> The trouble I had with just a simple flat csv file is repeating
>>>> properties like multiple image urls. ABCD clients dont use ABCD  
>>>> just
>>>> because its complex, but because they want to transport this
>>>> relational data. We were considering 2 solutions to extending  
>>>> this csv
>>>> approach. The first would be to have a single large denormalised  
>>>> csv
>>>> file with many rows for the same record. It would require knowledge
>>>> about the related entities though and could grow in size rapidly.  
>>>> The
>>>> second idea which we think to adopt is allowing a single level of  
>>>> 1-
>>>> many related entities. It is basically a "star" design with the  
>>>> core
>>>> dwc table in the center and any number of extension tables around  
>>>> it.
>>>> Each "table" aka csv file will have the record id as the first  
>>>> column,
>>>> so the files can be related easily and it only needs a single
>>>> identifier per record and not for the extension entities. This  
>>>> would
>>>> give a lot of flexibility while keeping things pretty simple to  
>>>> deal
>>>> with. It would even satisfy the ABCD needs as I havent yet seen  
>>>> anyone
>>>> requiring 2 levels of related tables (other than lookup tables).  
>>>> Those
>>>> extensions could even be a simple 1-1 relation, but would keep  
>>>> things
>>>> semantically together just like a xml namespace. The darwin core
>>>> extensions would be good for example.
>>>>
>>>> So we could have a gzipped set of files, maybe with a simple  
>>>> metafile
>>>> indicating the semantics of the columns for each file.
>>>> An example could look like this:
>>>>
>>>>
>>>> # darwincore.csv
>>>> 102    Aster alpinus subsp. parviceps    ...
>>>> 103    Polygala vulgaris    ...
>>>>
>>>> # curatorial.csv
>>>> 102    Kew Herbarium
>>>> 103    Reading Herbarium
>>>>
>>>> # identification.csv
>>>> 102    2003-05-04    Karl Marx    Aster alpinus L.
>>>> 102    2007-01-11    Mark Twain    Aster korshinskyi Tamamsch.
>>>> 102    2007-09-13    Roger Hyam    Aster alpinus subsp. parviceps
>>>> Novopokr.
>>>> 103    2001-02-21    Steve Bekow    Polygala vulgaris L.
>>>>
>>>>
>>>>
>>>> I know this looks old fashioned, but it is just so simple and  
>>>> gives us
>>>> so much flexibility.
>>>> Markus
>>>>
>>>>
>>>>
>>>>
>>>> On 14 May, 2008, at 24:39, Greg Whitbread wrote:
>>>>
>>>>> We have used a very similar protocol to assemble the latest AVH  
>>>>> cache.
>>>>> It should be noted that this is an as-well-as protocol that only  
>>>>> works
>>>>> because we have an established semantic standard (hispid/abcd).
>>>>>
>>>>> greg
>>>>>
>>>>> trobertson at gbif.org wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> This is very interesting too me, as I came up with the same
>>>>>> conclusion
>>>>>> while harvesting for GBIF.
>>>>>>
>>>>>> As a "harvester of all records" it is best described with an  
>>>>>> example:
>>>>>>
>>>>>> - Complete Inventory of ScientificNames: 7 minutes @ the  
>>>>>> limited 200
>>>>>> records per page
>>>>>> - Complete Harvesting of records:
>>>>>> - 260,000 records
>>>>>> - 9 hours harvesting duration
>>>>>> - 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and
>>>>>> curatorial
>>>>>> extensions)
>>>>>> - Extraction of DwC records from harvested XML: <2 minutes
>>>>>> - Resulting file size 32MB, Gzipped to <3MB
>>>>>>
>>>>>> I spun hard drives for 9 hours, and took up bandwidth that is  
>>>>>> paid
>>>>>> for, to
>>>>>> retrieve something that could have been generated provider side  
>>>>>> in
>>>>>> minutes
>>>>>> and transferred in seconds (3MB).
>>>>>>
>>>>>> I sent a proposal to TDWG last year termed "datamaps" which was
>>>>>> effectively what you are describing, and I based it on the  
>>>>>> Sitemaps
>>>>>> protocol, but I got nowhere with it.  With Markus, we are  
>>>>>> making more
>>>>>> progress and I have spoken with several GBIF data providers  
>>>>>> about a
>>>>>> proposed new standard for full dataset harvesting and it has been
>>>>>> received
>>>>>> well.  So Markus and I have started a new proposal and have a
>>>>>> working name
>>>>>> of 'Localised DwC Index' file generation (it is an index if you
>>>>>> have more
>>>>>> than DwC data, and DwC is still standards compliant) which is
>>>>>> really a
>>>>>> GZipped Tab file dump of the data, which is slightly  
>>>>>> extensible.  The
>>>>>> document is not ready to circulate yet but the benefits section  
>>>>>> reads
>>>>>> currently:
>>>>>>
>>>>>> - Provider database load reduced, allowing it to serve real
>>>>>> distributed
>>>>>> queries rather than "full datasource" harvesters
>>>>>> - Providers can choose to publish their index as it suits them,
>>>>>> giving
>>>>>> control back to the provider
>>>>>> - Localised index generation can be built into tools not yet
>>>>>> capable of
>>>>>> integrating with TDWG protocol networks such as GBIF
>>>>>> - Harvesters receive a full dataset view in one request, making  
>>>>>> it
>>>>>> very
>>>>>> easy to determine what records are eligible for deletion
>>>>>> - It becomes very simple to write clients that consume entire
>>>>>> datasets.
>>>>>> E.g. data cleansing tools that the provider can run:
>>>>>> -  Give me ISO Country Codes for my dataset
>>>>>>    -  The application pulls down the providers index file,
>>>>>> generates ISO
>>>>>> country code, returns a simple table using the providers own
>>>>>> identifier
>>>>>> - Check my names for spelling mistakes
>>>>>>   - Application skims over the records and provides a list that
>>>>>> are not
>>>>>> known to the application
>>>>>> - Providers such as UK NBN cannot serve 20 million records to the
>>>>>> GBIF
>>>>>> index using the existing protocols efficiently.
>>>>>> - They have the ability to generate a localised index however
>>>>>> - Harvesters can very quickly build up searchable indexes and  
>>>>>> it is
>>>>>> easy
>>>>>> to create large indices.
>>>>>> - Node Portal can easily aggregate index data files
>>>>>> - true index to data, not an illusion of a cache. More like  
>>>>>> Google
>>>>>> sitemaps
>>>>>>
>>>>>> It is the ease at which one can offer tools to data providers  
>>>>>> that
>>>>>> really
>>>>>> interests me.  The technical threshold required to produce  
>>>>>> services
>>>>>> that
>>>>>> offer reporting tools on peoples data is really very low with  
>>>>>> this
>>>>>> mechanism.  This and the fact that large datasets will be
>>>>>> harvestable - we
>>>>>> have even considered the likes of bit-torrent for the large ones
>>>>>> although
>>>>>> I think this is overkill.
>>>>>>
>>>>>> As a consumer therefore I fully support this move as a valuable
>>>>>> addition
>>>>>> to the wrapper tools.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> Tim
>>>>>> (wrote the GBIF harvesting, and new to this list)
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Begin forwarded message:
>>>>>>>
>>>>>>>> From: "Aaron D. Steele" <eightysteele at gmail.com>
>>>>>>>> Date: 13 de mayo de 2008 22:40:09 GMT+02:00
>>>>>>>> To: tdwg-tapir at lists.tdwg.org
>>>>>>>> Cc: Aaron Steele <asteele at berkeley.edu>
>>>>>>>> Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
>>>>>>>>
>>>>>>>> at berkeley we've recently prototyped a simple php program that
>>>>>>>> uses
>>>>>>>> an existing tapirlink installation to periodically dump tapir
>>>>>>>> resources into a csv file. the solution is totally generic  
>>>>>>>> and can
>>>>>>>> dump darwin core (and technically abcd schema, although it's
>>>>>>>> currently
>>>>>>>> untested). the resulting csv files are zip archived and made
>>>>>>>> accessible using a web service. it's a simple approach that has
>>>>>>>> proven
>>>>>>>> to be, at least internally, quite reliable and useful.
>>>>>>>>
>>>>>>>> for example, several of our caching applications use the web
>>>>>>>> service
>>>>>>>> to harvest csv data from tapirlink resources using the  
>>>>>>>> following
>>>>>>>> process:
>>>>>>>> 1) download latest csv dump for a resource using the web  
>>>>>>>> service.
>>>>>>>> 2) flush all locally cached records for the resource.
>>>>>>>> 3) bulk load the latest csv data into the cache.
>>>>>>>>
>>>>>>>> in this way, cached data are always synchronized with the
>>>>>>>> resource and
>>>>>>>> there's no need to track new, deleted, or changed records. as  
>>>>>>>> an
>>>>>>>> aside, each time these cached data are queried by the caching
>>>>>>>> application or selected in the user interface, log-only search
>>>>>>>> requests are sent back to the resource.
>>>>>>>>
>>>>>>>> after discussion with renato giovanni and john wieczorek, we've
>>>>>>>> decided that merging this functionality into the tapirlink  
>>>>>>>> codebase
>>>>>>>> would benefit the broader community. csv generation support  
>>>>>>>> would
>>>>>>>> be
>>>>>>>> declared through capabilities. although incremental harvesting
>>>>>>>> wouldn't be immediately implemented, we could certainly  
>>>>>>>> extend the
>>>>>>>> service to include it later.
>>>>>>>>
>>>>>>>> i'd like to pause here to gauge the consensus, thoughts,
>>>>>>>> concerns, and
>>>>>>>> ideas of others. anyone?
>>>>>>>>
>>>>>>>> thanks,
>>>>>>>> aaron
>>>>>>>>
>>>>>>>> 2008/5/5 Kevin Richards <RichardsK at landcareresearch.co.nz>:
>>>>>>>>>
>>>>>>>>> I think I agree here.
>>>>>>>>>
>>>>>>>>> The harvesting "procedure" is really defined outside the Tapir
>>>>>>>>> protocol, is
>>>>>>>>> it not?  So it is really an agreement between the harvester  
>>>>>>>>> and
>>>>>>>>> the
>>>>>>>>> harvestees.
>>>>>>>>>
>>>>>>>>> So what is really needed here is the standard procedure for
>>>>>>>>> maintaining a
>>>>>>>>> "harvestable" dataset and the standard procedure for  
>>>>>>>>> harvesting
>>>>>>>>> that
>>>>>>>>> dataset.
>>>>>>>>> We have a general rule at Landcare, that we never delete  
>>>>>>>>> records
>>>>>>>>> in
>>>>>>>>> our
>>>>>>>>> datasets - they are either deprecated in favour of another  
>>>>>>>>> record,
>>>>>>>>> and so
>>>>>>>>> the resolution of that record would point to the new record,  
>>>>>>>>> or
>>>>>>>>> the
>>>>>>>>> are set
>>>>>>>>> to a state of "deleted", but are still kept in the dataset,  
>>>>>>>>> and
>>>>>>>>> can
>>>>>>>>> be
>>>>>>>>> resolved (which would indicate a state of deleted).
>>>>>>>>>
>>>>>>>>> Kevin
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> "Renato De Giovanni" <renato at cria.org.br> 6/05/2008 7:33  
>>>>>>>>>>>> a.m.
>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Markus,
>>>>>>>>>
>>>>>>>>> I would suggest creating new concepts for incremental  
>>>>>>>>> harvesting,
>>>>>>>>> either in the data standards themselves or in some new
>>>>>>>>> extension. In
>>>>>>>>> the case of TAPIR, GBIF could easily check the mapped concepts
>>>>>>>>> before
>>>>>>>>> deciding between incremental or full harvesting.
>>>>>>>>>
>>>>>>>>> Actually it could be just one new concept such as  
>>>>>>>>> "recordStatus"
>>>>>>>>> or
>>>>>>>>> "deletionFlag". Or perhaps you could also want to create  
>>>>>>>>> your own
>>>>>>>>> definition for dateLastModified indicating which set of  
>>>>>>>>> concepts
>>>>>>>>> should be considered to see if something has changed or not,  
>>>>>>>>> but I
>>>>>>>>> guess this level of granularity would be difficult to be
>>>>>>>>> supported.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> --
>>>>>>>>> Renato
>>>>>>>>>
>>>>>>>>> On 5 May 2008 at 11:24, Markus Döring wrote:
>>>>>>>>>
>>>>>>>>>> Phil,
>>>>>>>>>> incremental harvesting is not implemented on the GBIF side  
>>>>>>>>>> as far
>>>>>>>>>> as I
>>>>>>>>>> am aware. And I dont think that will be a simple thing to
>>>>>>>>>> implement on
>>>>>>>>>> the current system. Also, even if we can detect only the  
>>>>>>>>>> changed
>>>>>>>>>> records since the last harevesting via dateLastModified we  
>>>>>>>>>> still
>>>>>>>>>> have
>>>>>>>>>> no information about deletions. We could have an arrangement
>>>>>>>>>> saying
>>>>>>>>>> that you keep deleted records as empty records with just  
>>>>>>>>>> the ID
>>>>>>>>>> and
>>>>>>>>>> nothing else (I vaguely remember LSIDs were supposed to  
>>>>>>>>>> work like
>>>>>>>>>> this
>>>>>>>>>> too). But that also needs to be supported on your side then,
>>>>>>>>>> never
>>>>>>>>>> entirely removing any record. I will have a discussion with  
>>>>>>>>>> the
>>>>>>>>>> others
>>>>>>>>>> at GBIF about that.
>>>>>>>>>>
>>>>>>>>>> Markus
>>>>>>>>> _______________________________________________
>>>>>>>>> tdwg-tapir mailing list
>>>>>>>>> tdwg-tapir at lists.tdwg.org
>>>>>>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please consider the environment before printing this email
>>>>>>>>>
>>>>>>>>> WARNING : This email and any attachments may be confidential  
>>>>>>>>> and/
>>>>>>>>> or
>>>>>>>>> privileged. They are intended for the addressee only and are  
>>>>>>>>> not
>>>>>>>>> to
>>>>>>>>> be read,
>>>>>>>>> used, copied or disseminated by anyone receiving them in  
>>>>>>>>> error. If
>>>>>>>>> you are
>>>>>>>>> not the intended recipient, please notify the sender by return
>>>>>>>>> email and
>>>>>>>>> delete this message and any attachments.
>>>>>>>>>
>>>>>>>>> The views expressed in this email are those of the sender  
>>>>>>>>> and do
>>>>>>>>> not
>>>>>>>>> necessarily reflect the
>>>>>>>>> official views of Landcare Research. http://
>>>>>>>>> www.landcareresearch.co.nz
>>>>>>>>> _______________________________________________
>>>>>>>>> tdwg-tapir mailing list
>>>>>>>>> tdwg-tapir at lists.tdwg.org
>>>>>>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>>>>>>>>
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> tdwg-tapir mailing list
>>>>>>>> tdwg-tapir at lists.tdwg.org
>>>>>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> tdwg-tapir mailing list
>>>>>> tdwg-tapir at lists.tdwg.org
>>>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>>>>
>>>>> --
>>>>>
>>>>> Australian Centre for Plant BIodiversity  
>>>>> Research<------------------+
>>>>> National            greg whitBread             voice: +61 2  
>>>>> 62509 482
>>>>> Botanic Integrated Botanical Information System  fax: +61 2  
>>>>> 62509 599
>>>>> Gardens                      S........ I.T. happens.. ghw at anbg.gov.au
>>>>> +----------------------------------------->GPO Box 1777 Canberra  
>>>>> 2601
>>>>>
>>>>>
>>>>>
>>>>> ------
>>>>> If you have received this transmission in error please notify us
>>>>> immediately by return e-mail and delete all copies. If this e-mail
>>>>> or any attachments have been sent to you in error, that error does
>>>>> not constitute waiver of any confidentiality, privilege or  
>>>>> copyright
>>>>> in respect of information in the e-mail or attachments.
>>>>>
>>>>>
>>>>>
>>>>> Please consider the environment before printing this email.
>>>>>
>>>>> ------
>>>>>
>>>>> _______________________________________________
>>>>> tdwg-tapir mailing list
>>>>> tdwg-tapir at lists.tdwg.org
>>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>>>>
>>>>
>>>> _______________________________________________
>>>> tdwg-tapir mailing list
>>>> tdwg-tapir at lists.tdwg.org
>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>>>
>>> _______________________________________________
>>> tdwg-tapir mailing list
>>> tdwg-tapir at lists.tdwg.org
>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>>
>> _______________________________________________
>> tdwg-tapir mailing list
>> tdwg-tapir at lists.tdwg.org
>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>
> _______________________________________________
> tdwg-tapir mailing list
> tdwg-tapir at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>