[tdwg-tapir] Fwd: Tapir protocol - Harvest methods? [SEC=UNCLASSIFIED]
Javier de la Torre
jatorre at gmail.com
Wed May 14 16:19:32 CEST 2008
This discussion starts to remind me another one in Google Appengine
discussio group. They talk about different ways to bulk upload data to
their Big Table database.
http://groups.google.com/group/google-appengine/browse_thread/thread/18d246b30e267da4
I have read so far:
-XML
-CSV
-RDF
-JSON
-AMF
-SQL
-OOXML
-TSV
Uff so many ideas...
I would take whatever Google finally decide as it will probable become
a defacto standard :D
The discussion is funny :D
Cheers.
On Wed, May 14, 2008 at 4:04 PM, Dave Vieglais <vieglais at ku.edu> wrote:
> Perhaps it could be put into some form of xml to preserve the
> relational model? Maybe a mechanism could be developed so that others
> could access the xml as well. How about even putting some sort of
> subsetting mechanism so that entire data sets need not be retrieved.
>
> just a thought...
>
>
> On Wed, May 14, 2008 at 9:25 AM, Aaron D. Steele <eightysteele at gmail.com> wrote:
>> for preserving relational data, we could also just dump tapirlink
>> resources to an sqlite database file (http://www.sqlite.org), zip it
>> up, and again make it available via the web service. we use sqlite
>> internally for many projects, and it's both easy to use and well
>> supported by jdbc, php, python, etc.
>>
>> would something like this be a useful option?
>>
>> thanks,
>> aaron
>>
>> On Wed, May 14, 2008 at 2:21 AM, Markus Döring <mdoering at gbif.org> wrote:
>> > Interesting that we all come to the same conclusions...
>> > The trouble I had with just a simple flat csv file is repeating
>> > properties like multiple image urls. ABCD clients dont use ABCD just
>> > because its complex, but because they want to transport this
>> > relational data. We were considering 2 solutions to extending this csv
>> > approach. The first would be to have a single large denormalised csv
>> > file with many rows for the same record. It would require knowledge
>> > about the related entities though and could grow in size rapidly. The
>> > second idea which we think to adopt is allowing a single level of 1-
>> > many related entities. It is basically a "star" design with the core
>> > dwc table in the center and any number of extension tables around it.
>> > Each "table" aka csv file will have the record id as the first column,
>> > so the files can be related easily and it only needs a single
>> > identifier per record and not for the extension entities. This would
>> > give a lot of flexibility while keeping things pretty simple to deal
>> > with. It would even satisfy the ABCD needs as I havent yet seen anyone
>> > requiring 2 levels of related tables (other than lookup tables). Those
>> > extensions could even be a simple 1-1 relation, but would keep things
>> > semantically together just like a xml namespace. The darwin core
>> > extensions would be good for example.
>> >
>> > So we could have a gzipped set of files, maybe with a simple metafile
>> > indicating the semantics of the columns for each file.
>> > An example could look like this:
>> >
>> >
>> > # darwincore.csv
>> > 102 Aster alpinus subsp. parviceps ...
>> > 103 Polygala vulgaris ...
>> >
>> > # curatorial.csv
>> > 102 Kew Herbarium
>> > 103 Reading Herbarium
>> >
>> > # identification.csv
>> > 102 2003-05-04 Karl Marx Aster alpinus L.
>> > 102 2007-01-11 Mark Twain Aster korshinskyi Tamamsch.
>> > 102 2007-09-13 Roger Hyam Aster alpinus subsp. parviceps
>> > Novopokr.
>> > 103 2001-02-21 Steve Bekow Polygala vulgaris L.
>> >
>> >
>> >
>> > I know this looks old fashioned, but it is just so simple and gives us
>> > so much flexibility.
>> > Markus
>> >
>> >
>> >
>> >
>> > On 14 May, 2008, at 24:39, Greg Whitbread wrote:
>> >
>> > > We have used a very similar protocol to assemble the latest AVH cache.
>> > > It should be noted that this is an as-well-as protocol that only works
>> > > because we have an established semantic standard (hispid/abcd).
>> > >
>> > > greg
>> > >
>> > > trobertson at gbif.org wrote:
>> > >> Hi All,
>> > >>
>> > >> This is very interesting too me, as I came up with the same
>> > >> conclusion
>> > >> while harvesting for GBIF.
>> > >>
>> > >> As a "harvester of all records" it is best described with an example:
>> > >>
>> > >> - Complete Inventory of ScientificNames: 7 minutes @ the limited 200
>> > >> records per page
>> > >> - Complete Harvesting of records:
>> > >> - 260,000 records
>> > >> - 9 hours harvesting duration
>> > >> - 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and
>> > >> curatorial
>> > >> extensions)
>> > >> - Extraction of DwC records from harvested XML: <2 minutes
>> > >> - Resulting file size 32MB, Gzipped to <3MB
>> > >>
>> > >> I spun hard drives for 9 hours, and took up bandwidth that is paid
>> > >> for, to
>> > >> retrieve something that could have been generated provider side in
>> > >> minutes
>> > >> and transferred in seconds (3MB).
>> > >>
>> > >> I sent a proposal to TDWG last year termed "datamaps" which was
>> > >> effectively what you are describing, and I based it on the Sitemaps
>> > >> protocol, but I got nowhere with it. With Markus, we are making more
>> > >> progress and I have spoken with several GBIF data providers about a
>> > >> proposed new standard for full dataset harvesting and it has been
>> > >> received
>> > >> well. So Markus and I have started a new proposal and have a
>> > >> working name
>> > >> of 'Localised DwC Index' file generation (it is an index if you
>> > >> have more
>> > >> than DwC data, and DwC is still standards compliant) which is
>> > >> really a
>> > >> GZipped Tab file dump of the data, which is slightly extensible. The
>> > >> document is not ready to circulate yet but the benefits section reads
>> > >> currently:
>> > >>
>> > >> - Provider database load reduced, allowing it to serve real
>> > >> distributed
>> > >> queries rather than "full datasource" harvesters
>> > >> - Providers can choose to publish their index as it suits them,
>> > >> giving
>> > >> control back to the provider
>> > >> - Localised index generation can be built into tools not yet
>> > >> capable of
>> > >> integrating with TDWG protocol networks such as GBIF
>> > >> - Harvesters receive a full dataset view in one request, making it
>> > >> very
>> > >> easy to determine what records are eligible for deletion
>> > >> - It becomes very simple to write clients that consume entire
>> > >> datasets.
>> > >> E.g. data cleansing tools that the provider can run:
>> > >> - Give me ISO Country Codes for my dataset
>> > >> - The application pulls down the providers index file,
>> > >> generates ISO
>> > >> country code, returns a simple table using the providers own
>> > >> identifier
>> > >> - Check my names for spelling mistakes
>> > >> - Application skims over the records and provides a list that
>> > >> are not
>> > >> known to the application
>> > >> - Providers such as UK NBN cannot serve 20 million records to the
>> > >> GBIF
>> > >> index using the existing protocols efficiently.
>> > >> - They have the ability to generate a localised index however
>> > >> - Harvesters can very quickly build up searchable indexes and it is
>> > >> easy
>> > >> to create large indices.
>> > >> - Node Portal can easily aggregate index data files
>> > >> - true index to data, not an illusion of a cache. More like Google
>> > >> sitemaps
>> > >>
>> > >> It is the ease at which one can offer tools to data providers that
>> > >> really
>> > >> interests me. The technical threshold required to produce services
>> > >> that
>> > >> offer reporting tools on peoples data is really very low with this
>> > >> mechanism. This and the fact that large datasets will be
>> > >> harvestable - we
>> > >> have even considered the likes of bit-torrent for the large ones
>> > >> although
>> > >> I think this is overkill.
>> > >>
>> > >> As a consumer therefore I fully support this move as a valuable
>> > >> addition
>> > >> to the wrapper tools.
>> > >>
>> > >> Cheers
>> > >>
>> > >> Tim
>> > >> (wrote the GBIF harvesting, and new to this list)
>> > >>
>> > >>
>> > >>>
>> > >>> Begin forwarded message:
>> > >>>
>> > >>>> From: "Aaron D. Steele" <eightysteele at gmail.com>
>> > >>>> Date: 13 de mayo de 2008 22:40:09 GMT+02:00
>> > >>>> To: tdwg-tapir at lists.tdwg.org
>> > >>>> Cc: Aaron Steele <asteele at berkeley.edu>
>> > >>>> Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
>> > >>>>
>> > >>>> at berkeley we've recently prototyped a simple php program that
>> > >>>> uses
>> > >>>> an existing tapirlink installation to periodically dump tapir
>> > >>>> resources into a csv file. the solution is totally generic and can
>> > >>>> dump darwin core (and technically abcd schema, although it's
>> > >>>> currently
>> > >>>> untested). the resulting csv files are zip archived and made
>> > >>>> accessible using a web service. it's a simple approach that has
>> > >>>> proven
>> > >>>> to be, at least internally, quite reliable and useful.
>> > >>>>
>> > >>>> for example, several of our caching applications use the web
>> > >>>> service
>> > >>>> to harvest csv data from tapirlink resources using the following
>> > >>>> process:
>> > >>>> 1) download latest csv dump for a resource using the web service.
>> > >>>> 2) flush all locally cached records for the resource.
>> > >>>> 3) bulk load the latest csv data into the cache.
>> > >>>>
>> > >>>> in this way, cached data are always synchronized with the
>> > >>>> resource and
>> > >>>> there's no need to track new, deleted, or changed records. as an
>> > >>>> aside, each time these cached data are queried by the caching
>> > >>>> application or selected in the user interface, log-only search
>> > >>>> requests are sent back to the resource.
>> > >>>>
>> > >>>> after discussion with renato giovanni and john wieczorek, we've
>> > >>>> decided that merging this functionality into the tapirlink codebase
>> > >>>> would benefit the broader community. csv generation support would
>> > >>>> be
>> > >>>> declared through capabilities. although incremental harvesting
>> > >>>> wouldn't be immediately implemented, we could certainly extend the
>> > >>>> service to include it later.
>> > >>>>
>> > >>>> i'd like to pause here to gauge the consensus, thoughts,
>> > >>>> concerns, and
>> > >>>> ideas of others. anyone?
>> > >>>>
>> > >>>> thanks,
>> > >>>> aaron
>> > >>>>
>> > >>>> 2008/5/5 Kevin Richards <RichardsK at landcareresearch.co.nz>:
>> > >>>>>
>> > >>>>> I think I agree here.
>> > >>>>>
>> > >>>>> The harvesting "procedure" is really defined outside the Tapir
>> > >>>>> protocol, is
>> > >>>>> it not? So it is really an agreement between the harvester and
>> > >>>>> the
>> > >>>>> harvestees.
>> > >>>>>
>> > >>>>> So what is really needed here is the standard procedure for
>> > >>>>> maintaining a
>> > >>>>> "harvestable" dataset and the standard procedure for harvesting
>> > >>>>> that
>> > >>>>> dataset.
>> > >>>>> We have a general rule at Landcare, that we never delete records
>> > >>>>> in
>> > >>>>> our
>> > >>>>> datasets - they are either deprecated in favour of another record,
>> > >>>>> and so
>> > >>>>> the resolution of that record would point to the new record, or
>> > >>>>> the
>> > >>>>> are set
>> > >>>>> to a state of "deleted", but are still kept in the dataset, and
>> > >>>>> can
>> > >>>>> be
>> > >>>>> resolved (which would indicate a state of deleted).
>> > >>>>>
>> > >>>>> Kevin
>> > >>>>>
>> > >>>>>
>> > >>>>>>>> "Renato De Giovanni" <renato at cria.org.br> 6/05/2008 7:33 a.m.
>> > >>>>>>>> >>>
>> > >>>>>
>> > >>>>> Hi Markus,
>> > >>>>>
>> > >>>>> I would suggest creating new concepts for incremental harvesting,
>> > >>>>> either in the data standards themselves or in some new
>> > >>>>> extension. In
>> > >>>>> the case of TAPIR, GBIF could easily check the mapped concepts
>> > >>>>> before
>> > >>>>> deciding between incremental or full harvesting.
>> > >>>>>
>> > >>>>> Actually it could be just one new concept such as "recordStatus"
>> > >>>>> or
>> > >>>>> "deletionFlag". Or perhaps you could also want to create your own
>> > >>>>> definition for dateLastModified indicating which set of concepts
>> > >>>>> should be considered to see if something has changed or not, but I
>> > >>>>> guess this level of granularity would be difficult to be
>> > >>>>> supported.
>> > >>>>>
>> > >>>>> Regards,
>> > >>>>> --
>> > >>>>> Renato
>> > >>>>>
>> > >>>>> On 5 May 2008 at 11:24, Markus Döring wrote:
>> > >>>>>
>> > >>>>>> Phil,
>> > >>>>>> incremental harvesting is not implemented on the GBIF side as far
>> > >>>>>> as I
>> > >>>>>> am aware. And I dont think that will be a simple thing to
>> > >>>>>> implement on
>> > >>>>>> the current system. Also, even if we can detect only the changed
>> > >>>>>> records since the last harevesting via dateLastModified we still
>> > >>>>>> have
>> > >>>>>> no information about deletions. We could have an arrangement
>> > >>>>>> saying
>> > >>>>>> that you keep deleted records as empty records with just the ID
>> > >>>>>> and
>> > >>>>>> nothing else (I vaguely remember LSIDs were supposed to work like
>> > >>>>>> this
>> > >>>>>> too). But that also needs to be supported on your side then,
>> > >>>>>> never
>> > >>>>>> entirely removing any record. I will have a discussion with the
>> > >>>>>> others
>> > >>>>>> at GBIF about that.
>> > >>>>>>
>> > >>>>>> Markus
>> > >>>>> _______________________________________________
>> > >>>>> tdwg-tapir mailing list
>> > >>>>> tdwg-tapir at lists.tdwg.org
>> > >>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>> > >>>>>
>> > >>>>>
>> > >>>>>
>> > >>>>>
>> > >>>>> Please consider the environment before printing this email
>> > >>>>>
>> > >>>>> WARNING : This email and any attachments may be confidential and/
>> > >>>>> or
>> > >>>>> privileged. They are intended for the addressee only and are not
>> > >>>>> to
>> > >>>>> be read,
>> > >>>>> used, copied or disseminated by anyone receiving them in error. If
>> > >>>>> you are
>> > >>>>> not the intended recipient, please notify the sender by return
>> > >>>>> email and
>> > >>>>> delete this message and any attachments.
>> > >>>>>
>> > >>>>> The views expressed in this email are those of the sender and do
>> > >>>>> not
>> > >>>>> necessarily reflect the
>> > >>>>> official views of Landcare Research. http://
>> > >>>>> www.landcareresearch.co.nz
>> > >>>>> _______________________________________________
>> > >>>>> tdwg-tapir mailing list
>> > >>>>> tdwg-tapir at lists.tdwg.org
>> > >>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>> > >>>>>
>> > >>>>>
>> > >>>> _______________________________________________
>> > >>>> tdwg-tapir mailing list
>> > >>>> tdwg-tapir at lists.tdwg.org
>> > >>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>> > >>>
>> > >>
>> > >>
>> > >> _______________________________________________
>> > >> tdwg-tapir mailing list
>> > >> tdwg-tapir at lists.tdwg.org
>> > >> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>> > >
>> > > --
>> > >
>> > > Australian Centre for Plant BIodiversity Research<------------------+
>> > > National greg whitBread voice: +61 2 62509 482
>> > > Botanic Integrated Botanical Information System fax: +61 2 62509 599
>> > > Gardens S........ I.T. happens.. ghw at anbg.gov.au
>> > > +----------------------------------------->GPO Box 1777 Canberra 2601
>> > >
>> > >
>> > >
>> > > ------
>> > > If you have received this transmission in error please notify us
>> > > immediately by return e-mail and delete all copies. If this e-mail
>> > > or any attachments have been sent to you in error, that error does
>> > > not constitute waiver of any confidentiality, privilege or copyright
>> > > in respect of information in the e-mail or attachments.
>> > >
>> > >
>> > >
>> > > Please consider the environment before printing this email.
>> > >
>> > > ------
>> > >
>> > > _______________________________________________
>> > > tdwg-tapir mailing list
>> > > tdwg-tapir at lists.tdwg.org
>> > > http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>> > >
>> >
>> > _______________________________________________
>> > tdwg-tapir mailing list
>> > tdwg-tapir at lists.tdwg.org
>> > http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>> >
>> _______________________________________________
>> tdwg-tapir mailing list
>> tdwg-tapir at lists.tdwg.org
>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>
> _______________________________________________
> tdwg-tapir mailing list
> tdwg-tapir at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>
More information about the tdwg-tag
mailing list