Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?[SEC=UNCLASSIFIED]

14 May 2008

      Another interesting problem you touch on...

Take the GBIF Index.  People want a country "slice" of the data.  The SQL to
slice up the data on occurrences is fine, but then what about the taxonomy
stuff - do you throq out the stuff that is not relevant to the sliced
region? What about sub selecting only the regional common names etc etc.  

I think you will be unlikely to generically come up with subsets of DB dumps
without specific model knowledge, but I'd be interested to hear if you do!!!
I think you'd have to basically do an interceptor that does a pre-select -
probably also a chained up sequence of post-SQL's  - no?

-----Original Message-----
From: tdwg-tapir-bounces@lists.tdwg.org
[mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Dave Vieglais
Sent: Wednesday, May 14, 2008 4:05 PM
To: Aaron D. Steele
Cc: Aaron Steele; tdwg-tapir@lists.tdwg.org
Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest
methods?[SEC=UNCLASSIFIED]

Perhaps it could be put into some form of xml to preserve the relational
model?  Maybe a mechanism could be developed so that others could access the
xml as well.  How about even putting some sort of subsetting mechanism so
that entire data sets need not be retrieved.

just a thought...

On Wed, May 14, 2008 at 9:25 AM, Aaron D. Steele <eightysteele@gmail.com>
wrote:
...
for preserving relational data, we could also just dump tapirlink  
resources to an sqlite database file (http://www.sqlite.org), zip it  
up, and again make it available via the web service. we use sqlite  
internally for many projects, and it's both easy to use and well  
supported by jdbc, php, python, etc.
would something like this be a useful option?
thanks,
 aaron
On Wed, May 14, 2008 at 2:21 AM, Markus Döring <mdoering@gbif.org> wrote:
...
Interesting that we all come to the same conclusions...
 The trouble I had with just a simple flat csv file is repeating  >  
properties like multiple image urls. ABCD clients dont use ABCD just  
 because its complex, but because they want to transport this  >  
relational data. We were considering 2 solutions to extending this csv  
 approach. The first would be to have a single large denormalised 
csv  >  file with many rows for the same record. It would require 
knowledge  >  about the related entities though and could grow in size 
rapidly. The  >  second idea which we think to adopt is allowing a 
single level of 1-  >  many related entities. It is basically a "star" 
design with the core  >  dwc table in the center and any number of
extension tables around it.
 Each "table" aka csv file will have the record id as the first 
column,  >  so the files can be related easily and it only needs a 
single  >  identifier per record and not for the extension entities. 
This would  >  give a lot of flexibility while keeping things pretty 
simple to deal  >  with. It would even satisfy the ABCD needs as I 
havent yet seen anyone  >  requiring 2 levels of related tables (other 
than lookup tables). Those  >  extensions could even be a simple 1-1 
relation, but would keep things  >  semantically together just like a 
xml namespace. The darwin core  >  extensions would be good for example.
So we could have a gzipped set of files, maybe with a simple 
metafile  >  indicating the semantics of the columns for each file.
 An example could look like this:
# darwincore.csv
 102    Aster alpinus subsp. parviceps    ...
 103    Polygala vulgaris    ...
# curatorial.csv
 102    Kew Herbarium
 103    Reading Herbarium
# identification.csv
 102    2003-05-04    Karl Marx    Aster alpinus L.
 102    2007-01-11    Mark Twain    Aster korshinskyi Tamamsch.
 102    2007-09-13    Roger Hyam    Aster alpinus subsp. parviceps
 Novopokr.
 103    2001-02-21    Steve Bekow    Polygala vulgaris L.
I know this looks old fashioned, but it is just so simple and 
gives us  >  so much flexibility.
 Markus
On 14 May, 2008, at 24:39, Greg Whitbread wrote:
...
We have used a very similar protocol to assemble the latest AVH
cache.
It should be noted that this is an as-well-as protocol that only 
works  >  > because we have an established semantic standard
(hispid/abcd).
greg
trobertson@gbif.org wrote:
...
Hi All,
This is very interesting too me, as I came up with the same  >  
conclusion  >  >> while harvesting for GBIF.
As a "harvester of all records" it is best described with an
example:
- Complete Inventory of ScientificNames: 7 minutes @ the 
limited 200  >  >> records per page  >  >> - Complete Harvesting of 
records:
 - 260,000 records
 - 9 hours harvesting duration
 - 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and  >  
curatorial  >  >> extensions)  >  >> - Extraction of DwC records 
from harvested XML: <2 minutes  >  >>  - Resulting file size 32MB, 
Gzipped to <3MB  >  >>  >  >> I spun hard drives for 9 hours, and took 
up bandwidth that is paid  >  >> for, to  >  >> retrieve something 
that could have been generated provider side in  >  >> minutes  >  >> 
and transferred in seconds (3MB).
I sent a proposal to TDWG last year termed "datamaps" which was  
effectively what you are describing, and I based it on the 
Sitemaps  >  >> protocol, but I got nowhere with it.  With Markus, we 
are making more  >  >> progress and I have spoken with several GBIF 
data providers about a  >  >> proposed new standard for full dataset 
harvesting and it has been  >  >> received  >  >> well.  So Markus and 
I have started a new proposal and have a  >  >> working name  >  >> of 
'Localised DwC Index' file generation (it is an index if you  >  >> 
have more  >  >> than DwC data, and DwC is still standards compliant) 
which is  >  >> really a  >  >> GZipped Tab file dump of the data, 
which is slightly extensible.  The  >  >> document is not ready to 
circulate yet but the benefits section reads  >  >> currently:
- Provider database load reduced, allowing it to serve real  >  
distributed  >  >> queries rather than "full datasource" harvesters  
- Providers can choose to publish their index as it suits them,  
giving  >  >> control back to the provider  >  >> - Localised 
index generation can be built into tools not yet  >  >> capable of  >  
integrating with TDWG protocol networks such as GBIF  >  >> - 
Harvesters receive a full dataset view in one request, making it  >  
very  >  >> easy to determine what records are eligible for 
deletion  >  >> - It becomes very simple to write clients that consume 
entire  >  >> datasets.
E.g. data cleansing tools that the provider can run:
 -  Give me ISO Country Codes for my dataset
    -  The application pulls down the providers index file,
generates ISO
country code, returns a simple table using the providers own  >  
identifier  >  >>  - Check my names for spelling mistakes
   - Application skims over the records and provides a list that
are not
known to the application
- Providers such as UK NBN cannot serve 20 million records to 
the  >  >> GBIF  >  >> index using the existing protocols efficiently.
 - They have the ability to generate a localised index however  
- Harvesters can very quickly build up searchable indexes and it 
is  >  >> easy  >  >> to create large indices.
 - Node Portal can easily aggregate index data files  >  >> - 
true index to data, not an illusion of a cache. More like Google  >  
sitemaps  >  >>  >  >> It is the ease at which one can offer tools 
to data providers that  >  >> really  >  >> interests me.  The 
technical threshold required to produce services  >  >> that  >  >> 
offer reporting tools on peoples data is really very low with this  >  
mechanism.  This and the fact that large datasets will be  >  >> 
harvestable - we  >  >> have even considered the likes of bit-torrent 
for the large ones  >  >> although  >  >> I think this is overkill.
As a consumer therefore I fully support this move as a valuable  
addition  >  >> to the wrapper tools.
Cheers
Tim
(wrote the GBIF harvesting, and new to this list)  >  >>  >  >>
...
...
>>> Begin forwarded message:
...
From: "Aaron D. Steele" <eightysteele@gmail.com>  >  >>>> 
Date: 13 de mayo de 2008 22:40:09 GMT+02:00  >  >>>> To: 
tdwg-tapir@lists.tdwg.org  >  >>>> Cc: Aaron Steele 
<asteele@berkeley.edu>  >  >>>> Subject: Re: [tdwg-tapir] Tapir 
protocol - Harvest methods?
at berkeley we've recently prototyped a simple php program 
that  >  >>>> uses  >  >>>> an existing tapirlink installation to 
periodically dump tapir  >  >>>> resources into a csv file. the 
solution is totally generic and can  >  >>>> dump darwin core (and 
technically abcd schema, although it's  >  >>>> currently  >  >>>> 
untested). the resulting csv files are zip archived and made  >  >>>> 
accessible using a web service. it's a simple approach that has  >  
proven  >  >>>> to be, at least internally, quite reliable and 
useful.
for example, several of our caching applications use the web  
service  >  >>>> to harvest csv data from tapirlink resources 
using the following  >  >>>> process:
1) download latest csv dump for a resource using the web service.
2) flush all locally cached records for the resource.
3) bulk load the latest csv data into the cache.
in this way, cached data are always synchronized with the  >  
resource and  >  >>>> there's no need to track new, deleted, or 
changed records. as an  >  >>>> aside, each time these cached data are 
queried by the caching  >  >>>> application or selected in the user 
interface, log-only search  >  >>>> requests are sent back to the 
resource.
after discussion with renato giovanni and john wieczorek, 
we've  >  >>>> decided that merging this functionality into the 
tapirlink codebase  >  >>>> would benefit the broader community. csv 
generation support would  >  >>>> be  >  >>>> declared through 
capabilities. although incremental harvesting  >  >>>> wouldn't be 
immediately implemented, we could certainly extend the  >  >>>> 
service to include it later.
i'd like to pause here to gauge the consensus, thoughts,  >  
concerns, and  >  >>>> ideas of others. anyone?
thanks,
aaron
2008/5/5 Kevin Richards <RichardsK@landcareresearch.co.nz>:
>
> I think I agree here.
>
> The harvesting "procedure" is really defined outside the 
Tapir  >  >>>>> protocol, is  >  >>>>> it not?  So it is really an 
agreement between the harvester and  >  >>>>> the  >  >>>>> 
harvestees.
>
> So what is really needed here is the standard procedure for  
> maintaining a  >  >>>>> "harvestable" dataset and the 
standard procedure for harvesting  >  >>>>> that  >  >>>>> dataset.
> We have a general rule at Landcare, that we never delete 
records  >  >>>>> in  >  >>>>> our  >  >>>>> datasets - they are 
either deprecated in favour of another record,  >  >>>>> and so  >  
the resolution of that record would point to the new record, or  
> the  >  >>>>> are set  >  >>>>> to a state of "deleted", but 
are still kept in the dataset, and  >  >>>>> can  >  >>>>> be  >  
resolved (which would indicate a state of deleted).
>
> Kevin
>
>
>>>> "Renato De Giovanni" <renato@cria.org.br> 6/05/2008 7:33 a.m.
>>>> >>>
>
> Hi Markus,
>
> I would suggest creating new concepts for incremental 
harvesting,  >  >>>>> either in the data standards themselves or in 
some new  >  >>>>> extension. In  >  >>>>> the case of TAPIR, GBIF 
could easily check the mapped concepts  >  >>>>> before  >  >>>>> 
deciding between incremental or full harvesting.
>
> Actually it could be just one new concept such as "recordStatus"
> or
> "deletionFlag". Or perhaps you could also want to create 
your own  >  >>>>> definition for dateLastModified indicating which 
set of concepts  >  >>>>> should be considered to see if something has 
changed or not, but I  >  >>>>> guess this level of granularity would 
be difficult to be  >  >>>>> supported.
>
> Regards,
> --
> Renato
>
> On 5 May 2008 at 11:24, Markus Döring wrote:
>
>> Phil,
>> incremental harvesting is not implemented on the GBIF side 
as far  >  >>>>>> as I  >  >>>>>> am aware. And I dont think that will 
be a simple thing to  >  >>>>>> implement on  >  >>>>>> the current 
system. Also, even if we can detect only the changed  >  >>>>>> 
records since the last harevesting via dateLastModified we still  >  
> have  >  >>>>>> no information about deletions. We could have 
an arrangement  >  >>>>>> saying  >  >>>>>> that you keep deleted 
records as empty records with just the ID  >  >>>>>> and  >  >>>>>> 
nothing else (I vaguely remember LSIDs were supposed to work like  >  
> this  >  >>>>>> too). But that also needs to be supported on 
your side then,  >  >>>>>> never  >  >>>>>> entirely removing any 
record. I will have a discussion with the  >  >>>>>> others  >  >>>>>> 
at GBIF about that.
>>
>> Markus
> _______________________________________________
> tdwg-tapir mailing list
> tdwg-tapir@lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>
>
>
>
> Please consider the environment before printing this email  
>  >  >>>>> WARNING : This email and any attachments may be 
confidential and/  >  >>>>> or  >  >>>>> privileged. They are intended 
for the addressee only and are not  >  >>>>> to  >  >>>>> be read,  >  
used, copied or disseminated by anyone receiving them in error. 
If  >  >>>>> you are  >  >>>>> not the intended recipient, please 
notify the sender by return  >  >>>>> email and  >  >>>>> delete this 
message and any attachments.
>
> The views expressed in this email are those of the sender 
and do  >  >>>>> not  >  >>>>> necessarily reflect the  >  >>>>> 
official views of Landcare Research. http://  >  >>>>> 
www.landcareresearch.co.nz  >  >>>>>

...
...
...
...
...
> tdwg-tapir mailing list
> tdwg-tapir@lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>
>
_______________________________________________
tdwg-tapir mailing list
tdwg-tapir@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
_______________________________________________
tdwg-tapir mailing list
tdwg-tapir@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
--
Australian Centre for Plant BIodiversity
Research<------------------+
National            greg whitBread             voice: +61 2 62509
482
Botanic Integrated Botanical Information System  fax: +61 2 62509
599
Gardens                      S........ I.T. happens..
ghw@anbg.gov.au
+----------------------------------------->GPO Box 1777 Canberra 
2601  >  >  >  >  >  >  >  > ------  >  > If you have received this 
transmission in error please notify us  >  > immediately by return 
e-mail and delete all copies. If this e-mail  >  > or any attachments 
have been sent to you in error, that error does  >  > not constitute 
waiver of any confidentiality, privilege or copyright  >  > in respect 
of information in the e-mail or attachments.
Please consider the environment before printing this email.
------
_______________________________________________
tdwg-tapir mailing list
tdwg-tapir@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
_______________________________________________
 tdwg-tapir mailing list
 tdwg-tapir@lists.tdwg.org
 http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
_______________________________________________
 tdwg-tapir mailing list
 tdwg-tapir@lists.tdwg.org
 http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
_______________________________________________
tdwg-tapir mailing list
tdwg-tapir@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tapir

Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?[SEC=UNCLASSIFIED]

Tim Robertson