[tdwg-tapir] Fwd: Tapir protocol - Harvest methods?

Markus Döring mdoering at gbif.org
Thu May 15 11:00:27 CEST 2008

I am worried about duplication and maintainance of multiple metadata  
files and also formats. TAPIR has one, NCD exists, FGDC, EML,  
DublinCore and many more. So maybe we should just add a URL to the  
metadata and not even specify the format, just recommend it should be  
compatible with dublin core? It could resolve into an RDF document, a  
TAPIR metadata response or an html page with embedded dublin core  
data? Then the dwc index metafile is a true static technical  
description and could be created once if we settle with the http  

Btw, with http you can even specify "If-Modified-Since" in a request  
header to get a "304 Not Modified" to be returned for files that  
havent changed since. The http 1.1 specs require webservers to support  
this. So the http response could always indicate the date-last- 
modified and the index file will only be returned if it was modified  
since the last request. Thats pretty much all we want, isnt it?



On 15 May, 2008, at 10:04, Tim Robertson wrote:

> Locally generated / localised DwC index files?
> (if you have rich data behind LSID, then this file is an index that  
> allows
> searching of those rich data using DwC fields)
> I would like to see the data file accompanied with a compulsory  
> metafile
> that details rights, citation, contacts etc are all given.  Whether  
> this
> file needs the data generation timestamp I am not so sure either and  
> the
> HTTP header approach does sound good.  It means you can do a one time
> metafile crafting and then just CRON the dump generation... This  
> would be
> for institutions with IT resources - e.g. UK NBN with 20M records.
> For Joe Bloggs with a data set, if we included it in the wrapper  
> tools, then
> it is easy to rewrite the metafile seemlessly anyway so they don't  
> care.
> Cheers,
> Tim
> -----Original Message-----
> From: Roger Hyam [mailto:rogerhyam at mac.com]
> Sent: Thursday, May 15, 2008 9:12 AM
> To: Markus Döring
> Cc: trobertson at gbif.org; tdwg-tapir at lists.tdwg.org
> Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?
> I imagine a lot of these CSV files (we need a name for them) will be
> generated by an SQL query run on a scheduled task or a cron job.  
> This is
> good and pretty easy to automate.
> It increases the complexity of the dump process greatly if it also  
> needs to
> update a metadata file with the new modified date every time.
> In fact it moves the set up of the process from just being a  
> configuration
> job in most RDMS to needing actual scripts to run and change the  
> metadata
> files. The structure of the CSV file is constant so the metadata  
> file should
> really only be created once when the process is set up.
> Could we use the modified / created dates in the HTTP headers for  
> the files
> instead. The client just has to call a HEAD to see if the file has  
> changed
> and get its size before deciding to download it. (It is amazing what  
> you can
> do with good old HTTP).
> The only thing that is lost doing it this way is we don't know the  
> number of
> rows in the file but we do know it's size in bytes. What we gain is  
> the
> ability for non-script-writing system admins to set up the system.
> Just a thought,
> Roger
> On 15 May 2008, at 00:00, Markus Döring wrote:
>> I agree with Tim that it would be better to keep this proposal/
>> specification seperate from TAPIR. Saying that it could still be
>> included in the TAPIR capabilities to indicate this feature. But an
>> important reason to have these file is to get more providers on  
>> board.
>> So they should also be able to implement this without the TAPIR
>> overhead.
>> A seperate metafile would certainly hold also the timestamp of the
>> last generation of the file, so keeping that seperate has additional
>> advantages.
>> Markus
>> On 14 May, 2008, at 21:44, trobertson at gbif.org wrote:
>>> Hi Renato,
>>> Do you think this really go under TAPIR spec?
>>> Sure we want the wrappers to produce it but it's just a document  
>>> on a
>>> URL and can be described in such a simple way that loads of other
>>> people could incorporate it without getting into TAPIR specs, nor  
>>> can
>>> they claim any TAPIR compliance just because they can do a 'select  
>>> to
>>> outfile'.
>>> I would also request that the headers aren't in the data file but  
>>> the
>>> metafile.  It is way easier to dump a big DB to this 'document
>>> standard'
>>> without needing to worry about how to get headers in a 20gig file.
>>> Just some more thoughts
>>> Cheers
>>> Tim
>>>> I agree with Markus about using a simple data format. Relational
>>>> database dumps would require standard database structures or would
>>>> expose specific things that are already encapsulated by abstraction
>>>> layers (conceptual schemas).
>>>> I'm not sure about the best way to represent complex data  
>>>> structures
>>>> like ABCD, but for simpler providers such as TapirLink/Dwc, the  
>>>> idea
>>>> was to create a new script responsible for dumping all mapped
>>>> concepts of a specific data source into a single file. Providers
>>>> could periodically call this script from a cron job to regenerate
>>>> the dump. The first line in the dump file would indicate the  
>>>> concept
>>>> identifiers (GUIDs) associated with each column to make it a  
>>>> generic
>>>> solution (and more compatible with existing applications). Content
>>>> could be tab-delimited and in the end compressed.
>>>> Harvesters could use this "seed" file for the initial data import,
>>>> and then potentially use incremental harvesting to update the  
>>>> cache.
>>>> But in
>>>> this case it would be necessary to know when the dump file was
>>>> generated.
>>>> To use the existing TAPIR infrastructure, we would also need to  
>>>> know
>>>> which providers support the dump files. Aaron's idea, when he first
>>>> discussed with me, was to use a new custom operation. This makes
>>>> sense to me, but would require a small change in the protocol to  
>>>> add
>>>> a custom slot in the operations section of capabilities responses.
>>>> Curiously, this approach would allow the existence of TAPIR "static
>>>> providers" - the simplest possible category, even simpler than
>>>> TapirLite. They would not support inventories, searches or query
>>>> templates, but would make the dump file available through the new
>>>> custom operation. Metadata, capabilities and ping could be just
>>>> static files served by a very simple script.
>>>> If this approach makes sense, I think these are the points that
>>>> still need to be addressed:
>>>> 1) Decide about how to indicate the timestamp associated with the
>>>> dump file.
>>>> 2) Change the TAPIR schema (or figure out another solution to
>>>> advertise the new capability, but always remembering that in the
>>>> TAPIR context a single provider instance can host multiple data
>>>> sources that are usually distinguished by a query parameter in the
>>>> URL, so I'm not sure how a sitemaps approach could be used).
>>>> 3) Decide about how to represent complex data such as ABCD (if  
>>>> using
>>>> multiple files, I would suggest to compress them together and serve
>>>> as a single file).
>>>> 4) Write a short specification to describe the new custom operation
>>>> and the data format.
>>>> I'm happy to change the schema if there's consensus about this.
>>>> Best Regards,
>>>> --
>>>> Renato
>>>>> it would keep the relations, but we dont really want any  
>>>>> relational
>>>>> structure to be served up.
>>>>> And using sqlite binaries for the dwc star scheme would not be
>>>>> easier to work with than plain text files. they can even be loaded
>>>>> into excel straight away, can be versioned with svn and so on. If
>>>>> there is a geospatial extension file which has the GUID in the
>>>>> first column, applications might grab that directly and not even
>>>>> touch the central core file if they only want location data.
>>>>> I'd prefer to stick with a csv or tab delimited file.
>>>>> The simpler the better. And it also cant get corrupted as easily.
>>>>> Markus
>>>>> On 14 May, 2008, at 15:25, Aaron D. Steele wrote:
>>>>>> for preserving relational data, we could also just dump tapirlink
>>>>>> resources to an sqlite database file (http://www.sqlite.org), zip
>>>>>> it up, and again make it available via the web service. we use
>>>>>> sqlite internally for many projects, and it's both easy to use  
>>>>>> and
>>>>>> well supported by jdbc, php, python, etc.
>>>>>> would something like this be a useful option?
>>>>>> thanks,
>>>>>> aaron
>>>> _______________________________________________
>>>> tdwg-tapir mailing list
>>>> tdwg-tapir at lists.tdwg.org
>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>> _______________________________________________
>>> tdwg-tapir mailing list
>>> tdwg-tapir at lists.tdwg.org
>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>> _______________________________________________
>> tdwg-tapir mailing list
>> tdwg-tapir at lists.tdwg.org
>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir

More information about the tdwg-tag mailing list