[tdwg-tapir] Fwd: Tapir protocol - Harvest methods?

Markus Döring mdoering at gbif.org
Thu May 15 11:00:27 CEST 2008


I am worried about duplication and maintainance of multiple metadata  
files and also formats. TAPIR has one, NCD exists, FGDC, EML,  
DublinCore and many more. So maybe we should just add a URL to the  
metadata and not even specify the format, just recommend it should be  
compatible with dublin core? It could resolve into an RDF document, a  
TAPIR metadata response or an html page with embedded dublin core  
data? Then the dwc index metafile is a true static technical  
description and could be created once if we settle with the http  
approach.

Btw, with http you can even specify "If-Modified-Since" in a request  
header to get a "304 Not Modified" to be returned for files that  
havent changed since. The http 1.1 specs require webservers to support  
this. So the http response could always indicate the date-last- 
modified and the index file will only be returned if it was modified  
since the last request. Thats pretty much all we want, isnt it?

http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.25

Markus




On 15 May, 2008, at 10:04, Tim Robertson wrote:

>
> Locally generated / localised DwC index files?
> (if you have rich data behind LSID, then this file is an index that  
> allows
> searching of those rich data using DwC fields)
>
> I would like to see the data file accompanied with a compulsory  
> metafile
> that details rights, citation, contacts etc are all given.  Whether  
> this
> file needs the data generation timestamp I am not so sure either and  
> the
> HTTP header approach does sound good.  It means you can do a one time
> metafile crafting and then just CRON the dump generation... This  
> would be
> for institutions with IT resources - e.g. UK NBN with 20M records.
>
> For Joe Bloggs with a data set, if we included it in the wrapper  
> tools, then
> it is easy to rewrite the metafile seemlessly anyway so they don't  
> care.
>
> Cheers,
>
> Tim
>
>
> -----Original Message-----
> From: Roger Hyam [mailto:rogerhyam at mac.com]
> Sent: Thursday, May 15, 2008 9:12 AM
> To: Markus Döring
> Cc: trobertson at gbif.org; tdwg-tapir at lists.tdwg.org
> Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?
>
>
> I imagine a lot of these CSV files (we need a name for them) will be
> generated by an SQL query run on a scheduled task or a cron job.  
> This is
> good and pretty easy to automate.
>
> It increases the complexity of the dump process greatly if it also  
> needs to
> update a metadata file with the new modified date every time.
> In fact it moves the set up of the process from just being a  
> configuration
> job in most RDMS to needing actual scripts to run and change the  
> metadata
> files. The structure of the CSV file is constant so the metadata  
> file should
> really only be created once when the process is set up.
>
> Could we use the modified / created dates in the HTTP headers for  
> the files
> instead. The client just has to call a HEAD to see if the file has  
> changed
> and get its size before deciding to download it. (It is amazing what  
> you can
> do with good old HTTP).
>
> The only thing that is lost doing it this way is we don't know the  
> number of
> rows in the file but we do know it's size in bytes. What we gain is  
> the
> ability for non-script-writing system admins to set up the system.
>
> Just a thought,
>
> Roger
>
>
> On 15 May 2008, at 00:00, Markus Döring wrote:
>
>> I agree with Tim that it would be better to keep this proposal/
>> specification seperate from TAPIR. Saying that it could still be
>> included in the TAPIR capabilities to indicate this feature. But an
>> important reason to have these file is to get more providers on  
>> board.
>> So they should also be able to implement this without the TAPIR
>> overhead.
>>
>> A seperate metafile would certainly hold also the timestamp of the
>> last generation of the file, so keeping that seperate has additional
>> advantages.
>>
>> Markus
>>
>>
>>
>> On 14 May, 2008, at 21:44, trobertson at gbif.org wrote:
>>
>>> Hi Renato,
>>>
>>> Do you think this really go under TAPIR spec?
>>>
>>> Sure we want the wrappers to produce it but it's just a document  
>>> on a
>>> URL and can be described in such a simple way that loads of other
>>> people could incorporate it without getting into TAPIR specs, nor  
>>> can
>>> they claim any TAPIR compliance just because they can do a 'select  
>>> to
>>> outfile'.
>>>
>>> I would also request that the headers aren't in the data file but  
>>> the
>>> metafile.  It is way easier to dump a big DB to this 'document
>>> standard'
>>> without needing to worry about how to get headers in a 20gig file.
>>>
>>> Just some more thoughts
>>>
>>> Cheers
>>>
>>> Tim
>>>
>>>
>>>
>>>> I agree with Markus about using a simple data format. Relational
>>>> database dumps would require standard database structures or would
>>>> expose specific things that are already encapsulated by abstraction
>>>> layers (conceptual schemas).
>>>>
>>>> I'm not sure about the best way to represent complex data  
>>>> structures
>>>> like ABCD, but for simpler providers such as TapirLink/Dwc, the  
>>>> idea
>>>> was to create a new script responsible for dumping all mapped
>>>> concepts of a specific data source into a single file. Providers
>>>> could periodically call this script from a cron job to regenerate
>>>> the dump. The first line in the dump file would indicate the  
>>>> concept
>>>> identifiers (GUIDs) associated with each column to make it a  
>>>> generic
>>>> solution (and more compatible with existing applications). Content
>>>> could be tab-delimited and in the end compressed.
>>>>
>>>> Harvesters could use this "seed" file for the initial data import,
>>>> and then potentially use incremental harvesting to update the  
>>>> cache.
>>>> But in
>>>> this case it would be necessary to know when the dump file was
>>>> generated.
>>>>
>>>> To use the existing TAPIR infrastructure, we would also need to  
>>>> know
>>>> which providers support the dump files. Aaron's idea, when he first
>>>> discussed with me, was to use a new custom operation. This makes
>>>> sense to me, but would require a small change in the protocol to  
>>>> add
>>>> a custom slot in the operations section of capabilities responses.
>>>> Curiously, this approach would allow the existence of TAPIR "static
>>>> providers" - the simplest possible category, even simpler than
>>>> TapirLite. They would not support inventories, searches or query
>>>> templates, but would make the dump file available through the new
>>>> custom operation. Metadata, capabilities and ping could be just
>>>> static files served by a very simple script.
>>>>
>>>> If this approach makes sense, I think these are the points that
>>>> still need to be addressed:
>>>>
>>>> 1) Decide about how to indicate the timestamp associated with the
>>>> dump file.
>>>> 2) Change the TAPIR schema (or figure out another solution to
>>>> advertise the new capability, but always remembering that in the
>>>> TAPIR context a single provider instance can host multiple data
>>>> sources that are usually distinguished by a query parameter in the
>>>> URL, so I'm not sure how a sitemaps approach could be used).
>>>> 3) Decide about how to represent complex data such as ABCD (if  
>>>> using
>>>> multiple files, I would suggest to compress them together and serve
>>>> as a single file).
>>>> 4) Write a short specification to describe the new custom operation
>>>> and the data format.
>>>>
>>>> I'm happy to change the schema if there's consensus about this.
>>>>
>>>> Best Regards,
>>>> --
>>>> Renato
>>>>
>>>>
>>>>> it would keep the relations, but we dont really want any  
>>>>> relational
>>>>> structure to be served up.
>>>>> And using sqlite binaries for the dwc star scheme would not be
>>>>> easier to work with than plain text files. they can even be loaded
>>>>> into excel straight away, can be versioned with svn and so on. If
>>>>> there is a geospatial extension file which has the GUID in the
>>>>> first column, applications might grab that directly and not even
>>>>> touch the central core file if they only want location data.
>>>>>
>>>>> I'd prefer to stick with a csv or tab delimited file.
>>>>> The simpler the better. And it also cant get corrupted as easily.
>>>>>
>>>>> Markus
>>>>>
>>>>>
>>>>>
>>>>> On 14 May, 2008, at 15:25, Aaron D. Steele wrote:
>>>>>
>>>>>> for preserving relational data, we could also just dump tapirlink
>>>>>> resources to an sqlite database file (http://www.sqlite.org), zip
>>>>>> it up, and again make it available via the web service. we use
>>>>>> sqlite internally for many projects, and it's both easy to use  
>>>>>> and
>>>>>> well supported by jdbc, php, python, etc.
>>>>>>
>>>>>> would something like this be a useful option?
>>>>>>
>>>>>> thanks,
>>>>>> aaron
>>>>
>>>>
>>>> _______________________________________________
>>>> tdwg-tapir mailing list
>>>> tdwg-tapir at lists.tdwg.org
>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> tdwg-tapir mailing list
>>> tdwg-tapir at lists.tdwg.org
>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>>
>>
>> _______________________________________________
>> tdwg-tapir mailing list
>> tdwg-tapir at lists.tdwg.org
>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>
>
>
>




More information about the tdwg-tag mailing list