[tdwg-tapir] Fwd: Tapir protocol - Harvest methods?

Markus Döring mdoering at gbif.org
Fri May 16 10:29:51 CEST 2008


Renato,
I was thinking along those lines too. It would be nice for TAPIRs to  
announce the availablility of the index files. I wouldnt mind adding  
it even to the regular tapir schema once it has proven to work with  
the custom slot approach you have given.

Regarding star shaped data I would prefer to agree on one format  
instead of allowing different ones to save consumers from this pain.  
There is a straight forward xml serialisation for this scheme that we  
could use instead of tab files:

<record uri="">
   <dwc:property1 />
   <dwc:property2 />
   <extA:record>
     <extA:property1 />
     <extA:property2 />
   </extA:record>
   <extB:record>
     <extB:property1 />
     <extB:property2 />
   <extB:record>
<record>


Advantage is, it can be produced by TAPIR software and xml  
serialisation is required for many services, eg RSS anyway.
But then again the whole point of the index files is that they are  
easy to generate and consume. On the other hand this xml structure is  
pretty simple to process and can be genereated from databases like  
sqlserver that have xml output straight away without the need of  
scripting.

That touches a different issue I am facing with the star scheme by the  
way. I have created an identification extension for darwin core that  
holds the historical list of identification events and their outcome.  
This is a YAML section of the metafile describing the columns for this  
extension through fully qualified concepts ala TAPIR:

identification:
   - http://rs.tdwg.org/dwc/dwcore/ScientificName
   - http://rs.tdwg.org/dwc/dwcore/AuthorYearOfScientificName
   - http://rs.tdwg.org/dwc/dwcore/Family
   - http://rs.tdwg.org/dwc/dwcore/IdentificationQualifier
   - http://rs.tdwg.org/dwc/curatorial/DateIdentified
   - http://rs.tdwg.org/dwc/curatorial/IdentifiedBy

When creating this I realised that pretty much all concepts I was  
interested in already existed in darwin core or the curatorial  
extension. Wouldnt it be wise to reuse those concepts? Or are they  
strictly tight to the idea of a current identification and therefore  
cant be used for historical ones? This is probably more of a darwin  
core question than TAPIR, but we are all on this list anyway ...

The xml in that case would look sth like this:

<record uri="http://mygarden.com/specimen/plants/54321-423-43-54-6-3-24-44 
">
   <dwc:ScientificName>Aster alpinus subsp.  
parviceps<dwc:ScientificName>
   ...
   <ident:record>
     <dwc:ScientificName>Aster alpinus<dwc:ScientificName>
     <dwc:AuthorYearOfScientificName>L.</dwc:AuthorYearOfScientificName>
     <dwc:Family>Asteraceae<dwc:Family>
     <cur:DateIdentified>1913-03-12</cur:DateIdentified>
     <cur:IdentifiedBy>Karl Marx</cur:IdentifiedBy>
   </ident:record>
   <ident:record>
     <dwc:ScientificName>Aster alpinus subsp.  
parviceps<dwc:ScientificName>
     <dwc:AuthorYearOfScientificName>Novopokr.</ 
dwc:AuthorYearOfScientificName>
     <dwc:Family>Asteraceae<dwc:Family>
     <cur:DateIdentified>2003-09-07</cur:DateIdentified>
     <cur:IdentifiedBy>Keith Richards</cur:IdentifiedBy>
   </ident:record>
<record>


Markus


On 15 May, 2008, at 20:42, Renato De Giovanni wrote:

> Right. I agree there's no particular reason to expose the dump file
> through a typical TAPIR URL. Headers could also be in a separate file.
> However, from a TAPIR service perspective, I think it's still  
> important to
> somehow advertise the availability of a dump file in capabilities  
> (even if
> GBIF doesn't use this). There's a slot in the end of a capabilities
> response that could be used for this purpose:
>
> ...
> <custom>
>  <ext:dump baseurl="http://somehost/somepath/"/>
> </custom>
> ...
>
> Providers that only want to see their data being served through GBIF  
> could
> simply make the dump files available somewhere, without the need to
> install and maintain a web service. TAPIR providers that have other
> reasons to exist could decide if they want to register the TAPIR  
> endpoint
> or just the base URL of the dump file in GBIF's registry.
>
> HTTP headers ("If-Modified-Since" and "Last-Modified") seem to solve  
> the
> timestamp issue in an elegant way.
>
> Regarding complex data, I would be inclined to propose some compact  
> XML
> representation compatible with TAPIR so that existing wrapper
> functionalities could be used to generate the dump file. I suppose  
> this
> could save considerable time. Another advantage is that it would be a
> generic solution, not restricted to one level relationships. Since  
> TAPIR
> output models can map XML nodes to a concatenation of concepts and
> literals, it's also possible to have a single record element with some
> sort of csv content inside. I'm just not sure how to escape eventual
> separators that could be present in real content.
>
> We could also provide more information about the format in the new  
> dump
> element:
>
> <ext:dump baseurl="http://somehost/somepath/" format="csv"/>
>
> or
>
> <ext:dump baseurl="http://somehost/somepath/" format="xml"
> outputModel="some_url"/>
>
> Regards,
> --
> Renato
>
>
>> Hi Renato,
>>
>> Do you think this really go under TAPIR spec?
>>
>> Sure we want the wrappers to produce it but it's just a document on  
>> a URL
>> and can be described in such a simple way that loads of other  
>> people could
>> incorporate it without getting into TAPIR specs, nor can they claim  
>> any
>> TAPIR compliance just because they can do a 'select to outfile'.
>>
>> I would also request that the headers aren't in the data file but the
>> metafile.  It is way easier to dump a big DB to this 'document  
>> standard'
>> without needing to worry about how to get headers in a 20gig file.
>>
>> Just some more thoughts
>>
>> Cheers
>>
>> Tim
>
>
> _______________________________________________
> tdwg-tapir mailing list
> tdwg-tapir at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>




More information about the tdwg-tag mailing list