Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?

16 May 2008

      Renato,
I was thinking along those lines too. It would be nice for TAPIRs to  
announce the availablility of the index files. I wouldnt mind adding  
it even to the regular tapir schema once it has proven to work with  
the custom slot approach you have given.

Regarding star shaped data I would prefer to agree on one format  
instead of allowing different ones to save consumers from this pain.  
There is a straight forward xml serialisation for this scheme that we  
could use instead of tab files:

<record uri="">
   <dwc:property1 />
   <dwc:property2 />
   <extA:record>
     <extA:property1 />
     <extA:property2 />
   </extA:record>
   <extB:record>
     <extB:property1 />
     <extB:property2 />
   <extB:record>
<record>

Advantage is, it can be produced by TAPIR software and xml  
serialisation is required for many services, eg RSS anyway.
But then again the whole point of the index files is that they are  
easy to generate and consume. On the other hand this xml structure is  
pretty simple to process and can be genereated from databases like  
sqlserver that have xml output straight away without the need of  
scripting.

That touches a different issue I am facing with the star scheme by the  
way. I have created an identification extension for darwin core that  
holds the historical list of identification events and their outcome.  
This is a YAML section of the metafile describing the columns for this  
extension through fully qualified concepts ala TAPIR:

identification:
   - http://rs.tdwg.org/dwc/dwcore/ScientificName
   - http://rs.tdwg.org/dwc/dwcore/AuthorYearOfScientificName
   - http://rs.tdwg.org/dwc/dwcore/Family
   - http://rs.tdwg.org/dwc/dwcore/IdentificationQualifier
   - http://rs.tdwg.org/dwc/curatorial/DateIdentified
   - http://rs.tdwg.org/dwc/curatorial/IdentifiedBy

When creating this I realised that pretty much all concepts I was  
interested in already existed in darwin core or the curatorial  
extension. Wouldnt it be wise to reuse those concepts? Or are they  
strictly tight to the idea of a current identification and therefore  
cant be used for historical ones? This is probably more of a darwin  
core question than TAPIR, but we are all on this list anyway ...

The xml in that case would look sth like this:

<record uri="http://mygarden.com/specimen/plants/54321-423-43-54-6-3-24-44 
">
   <dwc:ScientificName>Aster alpinus subsp.  
parviceps<dwc:ScientificName>
   ...
   <ident:record>
     <dwc:ScientificName>Aster alpinus<dwc:ScientificName>
     <dwc:AuthorYearOfScientificName>L.</dwc:AuthorYearOfScientificName>
     <dwc:Family>Asteraceae<dwc:Family>
     <cur:DateIdentified>1913-03-12</cur:DateIdentified>
     <cur:IdentifiedBy>Karl Marx</cur:IdentifiedBy>
   </ident:record>
   <ident:record>
     <dwc:ScientificName>Aster alpinus subsp.  
parviceps<dwc:ScientificName>
     <dwc:AuthorYearOfScientificName>Novopokr.</ 
dwc:AuthorYearOfScientificName>
     <dwc:Family>Asteraceae<dwc:Family>
     <cur:DateIdentified>2003-09-07</cur:DateIdentified>
     <cur:IdentifiedBy>Keith Richards</cur:IdentifiedBy>
   </ident:record>
<record>

Markus

On 15 May, 2008, at 20:42, Renato De Giovanni wrote:
...
Right. I agree there's no particular reason to expose the dump file
through a typical TAPIR URL. Headers could also be in a separate file.
However, from a TAPIR service perspective, I think it's still  
important to
somehow advertise the availability of a dump file in capabilities  
(even if
GBIF doesn't use this). There's a slot in the end of a capabilities
response that could be used for this purpose:
...
<custom>
 <ext:dump baseurl="http://somehost/somepath/"/>
</custom>
...
Providers that only want to see their data being served through GBIF  
could
simply make the dump files available somewhere, without the need to
install and maintain a web service. TAPIR providers that have other
reasons to exist could decide if they want to register the TAPIR  
endpoint
or just the base URL of the dump file in GBIF's registry.
HTTP headers ("If-Modified-Since" and "Last-Modified") seem to solve  
the
timestamp issue in an elegant way.
Regarding complex data, I would be inclined to propose some compact  
XML
representation compatible with TAPIR so that existing wrapper
functionalities could be used to generate the dump file. I suppose  
this
could save considerable time. Another advantage is that it would be a
generic solution, not restricted to one level relationships. Since  
TAPIR
output models can map XML nodes to a concatenation of concepts and
literals, it's also possible to have a single record element with some
sort of csv content inside. I'm just not sure how to escape eventual
separators that could be present in real content.
We could also provide more information about the format in the new  
dump
element:
<ext:dump baseurl="http://somehost/somepath/" format="csv"/>
or
<ext:dump baseurl="http://somehost/somepath/" format="xml"
outputModel="some_url"/>
Regards,
--
Renato
...
Hi Renato,
Do you think this really go under TAPIR spec?
Sure we want the wrappers to produce it but it's just a document on  
a URL
and can be described in such a simple way that loads of other  
people could
incorporate it without getting into TAPIR specs, nor can they claim  
any
TAPIR compliance just because they can do a 'select to outfile'.
I would also request that the headers aren't in the data file but the
metafile.  It is way easier to dump a big DB to this 'document  
standard'
without needing to worry about how to get headers in a 20gig file.
Just some more thoughts
Cheers
Tim
_______________________________________________
tdwg-tapir mailing list
tdwg-tapir@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tapir

Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?

Markus Döring