Renato, I was thinking along those lines too. It would be nice for TAPIRs to announce the availablility of the index files. I wouldnt mind adding it even to the regular tapir schema once it has proven to work with the custom slot approach you have given.
Regarding star shaped data I would prefer to agree on one format instead of allowing different ones to save consumers from this pain. There is a straight forward xml serialisation for this scheme that we could use instead of tab files:
<record uri=""> <dwc:property1 /> <dwc:property2 /> extA:record <extA:property1 /> <extA:property2 /> </extA:record> extB:record <extB:property1 /> <extB:property2 /> extB:record <record>
Advantage is, it can be produced by TAPIR software and xml serialisation is required for many services, eg RSS anyway. But then again the whole point of the index files is that they are easy to generate and consume. On the other hand this xml structure is pretty simple to process and can be genereated from databases like sqlserver that have xml output straight away without the need of scripting.
That touches a different issue I am facing with the star scheme by the way. I have created an identification extension for darwin core that holds the historical list of identification events and their outcome. This is a YAML section of the metafile describing the columns for this extension through fully qualified concepts ala TAPIR:
identification: - - - - - -
When creating this I realised that pretty much all concepts I was interested in already existed in darwin core or the curatorial extension. Wouldnt it be wise to reuse those concepts? Or are they strictly tight to the idea of a current identification and therefore cant be used for historical ones? This is probably more of a darwin core question than TAPIR, but we are all on this list anyway ...
The xml in that case would look sth like this:
<record uri=" "> dwc:ScientificNameAster alpinus subsp. parvicepsdwc:ScientificName ... ident:record dwc:ScientificNameAster alpinusdwc:ScientificName dwc:AuthorYearOfScientificNameL.</dwc:AuthorYearOfScientificName> dwc:FamilyAsteraceaedwc:Family cur:DateIdentified1913-03-12</cur:DateIdentified> cur:IdentifiedByKarl Marx</cur:IdentifiedBy> </ident:record> ident:record dwc:ScientificNameAster alpinus subsp. parvicepsdwc:ScientificName dwc:AuthorYearOfScientificNameNovopokr.</ dwc:AuthorYearOfScientificName> dwc:FamilyAsteraceaedwc:Family cur:DateIdentified2003-09-07</cur:DateIdentified> cur:IdentifiedByKeith Richards</cur:IdentifiedBy> </ident:record> <record>
On 15 May, 2008, at 20:42, Renato De Giovanni wrote:
Right. I agree there's no particular reason to expose the dump file through a typical TAPIR URL. Headers could also be in a separate file. However, from a TAPIR service perspective, I think it's still important to somehow advertise the availability of a dump file in capabilities (even if GBIF doesn't use this). There's a slot in the end of a capabilities response that could be used for this purpose:
<custom> <ext:dump baseurl="http://somehost/somepath/"/> </custom> ...
Providers that only want to see their data being served through GBIF could simply make the dump files available somewhere, without the need to install and maintain a web service. TAPIR providers that have other reasons to exist could decide if they want to register the TAPIR endpoint or just the base URL of the dump file in GBIF's registry.
HTTP headers ("If-Modified-Since" and "Last-Modified") seem to solve the timestamp issue in an elegant way.
Regarding complex data, I would be inclined to propose some compact XML representation compatible with TAPIR so that existing wrapper functionalities could be used to generate the dump file. I suppose this could save considerable time. Another advantage is that it would be a generic solution, not restricted to one level relationships. Since TAPIR output models can map XML nodes to a concatenation of concepts and literals, it's also possible to have a single record element with some sort of csv content inside. I'm just not sure how to escape eventual separators that could be present in real content.
We could also provide more information about the format in the new dump element:
<ext:dump baseurl="http://somehost/somepath/" format="csv"/>
<ext:dump baseurl="http://somehost/somepath/" format="xml" outputModel="some_url"/>
Hi Renato,
Do you think this really go under TAPIR spec?
Sure we want the wrappers to produce it but it's just a document on a URL and can be described in such a simple way that loads of other people could incorporate it without getting into TAPIR specs, nor can they claim any TAPIR compliance just because they can do a 'select to outfile'.
I would also request that the headers aren't in the data file but the metafile. It is way easier to dump a big DB to this 'document standard' without needing to worry about how to get headers in a 20gig file.
Just some more thoughts
