The notion of star schemas fits very nicely with what I had in mind for the RDF vocabularies. It would be good if any one of the CSV files in the star corresponds to a class in the vocabulary and the columns in the CSV files map to properties in vocabulary (or some other common vocabulary such as VCARD or DC etc. It would then be trivial to map the start to a semantic representation (such as the RDF returned from an LSID) of vice versa.
We can evolve the vocabularies to help this along.
This is probably all obvious but worth stating.
All the best,
Roger
On 20 May 2008, at 16:36, Markus Döring wrote:
Renato, complex data can also be represented by tab files, with a file for each extension that has a pointer in the first column. That is what we originally had in mind with the star scheme.
Markus
On 20 May, 2008, at 17:16, Renato De Giovanni wrote:
Hi Markus,
Since DarwinCore is a generic list of elements that can be used by any application schema, I think it's OK to use them in the new schema that you're suggesting.
I agree that ideally we should try to define and use a common format for index files, although it seems that we will have at least two: csv for simple data and probably another one in XML for complex data, right?
Regarding the XML for complex data, if you manage to find a generic schema that can be used in different contexts (not only biodiversity data) then I agree we could avoid extra attributes in the respective capabilities element. Otherwise, I would prefer to see some extra attribute (such as "outputModel") giving more information about the XML. Since TAPIR was designed to be generic, this should not be a problem because clients and networks are already free to decide and to mandate specific TAPIR capabilities. This doesn't mean that there will be lots of formats for index files. It's a matter of agreeing on a common format but still keeping the protocol generic to allow different uses by other communities.
I also agree we could advertise the index file through some new TAPIR element instead of using the custom slot.
Best Regards,
Renato
On 16 May 2008 at 10:29, Markus Döring wrote:
Renato, I was thinking along those lines too. It would be nice for TAPIRs to announce the availablility of the index files. I wouldnt mind adding it even to the regular tapir schema once it has proven to work with the custom slot approach you have given.
Regarding star shaped data I would prefer to agree on one format instead of allowing different ones to save consumers from this pain. There is a straight forward xml serialisation for this scheme that we could use instead of tab files:
<record uri=""> <dwc:property1 /> <dwc:property2 /> <extA:record> <extA:property1 /> <extA:property2 /> </extA:record> <extB:record> <extB:property1 /> <extB:property2 /> <extB:record> <record>
Advantage is, it can be produced by TAPIR software and xml serialisation is required for many services, eg RSS anyway. But then again the whole point of the index files is that they are easy to generate and consume. On the other hand this xml structure is pretty simple to process and can be genereated from databases like sqlserver that have xml output straight away without the need of scripting.
That touches a different issue I am facing with the star scheme by the way. I have created an identification extension for darwin core that holds the historical list of identification events and their outcome. This is a YAML section of the metafile describing the columns for this extension through fully qualified concepts ala TAPIR:
identification:
- http://rs.tdwg.org/dwc/dwcore/ScientificName
- http://rs.tdwg.org/dwc/dwcore/AuthorYearOfScientificName
- http://rs.tdwg.org/dwc/dwcore/Family
- http://rs.tdwg.org/dwc/dwcore/IdentificationQualifier
- http://rs.tdwg.org/dwc/curatorial/DateIdentified
- http://rs.tdwg.org/dwc/curatorial/IdentifiedBy
When creating this I realised that pretty much all concepts I was interested in already existed in darwin core or the curatorial extension. Wouldnt it be wise to reuse those concepts? Or are they strictly tight to the idea of a current identification and therefore cant be used for historical ones? This is probably more of a darwin core question than TAPIR, but we are all on this list anyway ...
The xml in that case would look sth like this:
<record uri="http://mygarden.com/specimen/plants/54321-423-43-54-6-3-24-44 "> dwc:ScientificNameAster alpinus subsp. parvicepsdwc:ScientificName ... ident:record dwc:ScientificNameAster alpinusdwc:ScientificName dwc:AuthorYearOfScientificNameL.</ dwc:AuthorYearOfScientificName> dwc:FamilyAsteraceaedwc:Family cur:DateIdentified1913-03-12</cur:DateIdentified> cur:IdentifiedByKarl Marx</cur:IdentifiedBy> </ident:record> ident:record dwc:ScientificNameAster alpinus subsp. parvicepsdwc:ScientificName dwc:AuthorYearOfScientificNameNovopokr.</ dwc:AuthorYearOfScientificName> dwc:FamilyAsteraceaedwc:Family cur:DateIdentified2003-09-07</cur:DateIdentified> cur:IdentifiedByKeith Richards</cur:IdentifiedBy> </ident:record>
<record>
Markus
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir