There is a subtext running in this discussion - whether part of our scope is the creation of lexicons or standard name-spaces - that to me
is
causing confusion.
OK, I'll clarify what I meant by the following two quotes - there is a difference, but it may not be clear..
Defining, and agreeing upon these standard notations/descriptions are A FUNDAMENTAL PART (my caps) of specifying this new format, and
one that isn't
solved simply by deciding to use XML (for example). Its part of the fundamental design and modelling, and is therefore something that should be addressed early on.
Here I'm refering standard(ised) ways of recording the data that gets stored in the format - e.g. "leaf shapes will be selected according to the IAPT plain shape chart", "all dates will be in dd-mmm-yyy format". To make the data as interoperable as possible agreeing on standard techniques and measures will help. This is not related to the definition of the format itself - it just defines how its used.
My main point being that any data format (new or otherwise) isn't beneficial by itself, unless you know how the data was collected. If that can rely on standard methods then fine. If not then the method of collection needs to be expressed.
I'd say that the DELTA approach - of avoiding domain (i.e. zoology, virology, etc) specific notations in the format has worked well. And I think this is the level that any initial work should be pitched at. i.e. the data format should encode taxonomic *data* - just as DELTA does. Any domain specific schema can be layered on top of this, or include it. Begin with capturing the relevant data just as DELTA does, and then progress from there.
Here I agree with the points you make below - DELTA and the other formats work because they don't attempt to enforce anything lexically. Lets pitch the effort at that level. It would be a mammoth and probably impossible task to capture all possible data usages.
From that starting point particular groups can then take that standard
and say "when we're recording leaf shapes they will be selected according to the IAPT plain shape chart". This is a layer above a standard data format. Here we're defining its usage. This is where the interoperability problem (sharing data between groups) will be solved.
As I say, it may not be possible to make bold statements like "use the IAPT plain shape chart" in every domain, but it may be possible in some domains, or in parts of some domains.
This is the layer at which the interesting stuff begins, without it you've just got a file format which happens to look the same for everyone. With agreed usage guidelines, it becomes possible to share data, and make the kind of inferences and extrapolations that we're lead to believe RDF, etc will be able to give us.
So do we or don't we? Am I misinterpreting these that they seem to say opposite things?
Bottom line - I'm talking about two different layers :
- a general format for data capture - (possibly domain specific) standardised usage guidelines
Does that help?
The only problem with competing hierarchies is if we are trying to standardise and resolve the conflicts. If every worker resolves for their own project what to call bracts, this is not a problem for us.
Exactly - each worker can resolve their own hiearchy. But if a hierarchy could be agreed between groups then you start to get more use from your data. However any agreement isn't likely to occur between domains (zoology, bryology, virology), but possibly within a domain.
For the record, all the current systems (DELTA, LucID, NEXUS etc) enforce nothing lexically, they merely enforce a particular way of representing data. Two data sets for similar groups of plants may contain entirely different characters, or the same characters worded in different ways, or the same characters resolved into states in different ways, or (occasionally) identical characters. Comparing and combining datasets automatically is thus impossible. This seems such a shame, but is it perhaps unavoidable?
Possibly - but the data format doesn't solve that. Its how you use it.
So, I'd like to suggest that we try to develop a standardised data representation, but put no constraints on character definitions whatsoever.
We're in agreement here - I'm just suggesting that while there are no constraints codified in the standard, they still need to be expressed (if only in usage guidelines) if you want to be able to share/exchange data meaningfully.
Cheers,
L.