(GEN) Lexicons

Thu Dec 9 09:34:58 CET 1999

> There is a subtext running in this discussion - whether part of
> our scope is the creation of lexicons or standard name-spaces - that to me
is
> causing confusion.

OK, I'll clarify what I meant by the following two quotes - there is
a difference, but it may not be clear..

> >Defining, and agreeing upon these standard notations/descriptions are
> >A FUNDAMENTAL PART (my caps) of specifying this new format, and
> one that isn't
> >solved simply by deciding to use XML (for example). Its part of the
> >fundamental design and modelling, and is therefore something
> >that should be addressed early on.

Here I'm refering standard(ised) ways of recording the data that gets
stored in the format - e.g. "leaf shapes will be selected
according to the IAPT plain shape chart", "all dates will be in dd-mmm-yyy
format". To make the data as interoperable as possible agreeing on
standard techniques and measures will help. This is not related to
the definition of the format itself - it just defines how its used.

My main point being that any data format (new or otherwise) isn't
beneficial by itself, unless you know how the data was collected.
If that can rely on standard methods then fine. If not then the method
of collection needs to be expressed.

> >I'd say that the DELTA approach - of avoiding domain (i.e. zoology,
> >virology, etc) specific notations in the format has worked well.
> >And I think this is the level that any initial work should be pitched at.
> >i.e. the data format should encode taxonomic *data* - just as DELTA
> >does. Any domain specific schema can be layered on top of this, or
> >include it. Begin with capturing the relevant data just as DELTA does,
> >and then progress from there.

Here I agree with the points you make below - DELTA and the other formats
work because they don't attempt to enforce anything lexically. Lets pitch
the effort at that level. It would be a mammoth and probably impossible
task to capture all possible data usages.

>>>From that starting point particular groups can then take that standard
and say "when we're recording leaf shapes they will be selected according to
the IAPT plain shape chart". This is a layer above a standard data format.
Here we're defining its usage. This is where the interoperability problem
(sharing data between groups) will be solved.

As I say, it may not be possible to make bold statements like
"use the IAPT plain shape chart" in every domain, but it may be possible
in some domains, or in parts of some domains.

This is the layer at which the interesting stuff begins, without it
you've just got a file format which happens to look the same for everyone.
With agreed usage guidelines, it becomes possible to share data, and make
the kind of inferences and extrapolations that we're lead to believe
RDF, etc will be able to give us.

> So do we or don't we? Am I misinterpreting these that they seem to say
> opposite things?

Bottom line - I'm talking about two different layers :

- a general format for data capture
- (possibly domain specific) standardised usage guidelines

Does that help?

> The only problem with competing hierarchies is if we are trying to
> standardise and resolve the conflicts. If every worker resolves for their
> own project what to call bracts, this is not a problem for us.

Exactly - each worker can resolve their own hiearchy. But if a hierarchy
could be agreed between groups then you start to get more use from your
data. However any agreement isn't likely to occur between domains (zoology,
bryology, virology), but possibly within a domain.

> For the record, all the current systems (DELTA, LucID, NEXUS etc) enforce
> nothing lexically, they merely enforce a particular way of representing
> data. Two data sets for similar groups of plants may contain entirely
> different characters, or the same characters worded in different ways, or
> the same characters resolved into states in different ways, or
> (occasionally) identical characters. Comparing and combining datasets
> automatically is thus impossible. This seems such a shame, but is
> it perhaps unavoidable?

Possibly - but the data format doesn't solve that. Its how you use it.

> So, I'd like to suggest that we try to develop a standardised data
> representation, but put no constraints on character definitions
> whatsoever.

We're in agreement here - I'm just suggesting that while there
are no constraints codified in the standard, they still need to
be expressed (if only in usage guidelines) if you want to be
able to share/exchange data meaningfully.

Cheers,

L.