[Tdwg-guid] (Fwd) Fwd: [TDWG] Announce: Proposal for "microformat" for mar

Tue Sep 26 19:29:07 CEST 2006

Hi Roger,

Supporting many representation formats would be really cool, but I have 
doubts as to whether the benefit of such a system will outweigh the costs. 

The initial goal behind modular schemata was that, if we had them, we 
could build a network of data providers and consumers that could carry 
any type of data (type independence).  In essence we would build a data 
network that would allow anyone to talk about anything.  This by itself 
is not an easy thing to do. 

Then the issue of representation language cropped up; first XML or RDF 
and now different types of XML, different RDF ontology languages, 
microformats, and semantic tags (why not JSON, SQL tables, serialized 
Java objects, C structs, and any other representation people might 
want).  To resolve this issue without restricting representation 
language requires a huge increase in the scope of work; not only type 
independence, but independence of representation; building a data 
network that allows anyone to talk about anything in any language.

On the one hand you're absolutely right that such a system, if we could 
build it, might work as a bridge between different technologies.  But I 
worry that it will be a massively difficult and expensive undertaking 
that might not ever work.  I'll list a few of my concerns.

The first is whether or not it will support automatic translation:

1.) If the system does not do automatic translation between 
representation languages, then it's more like a schema repository.  In 
my view, schema repositories don't help to integrate tools that use 
different representation languages.  Instead each representation 
language becomes a silo.  The schema repository helps to document what 
has to be done when people need to write code that will cut across silos 
for a one-time task, but it doesn't actually encourage people to do so.

2.) If the system does automatic translation between representations 
then it adds a layer of complexity and a large processing and transport 
cost to each transaction on the network.  Imagine that you want to do 
some niche modeling.  Assume you have some taxonomic group in mind.  
First you'd have to find the names for this group, including synonyms.  
Next you'd have to get specimens and observations for these names.  So, 
two large sets of transactions are necessary to acquire the data you 
need.  Each name and observation provider might be using a different 
representation language.  When you contact them you have to figure out 
what representation they've given you and ship the data off to a 
translation service before you can merge the results.  This adds a large 
(at best linear) cost to acquiring data.  Additionally, someone has to 
pay for the huge amount of bandwidth used by the translation service.  
We can propose to use a local library instead of a remote service to do 
the translation, but this adds a burden on the developers of all 
software, requires that the library is updated often as new types and 
representation languages are adopted, and requires that the library 
exists or has bindings to many programming languages; in short this is a 
software maintenance nightmare.

My second set of concerns are about the representations themselves:

3.) Each representation will require some effort to construct and 
maintain.  If the system will provide guidelines (rules expressed in 
natural language) for how to translate each representation into other 
representations, the cost (in effort, time, and money) will increase.  
If the system will provide automatic translation, the cost will increase 
further.  However, not all representations will be used equally.  If 
there are only two people who want TCS in format X, then is it worth the 
expense of providing it to them?  Who decides whether or not a 
particular representation format has enough demand to justify the work 
involved in supporting it?

4.) If the goal is to provide guidelines or automatic services for 
translation between representations of a given data type, then we have 
to map X * X-1 * Y possible translations where X is the number of 
allowed representations for a given data type and Y is the number of 
data types.  The TDWG biodiversity informatics ontology may end up with 
30 classes.  If we support 5 representations (maybe OWL, RDFS, semantic 
tags, XML metadata, and GML Feature Types) that's 5 * 4 * 30 = 600 
possible translation mappings to create and maintain.  Each time we have 
a new representation or a new data type we have to update the set of 
translation mappings.

My final set of concerns regards knowledge representation, modeling, and 
the expressive power of representation languages:

5.) Different representation languages have different language features 
and expressive powers.  For instance, there are things you can do with 
OWL that you can't do with semantic tags.  This is because OWL has 
language features for representing inheritance, property-value 
constraints, etc. that simply don't exist in the world of semantic 
tagging.  If we have to be able to represent the platonic ideal of our 
data types (as defined in the TDWG ontology) in any representation 
language and also have to be able to translate between representations, 
we run into a dilemma.

If we use all the features of a particular representation language we 
benefit from them when using that particular format.  The software that 
is constructed to natively consume that representation can use all of 
the available language features to automate tasks on behalf of the 
user.  However, translation becomes very difficult.  Imaging translating 
OWL-style inheritance into microformats or XML-Schema data type 
constraints into a system of semantic tags.  It's simply not possible.  
Translating between languages of differing expressive powers can be 
problematic.  The alternative approach is to use only those language 
features that are common to all representation languages.  In practice 
this usually means using only those features that exist in the most 
weakly-expressive language.  If our bag of representation languages 
includes both semantic tagging and OWL, then we're not really using the 
power of OWL.  In fact, if we have to use only the common features of 
the two, we might as well implement our OWL ontology so that there is 
only one type of class with a single property called "tagvalue".

6.) Different representation languages enable different functionality in 
the software that consumes them.  For instance, client software that 
consumes RDFS or OWL instances often expand searches to encompass 
instances of superclasses.  In other words, software designed to use 
semantic web technologies can do some of the work a human user might 
otherwise have to do by exploiting the features of semantic web 
languages.  Software designed to use semantic tags often doesn't do much 
more than search and statistical correlation between tag instances.  
This is quite powerful in it's own way, but because semantic tags were 
designed to indicate the context of a document, not necessarily its 
contents, semantic tagging really only helps a user to locate documents 
of interest.  A document with tags is ultimately read by a human, not a 
machine.  Every representation language carries with it assumptions 
about how "documents" that are instances of that language will be used. 

To navigate you need a fixed point.  To move the world you need a 
fulcrum.  Because representation languages provide different features 
and make different assumptions about how their instances will be used, 
it makes sense to use representation language as the fixed point of our 
designs and leave data types and service interfaces free to vary.  Some 
have argued that the TDWG ontology is the fixed point in our 
constellation of services, but I disagree.  It is the umbrella under 
which data integration will occur; there will always be extensions to 
the core ontology and it too will change over time as it is expanded.

Overall I think it's a laudable goal to support as many representation 
languages as possible, but there are so many headaches and compromises 
involved that we may end up with an expensive solution that, because it 
only supports the lowest common denominator of functionality, doesn't 
really work right for anybody.  A case in point is the current 
discussion of namespaces.  In order to make namespaces work across the 
widest range of representation languages, it's been proposed that they 
can no longer be used as packages to logically partition the larger 
ontology.  This makes it harder to manage extensions to the ontology and 
makes it likely that we'll end up using 
veryLongClassAndPropertyNamesToTryToAvoidNamespaceClashes.  And you 
still can't represent namespaces in semantic tags.

It's hard enough to write software that can cope with any data type and 
I'd rather spend energy, time, and money on getting it right with only 
one or two feature-rich representations.  What I'd really like to see is 
a network of heterogeneously typed, highly integrated data objects and a 
rich set of services that operate on them.  Once this is built, the real 
fun can begin, creating software that uses these data to answer 
important scientific questions.

-Steve

Roger Hyam wrote:
>
> Thanks for forwarding this Sally.
>
> What I am proposing at St Louis - though I seem to been having to 
> propose it long before - is that we have an application for managing 
> the ontology that will expose the underlying semantics in multiple 
> 'formats' i.e. as RDFS or OWL ontologies as GML application schemas, 
> as custom XML Schemas as OBO ontologies etc etc. I see no other way of 
> integrating multiple technologies. (Suggested alternatives welcome).
>
> One of the things on my list is micro formats along with tagging. It 
> seems crazy to define a 'specificEpithet' in a TDWG ontology and then 
> not use exactly the same concept in a micro format or as a tag.
>
> So this is timely. I just can't act on it very well before St Louis. 
> I'll add something to the wiki page to flag my/our interest.
>
> Thanks,
>
> Roger
>
>
> Sally Hinchcliffe wrote:
>> Hi all
>>
>> This is probably on the wrong list (Maybe TAG?) but it strikes me 
>> that what this guy needs is an ontology that he can use in his 
>> microformats ...
>>
>> Possibly an example of a real world need for ontologies ?
>>
>> Sally
>>
>> ------- Forwarded message follows -------
>> Date sent:          Tue, 26 Sep 2006 09:34:04 -0000
>> To:                 <sh00kg at rbgkew.org.uk>
>> Subject:            Fwd: [TDWG] Announce: Proposal for "microformat" 
>> for marking-up taxonomic names in HTML: comments and contributions 
>> sought
>> From:               <M.Jackson at kew.org>
>> Send reply to:      M.Jackson at rbgkew.org.uk
>>
>> Sally,
>>
>> Do you think you might respond to this? Just curious what you think.
>>
>> Mark
>> ----
>> Forwarded From: Andy Mabbett <andy at pigsonthewing.org.uk>
>>
>>  
>>> Hello - my first post to this mailing list.
>>>
>>> I'm not a taxonomist, but I've been told by one that you might be
>>> interested in recent proposals for a formula (a "microformat"
>>> <http://microformats.org>) for marking-up, in HTML, the names of 
>>> species
>>> (and other ranks, varieties, hybrids, etc.).
>>>
>>> Microformats are a way of adding additional, simple markup to
>>> human-readable data items on web pages, using common and open HTML
>>> standards, so that the information can be extracted by software and
>>> indexed, searched for, saved, cross-referenced or aggregated.
>>> Microformats are also open standards, freely available for anyone to
>>> use.
>>>
>>> The proposed format respects all existing biological taxonomies, and is
>>> not intended to change or supplant any of them - it merely provides
>>> webmasters with a method of either:
>>>
>>>    1)   marking-up a taxonomical name (or taxon-common name pair) in
>>>         such a way that its components can be recognised by computers
>>>
>>> or
>>>
>>>    2)   marking up a common name, so as to associative with it a
>>>         taxonomical name, in such a way that the latter's components 
>>> can
>>>         be recognised by computers
>>>
>>> For instance, if I mark up a list of common names on a page I maintain:
>>>
>>>    <http://www.westmidlandbirdclub.com/staffs/tittesworth/latest.htm>
>>>
>>> using that microformat, a visitor might have browser tool which lists
>>> all the species on the page, sorted into alphabetical order within
>>> taxonomic class, or in taxonomic order, and then creates links to, say
>>> (for Joe Public) their entries in Wikipedia, or the British Trust for
>>> Ornithology, or (for scientists) some academic database of the users
>>> choosing.
>>>
>>> Early thoughts on the format are on an editable "wiki", here:
>>>
>>>         <http://microformats.org/wiki/species>
>>>
>>> Please feel free to participate - the proposal needs both messages of
>>> support (particularly from people or organisations who have websites on
>>> which they might use them) and, especially, comments and constructive
>>> criticisms - does the proposal understand and use taxonomy 
>>> correctly; is
>>> the terminology right, are there any omissions or overlooked, unusual
>>> naming conventions?
>>>
>>> You can use the above wiki, or the microformats mailing list:
>>>
>>>         <http://microformats.org/wiki/mailing-lists>
>>>
>>> and/ or please feel free to pass this e-mail to other interested
>>> parties.
>>>
>>> Thank you.
>>>
>>> -- 
>>> Andy Mabbett
>>> Birmingham, England
>>>
>>> _______________________________________________
>>> TDWG mailing list
>>> TDWG at mailman.nhm.ku.edu
>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg
>>>
>>>     
>>
>>
>>
>>   
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> TDWG-GUID mailing list
> TDWG-GUID at mailman.nhm.ku.edu
> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
>