[tdwg-tag] TAG Road Map

Wed Aug 9 18:23:13 CEST 2006

Hi Roger & Javier,

Roger, thanks for sending the link to the technical roadmap.  Reading 
that document followed by Javier's comments and your responses, made me 
think of a few questions and comments of my own:

>> -Have you considered something about controlled vocabularies in the
>> semantic hub? Or maybe what I prefer to call managed controlled
>> vocabularies. For example in ABCD there is a concept called
>> "KindOfRecord", it is not a controlled vocabulary, is just a text. It
>> would be too difficult to provide a fix list of terms for it so would
>> be great if the list can somehow be created in a community base. Let
>> say that I want to map my database to this field, I could get a list
>> of proposed terms already being used, if none of them satisfy me then
>> i can create my own one. It is a little bit like tagging in a control
>> way. I love the del.icio.us example, they propose you tags and most of
>> the time I use them, and by doing this then the data is much more
>> accessible because the tags have not exploded. The opposite os what is
>> happening now in ABCD everybody use a different term for the same
>> thing and the unified data becomes useless.
>>
> There will be instances of classes in Tonto. I should have mentioned 
> that. Chatting to Rob Gales about it it seems a good way of doing 
> controlled vocabularies. They will be extensible because Tonto will 
> always be capable of change. Also in some languages - OWL etc you 
> could always define your own instances outside of Tonto - but it would 
> depend on how we do the coding for the XML Schema based renderings of 
> the semantics.
>
> Populating drop down menus with information out of Tonto when some one 
> is mapping a data source in the ultimate goal - the dream!
First, a silly point.  The term "controlled vocabulary" is used by some 
as a synonym for "ontology".  Can we find or coin a new term for 
concepts with a range of enumerated values that isn't overloaded?

I agree with Rob that any "concept" with enumerated values ought to have 
those values represented as instances with assigned GUIDs.  Some values 
will be blessed by the fact that they are stored in the core model but 
people will be free to invent their own without breaking existing 
software or mapping rules (at the expense of losing interoperability 
with software that only understands the approved values).
>> -In the implementation section you say something like "Data Providers
>> must map their data to these views" referring to views from the
>> semantic hub. This is actually we are trying to avoid. TAPIR at the
>> beginning was created with the vision of data providers mapping once
>> their databases and being accessible through different views that are
>> explicitly declared on the request. We changed the name now and we are
>> calling it outputModels.
>> In the other hand you know that WASABI and PyWrapper are now becoming
>> muti-protocol. That means that we want providers to map their
>> databases once and make the data available in different protocols.
>>
> The plan is that data mapping only has to be done to one of the views 
> that Tonto has onto its internal semantics then the other 
> views/representations could be used as outputModels (or custom output 
> models could be created by clients etc). The goal is definitely to 
> only map once - a single set of semantics - but then represent in 
> multiple ways. Tonto could provide a view of the semantics that a 
> graphical tool could then pick up to help some one build a mapping 
> file - the dream...
This ties in with a few questions I have about the semantic hub and Tonto.

First, since some of us are also involved in working to create data 
models that one day may be added to the semantic hub, is there a defined 
list of the common subset of modeling constructs (between UML, XSD, OWL, 
and RDFS) and suggestions about how to implement them?  Are there 
discussions about what constructs will be dropped and the trade-offs of 
different implementations?

For example, it could be argued that N-ary associations could be 
implemented in RDFS and OWL (and perhaps in XML-Schemas that can 
describe directed labeled graphs through the use of GUIDs), but from the 
research community, the recommended implementation for N-ary 
associations in RDF-based systems is reification.  As implementors of 
systems that work with RDF-based data, we feel that reification is not 
the way to go and that it may be better to drop support for N-ary 
associations rather than put in place a "flawed work-around" like 
reification.  Just off the top of my head there are also issues with 
modeling arbitrary cardinality, cardinality on one or both sides of an 
association, primitive type mapping and data type promotion, modeling 
aggregation and/or composition (sequences and bags mean anonymous nodes 
which don't play nice in a system that uses GUIDs to name resources), 
and how to implement many of these modeling constructs in XSD which was 
designed to describe trees (not graphs) and has no built-in notion of 
global identifiers.

I don't mean to get bogged down in detail or to make trouble, but 
creation of the concrete data models (in XSD, RDFS, OWL, etc.) from the 
abstract semantic core will depend on sorting all these issues out.  Is 
there a place where these discussions are happening and is there some 
way that implementors can feed back into the decisions made on these 
issues by the technical infrastructure group? 

Once they have been formalized, I think it may be important that these 
modeling recommendations be made available to the community in a 
document.  One idea behind the semantic core is that it can grow over 
time.  As the community models new areas of biodiversity informatics 
there has to be a way for the new data models to be incorporated into 
the semantic core (after being blessed by some TDWG body).  This will be 
easier if the people creating new data models understand which modeling 
constructs they have available to them.

>> So, again, within TAPIR itself and within protocols we need something
>> like the semantic hub you are proposing. And we are doing it right now
>> but very primitive. I am working on the implementation of the BioMOBY
>> protocol inside PyWrapper. I had created a mapping file between TDWG
>> schemas and MOBY data types registry so that I can resolve questions
>> like:
>> -Ok, if I have these TDWG concepts mapped, which MOBY services could 
>> I create.
>> -How can I create this MOBY types using TDWG concepts?
>>
> This is definitely where everyone is heading. The idea with the hub is 
> to start by clearly defining the semantic constructs we are going to 
> use (classes, properties, instances, literals, ranges) so that we can 
> be sure that we can represent the semantics in different 
> 'representations'. It is no good using some UML or OWL construct that 
> doesn't have a good representation in XML Schema or GML for example.
>
> This plan will not answer all the questions on the first day. The 
> existing schemas will need to be mapped into Tonto so they are 
> represented in a uniform way and this will take time - but months not 
> years I hope. It is not always clear in the exiting schemas what the 
> classes and properties are, for example, so this is not an automatic 
> process but will need thinking through - and there are issues around 
> cardinality and multiple inheritance that will need to be discussed.
>
Is there a more complete description of Tonto functionality?  I'm still 
not clear about what kind of interface it will provide to the network of 
services (if any).  Is it correct to think of Tonto as a schema 
repository, a collaborative schema editor, or a tool that will be used 
by technical infrastructure group to generate concrete implementations 
of the semantic core in various typing systems (XSD, RDFS, OWL, etc)?  
It may be lack of sleep, but I took the wording in section 5.2 to 
suggest all three at different points.
>> As I said this is now being implemented in a simple flat file that
>> will be available in Internet for all data providers, but I am not
>> accessing it as a service and I have to do all the handling on the
>> client. The semantic Hub you are proposing is exactly what we need and
>> want to do this more properly.
>>
> I am glad to hear that. I hope that we can generate the file you need 
> to do the mapping automatically in future.
>> So... summarizing, from our side I can imagine now that we need:
>>
>> -the semantic hub must expose the concepts in a way that we can use
>> them in our configuration tools in the data providers to allow mapping
>> a database there.
> Just specify the way and we will make it do it.
>> -the semantic hub must expose the different views, or outputModels as
>> we call them in TAPIR, so that providers software can produce them.
>>
> Just specify them and we write a script to produce them from Tonto.
>> There is a full list of other requirements I would love to see there
>> that can be found in Dave Thau's work on a schema repository, do you
>> remember?
>> http://ww3.bgbm.org/schemarepowiki
>>
>> Specially there where things like:
>> -Give an XSLT that transforms ABCD 1.2 into 2.06
> Ah ha! this is where I have to say no!
>
> What I am proposing is that we have a central semantic model that can 
> be presented in a multitude of ways. This is *very *different from a 
> service that can automatically transform one existing schema to 
> another. That is far more ambitious and may not be possible in what 
> remains of human history. We should build the central semantic model 
> out of the existing schemas but that is a manual process where 
> decisions will have to be made about what was meant by the different 
> constructs in the existing schemas. Think about the different ways 
> inclusion, adjacency, cardinality and type extension/restriction are 
> used within XML Schema documents and what they 'mean' in terms of GML 
> feature types or RDFS classes and properties? In general there just 
> isn't a mapping.
>
> When they built GML they stared with the model of feature types and 
> properties and then decided how they would represent this using XML 
> and control it with XML Schema. We need to take a similar approach. 
> Decide on what our modeling technique is then decide how we will 
> represent the model in different technologies including GML and 
> semantic web stuff.
>
I agree completely.

On the idea that we can generate GML feature types directly from the 
semantic hub, there is a tangential but related point that hasn't 
received much discussion.  It centers around how GUIDs will work within 
GML application schemas.  GML app schemas probably shouldn't contain 
many LSIDs because GIS apps can't resolve them to get at the underlying 
data (and LSIDs aren't very informative when they appear as labels for 
features on maps).  So any direct translation of semantic hub classes 
into GML app schemas may be of limited value.  Instead GML app schemas 
will probably have to be composed by selecting a set of properties with 
literals values, even if that means dereferencing LSIDs into associated 
objects to do so.  But that's a discussion that can be held later.
> A crazy analogy would be to say it is like getting a machine to write 
> a story in different languages based on the same plot. This might be 
> achievable because the plot can be encoded in some machine readable 
> way and the machine can just use rules to bolt together stored 
> sentences. Adding another language is just a matter of new output 
> rules and new stored sentences. It wouldn't produce great literature 
> but it would work.
>
> This is very different from asking a machine to read a book in one 
> language, understand the plot and print out the story in a completely 
> different language. For a start their may be things expressed in the 
> first language for which there is no direct equivalent. Just ask a 
> babel fish!
>
> I think this is the main way the Semantic Hub proposal differs from 
> the Schema Repository approach. I am trying to make something happen 
> by restricting the scope - if you remember the talk I gave on managing 
> scope and resources..
>
> I hope this is OK.
>> -Give me labels for this concept
> This should be easy. Internationalization won't be in the first 
> implementation but should be in subsequent versions.
>>
>> -Give me examples of values for this concept
> This should be easy but isn't in the highest priority.
This could be difficult when the value of a concept is not a single 
string value, but one of several different types of objects.

-Steve