Re: [tdwg-tag] TAG Road Map

9 Aug 2006

      Hi Roger & Javier,

Roger, thanks for sending the link to the technical roadmap.  Reading 
that document followed by Javier's comments and your responses, made me 
think of a few questions and comments of my own:
...
...
-Have you considered something about controlled vocabularies in the
semantic hub? Or maybe what I prefer to call managed controlled
vocabularies. For example in ABCD there is a concept called
"KindOfRecord", it is not a controlled vocabulary, is just a text. It
would be too difficult to provide a fix list of terms for it so would
be great if the list can somehow be created in a community base. Let
say that I want to map my database to this field, I could get a list
of proposed terms already being used, if none of them satisfy me then
i can create my own one. It is a little bit like tagging in a control
way. I love the del.icio.us example, they propose you tags and most of
the time I use them, and by doing this then the data is much more
accessible because the tags have not exploded. The opposite os what is
happening now in ABCD everybody use a different term for the same
thing and the unified data becomes useless.
There will be instances of classes in Tonto. I should have mentioned 
that. Chatting to Rob Gales about it it seems a good way of doing 
controlled vocabularies. They will be extensible because Tonto will 
always be capable of change. Also in some languages - OWL etc you 
could always define your own instances outside of Tonto - but it would 
depend on how we do the coding for the XML Schema based renderings of 
the semantics.
Populating drop down menus with information out of Tonto when some one 
is mapping a data source in the ultimate goal - the dream!
First, a silly point.  The term "controlled vocabulary" is used by some 
as a synonym for "ontology".  Can we find or coin a new term for 
concepts with a range of enumerated values that isn't overloaded?
I agree with Rob that any "concept" with enumerated values ought to have 
those values represented as instances with assigned GUIDs.  Some values 
will be blessed by the fact that they are stored in the core model but 
people will be free to invent their own without breaking existing 
software or mapping rules (at the expense of losing interoperability 
with software that only understands the approved values).
...
...
-In the implementation section you say something like "Data Providers
must map their data to these views" referring to views from the
semantic hub. This is actually we are trying to avoid. TAPIR at the
beginning was created with the vision of data providers mapping once
their databases and being accessible through different views that are
explicitly declared on the request. We changed the name now and we are
calling it outputModels.
In the other hand you know that WASABI and PyWrapper are now becoming
muti-protocol. That means that we want providers to map their
databases once and make the data available in different protocols.
The plan is that data mapping only has to be done to one of the views 
that Tonto has onto its internal semantics then the other 
views/representations could be used as outputModels (or custom output 
models could be created by clients etc). The goal is definitely to 
only map once - a single set of semantics - but then represent in 
multiple ways. Tonto could provide a view of the semantics that a 
graphical tool could then pick up to help some one build a mapping 
file - the dream...
This ties in with a few questions I have about the semantic hub and Tonto.
First, since some of us are also involved in working to create data 
models that one day may be added to the semantic hub, is there a defined 
list of the common subset of modeling constructs (between UML, XSD, OWL, 
and RDFS) and suggestions about how to implement them?  Are there 
discussions about what constructs will be dropped and the trade-offs of 
different implementations?

For example, it could be argued that N-ary associations could be 
implemented in RDFS and OWL (and perhaps in XML-Schemas that can 
describe directed labeled graphs through the use of GUIDs), but from the 
research community, the recommended implementation for N-ary 
associations in RDF-based systems is reification.  As implementors of 
systems that work with RDF-based data, we feel that reification is not 
the way to go and that it may be better to drop support for N-ary 
associations rather than put in place a "flawed work-around" like 
reification.  Just off the top of my head there are also issues with 
modeling arbitrary cardinality, cardinality on one or both sides of an 
association, primitive type mapping and data type promotion, modeling 
aggregation and/or composition (sequences and bags mean anonymous nodes 
which don't play nice in a system that uses GUIDs to name resources), 
and how to implement many of these modeling constructs in XSD which was 
designed to describe trees (not graphs) and has no built-in notion of 
global identifiers.

I don't mean to get bogged down in detail or to make trouble, but 
creation of the concrete data models (in XSD, RDFS, OWL, etc.) from the 
abstract semantic core will depend on sorting all these issues out.  Is 
there a place where these discussions are happening and is there some 
way that implementors can feed back into the decisions made on these 
issues by the technical infrastructure group? 

Once they have been formalized, I think it may be important that these 
modeling recommendations be made available to the community in a 
document.  One idea behind the semantic core is that it can grow over 
time.  As the community models new areas of biodiversity informatics 
there has to be a way for the new data models to be incorporated into 
the semantic core (after being blessed by some TDWG body).  This will be 
easier if the people creating new data models understand which modeling 
constructs they have available to them.
...
...
So, again, within TAPIR itself and within protocols we need something
like the semantic hub you are proposing. And we are doing it right now
but very primitive. I am working on the implementation of the BioMOBY
protocol inside PyWrapper. I had created a mapping file between TDWG
schemas and MOBY data types registry so that I can resolve questions
like:
-Ok, if I have these TDWG concepts mapped, which MOBY services could 
I create.
-How can I create this MOBY types using TDWG concepts?
This is definitely where everyone is heading. The idea with the hub is 
to start by clearly defining the semantic constructs we are going to 
use (classes, properties, instances, literals, ranges) so that we can 
be sure that we can represent the semantics in different 
'representations'. It is no good using some UML or OWL construct that 
doesn't have a good representation in XML Schema or GML for example.
This plan will not answer all the questions on the first day. The 
existing schemas will need to be mapped into Tonto so they are 
represented in a uniform way and this will take time - but months not 
years I hope. It is not always clear in the exiting schemas what the 
classes and properties are, for example, so this is not an automatic 
process but will need thinking through - and there are issues around 
cardinality and multiple inheritance that will need to be discussed.
Is there a more complete description of Tonto functionality?  I'm still 
not clear about what kind of interface it will provide to the network of 
services (if any).  Is it correct to think of Tonto as a schema 
repository, a collaborative schema editor, or a tool that will be used 
by technical infrastructure group to generate concrete implementations 
of the semantic core in various typing systems (XSD, RDFS, OWL, etc)?  
It may be lack of sleep, but I took the wording in section 5.2 to 
suggest all three at different points.
...
...
As I said this is now being implemented in a simple flat file that
will be available in Internet for all data providers, but I am not
accessing it as a service and I have to do all the handling on the
client. The semantic Hub you are proposing is exactly what we need and
want to do this more properly.
I am glad to hear that. I hope that we can generate the file you need 
to do the mapping automatically in future.
...
So... summarizing, from our side I can imagine now that we need:
-the semantic hub must expose the concepts in a way that we can use
them in our configuration tools in the data providers to allow mapping
a database there.
Just specify the way and we will make it do it.
-the semantic hub must expose the different views, or outputModels as
we call them in TAPIR, so that providers software can produce them.
Just specify them and we write a script to produce them from Tonto.
...
There is a full list of other requirements I would love to see there
that can be found in Dave Thau's work on a schema repository, do you
remember?
http://ww3.bgbm.org/schemarepowiki
Specially there where things like:
-Give an XSLT that transforms ABCD 1.2 into 2.06
Ah ha! this is where I have to say no!
What I am proposing is that we have a central semantic model that can 
be presented in a multitude of ways. This is *very *different from a 
service that can automatically transform one existing schema to 
another. That is far more ambitious and may not be possible in what 
remains of human history. We should build the central semantic model 
out of the existing schemas but that is a manual process where 
decisions will have to be made about what was meant by the different 
constructs in the existing schemas. Think about the different ways 
inclusion, adjacency, cardinality and type extension/restriction are 
used within XML Schema documents and what they 'mean' in terms of GML 
feature types or RDFS classes and properties? In general there just 
isn't a mapping.
When they built GML they stared with the model of feature types and 
properties and then decided how they would represent this using XML 
and control it with XML Schema. We need to take a similar approach. 
Decide on what our modeling technique is then decide how we will 
represent the model in different technologies including GML and 
semantic web stuff.
I agree completely.

On the idea that we can generate GML feature types directly from the 
semantic hub, there is a tangential but related point that hasn't 
received much discussion.  It centers around how GUIDs will work within 
GML application schemas.  GML app schemas probably shouldn't contain 
many LSIDs because GIS apps can't resolve them to get at the underlying 
data (and LSIDs aren't very informative when they appear as labels for 
features on maps).  So any direct translation of semantic hub classes 
into GML app schemas may be of limited value.  Instead GML app schemas 
will probably have to be composed by selecting a set of properties with 
literals values, even if that means dereferencing LSIDs into associated 
objects to do so.  But that's a discussion that can be held later.
...
A crazy analogy would be to say it is like getting a machine to write 
a story in different languages based on the same plot. This might be 
achievable because the plot can be encoded in some machine readable 
way and the machine can just use rules to bolt together stored 
sentences. Adding another language is just a matter of new output 
rules and new stored sentences. It wouldn't produce great literature 
but it would work.
This is very different from asking a machine to read a book in one 
language, understand the plot and print out the story in a completely 
different language. For a start their may be things expressed in the 
first language for which there is no direct equivalent. Just ask a 
babel fish!
I think this is the main way the Semantic Hub proposal differs from 
the Schema Repository approach. I am trying to make something happen 
by restricting the scope - if you remember the talk I gave on managing 
scope and resources..
I hope this is OK.
...
-Give me labels for this concept
This should be easy. Internationalization won't be in the first 
implementation but should be in subsequent versions.
-Give me examples of values for this concept
This should be easy but isn't in the highest priority.
This could be difficult when the value of a concept is not a single 
string value, but one of several different types of objects.
-Steve

Re: [tdwg-tag] TAG Road Map

Steve Perry