Hi Roger & Javier,
Roger, thanks for sending the link to the technical roadmap. Reading that document followed by Javier's comments and your responses, made me think of a few questions and comments of my own:
-Have you considered something about controlled vocabularies in the semantic hub? Or maybe what I prefer to call managed controlled vocabularies. For example in ABCD there is a concept called "KindOfRecord", it is not a controlled vocabulary, is just a text. It would be too difficult to provide a fix list of terms for it so would be great if the list can somehow be created in a community base. Let say that I want to map my database to this field, I could get a list of proposed terms already being used, if none of them satisfy me then i can create my own one. It is a little bit like tagging in a control way. I love the del.icio.us example, they propose you tags and most of the time I use them, and by doing this then the data is much more accessible because the tags have not exploded. The opposite os what is happening now in ABCD everybody use a different term for the same thing and the unified data becomes useless.
There will be instances of classes in Tonto. I should have mentioned that. Chatting to Rob Gales about it it seems a good way of doing controlled vocabularies. They will be extensible because Tonto will always be capable of change. Also in some languages - OWL etc you could always define your own instances outside of Tonto - but it would depend on how we do the coding for the XML Schema based renderings of the semantics.
Populating drop down menus with information out of Tonto when some one is mapping a data source in the ultimate goal - the dream!
First, a silly point. The term "controlled vocabulary" is used by some as a synonym for "ontology". Can we find or coin a new term for concepts with a range of enumerated values that isn't overloaded?
I agree with Rob that any "concept" with enumerated values ought to have those values represented as instances with assigned GUIDs. Some values will be blessed by the fact that they are stored in the core model but people will be free to invent their own without breaking existing software or mapping rules (at the expense of losing interoperability with software that only understands the approved values).
-In the implementation section you say something like "Data Providers must map their data to these views" referring to views from the semantic hub. This is actually we are trying to avoid. TAPIR at the beginning was created with the vision of data providers mapping once their databases and being accessible through different views that are explicitly declared on the request. We changed the name now and we are calling it outputModels. In the other hand you know that WASABI and PyWrapper are now becoming muti-protocol. That means that we want providers to map their databases once and make the data available in different protocols.
The plan is that data mapping only has to be done to one of the views that Tonto has onto its internal semantics then the other views/representations could be used as outputModels (or custom output models could be created by clients etc). The goal is definitely to only map once - a single set of semantics - but then represent in multiple ways. Tonto could provide a view of the semantics that a graphical tool could then pick up to help some one build a mapping file - the dream...
This ties in with a few questions I have about the semantic hub and Tonto.
First, since some of us are also involved in working to create data models that one day may be added to the semantic hub, is there a defined list of the common subset of modeling constructs (between UML, XSD, OWL, and RDFS) and suggestions about how to implement them? Are there discussions about what constructs will be dropped and the trade-offs of different implementations?
For example, it could be argued that N-ary associations could be implemented in RDFS and OWL (and perhaps in XML-Schemas that can describe directed labeled graphs through the use of GUIDs), but from the research community, the recommended implementation for N-ary associations in RDF-based systems is reification. As implementors of systems that work with RDF-based data, we feel that reification is not the way to go and that it may be better to drop support for N-ary associations rather than put in place a "flawed work-around" like reification. Just off the top of my head there are also issues with modeling arbitrary cardinality, cardinality on one or both sides of an association, primitive type mapping and data type promotion, modeling aggregation and/or composition (sequences and bags mean anonymous nodes which don't play nice in a system that uses GUIDs to name resources), and how to implement many of these modeling constructs in XSD which was designed to describe trees (not graphs) and has no built-in notion of global identifiers.
I don't mean to get bogged down in detail or to make trouble, but creation of the concrete data models (in XSD, RDFS, OWL, etc.) from the abstract semantic core will depend on sorting all these issues out. Is there a place where these discussions are happening and is there some way that implementors can feed back into the decisions made on these issues by the technical infrastructure group?
Once they have been formalized, I think it may be important that these modeling recommendations be made available to the community in a document. One idea behind the semantic core is that it can grow over time. As the community models new areas of biodiversity informatics there has to be a way for the new data models to be incorporated into the semantic core (after being blessed by some TDWG body). This will be easier if the people creating new data models understand which modeling constructs they have available to them.
So, again, within TAPIR itself and within protocols we need something like the semantic hub you are proposing. And we are doing it right now but very primitive. I am working on the implementation of the BioMOBY protocol inside PyWrapper. I had created a mapping file between TDWG schemas and MOBY data types registry so that I can resolve questions like: -Ok, if I have these TDWG concepts mapped, which MOBY services could I create. -How can I create this MOBY types using TDWG concepts?
This is definitely where everyone is heading. The idea with the hub is to start by clearly defining the semantic constructs we are going to use (classes, properties, instances, literals, ranges) so that we can be sure that we can represent the semantics in different 'representations'. It is no good using some UML or OWL construct that doesn't have a good representation in XML Schema or GML for example.
This plan will not answer all the questions on the first day. The existing schemas will need to be mapped into Tonto so they are represented in a uniform way and this will take time - but months not years I hope. It is not always clear in the exiting schemas what the classes and properties are, for example, so this is not an automatic process but will need thinking through - and there are issues around cardinality and multiple inheritance that will need to be discussed.
Is there a more complete description of Tonto functionality? I'm still not clear about what kind of interface it will provide to the network of services (if any). Is it correct to think of Tonto as a schema repository, a collaborative schema editor, or a tool that will be used by technical infrastructure group to generate concrete implementations of the semantic core in various typing systems (XSD, RDFS, OWL, etc)? It may be lack of sleep, but I took the wording in section 5.2 to suggest all three at different points.
As I said this is now being implemented in a simple flat file that will be available in Internet for all data providers, but I am not accessing it as a service and I have to do all the handling on the client. The semantic Hub you are proposing is exactly what we need and want to do this more properly.
I am glad to hear that. I hope that we can generate the file you need to do the mapping automatically in future.
So... summarizing, from our side I can imagine now that we need:
-the semantic hub must expose the concepts in a way that we can use them in our configuration tools in the data providers to allow mapping a database there.
Just specify the way and we will make it do it.
-the semantic hub must expose the different views, or outputModels as we call them in TAPIR, so that providers software can produce them.
Just specify them and we write a script to produce them from Tonto.
There is a full list of other requirements I would love to see there that can be found in Dave Thau's work on a schema repository, do you remember? http://ww3.bgbm.org/schemarepowiki
Specially there where things like: -Give an XSLT that transforms ABCD 1.2 into 2.06
Ah ha! this is where I have to say no!
What I am proposing is that we have a central semantic model that can be presented in a multitude of ways. This is *very *different from a service that can automatically transform one existing schema to another. That is far more ambitious and may not be possible in what remains of human history. We should build the central semantic model out of the existing schemas but that is a manual process where decisions will have to be made about what was meant by the different constructs in the existing schemas. Think about the different ways inclusion, adjacency, cardinality and type extension/restriction are used within XML Schema documents and what they 'mean' in terms of GML feature types or RDFS classes and properties? In general there just isn't a mapping.
When they built GML they stared with the model of feature types and properties and then decided how they would represent this using XML and control it with XML Schema. We need to take a similar approach. Decide on what our modeling technique is then decide how we will represent the model in different technologies including GML and semantic web stuff.
I agree completely.
On the idea that we can generate GML feature types directly from the semantic hub, there is a tangential but related point that hasn't received much discussion. It centers around how GUIDs will work within GML application schemas. GML app schemas probably shouldn't contain many LSIDs because GIS apps can't resolve them to get at the underlying data (and LSIDs aren't very informative when they appear as labels for features on maps). So any direct translation of semantic hub classes into GML app schemas may be of limited value. Instead GML app schemas will probably have to be composed by selecting a set of properties with literals values, even if that means dereferencing LSIDs into associated objects to do so. But that's a discussion that can be held later.
A crazy analogy would be to say it is like getting a machine to write a story in different languages based on the same plot. This might be achievable because the plot can be encoded in some machine readable way and the machine can just use rules to bolt together stored sentences. Adding another language is just a matter of new output rules and new stored sentences. It wouldn't produce great literature but it would work.
This is very different from asking a machine to read a book in one language, understand the plot and print out the story in a completely different language. For a start their may be things expressed in the first language for which there is no direct equivalent. Just ask a babel fish!
I think this is the main way the Semantic Hub proposal differs from the Schema Repository approach. I am trying to make something happen by restricting the scope - if you remember the talk I gave on managing scope and resources..
I hope this is OK.
-Give me labels for this concept
This should be easy. Internationalization won't be in the first implementation but should be in subsequent versions.
-Give me examples of values for this concept
This should be easy but isn't in the highest priority.
This could be difficult when the value of a concept is not a single string value, but one of several different types of objects.
-Steve