I agree with Bob that our data model specifications should be decoupled from possible representation schemes. In my opinion, these specifications should take the form of UML static structures with accompanying explanatory documents. The use of BNF grammars is a good idea, but I worry that they might become difficult to manage as they grow and that non-CS people in the community would find it difficult to understand them.
I also think that the technical architecture group should not be concerned with the data models themselves. Instead we have to worry about how to map existing data sets into the shared models, how to link instances of different models together, how to locate one or more data objects that meet certain criteria, how to merge collections of data objects from one or more models, how to visualize trees or graphs of data objects, how to serialize and deserialize data objects into different representations, etc. In short, we have to design a network of services that allow us work with data objects and collections of data objects in a fairly generic fashion and leave the actual creation of the models up to the subject matter experts (though we might supply a bit of KR advice).
These services and processes will also require documentation and might be specified with the same combination of UML (sequence or activity diagrams) and explanatory documentation. I think all of us agree that these ought to be designed in a language-independent manner and be built upon a small stack of existing technologies like HTTP and XML for message transport.
At some point though we have to agree on a representation format. If we're talking about building a set of distributed services that will allow us to locate, acquire, and work with biodiversity data, then I think we need to propose an architecture that has a few fixed points, one of which should be representation format. I for one don't want to have to design tools that can injest both XML Schema instances and RDF described by ontologies.
The representation format we select ought to be flexible enough to accommodate the data models described by the subject matter experts. It should also minimize the burden on the software engineers and developers that have to design and maintain the processes, tools, and services that satisfy the above use cases (mapping, serializing/deserializing, searching, merging, visualizing, etc). Ideally the representation format should allow us to choose from a collection of existing tools and frameworks to use while implementing (because no one has the time or money to create all this from scratch). This means that we need to evaluate each candidate representation format with the above use cases in mind. Every representation scheme (RDF, XML, Java classes, etc.) has its strengths and weaknesses and this process of talking about each use case with respect to implementing it over each of the representation formats will allow us to better understand the trade offs of selecting one format over another.
Here's an example of the types of discussions I'd like to see from TAG:
Portions of the TCS data model describe specimens, publications, and other things that are not names or concepts. In a perfect world, TCS would not define it's own data model for specimens but would instead use an existing model designed by the curators of collections (perhaps with Darwin Core as a starting point). The same is true for publications. Instances of TCS should then use GUIDs to point to instances of Specimen and Publication.
Now, imagine a hypothetical system that provides for visualization of TCS, a Taxon Concept Browser that allows researchers to search for and view TCS instances in order to select a set of concepts to use in their own work. At some point, this system will have to have an in-memory graph of data objects of different types including TCS, Specimen, and Publication. This graph could be constructed in a variety of ways, but the most likely method will start by parsing a TCS document that contains several taxon concept instances. Each instance will be examined for references to other objects named by LSID. Each of these LSIDs will be resolved (with care so as to not create cycles), resulting in a chunk of serialized data which will be turned into an in-memory instance and inserted into the graph. This is at heart the merge case I was talking about above.
So, to satisfy this case with XML Schema, TCS's specimen element might be implemented with a simpleType that has a restriction base of string and a pattern match constraint designed to allow validation of LSID URNs. This effectively decouples Specimen from TCS and allows instances of them to refer to each other without having to import each other's schema. The Taxon Concept Browser's instance graph composer would know about the schemas for TCS and Specimen. It would take the resolved chunk of XML from specimen LSIDs and attempt to deserialize (unmarshall) it before merging it into the graph that will be visualized.
At first glance this appears to work fairly well, however there are a few issues with this design. First, it precludes the direct embedding of Specimen instances in TCS instances. There are many reasons why one might want to do this. One reason is to avoid unnecessary LSID resolution calls (which add latency) in the case where specimen and taxon concept objects are coming from the same server. Another reason for embedding specimen instances in taxon concept instances is to make things easier on a user who might want to download an entire taxon concept graph to their local machine for processing by a desktop application. Without embedding they may be forced to download several different files. We could fix this by changing the definition of the specimen element in TCS so it can be either a subtree or an LSID, but then the TCS schema would have to import specimen and vice versa. This is of course impossible so we'd have to do one of three things: make the TCS specimen element xsd:any, derive both TCS and Specimen from the same base XML Schema which minimally defines an LSID element, or design a complicated scheme for embedding both instances and schema (akin to how WFS works) in a single instance document. Each of these have their own drawbacks: xsd:any makes it difficult or impossible to use most XML-to-Object binding tools, schema inheritance is difficult, can be accomplished only with a social agreement between everyone in the community, and allows for only a weak form of validation, and the embedded-schema approach is burdensome to developers.
An additional problem with the XML solution is that it is relatively brittle when it comes to change in data models over time. Darwin Core has a high adoption rate because it is very simple. However, the simplicity that drove it's adoption also encouraged different parts of the community to customize it to fit their needs; I know of at least three variants of Darwin Core in common use and there could be many more that I haven't encountered. Some of these variants were declared in their own namespaces, but others were not. This has made it quite difficult to write code that can injest all variants of Darwin Core to extract even the minimal set of common elements such as ScientificName. Finally it is not possible to validate most variants of DarwinCore (for a variety of reasons). This makes them a poor candidate for XML-to-Object binding tools. In our example above, if a new variant of the Specimen Schema were introduced, then the XML-to-Object binding code that backs the deserialization of Specimen instances into the graph would most likely not be able to handle the new version. So, in order to use the Taxon Concept Browser in a heterogeneous network that has more than one version of the Specimen schema, even if the goal is simply to display the minimal set of elements common to each version, we would have to release a new version of the tool every time we deploy a new Schema.
That's not to say that RDF will solve all the problems. While it might make the design of flexible, modular data models and the software that use them a bit easier, no one has ever proved that it will scale. Additionally there is the temptation with RDF to catch what I call Ontology Fever. In terminal cases, this disease results in an obsession with using OWL Full to model the entire universe reductively at the level of the laws of physics. Any distributed data network afflicted by this disease is destined to die. That's why I prefer RDF-Schema to OWL (though I think OWL may end up eventually playing some role if we move towards RDF). For the same reason I think the primary use case is not inference over OWL-described RDF, but search over flexible RDF-Schema described data models. I personally think that RDF might make some use cases, especially the merge case, easier to handle. So I'd like to see further discussion of the use cases above for both XML Schema and RDF.
In summary, the design of our shared data models is more a social process than a technical one and I agree with Bob that it should be carried out using a representation-agnostic modeling language. The technically difficult bit is designing the network of services that will allow one to use the data models. We have an intuitive idea of what the use cases are for such a system, but I'd like to see more discussion on that topic. Roger has started this off nicely by considering the differences between resolution and search but I'd like to continue the discussion into the other use cases like merging, visualization, etc. TAG seems like the best place to do so.
-Steve
Roger Hyam wrote:
Hi Bob,
I'm rushing off to the GISIN meeting at AGADIR and might not have much time to respond more before midweek, or maybe even until I get back next week, but:
- I _wish_ this discussion were taking place in a wiki, with RSS or
email notification, so it is easier to follow if you cannot keep up with the email
The way I was planning on running the TAG discussions was to have 'discussions' on the mailing list and summarize them to the wiki. The motivation behind this is to work towards the wiki being a readable document for the uninitiated. It should not be necessary for some one coming new to a field to have to read all the discussions that have taken place to reach a conclusion. These discussions should be available but it is the job of an editor/facilitator to create a readable narrative from possibly wandering dialog.
The wiki is here: http://www.tdwg.hyam.net/twiki/bin/view/TAG
The URL will change at some point in the next few months but I will make sure all URLs forward to the appropriate place on the new server. There is no RSS feed on it at present I'll see about setting one up either now or when we move it to the main server.
The mailing list archive is here: http://lists.tdwg.org/pipermail/tdwg-tag_lists.tdwg.org/ so any thread can be followed and resurrected at any time.
I take on board what you are saying though and will try and create links between the wiki and list archive.
- I don't think specifications of high level things like "objects"
should be done in a serlization constraint languge such as RDF or XML Schema. Instead, it should have something more general as the normative definition and have _representation_ in one or more of such constraint languages. This is the mechanism of W3C usually. Many (Most?) W3C standards have a normative BNF definition, and one or more representations to allow implementers to actually do business. OMG favors UML for this, etc.There is nothing inherently normative about, say RDF or XML Schema, for, say TaxonConcepts. If you take the serialization language as the normative language, then in the future you just end up having to support several serialization languages when you find you want to extend your specification with something for which the chosen one is insufficiently expressive. This, in fact, is what is going on now with the cries for RDF over XML Schema. Put another way, if you choose language L as the normative language, you are not building a specification, but rather a set of constraints on applications written in L. Such things do not have as long a life as actual specifications do and mature standards bodies do not seem to use serialization languages as the root specification language, as far as I can tell. My conclusion is that specifications should not be in anything like RDF or XML Schema, but in something else---BNF is probably adequate for most TDWG standards---with working subgroups responsible for publishing a serialization definition implementing the standard in languages useful for one or another purpose, e.g. LSID resolution.
Yes I think you are right. We should be specifying our objects in a high level 'language' like UML (not so sure about BNF but I am not so familiar with it) . There has been talk about OWL Lite as a subset of UML. This was actually the next topic I was going to suggest and I'll kick of a thread on it soon if no one else does.
Can I take it from your reply that you think:
- There should be commonality between all TDWG 'objects' and that that commonality should be their specification in UML/BNF/Other technology? (Yes to my question 1).
- Their should be alternative ways to serialize these objects. Some of the serialization may support different aspects of the objects (Yes to my question 2).
- XML Schema or RDF/S are not appropriate ways to define such objects
Have I read this correctly?
Roger
Bob
On 2/17/06, *Roger Hyam* <roger@tdwg.org mailto:roger@tdwg.org> wrote:
Hi All, In a previous post I suggested definitions for Resolving, Searching and Querying from the point of view of the TAG. There has been a muted response which I take as meaning there aren't any strong objections to these definitions. We can come back to them later if need be. You can read the post here if you missed it: http://lists.tdwg.org/pipermail/tdwg-tag_lists.tdwg.org/2006-February/000009.html <http://lists.tdwg.org/pipermail/tdwg-tag_lists.tdwg.org/2006-February/000009.html> I'd like to look at the implications of the first two definitions:
*Resolving.* This means to convert a pointer into a data object. Examples would be to resolve an LSID and get back either data or metadata or resolve a url and get back a web page in html.
*Searching.* This means to select a set of objects (or their proxies) on the basis of the values of their properties. The objects are predefined (implicitly part of the call) and we are simply looking for them. An example would be finding pages on Google.
Both these definitions imply the existence of data 'Objects' or 'Structures' that are understood by the clients when they are received. The kinds of objects that jump to mind are Specimens, TaxonNames, TaxonConcepts, NaturalCollections, Collectors, Publications, People, Expeditions etc etc. A piece of client software should be able to know what to do with an object when it gets - how to display it to the user or map it to a db etc. My two leading questions are: 1. *Should there be commonality to all the objects?* If yes - what should it be? XML Schema location or OWL Class or something else? If no - then how should clients handle new objects dynamically - or shouldn't they be doing that kind of thing. 2. *Should we have multiple ways of representing the SAME objects?* e.g. Should there be only one way to encode a Specimen or should it be possible to have several encodings running in parallel. If there is only one way how do we handle upgrades (where we have to run two types of encoding together during the roll out of the new one) AND how do we reach consensus on the 'perfect' way of encoding each and every object in our domain? The answers I have for my leading questions are: 1. Yes - We should have some commonality between objects or it will be really difficult to write client code - but what that commonality is has to be decided. 2. Yes - The architecture has to handle multiple versions/ways of encoding any particular object type because any one version is not likely to be ideal for everyone forever. Are the two conclusions I come to here reasonable? Is this too high level and not making any sense? I'd be grateful for your thoughts on this, Roger
--
Roger Hyam Technical Architect Taxonomic Databases Working Group
http://www.tdwg.org roger@tdwg.org mailto:roger@tdwg.org
+44 1578 722782
_______________________________________________ Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org <mailto:Tdwg-tag@lists.tdwg.org> http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
--
Roger Hyam Technical Architect Taxonomic Databases Working Group
http://www.tdwg.org roger@tdwg.org
+44 1578 722782
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org