[Tdwg-tag] RDF instead of xml schema

Mon Mar 27 16:35:38 CEST 2006

Markus,

I see UBIF as a loose combination of two things.  

The first is a standard metadata envelope for returning TDWG data.  I see
this as a very good thing, although I think that as far as possible we must
make sure we make use of existing standards such as Dublin Core. 

The second is a set of approaches to modelling our data objects.  As I
suggested in my previous message, the work that TDWG has done with DiGIR and
Darwin Core could provide a really good basis for an RDF-like (GML-like)
approach based on XML Schema which would provide us with a standard
structure for modelling our objects.

I guess that the best way to handle the metadata in RDF would be to treat
everything (both data and metadata) as a retrievable RDF object and
therefore to link data objects to the objects representing the information
resource from which they come.

Thanks,

Donald

---------------------------------------------------------------
Donald Hobern (dhobern at gbif.org)
Programme Officer for Data Access and Database Interoperability 
Global Biodiversity Information Facility Secretariat 
Universitetsparken 15, DK-2100 Copenhagen, Denmark
Tel: +45-35321483   Mobile: +45-28751483   Fax: +45-35321480
---------------------------------------------------------------

-----Original Message-----
From: "Döring, Markus" [mailto:m.doering at BGBM.org] 
Sent: 27 March 2006 16:04
To: Donald Hobern; Tdwg-tag at lists.tdwg.org
Subject: AW: [Tdwg-tag] RDF instead of xml schema

Donald,

as you said there are probably more ways of solving our current problems
(btw, did we nail them down anywhere?). And in this context I think we
should at least mention the UBIF idea based on xml schema:

Inspired by SDD there should be a common base schema for all other tdwg
schemas, the "Unified Biosciences Information Framework". It will hold
common and simple types. All major biodiv-objects (eg
names,taxa,specimen,people,bib-refs) become +/- root level elements in the
schema with a referrable GUID. Each object would have an extension slot
(xsd:any) to carry any other data defined in extension schemas.

Does UBIF miss any of our requirements? The main trouble with UBIF was to
agree on THE unified model and force the other standards to adopt it. But
that is something we will face in any solution regardless of technology. We
will all have to come together to model the core entities at least. By the
way, in UBIF the largest part was about dataset metadata (metadata about the
information source). How would we share this data in RDF? Would every RDF
biodiv-object reference another metadata resource, based on dublin core?

Markus

-----Ursprüngliche Nachricht-----
Von: Tdwg-tag-bounces at lists.tdwg.org
[mailto:Tdwg-tag-bounces at lists.tdwg.org] Im Auftrag von Donald Hobern
Gesendet: Samstag, 25. März 2006 10:18
An: Tdwg-tag at lists.tdwg.org
Betreff: Re: [Tdwg-tag] RDF instead of xml schema

What I somehow still failed to say clearly at the end of my previous post
was that "RDF technologies are an excellent way to do this... BUT we may
well have other mechanisms we could use".  My real interests are not in
getting RDF adopted but in finding a really good solution to all our
modelling problems and answering the three needs I gave in my previous post.
I have two options that I see at this stage as plausible ways to do this.

In both cases, I would of course recommend that we model our objects using a
more neutral language such as UML and then generate the encoding models we
actually use.  After that here are the two options I see (not in any order
of priority).

- - - - -

1. The "new-style" RDF-based ontology approach

I see the direct use of plain RDF as the encoding as roughly equivalent to
using an assembler as our programming language.  We would be able to do
absolutely everything we would like to do but there is no real need for the
pain and ugliness of such a low-level representation.  I understand the
natural response of shock when people are presented with a mass of RDF
triples.

I would much rather adopt a higher-level standard (probably OWL Lite) to
allow us to represent the same information in a more familiar and friendly
way and make our underlying classes clearer.

I believe that the main problems from use of RDF will arise from attempts to
manage inferences based on the underlying triples rather than in the basic
modelling of objects and exchanging and consuming data encoded in RDF or
RDF-based languages.  For me, the main reasons for considering RDF are all
in this second (easier) set of functions, and therefore I'm not really
worried by the technology.  If the tools mature and we can subsequently use
inferential approaches (and exploit the power of SPARQL, etc.) then we will
be in an excellent position to benefit - otherwise nothing is lost.

Despite the doubts that met this approach last week, I still tend to think
that we should consider the option of using an RDF (OWL Lite) approach but
at the same time to support the use of XML schema models which conform to
valid subsets of our RDF models.  This doesn't seem hard (although I have
been too busy this week to look closely at what Roger has done in this area)
and would immediately mean that the less IT-oriented members of our
community would be able to continue working with the tools they have already
got used to.

- - - - -

2. The modified "new-style" XML schema approach

In my opinion, the power present in the DiGIR family of protocols and
particularly TAPIR, when combined with conceptual schemas such as Darwin
Core, is enormous.  The developers of DiGIR and Darwin Core produced a model
which has most of the strengths I am looking for in my previous post, but
uses the XML schema concept of substitution groups.  It supports extension
in a way that is very like RDF.  Its biggest weakness is that it does not
provide any ontological underpinnings of any kind.  Neither DiGIR nor Darwin
Core makes any commitments regarding the class of object described in the
records returned.  DiGIR provides a generic query language.  Darwin Core
provides a set of useful descriptors.   I could use Darwin Core to encode
data on a collection of stamps illustrating plants and animals
(ScientificName, Country, CollectionDay, etc.).  We need the ability to
identify what sort of objects are being described.  I gave a long and
confusing presentation at the TDWG meeting in Christchurch which was my
first attempt to explain this - I'm not sure anyone had a clue what I was
trying to say).  See:

http://www.tdwg.org/2004meet/EV/TDWG_2004_Papers_Hobern_4.zip

We also had real problems applying DiGIR to ABCD because substitution groups
will not work with complex documents like ABCD.  

However we could simply carry out some fairly simple modifications to our
existing XML schemas and solve these problems.  We would need to do the
following:

* Determine a basic ontology of biodiversity data objects (Specimen,
Locality, Character, etc.) and some of their fundamental relationships
(collectedAt, hasCharacter, etc.)
* Restructure our current schemas so they each schema is a collection of
descriptive properties for one of these classes (perhaps a substitution
group for properties for each class - like GML?) and a container element
representing an instance of the class (and holding a collection of
descriptive properties for the class).  Note that some property elements
would be RDF-like references to other objects (e.g. <collectedAt
ref="locality1"> or inline versions of such objects (e.g.
<collectedAt></location></collectedAt>. 
* Enhance TAPIR so that each resource identifies itself as returning objects
from one of the standard classes (rather than untyped records)

I suspect that the structure of TCS and SDD would make this really easy in
those cases.  ABCD would need to be split into classes such as Unit,
GatheringEvent, Locality, Collection and Collector rather than having
everything presented as nested properties of a Unit, but most of the work
would survive quite cleanly.

This approach would allow us to extend the properties for any class just as
easily as we can Darwin Core.  We could keep using TAPIR as our search
protocol almost unchanged.

- - - - -

I am sure that there are other options but each of these seems a fairly easy
and powerful development from where TDWG has already gone.  The second is
much closer to what has been done in the past (although it still represents
a more object-oriented approach instead of the current document-oriented
one).  However I am not sure what we gain at that stage from not simply
using OWL Lite for the models.

Anyway, I hope this clarifies where I am coming from.

Donald

---------------------------------------------------------------
Donald Hobern (dhobern at gbif.org)
Programme Officer for Data Access and Database Interoperability Global
Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100
Copenhagen, Denmark
Tel: +45-35321483   Mobile: +45-28751483   Fax: +45-35321480
---------------------------------------------------------------

Donald Hobern wrote:
> Gregor,
> 
> I can understand your angst, but I would like to suggest that XML 
> schema actually only really provides good support for some aspects of 
> OO
modelling.
> Extending classes is a real problem.
> 
> A data model encoded in RDF can still make use of an ontology language 
> to provide greater rigour in the way that objects are defined.
> 
> As was indicated in some of the earlier messages here, it is even 
> possible to put together a data model which looks fundamentally just 
> the same as
one
> defined using XML schema but which is using RDF technologies under the 
> covers and which consequently is easier to extend than XML schema.
> 
> For me however the biggest factors of importance in a revision of our 
> data models would be:
> 
> 1. A cleaner separation between different object classes (not all
versioned
> in a single schema).
> 
> 2. A good model to support easy extension (using a multiple 
> inheritance
> approach) so that different (potentially overlapping) communities can 
> add extra information in the ways that best suit them.
> 
> 3. An underlying ontology that is sufficient for us at least to 
> identify
the
> object class of each record.
> 
> RDF technologies are an excellent way to do this.  GML has managed to 
> produce many of the same features, but has probably done so largely by 
> replicating the essentials of RDF modelling.
> 
> Thanks,
> 
> Donald
>  
> ---------------------------------------------------------------
> Donald Hobern (dhobern at gbif.org)
> Programme Officer for Data Access and Database Interoperability Global 
> Biodiversity Information Facility Secretariat Universitetsparken 15, 
> DK-2100 Copenhagen, Denmark
> Tel: +45-35321483   Mobile: +45-28751483   Fax: +45-35321480
> ---------------------------------------------------------------
> 
> 
> -----Original Message-----
> From: Tdwg-tag-bounces at lists.tdwg.org
> [mailto:Tdwg-tag-bounces at lists.tdwg.org] On Behalf Of Gregor Hagedorn
> Sent: 24 March 2006 18:37
> To: Tdwg-tag at lists.tdwg.org
> Subject: [Tdwg-tag] RDF instead of xml schema
> 
> Hi all,
> 
> RDF to me appears on a level of abstraction making it very hard for me 
> to follow the documentation and discussion. Most of the examples are 
> embedded in an artificial intelligence / reasoning use cases that I 
> have no experience

> with.
> 
> I am a biologist and I feel comfortable with UML, ER-modeling, 
> xml-schema- modeling, and - surprise - relational databases. I believe 
> many others are as well - how many datastores are actually build upon 
> RDBMS technology?
> 
> To me xml-schema maps nicely to both UML-like OO-modeling and 
> Relational DBMS.
> I can guess about the advantages of opening this all up and seeing the
world
> as
> a huge set of unstructured statement tupels. But it also scares me.
> 
> Angst is a bad advisor. But then if only a minority of the current few 
> people involved can follow on the RDF abstraction level. A few 
> questions I have:
> 
> * Would we be first in line to try rdf for such complex models as 
> biodiversity informatics?
> 
> * Do Genbank/EMBL with their hundreds of employees and programmers use
rdf? 
> Internally/externally? The molecular bioinformatics is probably 1000 
> times

> larger than our biodiversity informatics.
> 
> * Why are GML, SVG etc. based on xml schema and not RDFS? Is this just 
> historical?
> 
> * Are there any tools around that let me import RDF into a relational 
> database (simple tools for xml-schema-based import/export are almost 
> standard part
of
> 
> databases now, or you can use comfortable graphical tools like Altova 
> MapForce).
> 
> -- I am just trying to test some tools to help me to visualize RDFS 
> productions (like Roger has send around) on a level comparable with 
> the UML-like xml-schema editors (Spy, Stylus, Oracle, etc.) I will try 
> Altova SemanticWorks and Protege over the next week. The screenshot 
> seem to be about AI and semantic web
much
> 
> more than about information models (those creatures where you try to
> simplify 
> the world to make it manageable...).
> 
> Gregor----------------------------------------------------------
> Gregor Hagedorn (G.Hagedorn at bba.de)
> Institute for Plant Virology, Microbiology, and Biosafety
> Federal Research Center for Agriculture and Forestry (BBA)
> Königin-Luise-Str. 19           Tel: +49-30-8304-2220
> 14195 Berlin, Germany           Fax: +49-30-8304-2203
> 
> 
> _______________________________________________
> Tdwg-tag mailing list
> Tdwg-tag at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
> 
> 
> 
> _______________________________________________
> Tdwg-tag mailing list
> Tdwg-tag at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org

-- 
Robert A. Morris
Professor of Computer Science
UMASS-Boston
http://www.cs.umb.edu/~ram
phone (+1)617 287 6466

_______________________________________________
Tdwg-tag mailing list
Tdwg-tag at lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org