[tdwg-tapir] Interpretation of TAPIR filters

newer
[tdwg-guid] Biodiversity Heritage...

Renato De Giovanni

23 Nov 2007 23 Nov '07

01:53

Dear all, I came across an issue that I know it was discussed before, but I couldn't find anything about it in the spec and I couldn't find a message with a final agreement/solution. The issue is: How should TAPIR filters be interpreted if they reference concepts that are not understood by providers or if they reference parameters that were not passed? Query templates may often include parameterized filters with many conditions, such as: <filter> <and> <equals> <concept id="http://rs.tdwg.org/dwc/dwcore/Genus"/> <parameter name="genus"/> </equals> <equals> <concept id="http://rs.tdwg.org/dwc/dwcore/SpecificEpithet"/> <parameter name="species"/> </equals> <equals> <concept id="http://rs.tdwg.org/dwc/dwcore/Country"/> <parameter name="country"/> </equals> </and> </filter> And it's important to be able to use the same query template even when not all parameters are passed, such as allowing all these calls: http://myprovider.org/tapir?op=search&t=mytemplate&genus=x http://myprovider.org/tapir?op=search&t=mytemplate&genus=x&species=y http://myprovider.org/tapir?op=search&t=mytemplate&country=z and so on... Otherwise we would need a different template for each combination of parameters, which would certainly make clients very unhappy. I've been discussing this with Markus and we're leaning towards the following suggestion about how to interpret filters: 1- Comparative operators (COPs) that reference unspecified parameters should be simply dropped from the filter. 2- COPs must evaluate to false if they reference a concept that is not understood by the provider, unless they also reference an unspecified parameter - in which case the COP must be dropped from the filter. 3- After dropping a COP, if it was part of a logical operator (LOP) then the LOP should also be dropped if there are no boolean operators left inside the LOP. 4- After dropping a LOP, if it was part of another LOP, then the parent LOP should also be dropped if there are no boolean operators left inside the parent LOP. There's clearly a gap in the specification about this and we need to fix it, so please let me know if you have any thoughts about this. If there are no objections or better suggestions, then I'll probably use the proposal above to amend the spec. Many thanks! -- Renato

Show replies by date

Roger Hyam

23 Nov 23 Nov

15:12

On 23 Nov 2007, at 01:53, Renato De Giovanni wrote:

...

2- COPs must evaluate to false if they reference a concept that is not understood by the provider, unless they also reference an unspecified parameter - in which case the COP must be dropped from the filter.

Does this mean that if I have a filter that specifies long, lat and altitude and a provider who has no notion of altitude 1) requests that include the long, lat and alt parameters will return matches for records with those longs and lats (the alt COP will have been dropped from the equals block) 2) requests that include just long and lat parameters will match nothing because the alt COP will evaluate to false. To get a result from a provider one has to define a parameter for a concept they don't map! If this is correct isn't it counter intuitive? Perhaps I have interpreted it wrongly. (long lat is probably a bad example) A related issue to this is sorting. It is not clear what the orderBy operator does because there are no data types in TAPIR concepts. Should the wrapper order results by the conceptual XML schema data type (if there is one) or the output model data type? I can't see solutions for this because I can't see how a wrapper would implement alternative sorting mechanisms from the ones it knows i.e. the types of its columns. Perhaps the capabilities should indicate the data type the concepts are treated as e.g. <mappedConcept id="http://example.net/redlist/DataSourceCode" alias="DataProviderCode" as="xsd:string" /> Meaning "We have mapped this concepts and we treat it like a string". This problem grows when we think of data types that may be serialized in different ways (date is the only one I can think of). If I want all the records before a certain date how do I pass the date? One of the ISO standards? Unix timestamp? How do we handle granularity? etc. We don't define that anywhere and if the wrapper just takes the string value it is given and passes it to the database we might get different results from different RDMS or even errors. Sorry to muddy the waters. Roger

Renato De Giovanni

05:29

Hi Roger,

...

Does this mean that if I have a filter that specifies long, lat and altitude and a provider who has no notion of altitude

1) requests that include the long, lat and alt parameters will return matches for records with those longs and lats (the alt COP will have been dropped from the equals block)

No. If a request contains the parameters long, lat and alt, it will only get records from providers that have mapped all three underlying concepts. If a provider didn't map alt, you'll get no results back because the alt condition will evaluate to false.

...

2) requests that include just long and lat parameters will match nothing because the alt COP will evaluate to false.

No. Requests containing only long and lat will return records from all providers that mapped the two underlying concepts, even if they didn't map alt (in this case the alt condition will be dropped).

...

To get a result from a provider one has to define a parameter for a concept they don't map! If this is correct isn't it counter intuitive? Perhaps I have interpreted it wrongly. (long lat is probably a bad example)

Well, that proposal still makes sense to me, unless I'm missing something. The only thing, still using your example, is that if no parameters are passed, all conditions will be dropped, which means there will be no filter and you'll get all records back (truncated by the maximum element repetitions, of course). But I don't think this is a problem.

...

A related issue to this is sorting. It is not clear what the orderBy operator does because there are no data types in TAPIR concepts.

Concept data types are specified by the respective conceptual schema. The data types for ABCD or DarwinCore concepts, for example, are all well-defined by their XML Schemas. But if the conceptual schema is just a simple list in a CNS configuration file, with no other external definitions, then I agree it may be an issue. Perhaps we should recommend more explicitly that concepts will be better defined in a format which includes data types? (could even be just a simple list of global elements in an XML Schema).

...

Should the wrapper order results by the conceptual XML schema data type (if there is one) or the output model data type?

It should order results by the data type defined in the conceptual schema.

...

I can't see solutions for this because I can't see how a wrapper would implement alternative sorting mechanisms from the ones it knows i.e. the types of its columns. Perhaps the capabilities should indicate the data type the concepts are treated as e.g.

<mappedConcept id="http://example.net/redlist/DataSourceCode" alias="DataProviderCode" as="xsd:string" />

Meaning "We have mapped this concepts and we treat it like a string".

I like the idea, even if it will just repeat an information that can be found in the conceptual schemas. Not sure what the others think about this. But I would make this new attribute optional, for backwards compatibility and also considering possible class concepts in the future.

...

This problem grows when we think of data types that may be serialized in different ways (date is the only one I can think of). If I want all the records before a certain date how do I pass the date?

In this case you should pass a value compatible with the corresponding data type. If the data type somehow accepts values in different formats (for instance '2002-09-24Z' or '2002-09-24+06:00') then the provider should be able understand all these formats when processing the query. Best Regards, -- Renato

Roger Hyam

21:22

Hi Renato, I think I follow you - but I'll probably believe it when I see it spelled out in the spec. I think some of the sorting stuff should be in the specification. There are some things there that are implied but not expressed. 1) schema concepts should/must only be mapped to columns of the same data type in the host database or at least wrappers should act as if they are. 2) Values in queries (and parameters) should use the xsd serialization of date (and other things?). It is the wrappers responsibility to present the correctly to the underlying database. 3) If the concept does not have a data type the behaviour of orderby, greaterthan and lessthan are undefined? If we stick to XML Schema concept schemas then they default to string so maybe the default should be to act as string. What do you think? Roger On 23 Nov 2007, at 05:29, Renato De Giovanni wrote:

...

Hi Roger,

...
Does this mean that if I have a filter that specifies long, lat and altitude and a provider who has no notion of altitude

1) requests that include the long, lat and alt parameters will return matches for records with those longs and lats (the alt COP will have been dropped from the equals block)

No. If a request contains the parameters long, lat and alt, it will only get records from providers that have mapped all three underlying concepts. If a provider didn't map alt, you'll get no results back because the alt condition will evaluate to false.

...
2) requests that include just long and lat parameters will match nothing because the alt COP will evaluate to false.

No. Requests containing only long and lat will return records from all providers that mapped the two underlying concepts, even if they didn't map alt (in this case the alt condition will be dropped).

...
To get a result from a provider one has to define a parameter for a concept they don't map! If this is correct isn't it counter intuitive? Perhaps I have interpreted it wrongly. (long lat is probably a bad example)

Well, that proposal still makes sense to me, unless I'm missing something. The only thing, still using your example, is that if no parameters are passed, all conditions will be dropped, which means there will be no filter and you'll get all records back (truncated by the maximum element repetitions, of course). But I don't think this is a problem.

...
A related issue to this is sorting. It is not clear what the orderBy operator does because there are no data types in TAPIR concepts.

Concept data types are specified by the respective conceptual schema. The data types for ABCD or DarwinCore concepts, for example, are all well-defined by their XML Schemas. But if the conceptual schema is just a simple list in a CNS configuration file, with no other external definitions, then I agree it may be an issue. Perhaps we should recommend more explicitly that concepts will be better defined in a format which includes data types? (could even be just a simple list of global elements in an XML Schema).

...
Should the wrapper order results by the conceptual XML schema data type (if there is one) or the output model data type?

It should order results by the data type defined in the conceptual schema.

...
I can't see solutions for this because I can't see how a wrapper would implement alternative sorting mechanisms from the ones it knows i.e. the types of its columns. Perhaps the capabilities should indicate the data type the concepts are treated as e.g.

<mappedConcept id="http://example.net/redlist/DataSourceCode" alias="DataProviderCode" as="xsd:string" />

Meaning "We have mapped this concepts and we treat it like a string".

I like the idea, even if it will just repeat an information that can be found in the conceptual schemas. Not sure what the others think about this. But I would make this new attribute optional, for backwards compatibility and also considering possible class concepts in the future.

...
This problem grows when we think of data types that may be serialized in different ways (date is the only one I can think of). If I want all the records before a certain date how do I pass the date?

In this case you should pass a value compatible with the corresponding data type. If the data type somehow accepts values in different formats (for instance '2002-09-24Z' or '2002-09-24+06:00') then the provider should be able understand all these formats when processing the query.

Best Regards, -- Renato _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir

Renato De Giovanni

26 Nov 26 Nov

01:18

Hi Roger, Yes, I fully agree that the spec should be revised to be clearer about data types. I also like the idea of using "string" as a default data type. Let's just wait a bit more to see if there are other suggestions and opinions. Best Regards, -- Renato On 23 Nov 2007 at 21:22, Roger Hyam wrote:

...

Hi Renato,

I think I follow you - but I'll probably believe it when I see it spelled out in the spec.

I think some of the sorting stuff should be in the specification. There are some things there that are implied but not expressed.

1) schema concepts should/must only be mapped to columns of the same data type in the host database or at least wrappers should act as if they are.

2) Values in queries (and parameters) should use the xsd serialization of date (and other things?). It is the wrappers responsibility to present the correctly to the underlying database.

3) If the concept does not have a data type the behaviour of orderby, greaterthan and lessthan are undefined? If we stick to XML Schema concept schemas then they default to string so maybe the default should be to act as string.

What do you think?

Roger

Döring, Markus

4 Dec 4 Dec

09:48

Hi, Just catching up. I agree with Renato about all the filter issues as you probably have guessed. Regarding sorting now. In pywrapper I have left the "how" it gets sorted to the underlying database type. And that might be very different to the conceptual or model one. But I wonder if it is really important to be specific about the sorting order? At least it is sorted in some stable way. I agree it makes sense to announce that datatype in the capabilities, so you understand the sorting. That can easily be done using xml schema datatypes (should cover nearly all db types). The underlying datatype also affects what COPs you can use with it. You will get an error with PyWrapper for example if you do a LIKE on a date or integer type. If we will force people to adapt their datatypes in the underlying database this is quite a burden. This way every data provider will need a copy of their database and they will never be able to use the original dataset. That might not be a problem and in fact this allows you to do quite some data transformation in between, but for many providers I know this will be too much - or they will close to never update their data clone. So for now I would suggest to indicate the underlying db type in the concept capabilities and just use whatever there is for ordering. We should probably come up with a standard error for the CopNotSupportedByLocalDatatype. How do you deal with this in TapirLink Renato? Markus "Renato De Giovanni" wrote on 26.11.2007 2:18 Uhr:

...

Hi Roger,

Yes, I fully agree that the spec should be revised to be clearer about data types. I also like the idea of using "string" as a default data type. Let's just wait a bit more to see if there are other suggestions and opinions.

Best Regards, -- Renato

On 23 Nov 2007 at 21:22, Roger Hyam wrote:

...
Hi Renato,

I think I follow you - but I'll probably believe it when I see it spelled out in the spec.

I think some of the sorting stuff should be in the specification. There are some things there that are implied but not expressed.

1) schema concepts should/must only be mapped to columns of the same data type in the host database or at least wrappers should act as if they are.

2) Values in queries (and parameters) should use the xsd serialization of date (and other things?). It is the wrappers responsibility to present the correctly to the underlying database.

3) If the concept does not have a data type the behaviour of orderby, greaterthan and lessthan are undefined? If we stick to XML Schema concept schemas then they default to string so maybe the default should be to act as string.

What do you think?

Roger

_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir

Renato De Giovanni

17:17

Hi Markus, I think we could still recommend providers to share their data using the same datatypes defined by conceptual schemas, but without being too strict. After all, it's better to share something using a different datatype than not share anything. In the future, the TAPIR Tester could try to check this and raise warnings when the concept datatype is different from the mapped (declared) datatype. So if we all agree, here's a summary of the necessary changes: * Add an optional attribute (datatype) for each mapped concept in capabilities responses. * Datatypes would come from XML Schema built-in datatypes and should be declared with the full URI, such as: http://www.w3.org/2001/XMLSchema#int * Providers should declare the underlying datatype used when mapping the concept, which should preferably be the same datatype defined by the corresponding conceptual schema. * The default dataype is http://www.w3.org/2001/XMLSchema#string However, there's one remaining issue: Should we handle custom datatypes? For example, what would be the corresponding datatype for DarwinCore/ABCD collecting dates? (now it's a custom DateTimeISO). Regarding standard TAPIR errors, I certainly agree it would be interesting to define them. I wish I had more time to revise what we have and make a proposal. About TapirLink, I think I used a similar approach. An error is raised if you try to use "like" with non-string datatypes. Also the configurator doesn't check if the underlying datatype is compatible with the one defined by the conceptual schema. Best Regards, -- Renato On 4 Dec 2007 at 10:48, Döring, Markus wrote:

...

Hi, Just catching up. I agree with Renato about all the filter issues as you probably have guessed. Regarding sorting now. In pywrapper I have left the "how" it gets sorted to the underlying database type. And that might be very different to the conceptual or model one. But I wonder if it is really important to be specific about the sorting order? At least it is sorted in some stable way.

I agree it makes sense to announce that datatype in the capabilities, so you understand the sorting. That can easily be done using xml schema datatypes (should cover nearly all db types). The underlying datatype also affects what COPs you can use with it. You will get an error with PyWrapper for example if you do a LIKE on a date or integer type.

If we will force people to adapt their datatypes in the underlying database this is quite a burden. This way every data provider will need a copy of their database and they will never be able to use the original dataset. That might not be a problem and in fact this allows you to do quite some data transformation in between, but for many providers I know this will be too much - or they will close to never update their data clone.

So for now I would suggest to indicate the underlying db type in the concept capabilities and just use whatever there is for ordering. We should probably come up with a standard error for the CopNotSupportedByLocalDatatype. How do you deal with this in TapirLink Renato?

Markus

6370

Age (days ago)

6381

Last active (days ago)

List overview

Download

6 comments

3 participants

participants (3)

Döring, Markus
Renato De Giovanni
Roger Hyam