[Tdwg-guid] Annotation and Link Out
Kevin Richards
richardsk at landcareresearch.co.nz
Fri May 19 10:21:47 CEST 2006
My initial thought is that a FAN-type system is probably intended for
the uncommon/un-handled case where a thrid party needs to add
annotations to someone elses LSIDs. I would imagine that in most cases
linkages between data stored at different locations/authorities would be
specified in the RDF of each LSID. Eg one LSID metadata might have a
field that indicated it was a "correctedSpellingOf" another LSID.
Kevin
>>> Steven Perry <smperry at ku.edu> 05/19/06 4:14 PM >>>
I've been thinking about annotation, attribution, and LSID Foreign
Authority Notification and wanted to continue the discussions from last
month.
While I like the idea of notification of annotations, I'm not a fan of
FAN. In the first paragraph of the FAN proposal it states "An example
authority might return metadata service endpoints that contain metadata
(for example annotations) about LSIDs for which it is not the
authority".
I'll explore this idea using two example LSID authorities, A and B.
Assume that A has published some metadata identified by an LSID it
assigned (A is the original source) and that B wants to annotate it. B
could be doing so for one of three reasons: to suggest a revision to A's
metadata (correction), to propose additional information that
supplements A's metadata (annotation), or to intentionally confuse
clients and disrupt the system (spam).
A close reading of the FAN proposal makes me think that, according to
the authors, B will create assertions directly about A's metadata
(meaning B will use A's LSIDs as the subject of assertions). FAN
provides a method for B to notify A that it has done so and suggests a
method for telling clients of A's resolver to directly contact B (using
a modified form of resolution) to get additional information about A's
LSID. This leads me to believe that the authors think that a single
metadata object could be split between multiple authorities.
Because of this I see three problems with FAN. It increases the burden
on LSID resolution clients and perhaps on the LSID resolution system as
a whole, it makes attribution of assertions about an LSIDs indirect, and
it makes it easier to spam the network.
FAN-aware resolution is more intensive than normal resolution and
requires the following process: The client knows an LSID. It performs
normal resolution by querying DNS for the service record of the
authority, contacting the authority to find the correct endpoint, and
calling getMetadata() on that endpoint.. Next it must parse this
metadata and look for the special predicate which indicates that a
foreign authority contains more information about an LSID. Since the
client does not know if it has retrieved the entire metadata object
(some pieces may exist on other authorities), it must perform a modified
getMetadata() call for each foreign authority. To do so, the proposal
suggests that the client again queries DNS for the foreign authority
(because only the authority name is returned in the first resolution),
gets its endpoint, and calls getMetadata() with the original LSID
(though it was assigned by a different authority). The implied final
step is that the metadata from the foreign authority is merged with the
metadata from the authoritative source before being handed back to
calling code.
The assumption that a metadata object *could* be split across multiple
authorities forces the client to query all foreign authorities, even if
those authorities only contain annotations that the client is not
interested in. To look at an example case of names and specimens, it
might be that we want notification of usage of names by specimens so we
can immediately retrieve all specimens for a name. Assume A issues a
name and B issues a specimen.
Under FAN, the B authority could add an assertion to their data store
that says [urn:lsid:A:names:1 hasSpecimen urn:lsid:B:specimens:3]. I
consider this an annotation (rather than a revision, modification, or
correction) because the fact that B linked a specimen to a name does not
change the semantic meaning of the name object (it merely provides
additional information). If you're interested in just the name and you
don't know by resolving it against A that you've got the complete set of
name metadata, you're forced to get a bunch of information about
specimens that you're not interested in (and there could be a great deal
of this kind of information).
To complicate matters, C might want to propose a correction to A. C
might believe that A misspelled the name. Consider this simple example:
A's original assertion:
urn:lsid:A:names:1 fullScientificName "Salmo saler, Linnaeus,
1758"
C's proposed modification:
urn:lsid:A:names:1 fullScientificName "Salmo salar, Linnaeus,
1758"
Under FAN, C would propose its modification and notify A that it did
so. When a FAN-aware client goes to resolve A:names:1, it will get back
some data that changes the meaning of A's original object (C's
assertion) and a lot of data that's not directly about it (all the
specimen assertions). What's worse, it now has both of these assertions
in the same model:
urn:lsid:A:names:1 fullScientificName "Salmo saler, Linnaeus,
1758"
urn:lsid:A:names:1 fullScientificName "Salmo salar, Linnaeus,
1758"
If the names class is defined with fullScientificName as a functional
property (can have only one assertion about the property on an instance)
we're in trouble and we no longer have a valid OWL instance. So the FAN
merge process has to know both that this is a problem, by understanding
the ontology for names, and how to fix it, by deciding whether to trust
C or not and select one or the other assertion but not both.
How does the client decide to trust or distrust C?. The FAN proposal
suggests that the client gets a semantic description of C and uses that
to make the decision. I think it more likely that the code that calls
the client on behalf of the user should decide whether or not to trust
C.
So to fix all this we could propose that the FAN-aware client has to be
triply smart. In addition to having to know how to deal with foreign
authorities it should provide 1.) some mechanism for getting information
about a foreign authority and deciding whether to trust it or not, 2.)
some mechanism for conditionally merging the foreign metadata it cares
about (proposed modifications for example) while discarding other
information (specimens), and 3.) the ability to know whether or not its
merge process violates the integrity of the metadata. This seems to be
quite a burden. I think it will add not only complexity to LSID
resolution, but also slow it down tremendously. LSID resolution is a
low-level data identification and access mechanism. Since so many other
services will build upon it, it ought to be fast.
That brings up the idea of attribution. In FAN, the attribution of an
assertion is not directly tied to the authority of the LSID that acts as
the subject of the assertion. Instead, since foreign authorities are
allowed to make statements directly about an LSID issued by someone
else, the resolution client has to deduce attribution of each assertion
during the merge process (using the foreign authority predicate as
evidence). After the merge process is complete, downstream processes
will have lost this attribution information.
Finally, FAN opens the door to spammers because attribution is not
directly tied to the authority name. If I trust A and I resolve the
name id above I could get back information from C. It could easily be
the case that I don't trust C (because I believe it to be a spammer) but
that my LSID resolution client doesn't know this fact. The end user of
the client may be using a FAN-enabled resolver or not and never know.
The FAN proposal suggests that clients be notified, but what is your
average piece of software that calls a resolution client on behalf of a
user to do with this information?
I propose we take an alternative approach. I like the notification
system in FAN, but I dislike the assumption about distributed objects
and indirect attribution. I would rather we agree on an annotation and
linking system that holds the following to be true:
1.) An authority is never allowed to make an assertion whose subject is
an LSID that it is not authoritative for.
2.) LSIDs authoritatively attribute an assertion to the organization
that owns the fully qualified authority domain (this includes sub
domains to support the case of hosted authority services)
3.) Objects are not distributed across authorities but instead link to
other first-class objects that may exist within other authorities.
4.) Annotations, corrections, etc. are valuable, and as such should be
first class objects that are attributed to their issuers by the normal
process of 2 above. This is annotation as linking; the subject of the
annotation is the LSID of the annotation, it is attributed to the
authority of the issuer, and an agreed upon predicate is used to define
the object that acts as domain of an annotation (what it applies to).
This guarantees correct attribution of annotations throughout the
network.
5.) We cannot stop spammers from publishing malicious assertions but we
can validate a given assertion acquired from some source external to the
LSID system by resolving the LSID and checking the resulting metadata
against the assertion. Spammers will find it difficult to circumvent
this method because it requires hijacking the DNS system. If you trust
that a source gave you a complete and full copy of an object you don't
have to resolve its LSID.
6.) We will often have to use systems that are external to LSID in
order to find the data/metadata we're interested in (at the very least
to identify LSIDs to resolve). These other systems have to understand
how to attribute assertions to their owners and can do so using the
rules above without relying on metadata about a foreign authority.
7.) We don't have to assume (as FAN does) that an authority is
responsible for providing all information in the universe that relates
in any way to its LSIDs
8.) If you get back invalid metadata from an authority (not incorrect,
but metadata that, for example, violates a functional property
restriction), you automatically know who is at fault (the organization
that owns the authority name).
9.) A FAN style notification system could still be used for logging and
tracking the use of one's data or for proposing corrections to and
issuer who might want to approve and incorporate those suggestions into
their published data.
So we might use a FAN-like system, but only for foreign annotation
notification, not foreign authority notification.
Any comments? Have I misunderstood the assumptions behind FAN?
-Steve
_______________________________________________
TDWG-GUID mailing list
TDWG-GUID at mailman.nhm.ku.edu
http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
WARNING: This email and any attachments may be confidential and/or
privileged. They are intended for the addressee only and are not to be read,
used, copied or disseminated by anyone receiving them in error. If you are
not the intended recipient, please notify the sender by return email and
delete this message and any attachments.
The views expressed in this email are those of the sender and do not
necessarily reflect the official views of Landcare Research.
Landcare Research
http://www.landcareresearch.co.nz
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
More information about the tdwg-tag
mailing list