Re: [tdwg-guid] Need for citation information in GUID metadata

9 Nov 2007

      Ryan,

My get out is that I don't know much about METS/MODS but I'll try and  
express why we are not *just* picking them up or any other XML based  
format. I hope this doesn't come across as a flame - I am just  
running over old arguments that I probably have said too often. I  
appreciate you didn't suggest we use METS but I think it needs  
justifying again.

We could use METS for digital objects, embed MODS for bibliographic  
stuff and make up our own schemas for each of our domains  
(entomology, botany, molecular phylogentics, functional ecology, you  
name it) and we would have integration of data at the application  
level but not at the semantic level. Effectively each of our domains  
would have its own XML silo and mixing stuff together would be a  
complete pain. The attraction of RDF is that it allows the mixing of  
concepts across domains so we only define things once at a very fine  
level and can be explicit about what we "mean".

I'll see if I can illustrate this in a naive way by picking one  
element from the example you give:

<mods:identifier displayLabel="Acquisition number"  
type="local">27309</mods:identifier>

Do different displayLabel attribute values effect the meaning (i.e.  
where I put it in my database or calculation)  of the value in the  
element or does the value in the element only mean "mods:identifier"  
no matter what is in the attribute?
So if I put displayLabel="National Insurance Number" or t  
displayLabel="Barcode" my application may do something different with  
27309. How do we do multiple languages for the displayLabel?

The QNAME for mods:identifier from the document would be

http://www.loc.gov/mods/v3identifier

which doesn't resolve. There would normally be a slash or hash on the  
end of the namespace so that we would get

http://www.loc.gov/mods/v3/
http://www.loc.gov/mods/v3/identifier

but neither of these resolve to anything useful either.

All this may be in MODS documentation but only humans read  
documentation and then only rarely! Each time we come across a new  
XML standard some poor human has to go off and read all the PDFs  
involved before we can get started.

In a sematic web type world all the elements should resolve to their  
definitions and at that point we can define things like the  
relationship of this concept to other things and some display labels  
in different languages etc etc. There is an outside chance that a  
machine could do something "meaningful" with the information.

Really all XML bought the world is the ability to parse transfer  
files easily. In the old days when things were space delimited one  
would have to write a parser to get he documents into memory. Now we  
can use a generic parser to get them into memory. But XML does not  
tell us what to do once it is in memory. XML is just a serialization.  
All the interesting problems are in what is serialized. This is why  
we lean to RDF/OWL.

I hope this enlightens without putting you off. I expect/hope Bob  
will have a correction somewhere in what I have said :)

All the best,

Roger

On 8 Nov 2007, at 18:27, Ryan Scherle wrote:
...
Disclaimer: I don't fully understand all of the issues involved  
here, as I've only been looking at the biology standards for a few  
months. I may be misinterpreting some of the points being made.  
However, I have a good understanding of related standards in the  
library world, so I hope my comments may be of use.
In my opinion, if you try to put too many external semantics on DC  
data, you're going to run into many problems in the future, when  
you interact with groups that have "regular" DC data. It is  
possible to solve these issues with explicit metadata relationships  
using existing metadata standards. Here is the metadata for an  
image object in a repository I built recently:
http://fedora.dlib.indiana.edu:8080/fedora/get/iudl:20008/METADATA
At first glance it looks long and complex, but it's relatively easy  
to pick apart. The outer layer of metadata follows the METS schema,  
which is a wrapper format for collecting together different types  
of metadata.
The first inner object is a MODS record. MODS holds essentially the  
same type of information as DC, but it allows for more detailed  
descriptions. This record always describes the artifact in the  
image, not the image itself. And a big advantage of MODS is that it  
allows specification of the thumbnail URL that Greg was originally  
asking about (in the <mods:url access="preview">). Note: It is  
possible to include a DC representation as well as a MODS  
representation with a single METS document.
After the MODS record are MIX records containing detailed technical  
information about each of the image files.
Finally, the mets:fileSec and mets:structMap sections specify  
relationships between the metadata sections and the actual files.  
In this case, the hrefs are relative URLs, but they could easily be  
full URLs or LSIDs.
Now, I'm not advocating that you dump RDF in favor of METS. My main  
point is that explicitly separating the different types of metadata  
may be useful. If you would like more information about the  
specifics, let me know.
--- Ryan Scherle
--- Digital Data Repository Architect
--- NESCent
On Nov 6, 2007, at 8:48 PM, Ricardo Scachetti Pereira wrote:
...
Please see my comments in line below.
Bob Morris wrote:
...
The problem that nobody will take a position on is this:
Is the metadata on an image file, or on an image?
I don't want to dismiss this as a simple problem. We've been  
trying to knock it down for a long time now. However, I keep  
wondering why can't we just include information from both (image  
and image file) in the metadata by using different predicates in  
each case. See an example below.
...
Even---or especially if---you stick to DC, you have a problem about
what things are part of a description.  If the metadata is about the
file, then it is reasonable to express, e.g. that it has 1200x800
pixels, encoded as jpeg but perhaps not that it is a a picture of a
flea biting a dog.  If the image is being described, the reverse  
might
hold.
Couldn't we say the following about an image?
<rdf:RDF>
   <tdwg:Image rdf:about="urn:lsid:example.com:image:1234">
      <dc:title>Picture of my dog Scratchy</dc:title>
      <dc:subject>A picture of a flea biting my dog.</dc:subject>
      <dc:description>A description of a flea biting my dog. You  
get the idea, but an image is worth a thousand words...</ 
dc:description>
      <dc:identifier>urn:lsid:example.com:image:1234</dc:identifier>
      <dc:format>image/jpeg</dc:format>
      <tdwg:imageDimensions>1200x800</tdwg:ImageDimensions>
   </tdwg:Image>
</rdf:RDF>
Even though I bet the RDF isn't valid, I hope you get the point  
that each predicate refers to either the file or the image, but  
not both.
If some of these predicates aren't suitable, we can always use  
some other vocabularies (EXIF?). If you want to refer to what's in  
the picture, we can somehow point to our familiar biodiversity  
information objects: taxon name, observation, specimen, etc.
Is there a case where this can't be done?
...
... rendering clients probably
desperately need the pixel size and also information about where to
find other sizes of the "same" image.
That's a different problem. We had agreed that LSIDs can't be used  
if the number of representations of an image is infinite or just  
very large. Should we be looking at OpenURL or just Web services  
(and WSDL)?? But that's a little advanced for our simple  
discussion thread, isn't it?
So, is this a feasible solution, or is there a class of counter  
examples that I'm missing completely?
Cheers,
Ricardo
_______________________________________________
tdwg-guid mailing list
tdwg-guid@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-guid
_______________________________________________
tdwg-guid mailing list
tdwg-guid@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-guid

Re: [tdwg-guid] Need for citation information in GUID metadata

Roger Hyam