versioning for data versus metadata

12 Oct 2005

      In my previous email I referred to problems in distinguishing data and
metadata.  I'd like to elaborate here.

Traditional definitions of metadata are 'data about data' or 'data
documentation'.  These are good definitions from a pragmatic standpoint,
but become somewhat less than helpful when trying to build real working
systems that utilize both data and metadata and try to preserve
replicability of analyses through versioning.  A simple example will
illustrate.

Sometimes people record repeating information about data as separate
metadata (for example, the date on which data were collected).  Other
times, they might include that information directly in their data model.

Take two entities, A and B:

Entity A:
---------
Metadata:
AttributeLabels = Site,Date,Abundance

Data:
Foo  20041010  19
Bar  20050712  20
Foo  20051010  20

Entity B:
-----------
Metadata:
AttributeLabels = Site,Abundance
CollectionDate = 20011002

Data:
Foo  24.3
Bar  21.3
Baz  20.4

Note that both entities contain the same information, but the second
places the date of collection as a metadata property, while the first
puts it in the data model. If one were to integrate these data entities
to produce a time-series plot of abundance by site, one would need to
extract the CollectionDate information from the metadata of entity B
before proceeding. Thus, for the purpose of an integrated analysis,
"CollectionDate" is really data.

Which way people model the information is somewhat of an arbitrary
decision, but typically comes down to looking at 1) rate of change of
the information across tuples, and 2) intended use of the final data.

So, to bring this back to the identifier issue.  If one were to assign
an identifer to Entity A and another to Entity B, resolving the
identifier should allow one to retrieve the data.  But in these two
cases the data that is returned will have different schemas and will
have different dependencies on the metadata.  To do integrated analyses
of the two entities, one really needs to be able to utilize both the
metadata and the data together and be assured that both are consistent.

LSIDs require that the data retrieved from an LSID never changes to
guarantee replicability and persistence, but allows the metadata to
change.  Clearly, if the 'CollectionDate' metadata for entity B were to
be changed any analyses that were performed using the original metadata
could no longer be replicated.  This causes a lot of trouble for
analytical systems that emphasize provenance and lineage for derived
data products.  It indicates that there is a strong case to be made that
metadata should be versioned as well and that both the metadata and data
associated with an identifier really should be immutable with respect to
the identifier.  Any changes to data or metadata should require updates
to the identifier revision.

Matt
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Matt Jones
jones@nceas.ucsb.edu                         Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
UC Santa Barbara     http://www.nceas.ucsb.edu/ecoinformatics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~