versioning for data versus metadata
In my previous email I referred to problems in distinguishing data and metadata. I'd like to elaborate here.
Traditional definitions of metadata are 'data about data' or 'data documentation'. These are good definitions from a pragmatic standpoint, but become somewhat less than helpful when trying to build real working systems that utilize both data and metadata and try to preserve replicability of analyses through versioning. A simple example will illustrate.
Sometimes people record repeating information about data as separate metadata (for example, the date on which data were collected). Other times, they might include that information directly in their data model.
Take two entities, A and B:
Entity A: --------- Metadata: AttributeLabels = Site,Date,Abundance
Data: Foo 20041010 19 Bar 20050712 20 Foo 20051010 20
Entity B: ----------- Metadata: AttributeLabels = Site,Abundance CollectionDate = 20011002
Data: Foo 24.3 Bar 21.3 Baz 20.4
Note that both entities contain the same information, but the second places the date of collection as a metadata property, while the first puts it in the data model. If one were to integrate these data entities to produce a time-series plot of abundance by site, one would need to extract the CollectionDate information from the metadata of entity B before proceeding. Thus, for the purpose of an integrated analysis, "CollectionDate" is really data.
Which way people model the information is somewhat of an arbitrary decision, but typically comes down to looking at 1) rate of change of the information across tuples, and 2) intended use of the final data.
So, to bring this back to the identifier issue. If one were to assign an identifer to Entity A and another to Entity B, resolving the identifier should allow one to retrieve the data. But in these two cases the data that is returned will have different schemas and will have different dependencies on the metadata. To do integrated analyses of the two entities, one really needs to be able to utilize both the metadata and the data together and be assured that both are consistent.
LSIDs require that the data retrieved from an LSID never changes to guarantee replicability and persistence, but allows the metadata to change. Clearly, if the 'CollectionDate' metadata for entity B were to be changed any analyses that were performed using the original metadata could no longer be replicated. This causes a lot of trouble for analytical systems that emphasize provenance and lineage for derived data products. It indicates that there is a strong case to be made that metadata should be versioned as well and that both the metadata and data associated with an identifier really should be immutable with respect to the identifier. Any changes to data or metadata should require updates to the identifier revision.
Matt -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Matt Jones jones@nceas.ucsb.edu Ph: 907-789-0496 National Center for Ecological Analysis and Synthesis (NCEAS) UC Santa Barbara http://www.nceas.ucsb.edu/ecoinformatics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
participants (1)
-
Matt Jones