Thursday, October 10, 2013

Metadata – a problem that doesn’t go away

Last month I came across yet another case study that exposed spatial metadata standard as the primary cause of problems with a delivered data archiving, cataloguing and dissemination system. It is another statistic on a long list of failures caused by reliance on a flowed concept. I am an avid critic of the current metadata standard for spatial data and have written extensively about the reasons in my earlier posts so, I am not going to repeat previous arguments here again - you can find links to those posts at the end of this article. Criticism, however constructive, only helps to expose the problems, not to solve them.  Therefore today I would like to share with you my thoughts on a better, more pragmatic approach to creating metadata for spatial information.

# The history

Briefly about the anatomy of the problem (maybe a bit overdramatised but with a good intention to highlight where things went wrong). I remember vividly the presentation on spatial data interoperability I attended about a decade ago. One of the key messages was that metadata is “a glue that will allow it all to happen”. No problem with that but when I asked a few questions regarding practical implementation, I was quickly hushed. One of the presenters did approach me after the event and admitted that they were aware of the potential problems but that they did not want to alienate the audience by bringing those issues to the forefront at that point in time. As he put it, there were enough benefits in the proposed approach that warranted overlooking potential issues in order to get the maximum buy-in from key stakeholders. This spin approach succeeded.  In years that followed you only got to hear about “good things” and quite a number of people made a career out of selling the “metadata success story” from one continent to the other. And the myth “just follow the standard and everything will be ok” perpetuated… never mind the truth.

# Background on metadata and standards

Let me set the record straight - I do not dismiss the need for metadata. On the contrary – metadata is a very, very important aspect of any data creation, maintenance and dissemination activity. But I question the usefulness of current metadata standards to support those tasks adequately.

“Standard” is just a set of conventions that a community of practice agrees to follow. It could be designed by a committee (pun intended) or arrived at by wide acceptance of a common practice (ie. when “someone’s way” of doing things is liked and followed by others). But standards should never be treated as a formula for success … Unfortunately, this is how ISO 19115 Geographic Information - Metadata standard has been sold to GIS community…

At a theoretical level, the concept of metadata is very simple. It is just “data about the data”… But this is where the simplicity ends and complexity begins because you quickly discover that metadata is useful not only for spatial data but also for spatial information in any format (ie. printed as well as electronic, single point to compilations of 100’s of layers, vectors as well as rasters and grids… and point clouds… one-off or dynamically generated on the fly… and whatever else you want to class as “spatial”).

# Why metadata implementation projects fail

A typical metadata implementation project goes like this - an organisation accumulates more and more data, to the extent that it causes problems for the IT department to manage it. So, a project is initiated to “catalogue what we have”. And in order to catalogue systematically what you have you need to describe it in a consistent way - you need metadata. The obvious step is to search if there are any standards that will help to deal with the issue rather than reinventing the wheel.

It does not take much effort to find the spatial data metadata standard documentation. So, you start reading and quickly realise that you cannot make any sense of the gobbledygook of the official documentation. A thought inevitably crosses your mind - “I need an expert!”. And of course you look for… “the expert in ISO 19115 metadata standard” . In the end, you get what you ask for - an expert in the standard, not the expert in solving your kind of problems. That expert cannot advise you otherwise but just to “follow the standard and you will be right”.

The expert usually brings a set of recommend tools (which of course are built around ISO metadata standard) and you also get help in implementing your catalogue/system “by the book”. All good, great success… Until you realise that a classic “garbage in / garbage out” principle applies here as well… what a surprise!

You see, the failure is built into the solution (it is the metadata standard!) so, it is very rare that this approach delivers. You don’t believe me? Just talk to those who have to use the information contained in metadata records (not those who implement metadata standards and build systems!)…  

Ok, to be fair, there is one exception where you can achieve an acceptable outcome (but I caveat this statement depending on how you define “acceptable”). It will happen only if you rule with an iron fist what information goes into your metadata. However, it can get out of hands easily if you have a lot of data to deal with (so you take shortcuts to process it all in bulk), a lot of different people writing metadata content (so, there are different views of what is important and needs to be recorded – because the standard allows this), and/or a lot of different types of data or data that grows rapidly (lack of consistency of information or sheer volume of data makes it impossible to record all the useful details). And let me stress this again, there is no guarantee that content of your metadata will be of any use outside of your organisation or immediate community of interest because it may lack the information perceived by others as vital (see my earlier post for more explanations).

If this all sounds too melodramatic and over-generalised, I do not apologise for it. My intension is to shake readers’ perception about the infallibility of “follow the standard” mantra. Enough money has been wasted on this so far. Do not trust “the experts” in a flowed concept any more – they do not know better than you do. Applying common sense to create your metadata will yield better outcomes than “following the standard” can ever deliver.

# A better approach - for better outcome

Now that you understand why “following the standard” is not a recipe for creating useful metadata records let’s review how to make it all work for you.

First and foremost, you have to define precisely WHY you need the metadata in the first place. The information you need to capture will differ depending whether the metadata is for internal use in data production and maintenance tasks or just to make the data easily discoverable by others.

For example, if you are building “just a catalogue” why bother with the complexity of ISO standard? Majority of users of your data simply want to know a few basic things about it:
  • what is it (basic description – including list of features for compilation products, spatial accuracy, geographic extents, when created/time reference and version if more than one created),
  • where to get it from and how (online or shopfront, order hardcopy or download electronic format),
  • how much it costs (free or paid - how much!),
  • how to access it (ie. access constraint, e.g. “none” or “login” or “restricted” plus relevant classification level) and what can be done with it (eg. “internal use only”, or “non commercial use only” or “republish with attribution”, etc.) – important to separate the two to make it clear!

As simple as that – nothing more and nothing less. Capture this information in a succinct metadata record and you have already done better than “following the standard” - even if this is only a paragraph of free text. And if you add a consistent structure to record that information for all you data you will achieve more than any expert in ISO 19115 standard can ever do for you. 

If you are thinking that “most of these items are specified in the metadata standard anyway”, have a look at the documentation in detail. The key point is that most of these vital information items are either optional categories (so, may or may not be included in metadata record created “by the book”) or the choice of options in the mandatory categories does not allow including anything that is meaningful.

# Divide to conquer – metadata hierarchy

The key issue I have with ISO19115 metadata standard is that it tries to be all things to all people. The result is that it is too specific for most cataloguing purposes or not detailed enough for capturing really important details about the data for reuse, production and maintenance purposes. It also tries to be applicable to any spatial data which compounds its uselessness by bringing it all to a “least common denominator”. In reality, you have to capture and store different information for different data types and for different purposes, depending on the intended use of that information. Therefore, in a complex production environment you will need:

  • Metadata describing source inputs (ie. to define lineage of your data);
  • Metadata for production datasets (which describes interim data versions at various stages of the process of transformation of inputs/source data into finished product);
  • Metadata for all output formats of the finished product (since it is inevitable that format conversion will alter the data in some way so, it needs to be documented that “what you see” eg. on a slippy map demonstrating the data is not the same as what you get in a data file in format x and which is different yet to that in format y; this is due to generalisations and other inherent alterations of original inputs in the process of spatial conversion);
  • Metadata for discovery (ie. solely for cataloguing purposes).

This or similar metadata hierarchy should be adopted to capture relevant information as data progresses through various stages of the production process. In an ideal environment you should maintain all that information and make it available for the end users of your data because it describes the product from end to end. Also, if you are engaged in a continuous production process, and that process changes over time, it is important to preserve all the relevant information for future perusal. However, for many, it will be an overkill as all they really need is the last metadata option, and in a very simplified format.

# Metadata granularity

In an ideal world there should be a metadata about every single piece of information you use or manage. In case of spatial data - about every single point, or line segment, or network node, or grid cell, as well as about their respective attributes. The hierarchy of metadata documentation outlined in the above section allows managing the granularity of maintained information. So, if you do need point/cell or segment metadata you can maintain that information in a lower level metadata construct (eg. production level metadata) and more generic information about your data can be captured in a higher level metadata. For example, information about source and acquisition date of a road segment data may be stored in production level metadata while your data licensing details in discovery level metadata. And one is linked to another via a hierarchy structure.

As you can see, this approach is flexible enough to allow storing relevant information about online applications as well as whole data collections with 100’s of layers but also about individual data layers, individual features within those layers, and down to the smallest spatial data construct – point/grid cell in space, if you need.  You will never be able to achieve this level of granularity with ISO 19115 standard metadata.

# A word about naming your data/ spatial information

There are no conventions for naming spatial data that are universally acceptable and followed. Generally, creators try to give descriptive names and pack as much extra information into the title or file name as possible, so humans can quickly ascertain what the data is about by looking at just a file name or title. Data which is disseminated in “chunks”, like satellite imagery scenes or various grids structures, usually incorporate basic metadata information in their names - like satellite name, sensor, time of acquisition, resolution and grid/path references. This approach is handy if you need to interact with a small number of data files manually but it is a totally unnecessary complication if you have thousands of files. Your metadata should be a window to all your data and there should be no need to interact with the data via their convoluted naming convention. This is where the ISO 19115 metadata concept falls short again because it is inadequate for complex data filtering purposes, hence you have no choice but to interact with the data manually, based on file names, and not via purpose-built data query tools. That innovation could not happen to date.

For all practical purposes I suggest to stick to a minimum when naming your files. That is, giving your file a descriptive name and version/date id to make it unique. Information about everything else relating to you data should be in a proper metadata file.

# What about ISO19115 then?

You may still need to publish metadata in ISO 19115 standard, for example, to deal with limitations of cataloguing tools or to accommodate requirements of some of your less sophisticated clients. If you design your metadata content correctly, it will take just a few minutes for your programmer to map your metadata content to mandatory ISO categories and to make your metadata into an “ISO 19115 compliant” XML structure. The key point is to treat ISO metadata standard as an output format, one of many formats that may be required by different users, and not as a foundation for creating the metadata in the first place.

# Closing remarks

If you are wondering whether any of the above makes sense and why nobody else is raising the issue… Well, please consider this: those active in the OGC and spatial standards arena have quietly recognised the problem. There is already a number of initiatives on the way to develop more metadata standards for specific data formats (like Metadata for Observations and Measurements) and for “easier discovery” of spatial information (eg.  Earth Observation extension of OpenSearch specification)… But true, no one has publicly admitted yet that the old approach failed and that the spatial community should have another go at solving metadata problem - in a holistic rather than piecemeal way.

Current approach to bring more and more standards is a lost cause as it is just an attempt to patch things up rather than to address the issue properly. Dividing and splitting the problem without acknowledging that it exists in the first place will only lead to more problems down the track - it will result in more chaos and confusion for the end users. These new initiatives are not about creating a hierarchy of metadata standards - just about more standards. If it was so difficult to successfully implement one standard, just imagine the troubles of trying to deal with 3 or more! The obvious question will be: “Which one do I use???”  If you choose OpenSearch approach chances are that your data cannot be catalogued because traditional spatial cataloguing tools require ISO 19115 structure. And if your data formats happen to be different than “observations and measurements” or "image/grid", you may be waiting another decade for a proper standard to be published….

Persisting with the current approach to solve metadata problem will not succeed. As with Gordian Knot, there is only one way to solve this problem quickly…cut your loses short and start afresh. This is what I am doing. I will share my further thoughts and experiences in not too distant future.

Related Posts:
Why standards should be ignored 
GIS metadata standard deficiencies 
GIS standards dilemma
Ed Parsons on Spatial Data Infrastructure
Data overload makes SDI obsolete

No comments: