Thursday, March 21, 2013

GIS metadata standard deficiencies

My recent post on GIS standards dilemma generated quite an interest so, as a follow up, I am publishing today a post explaining my position in more detail and illustrate deficiencies of one of the standards with concrete examples.

The conception of GIS metadata standard was a long awaited breakthrough and raised hopes of the entire spatial community that finally it will be possible to have a consistent way of describing vast amount of geographic information created over the years but also new data generated on a daily basis. The expected benefits of the standard were far reaching because it would allow consistent cataloguing, discovery and sharing of all the information. In other words, if successfully implemented, it would deliver a great economic benefit for all. The concept of Spatial Data infrastructure (SDI) was borne… That was more than a decade ago.

Fast forward to 2013. Creation of SDI has been a holy grail of GIS community for quite a long time so why, despite all the good intentions and millions of dollars poured into various initiatives, we still don’t have one in Australia? Why we don’t even have in place the first element of that infrastructure – a single catalogue of spatial data? In my opinion the answer is simple – because we are trying to act on a flowed concept.

At the core of the problem is a conceptual flaw in the underlying metadata standard that makes it impossible to implement successfully any nation wide or international SDI. In other words, SDI concept will never work beyond a closely controlled community of interest, with a “dictatorship like” implementation of the rules that go far beyond the loosely defined standards. Until that flaw is widely acknowledged we cannot move forward. Any attempt to build a national SDI, or even a simple catalogue based on flawed ISO 19115 standard is bound to fail and is a total waste of money. The reason why follows...

For years many were led to believe “follow the standard and everything will take care of itself”. But the reality check provides a totally different picture. For a start, it took years to formalise Australian profile of ISO 19115 standard. Then everybody started working on their own extensions because it turned out it is quite hard to implement the standard in a meaningful way for all the data types as well as historical data which lack many details about it. But the true nature of the problem lays somewhere else... 

You see, the standard prescribes the structure of the metadata record, that is, what information should be included, but to a large degree, it does not mandate the content. The result is a “free text” like entry for almost everything that is included in a metadata record. Just to illustrate, access constraint is specified as “legal” and “use” related, and both are limited to the following categories: “copyright, patent, patentPending, trademark, license, intellectualPropertyRights, restricted, otherConstraints”. But the information is optional so that metadata element may also be empty. Now, consider a case of a user who tries to find free data… impossible.

Inclusion of so much free text in metadata information means the key benefit of creating a structured metadata record in the first place is almost entirely lost. Yes, it describes the dataset it refers to but in a totally unique way, which means searching a collection of records can only be limited to very generic criteria – in practice, with any certainty to only time and location (ie. a bounding box for the dataset). The problem is compounded if you start looking across different collections of metadata records, created and maintained by different individuals, with a different logic of what is important and what is not… But don’t blame the creators of metadata records for this – the standard does not prescribe the content in the first place!

The second problem is that the current metadata standard is applied primarily to collections (like, for example, TOPO-250K Series 3 topographic vector data for Australia or its raster representation) but it is generally not applied to individual data layers within a collection (which, in case of TOPO-250K Series 3 data would be any of 92 layers that comprise the collection). Therefore, a simple search for say, “road vector data in Australia” will not yield any results unless you revert to free text search option and “roads” happen to be specifically mentioned somewhere within metadata record (more on this below).

Not to mention that it would be almost impossible, from a practical point of view, to apply the metadata in the existing format to individual features or points making up that feature. This aspect of information about spatial data, especially important for the data originators and maintainers, has been totally overlooked by the creators of the metadata standard. 

Then there is a data user perspective. The key benefit of a comprehensive metadata record is that it provides all the relevant information enabling user to firstly, find the data and secondly, decide whether it is fit for intended purpose. In the most general terms, the users apply “when, where and what” criteria to find the data (not necessary in that specific order). In particular, they specify the reference date (relatively well defined in metadata records so, the least of the problems), location (which is limited only to a bounding box but data footprint concept is also addressed within existing metadata standard) and some characteristics of the dataset … and this is where things are not so great because each data type will have its own set of characteristics and these are mostly optional in an ISO 19115 compliant metadata record (so may not be implemented at all by data providers).

Take for example cases where users are interested in “2m accuracy roads dataset for Bendigo, Vic”… or “imagery over Campbelltown, NSW acquired no later than 3 months ago and with under 1m resolution”. It is virtually impossible to specify search criteria in this way so the users have to fit their criteria to information that is captured in metadata. That is, location becomes the bounding box constraint, time criterion becomes date constraint (either specific or as a range from – to) and the characteristics of datasets can only be specified as keywords…

And this leads me to the final point - the need for ISO 19115 compliant metadata in the first place. Since the only truly comprehensive way to find what you are looking for is to conduct free text search, the structured content of the metadata record is obsolete. The result would be exactly the same if the information is compiled into “a few paragraphs of text”. That is the essence of the argument Ed Parsons, Geospatial Technologist of Google presented to the Australian spatial community as far back as 2009 but which remains mostly ignored to this day…

There is only one practical use for all the metadata records already created. You can dump the entire content of the catalogues, the ones that contain the information about the data you care, into your own server and reprocess it to your liking into something more meaningful, or just expose it to Google robots so that content can be indexed and becomes discoverable via standard Google search. Unfortunately, this totally defeats another implied benefit of SDI - that metadata records will be maintained and updated at the source and that there will be no need for duplication of information…

I believe it is time to close the chapter on a national SDI and move on. Another failed attempt to create “an infrastructure that will serve all users in Australia” cannot be reasonably justified. The bar has to be lowered to cater only for the needs of your own community of practice. Which also means, you have to do it all by yourself and according to your own rules (ie. most likely creating your own metadata standard). That’s the only way to move forward.

Related Posts:
Ed Parsons on Spatial Data Infrastructure
Data overload makes SDI obsolete

GIS standards dilemma

No comments: