Tuesday, October 13, 2009

Ed Parsons on Spatial Data Infrastructure

I recently attended Surveying & Spatial Sciences Institute Biennial International Conference in Adelaide and was privileged to see Ed Parsons’ presentation. For those who don’t know Ed, his bio describes him as “… the Geospatial Technologist of Google, with responsibility for evangelising Google's mission to organise the world's information using geography, and tools including Google Earth, Google Maps and Google Maps for Mobile.” He delivered a very enlightening and truly evangelistic presentation outlining his views on the best approach to building Spatial Data Infrastructures. The following paragraphs summarise the key, thought provoking points from the presentation – with some comments from my perspective.

The essence of Ed’s position is that the currently favoured approach of building highly structured and complex to the n-th degree “digital libraries” to manage spatial information is very inefficient and simply does not work. There is much better framework to use – the web – which is readily available and can deliver exactly what is needed by the community, and in a gradual and evolutionary fashion rather than as a pre-designed and rigid solution.

I could quote many examples of failed or less than optimal implementations of SDI initiatives in support of Ed’s views. There is no doubt that there are many problems with the current approach. New initiatives are continuously launched to overcome the limitations of previous attempts to catalogue collections of spatial information. And it is more than likely that none of the implementations is compatible with the others. The problem is that metadata standards are too complex and inflexible and data cataloguing software is not intelligent enough to work with less than perfectly categorised information. I recently had first hand experience with it. I tried to use approved metadata standards for my map catalogue project, hoping it will make the task easier and the application fully interoperable, but in the end, I reverted to adding my own “interpretations and extensions” (and proving, at least to myself, that “one-fit-all” approach is almost impossible). I will not even mention the software issues…

Ed argued that most SDI initiatives are public sector driven and since solution providers are primarily interested in “selling the product”, therefore by default it all centres on data management aspect of the projects. In other words, the focus is on producers rather than users, on Service Oriented Architecture (SOA) rather than on “discoverability” of relevant information. All in all, current SDI solutions are built on the classic concept of a library where information about the data (metadata) is separated from the actual data. Exactly as in a local library, where you have an electronic or card based catalogue with book titles and respective index numbers and rows of shelves with books organised according to those catalogue index numbers. For small, static collections of spatial data this approach may work, but not in the truly digital age, where new datasets are produced in terabytes, with myriad of versions (eg. temporal datasets), formats and derivations. And this is why most SDI initiatives do not deliver what is expected of them at the start of the project.

Ed made a point that it is much better to follow an evolutionary approach (similarly to how web developed over time) rather than strict, “documentation driven” process, as is the case with most current SDI projects. The simple reason is that you don’t have to understand everything up-front to build your SDI. The capabilities may evolve as needs expand. And you can adjust your “definitions” as you discover more and more about the data you deal with. In an evolutionary rather than prescriptive way. It is a very valid argument since it is very, very hard to categorise the data according to strict rules, especially if you cannot predict how the data will evolve over time.

[source: Ed Parsons, Google Geospatial Technoloist]

The above table contrasts the two approaches. On one side you have traditional SDIs with strict OGC/ISO metadata standards and web portals with search functionality - all built on Service Oriented Architecture (SOA) principles and with SOAP service (Simple Object Access Protocol) as the main conduit of information. Actually, the whole set up is much more complex as, in order to work properly, it requires formalised “discovery” module - a registry that follows Universal Description, Discovery and Integration (UDDI) protocol and a “common language” for describing available services (that is, Web Service Description Language or WSDL in short). And IF you can access the data (big “if” because most public access SDI projects do not go as far) it will most likely be in a “heavy duty” Geographic Markup Language (GML) format (conceived over a decade ago but still mostly misunderstood by software vendors as well as potential users). No wonder that building SDI based on such complex principles poses a major challenge. And even in this day and age the performance of such constructed SDI may not be up to scratch as it involves very inefficient processes (“live” multidimensional queries, multiple round trips of packets of data, etc).

On the other side you have the best of web, developed in an evolutionary fashion over the last 15 years: unstructured text search capabilities delivered by Google and other search engines (dynamically indexed and heavily optimised for performance), simple yet efficient RESTful service (according to Ed Parsons, not many are choosing to use SOAP these days) and simpler and lighter data delivery formats like KML, GeoRSS or GeoJSON (that have a major advantage – the content can be indexed by search engines and therefore making the datasets discoverable!). As this is much simpler setup it is gaining a widespread popularity amongst “lesser geeks”. US government portal data.gov is the best example of where this approach is proving its worth.

The key lesson is, if you want to get it working – keep it simple and do not separate metadata from your data to allow easy discovery of the information. And let the community of interest define what is important rather than prescribe upfront a rigid solution. The bottom line is that Google strength is in making sense of chaos that is in cyberspace so it should be no surprise that Ed is advocating similar approach to dealing with chaos of spatial data. But can the solution be really so simple?

The key issue is that most of us, especially scientists, would like to have a definite answer when we search for the right information. That is: “There are 3 data sets matching your search criteria” rather than: “There are 30,352 datasets found, first 100 closest matches are listed below…” (ie. the “Google way”). There is always that uncertainty, “Is there something better/ more appropriate out there or should I accept what Google is serving as the top search result?... What if I choose the incomplete or not the latest version of the dataset?”… So the need for highly structured approach to classify and manage spatial information is understandable but it comes at a heavy cost (both time and money) and in the end it can serve only the needs of a small and well defined group of users. “The web” approach can certainly bring quick results and open up otherwise inaccessible stores of spatial information to masses but I doubt it can easily address the issue of “the most authoritative source” that is so important with spatial information. In the end, the optimal solution will probably be a hybrid of the two approaches but one thing is certain, we will arrive at that optimal solution by evolution and not by design!


Jeroen Ticheler said...

Nice summary and excellent conclusion! Indeed the hybrid solution is the way to go in my opinion. Both the structured way of describing and finding are needed by scientists and governments. Unstructured ways of finding are required by the general web audience, while they could do with a structured description. That description could be in any form and can very well have its original form in e.g. ISO metadata format. We would be ignorant to build systems that only serve one audience at the time. Jeroen Ticheler (PSC chair GeoNetwork opensource)

Anonymous said...

I like that you are bringing attention to something that just a few years ago was a dull and boring topic for most. But now, today, with so much geospatial data and services available, people finally see the value in search and discovery.

I disagree that metadata is not flexible enough. However, metadata is often misused and misunderstood and there are way too many standards or extended profiles to the standards to work with to the point that its taken on its own search, discovery and categorization problem.

Your best point and one that’s sorely needed is a way to indicate authoritative source. But, the politics behind declaring stewardship, ownership or authority of a specific source, I have found, is the challenge to overcome before that capability is achieved.

Rod Erickson, GeoCGI

Arek said...

Thank you gentlemen for your comments and for contributing your views to extend insights into this topic. The issues are certainly complex and will take collective wisdom of many practitioners to address them properly over time.

There is a new challenge in Australia that will most likely bring heightened scrutiny of various solutions for managing data on a large scale. In particular, Government 2.0 Taskforce is aiming to free up all public sector information… a monster of a task!

Jeff Harrison said...

A services-based SDI approach does not always include SOAP, WSDL or UDDI, or preclude REST, GeoRSS, etc.

For example, there's an open source dashboard for geodata.gov and the US NSDI that was recently released - it uses REST and GeoRSS (and even Bing Maps) to access metadata feeds and then connect directly to standards-based services (WMS/WFS). These services can be then combined with OpenStreetMap, Yahoo! Maps, etc.

In the end, the optimal solution will probably be a hybrid of the two approaches.

Marten said...

The topic of verbose metadata versus Youtube-level metadata (a title and a video) is not new. Even the publishers struggle with the verbose metadata standards that have been created over the years.

I'm glad to hear Ed's positive review of the data.gov initiative. It illustrates how the old can support the new! The geospatial content in data.gov (over 99% of what you can find there) is actually served through the Geospatial One-Stop portal (http://www.geodata.gov). This 5-year old site provides both standards-based OGC CSW 2.0.2 and web-oriented REST interfaces.

Data.gov makes use of the CSW 2.0.2 interface to access the rich content available in geodata.gov while providing a modern user interface to non-GIS users.

I'm looking forward to a discussion on how thinking about SDI can evolve as the Web evolves. To that point I posted some thoughts and examples on my blog: http://martenhogeweg.blogspot.com/2009/10/sdi-for-everyone.html

Marten Hogeweg, ESRI Inc.