Open Data – It’s More Than Publishing a CSV
Today, the status quo for governments sharing public data is via the CSV. It’s the de facto standard by which most government organizations publish their data for consumption. The CSV is no longer an acceptable format for data publishing. It’s time to raise the bar on how government open data sites like data.gov and DataSF.org share our public data.
Know Your Customer
For data to truly be open, it must be equally accessible by all three distinct data-consuming constituents:
Scientists, Researchers, Analysts, Economists and the Media. This group requires bulk, downloadable, machine-readable access to data in order to pour over it and combine it with other data sources. The CSV meets this group’s needs the best, but not adequately. In addition to CSV, governments should offer the data in more self-describing, schematized formats including JSON and XML.
Programmers. Contrary to conventional wisdom, programmers creating civic apps seldom want bulk data as it requires the programmer to set up a database and develop systems and processes for keeping the data up to date. Programmers much prefer a standards-based, open data API, which points to a real-time source of the data. The API should offer the programmer the flexibility to request how the data is returned, for example in JSON, XML or CSV or even as open linked data in RDF.
Non-Technically Trained but Interested Citizens. This group has casual needs for data. They don’t want to download 350,000 White House visitor records. They merely want to search through the dataset to see if Oprah Winfrey has visited, or they want to sort it by frequency to see who visited the most. This group’s needs are best met when data is available online in a consistent, interactive format allowing sorting, searching, filtering and visualizing the data. Requiring a bulk CSV download creates an unnecessary access barrier for a significant percentage of citizens.
Publishers – Don’t Ignore Your Self-Interests
While it’s important and worthwhile to examine those aforementioned benefits from the perspective of the data consumer, change is likely to take place more quickly by describing the benefits to the data publisher.
Reach and Amplification. There are a vast number of devices, machines, programs and websites where people will directly and indirectly consume and use government data. The easier governments make it to link, embed, email, share and socialize their data into these devices, machines, programs and websites, the more broadly people will access the data – perhaps without even knowing it. Publishing via CSV reduces the likelihood that your data will be discovered or shared; that people will discuss, collaborate and engage around your data; or that people will create visualizations – charts, graphs and maps – each of which helps your data tell a story.
Feedback. Once a CSV has been downloaded, what happens with it? Nobody knows. Don’t you want to know by whom, how and where your data is being used? Of course you do. That’s the civic engagement feedback loop you’re lacking. Sure, you can see the number of page views and count the number of downloads on sites like data.gov, but that’s all you see. You can’t measure any of the indirect activity. How many times has it been tweeted? How many times has it been discussed on Facebook? How many times has it been embedded on all sorts of websites and blogs across the Internet? How many applications are embedding it?
Cost Savings. There are real costs associated with sharing public data. Two of the direct costs are the cost of storage and the cost of bandwidth to deliver the data to someone who wants it. When an agency publishes a CSV, it bears the transmission cost to deliver the entire file to everyone who downloads it, even if the people who download it simply look up one value and discard it. Allowing data consumers to selectively access only the records they want reduces the amount of data transferred, reducing the bandwidth costs. How do you let people selectively choose the relevant records? API enablement allows apps, widgets and controls like the Socrata Social Data Player to stream data out in small chunks or in response to explicit search requests or filters.
It’s Time to Raise the Bar
Consider how the web itself would look today if it were nothing more than loosely connected directories of text (.txt) files, lightly described and inadequately linked. It would hardly be the World Wide Web we now know and enjoy.
Emerging, pioneering Open Data sites like data.utah.gov, data.ca.gov and Portland’s CivicApps are laying the foundation for the web of data. Unless the status quo for publishing data changes, we’ll end up with loosely connected directories of CSV and XLS files, lightly schematized and inadequately reusable. We’ll all lose out on the real and lasting social and economic benefits that open data offers.
The time has come to raise the bar on how public data is shared. A CSV is no longer acceptable. Public data must be online, interactive, machine readable, embeddable, linkable and API enabled.
It’s time for Open Data sites to become Open Data platforms.