Practical Guide for Publishing Tabular Data to CSV Files

Summary

Today we have more and more data sources at our fingertips. According to the European Data Portal, the impact of the open data market could reach up to 334,000 million euros and generate around 2 million jobs in 2025 ('The Economic Impact of Open Data: Opportunities for value creation in Europe. ( 2020)).

However, paradoxically, even though data is more affordable than ever, the possibilities for reusing it are still quite limited. The potential users of this data often have to face multiple barriers that hinder their access and use. There are many facets in which there may be quality problems that make it difficult to reuse data: poorly descriptive and standardized metadata, choice of license, choice of format, inappropriate use of formats or deficiencies in the data itself. There are many initiatives that try to measure the quality of data sets based on their metadata: date and frequency of update, license, formats used, ... as occurs, for example, in the metadata quality scorecard present in the European Data Portal or in the quality dimension of the Open Data Maturity Index.

But these analyzes are insufficient given that most of the time quality deficiencies can only be identified after starting the reuse process. The work that the debugging and preparation processes take up thus becomes a significant burden that in many cases is unaffordable for the open data user. This fact produces frustration and loss of interest on the part of the reusing sector in the data offered by public organizations, affecting the credibility of the publishing institutions and considerably lowering the expectations of return and generation of value from the reuse of open data.

These potential problems can be tackled since, to a large extent, it has been observed that they are due to the publisher not knowing how to express the data correctly in the chosen format.

For all this, and with the aim of contributing to the improvement of the quality of open data, at datos.gob.es we have decided to create a collection of guides aimed at guiding publishers in the proper use of formats and media. access to open data most used in the field of open data.

The collection of guides begins here by focusing on the CSV format. The choice of this format is based on its popularity in the field of open data, its simplicity and how light it is when expressing data in table form. It is the most common format in open data catalogs; specifically, in datos.gob.es it represents 20% of the distributions coexisting with other formats such as XLS or XLSX that could also be expressed as CSV. Furthermore, it is a format that we can call hybrid because it combines the ease of its automated processing with the possibility of being explored directly by people with a simple text editor.

This guide covers the basic characteristics of this type of format and a compendium of guidelines for publishing correctly in tabular data, especially in CSV. The guidelines are accompanied by suggestions for free tools that stand out for their ease of working with CSV files and the extra functionality they provide. In addition, a summary of the guidelines present in the guide is also available in the form of a Cheet Sheet (cheat sheet or cheat sheet) for ease of use and consultation.