Eurostat is the main entry point for European statistics. This article helps you with the steep learning curve for using the data in Python.
Eurostat is the statistical office of the European Union. Their mission is to provide statistics and data on Europe. All the data is available for browsing and download. You can browse their complete database here. The final code of this article is available on my GitHub as easy_eurostat package.
Datasets
All tables can be downloaded by their identification code. This code is shown in the database browser between brackets behind the name of the dataset. See the red circle in the picture below The downloaded files are in a gzipped tar format, containing a tab separated datafile.
With the dataset idetification code the dataset can be download from an URL with the form https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing
with the parameter file=data/dataset.tsv.gz
:
import os
import gzip
import urllib.requestdef download_eurostat_dataset(dataset):
url = "https://ec.europa.eu/eurostat/" +
"estat-navtree-portlet-prod/" +
"BulkDownloadListing?file=data/" +
dataset + ".tsv.gz"
out_file = os.path.join("cache", dataset + ".tsv")
if os.path.exists(out_file):
return
try:
with urllib.request.urlopen(url) as resp:
with gzip.GzipFile(fileobj=resp) as data:
file_content = data.read()# Write to file in binary mode 'wb'
with open(out_file, 'wb') as f:
f.write(file_content)
except Exception as e:
print(e)download_eurostat_dataset("teilm020")
This example downloads the dataset teilm020
(unemployment numbers) and store the tab separated file in the cache directory. The dataset is saved to disk so it can be used as a cached version of the eurostat database. The gzipped file is loaded in memory and decompressed before writing. In this case, working in memory is no problem due to the limited size of the dataset.
The stored file is in a tab separated format, except for the first columns which are comma separated (let’s not make it to easy). Diving into the working of the CSV import function, a complex way can be found to accept comma’s and then tabs as separator. So to load the dataset to a Pandas dataframe, we can use the following code:
def get_eurostat_dataset(dataset):
download_eurostat_dataset(dataset)
df = pd.read_csv(os.path.join("cache", dataset + ".tsv"),
sep=",|t| [^ ]?t", na_values=":",
engine="python")
df.columns = [x.split('')[0].strip(' ') for x in df.columns]
return dfdf = get_eurostat_dataset("teilm020")
The column names are cleaned from surrounding spaces (sometimes there is a space at the end of a column name) and the last column has a postfix ‘time’, which indicates the following columns are the timeseries. This postfix is also removed.
The resulting dataframe is as follows:
Dictionaries
The first five columns are codes, referring to code lists. These code lists (dictionaries) can be found here on the Eurostat site. The column name equals the name of the .dic
file at this location. This file contains key/value pairs separated by tabs. The key is the abbreviation and the value is the full text. In the dataset, the coded columns are the first columns that are not tab separated but comma separated. Iterating over these column names we can replace them one by one with the text values.
In order to replace the codes with their values, we need to load the dictionary file from Eurostat and import it into a dictionary:
def download_eurostat_dict(dictionary):
dictionary = dictionary.lower()
url = "https://ec.europa.eu/eurostat/" +
"estat-navtree-portlet-prod/BulkDownloadListing" +
"?sort=1&downfile=dic%2Fen%2F" +
dictionary + ".dic"
out_file = os.path.join("cache", dictionary + ".dic")
if os.path.exists(out_file):
return
try:
with urllib.request.urlopen(url) as resp:
file_content = resp.read().decode('utf-8')
with open(out_file, 'w') as f:
f.write(file_content)
except Exception as e:
print(e)def get_eurostat_dictionary(dictionary, inverse=False):
download_eurostat_dict(dictionary)
filename = os.path.join("cache", dictionary + ".dic")
try:
with open(filename) as f:
d = {}
for line in f:
if len(line) > 1:
row = line.split('t')
d[row[0]] = row[1].strip()
if inverse:
d = {v: k for k, v in d.items()}
return d
except:
return {}
As with the datasets, a copy is downloaded to a local folder functioning as cache. The difference is that the dictionary file is not compressed and can be stored as downloaded.
The dictionary file contains per line a key/value pair separated by a tab character. Some dictionary files contain an empty line between each key/value pair so the check on line length is required to skip the empty lines. The value is stripped from the ‘n’ character.
With the dictionary available, the codes in the dataframe can be replaced. The Panda dataframe has a method replace()
specific for this functionality. When it is called on a column with a dictionary as parameter, all values in this column will be replaced with the value associated with the current column value as key.
Replacing the codes
The get_eurostat_dataset
method is exended with the functionality to replace the code columns keys with values:
def get_eurostat_dataset(dataset):
download_eurostat_dataset(dataset)
df = pd.read_csv(os.path.join("cache", dataset + ".tsv"),
sep=",|t| [^ ]?t", na_values=":",
engine="python")
df.columns = [x.split('')[0].strip(' ') for x in df.columns]
# Now get the dictionary columns (first line, comma sep)
with open(os.path.join("cache", dataset + ".tsv")) as f:
first_line = f.readline()
codes = first_line.split('t')[0].split('')[0].split(',')
# Replace codes with value
for c in codes:
code_list = get_eurostat_dictionary(c)
df[c] = df[c].replace(code_list)
df = df.set_index(codes).transpose()
return df
Before returning, the dataframe is transposed and all code columns are part of a multilevel column index. At first , this seems complicated, but the dataframe becomes easier to use.
Reading the same dataset now results in:
From dataset to graph
Accessing and using the MultiIndex columns works as follows:
>>> # Get names of all levels
>>> print(df.columns.names)['s_adj', 'age', 'sex', 'unit', 'geo']>>> # Obtaining all unique values for level 2 ('sex')
>>> print(df.columns.get_level_values(1).unique())Index(['Females', 'Males', 'Total'], dtype='object', name='sex')>>> #Selecting unemployment data for all sexes combined
>>> df.loc[:, ('Seasonally adjusted data, ...',
'Total',
'Total',
'Percentage of population in the labour force')]
So, as an example to create a bar plot with the unemployment data for Austria:
d = df.loc[:, ('Seasonally adjusted data, not calendar adjusted' +
' data','Total', 'Total',
'Percentage of population in the labour force')]
ax = d['Austria'].plot.bar()
ax.set_title('Unemployment in Austria')
ax.set_xlabel('Month')
_ = ax.set_ylabel('Unemployment percentage')
resulting in
So now we know how to download and prepare data from Eurostat and to use it for creating graphs. All datasets contain data per country so making a choropleth map is the next step.
Choropleth map
Eurostat also provides geodata. This geodata contains country borders, region borders, cities, etc. The level of detail is called NUTS and NUTS level 0 is country level. By increasing this number, more details are added but for now the level 0 is sufficient:
import geopandas as gpddef download_url(url, filename, binary=False):
if os.path.exists(filename):
return
try:
with urllib.request.urlopen(url) as resp:
file_content = resp.read()
if not binary:
file_content = file_content.decode('utf-8')
with open(filename, 'wb' if binary else 'w') as f:
f.write(file_content)
except Exception as e:
print(e)def download_eurostat_geodata():
url = "https://gisco-services.ec.europa.eu/distribution/v2/" +
"nuts/shp/NUTS_RG_20M_2021_3035.shp.zip"
out_file = os.path.join('data',
'NUTS_RG_20M_2021_3035.shp.zip')
download_url(url, out_file, binary=True)
We use the RG_20M
file wich has a scale of 1:20mio. If more details are required, there are also 1:10mio, 1:3mio and 1:1mio files available. Replace 20M
in the filename to download another scale, e.g. 10M
for the 1:10mio scale.
The downloaded zip-file contains a geospatial shapefile. This file format from ESRI is the most common format for vectorized geospatial data. It consists of points, lines and polygons representing geospatial items (countries, roads, etc). All items can have attributes associated with them.
The GeoPandas package is an excellent Python package for handling geospatial data. It is an extension of Pandas dataframe that adds geospatial functions. It can be installed with pip or Conda :
pip install geopandasorconda install -c conda-forge geopandas
GeoPandas is capable of opening the shapefile in zip format:
def get_eurostat_geodata(lvl=0):
download_eurostat_geodata()
borders = gpd.read_file(os.path.join('data',
'NUTS_RG_20M_2021_3035.shp.zip'))
return borders[borders['LEVL_CODE'] == lvl]
The resulting dataframe is shown below. The geometry
column contains the geospatial items, in this example all off them are polygons (MULTIPOLYGON
). The other columns are the attributes. The country code is in the NUTS_ID
and CNTR_CODE
columns. The country code column is the same for all elements of a country, including e.g. roads, rivers and cities. The NUTS_ID
is an unique identification, when the a country code appears in this column the row contains the country borders.
To check the contents of the dataframe we can plot it with the built-in plot()
method.
countries = get_eurostat_geodata()ax = countries.plot()
ax.set_xlim(2000000, 8000000)
ax.set_ylim(1000000, 5500000)
This results in a plot of Europe:
The x and y limits crop away the distant parts like the Canary Islands.
A choropleth map is created when the plot()
parameter column
is specified. This column is used to color the countries by this columns value. First, we have to combine the unemployment dataframe with the country dataframe. The country dataframe uses the countries abbreviation as identification. This abbreviation was removed during the import phase of the unemployment data which contains the countries full names. The geo dictionary is used for the reverse translation.
d = df.loc[:, ('Seasonally adjusted data, not calendar adjusted' +
' data','Total', 'Total',
'Percentage of population in the labour force')]
d = d[d.index == '2022M08'].transpose().reset_index()
code_list = get_eurostat_dictionary('geo', inverse=True)
d['geo'] = d['geo'].replace(code_list)
d = countries.merge(d, right_on='geo', left_on='NUTS_ID')
First, unemployment data is filtered like before (total numbers for all sexes combined). The month 2022M08 is selected since it is the last row containing data for all countries. After this, the dataframe is transposed resulting in
The geo
dictionary is used to translate the countries name back to their abbreviation.
The unemployment data and countries data are now ready to be merged. Make sure to merge the unemployment data into the country data to ensure the resulting dataframe is indeed a GeoPandas dataframe capable of plotting maps. The countrie codes are in the column NUTS_ID
of the countries
dataframe. The final code to plot the choropleth map is
ax = d.plot(column='2022M08', legend=True)
ax.set_xlim(2000000, 7000000)
ax.set_ylim(1000000, 6000000)
ax.set_title('Unemployment in the European Union')
resulting in
Congratulations! You now have all the tools to use the Eurostat open data, including the creation of graphs and maps.
The easy_eurostat package
I have created a package based on the code above. Code is better reused and some method parameters for easy usage have been added. One example is de transpose
parameter for the download_eurostat_data
method. The code above showed that during the download the table as transposed but it needed another transpose, undoing the first one, to use it for mapping purposes. The boolean transpose
parameter definers whether or not the transpose is performed in the method.
All code is available on my GitHub as easy_eurostat package. Feel free to use as required.
The examples of this articles implemented using this package:
from easy_eurostat import get_eurostat_geodata, get_eurostat_dataset# Plotting the Austrian unemployment numbers over time
df = get_eurostat_dataset("teilm020")
d= df.loc[:, ('Seasonally adjusted data, not calendar adjusted' +
' data','Total', 'Total',
'Percentage of population in the labour force')]
ax = d['Austria'].plot.bar()
ax.set_title('Unemployment in Austria')
ax.set_xlabel('Month')
_ = ax.set_ylabel('Unemployment percentage')# Making a choropleth map of unemployment numbers in Europe
df = get_eurostat_dataset("teilm020", replace_codes=True,
transpose=False, keep_codes=['geo'])
df = df[(df['age'] == 'Total') & (df['sex'] == 'Total')]df = pd.merge(countries, df, left_on='NUTS_ID', right_on='geo')
ax = df.plot(column='2022M08', legend=True)
ax.set_xlim(2000000, 7000000)
ax.set_ylim(1000000, 6000000)
ax.set_title('Unemployment in the European Union')
I hope you enjoyed this article. For more inspiration, check some of my other articles:
All stories can be found in my profile. If you like this article or any other of my articles, please hit the Follow button!
Disclaimer: The views and opinions included in this article belong only to the author.
Using Eurostat statistical data on Europe with Python Republished from Source https://towardsdatascience.com/using-eurostat-statistical-data-on-europe-with-python-2d77c9b7b02b?source=rss—-7f60cf5620c9—4 via https://towardsdatascience.com/feed
<!–
–>