- Wed 28 August 2019
- Data Science
- #python, #data-science
Introduction¶
The objective is to map the traffic information for May 2019. For that we will need to get the traffic information per street section, and to connect it with the information which contains the geometry of the street sections. We can download the traffic information for last month (May, 2019), on this url
We can also download the geometry of the street sections (described in terms of latitude and longitude) from this url:
https://opendata-ajuntament.barcelona.cat/data/en/dataset/transit-relacio-trams
The final goal is to have a static (e.g. an image) or interactive map, which displays the traffic state intensity by street sections, over a context background map (e.g.: OpenStreetMap).
A short disclaimer before we begin: In catalan, tram means something like segment (of a street, in this context *ahem*). It's not a rail tram.
Hands on¶
We will try to get a quick sketch using only a subset of the data first, and I will be adding comments when necessary.
We first note that the files that we will be using are really big, and we can't afford to download them from a notebook cell each time. Since this notebook was created using the Google Colab Platform, we will have to find a way to make these files available from somewhere. Luckyily, Colab's API provides a Google Drive extension that solves this problem.
from google.colab import drive
drive.mount('/content/drive')
TRAFFIC_MAY_URL="/content/drive/My Drive/Colab Notebooks/trams-barcelona-data/2019_05_Maig_TRAMS_TRAMS.csv"
STREETS_MAPPING_URL="/content/drive/My Drive/Colab Notebooks/trams-barcelona-data/transit_relacio_trams.csv"
STREETS_MAPPING_URL_LONG="/content/drive/My Drive/Colab Notebooks/trams-barcelona-data/transit_relacio_trams_format_long.csv"
import pandas as pd
import matplotlib.pyplot as plt
# define default size of plots
plt.rcParams["figure.figsize"] = (20,3)
trf = pd.read_csv(TRAFFIC_MAY_URL, sep=",")
trf.head(10)
trf.info()
old_size, _ = trf.shape
Some information about this dataset can be found in
https://opendata-ajuntament.barcelona.cat/data/en/dataset/trams
Here, we see that the columns EstatActual
(current state) and EstatPrevist
(forecasted state) have 7 possible values:
Value | Meaning |
---|---|
0 | No data |
1 | Very fluid (Light) |
2 | Fluid |
3 | Dense |
4 | Very Dense |
5 | Congestion |
6 | Line closed |
Our first task is to decide what to do with those values with 0 and 6. For the sake of these examples, we will drop these rows. In your case, you might want to work only with these rows, or something in the middle. Your call.
# clean trf fr
trf = trf[
~(
trf['estatActual'].isin([0, 6]) | trf['estatPrevist'].isin([0, 6])
)
]
trf.info()
new_size, _ = trf.shape
# print the amount of rows that were dropped
print(f"{old_size - new_size} of {old_size} rows were dropped ({(old_size-new_size)/old_size:0%})")
streets = pd.read_csv(STREETS_MAPPING_URL, sep=",")
streets.head(10)
streets.dtypes
Manipulating WKT data with shapely¶
streets["Coordenades"].iloc[1]
Here we note that the coordinates are in chunks of text and there's no encoding explicited (I didn't find any on the official information). We need to do some research.
But before diving in, let's check for missing data.
# we check for possible missing data
streets["Coordenades"].isna().sum()
streets["Coordenades"].isnull().sum()
expanded_coords = streets["Coordenades"].str.split(",", expand=True)
expanded_coords.head()
len(expanded_coords.columns)
We note that the Coordenades
column could be expanded into ~76 new columns, if we consider sep=","
as a separator. From looking at this data, it's probably in the WKT
format (Well-Known Text representation of geometry) and it represents a line, according to the LINESTRING
representation.
We'll need to convert each row to match this WKT
format, using string manipulation. But we don't want to reinvent the wheel, so we will be using a python package called Shapely
developed specifically for these types of tasks.
!pip install shapely
sample = streets["Coordenades"].str.split(",")[0]
sample
Reshaping the strings¶
A LINESTRING
is defined in terms of pairs of values. We need to take a list of length $n$, (and assume this number is even), and reshape it as a list of tuples.
[a, b, c, d, e, f, g] -> [(a, b), (c, d), (f, g),]
import typing as T
from itertools import zip_longest
# not tested
def reshape_line(line: T.List[str], chunksize: int, fillvalue: int) -> T.Iterable[T.Tuple[float]]:
"""Reshape a line to match the LineString WKT format
This is based on `zip_longest`, read more in
<https://docs.python.org/3/library/itertools.html#itertools.zip_longest> and
in the StackOverflow solution posted in <https://stackoverflow.com/a/434411/5819113>
zip_longest('ABCD', 'xy', fillvalue='-') --> Ax By C- D-
"""
args = [iter(float(el) for el in line)] * chunksize
return zip_longest(*args, fillvalue=fillvalue)
list(reshape_line(sample, chunksize=2, fillvalue=0))
With this, we can get our very first drawing. Baby steps.
from shapely.geometry import LineString
sample_wkt = LineString(reshape_line(sample, 2, 0))
sample_wkt
The next step is to apply this process to every row on the column Coordenades
. We will implement a helper function for this purpose.
def convert_line_to_wkt(line: str) -> LineString:
splitted_line = line.split(",")
reshaped_line = reshape_line(splitted_line, chunksize=2, fillvalue=None)
return LineString(reshaped_line)
# let's process the whole column
coords_wkt = streets["Coordenades"].apply(lambda line: convert_line_to_wkt(line))
# access the __str__ of the first element
print(coords_wkt[1])
With the values in WKT format, we could use an online WKT viewer like http://arthur-e.github.io/Wicket/sandbox-gmaps3.html to see how this is rendered on a map.
First plot using geopandas
¶
The former was the first approach. We will try now our next step: plot everything at once. Whatever gets displayed will be considered success.
!pip install geopandas
from geopandas import GeoDataFrame
crs = {'init': 'epsg:4326'}
gdf = GeoDataFrame(streets, crs=crs, geometry=coords_wkt)
ax = gdf.plot(cmap="viridis")
Note that here, the colormap viridis
is not supposed to make sense. It is not giving any useful information out of the box. But we'll get to that in a minute.
Tweaking the plot¶
Hooray! We got a map. But it's ugly and it doesn't say much.
At this point we could start tweaking the plot using the matplotlib
API. We could add legends, meaningful colours, different line styles, etc. You get it. And if you have played with matplotlib
long enough, you'll know that almost everything is possible, but not easy to get.
In our case we don't need to produce beautiful plots at this point, we need to provide an efficient way to explore the data visually.
For maps, the best tool to do this IMHO is to use leaflet
, but we don't want to write javascript (I don't want to). We will be using Folium
, which is a library for using leaflet in python.
Plotting routes on top of Barcelona using Folium
¶
!pip install folium
import os
import folium
print(folium.__version__)
barcelona_map = folium.Map([41.3947,2.1557], zoom_start=12.4, tiles='cartodbpositron')
folium.GeoJson(gdf).add_to(barcelona_map)
# plot map
barcelona_map
Plotting traffic data¶
This map is plain and boring, because it doesn't contain any information from the traffic dataset. Let's fix that.
We would like to achieve the following:
- To draw the lines from
Coord_wkt
on top of a map of Barcelona - To plot the values from the
trf
dataframe, showing how busy is the line at a certain moment, defined by the columndata
We note that the traffic data has a temporal component in data
, which will benefit greatly from using the correct datatype, i.e. we will try to use the DateTime
type over the Object
type.
trf["data"].head()
We can convert this column to datetime
(https://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.to_datetime.html) following the strftime
format specifications:
# https://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.to_datetime.html
pd.to_datetime(trf["data"].astype(str)[:5], format='%Y%m%d%H%M%S')
trf["data_datetime"] = pd.to_datetime(trf["data"].astype(str), format='%Y%m%d%H%M%S')
trf.columns
trf.head(5)
Quick survey on the data¶
How busy are trams compared to the expected traffic, on average, for the month of May?¶
As an extra, we'll try to assess the forecasts: were they right? We'll consider a simple rule of thumb
If the forecasted traffic is greater than the current traffic, then the forecast is right
This is one approach, and I agree it is rough. There are multiple ways of asessing the error in predictions, but here's not the place to discuss them.
# caveat: this is an *aggregated* measure of the whole month, but it provides
# a rough measure
monthly_traffic = trf.groupby(by=['idTram'], as_index=True).mean()
monthly_traffic["diff_perc"] = monthly_traffic["estatActual"]/monthly_traffic["estatPrevist"]*100
monthly_traffic["was_forecast_right"] = monthly_traffic["diff_perc"] <= 100
monthly_traffic.head()
How busy were trams on average, on an hour basis, per tram line, during the month of May?¶
# line continuation according to PEP8
# see https://stackoverflow.com/a/17630918/5819113
trams_hourly = (
# drop unwanted columns
trf.drop(columns=['data', 'idTram'])
# group by desired columns
.groupby(by=[trf['idTram'], trf["data_datetime"].dt.day], as_index=True)
# show the mean of the aggregated values
.mean()
)
trams_hourly["diff_perc"] = trams_hourly["estatActual"]/trams_hourly["estatPrevist"]*100
trams_hourly["was_forecast_right"] = trams_hourly["diff_perc"] <= 100
trams_hourly.head(24) # first 24 hours
Baking everything together¶
Let's merge dataframes and see what we get. This is my first time using these libraries, so my code has a lot of comments that hopefully explain what I'm doing. I've taken a lot from the examples given in the folium repo, so please go there to see what else can be done!
# join dataframes, to *see* what happens. But make a copy first.
_trf_to_merge = trf.copy()
_streets_to_merge = streets.copy()
# merge both dataframes using the id of the tram as common column
_merged = _trf_to_merge.merge(_streets_to_merge, left_on="idTram", right_on="Tram")
# drop unnecesary columns and set the idTram
_merged = _merged.drop(columns=["data", "Tram", "Coordenades"])
_merged.head()
# `estatActual` and `estatPrevist` vary from 1 to 5
_merged.describe()
_merged.info()
Plotting combined data¶
Grouping information by day¶
from geopandas import GeoDataFrame
crs = {'init': 'epsg:4326'}
merged_gdf = GeoDataFrame(_merged, crs=crs, geometry=_merged['geometry'])
# create a colormap using branca
# following the examples from
# https://nbviewer.jupyter.org/github/python-visualization/folium/blob/master/examples/plugin-Search.ipynb
import branca
colorscale = branca.colormap.linear.viridis.scale(1, 5)
colorscale.caption="Tram usage level"
def style_function(feature):
estat_actual = int(feature['properties']['estatActual']) or None
return {
'alpha': 0.5,
'weight': 3,
'color': '#black' if estat_actual is None else colorscale(estat_actual)
}
# we will manipulate the dataframe to plot data using time scales,
# since plotting the whole dataframe it's too expensive memory-wise
first_10_am = merged_gdf[(merged_gdf["data_datetime"].dt.day == 1) & (merged_gdf["data_datetime"].dt.hour == 10)]
# we need to drop this column because it raises an error
first_10_am = first_10_am.drop(columns="data_datetime", inplace=False)
# plot the map
barca_map_colors = folium.Map([41.3947,2.1557], zoom_start=12.4, tiles='cartodbpositron')
(
folium.GeoJson(
first_10_am,
style_function=style_function)
.add_to(barca_map_colors)
)
# add legend, see <https://github.com/python-visualization/folium/issues/528>
barca_map_colors.add_child(colorscale)
# plot map
barca_map_colors
# we will manipulate the dataframe to plot data using time scales
gdf_hourly = (
# group by desired columns
merged_gdf.groupby(
by=[merged_gdf['idTram'],
merged_gdf["data_datetime"].dt.day, # here is where the grouping happens
],
as_index=True)
# show the mean of the aggregated values
.mean()
.reset_index()
)
gdf_hourly["diff_perc"] = gdf_hourly["estatActual"]/gdf_hourly["estatPrevist"]*100
gdf_hourly["was_forecast_right"] = gdf_hourly["diff_perc"] <= 100
gdf_hourly_w_geometry = gdf_hourly.merge(streets, left_on="idTram", right_on="Tram")
gdf_hourly_w_geometry.tail()
Plot daily average values¶
- We are using a discrete colorscale that ranges from 1 to 6, to indicate the mean of the
estatActual
for each Tram line on May 27 (The color selection is arbitrary) - We are drawing the lines whose capacity forecast was right (e.g. the actual capacity was below the forecasted value) with a dashed line, and those lines whose value was wrong (e.g. the capacity forecast fell short on that day) with a continuous line. This is because of the limited styles that leaflet provides out of the box, and any work on improving this style had to be done with CSS, which is out of the scope of this challenge.
import branca
from folium.plugins import Search
from geopandas import GeoDataFrame
# produce the geopandas dataframe
crs = {'init': 'epsg:4326'}
gdf_hourly = GeoDataFrame(
gdf_hourly_w_geometry[gdf_hourly_w_geometry['data_datetime']==27], # here is where the filtering happens
crs=crs,
geometry=gdf_hourly_w_geometry['geometry']
)
# create colorscale
colorscale = branca.colormap.step.Accent_06.scale(1, 5)
colorscale.caption="estatActual (Daily average)"
# define a method for styling lines
def style_function_capacity(feature):
not_exceeded = int(feature['properties']['was_forecast_right']) or None
estat_actual = int(feature['properties']['estatActual']) or None
# see for other options https://leafletjs.com/reference-1.5.0.html#path-option
return {
'alpha': 0.5,
'dashArray': '5,10' if not_exceeded else '0', # dashed line unless forecast is wrong
'weight': 3,
'color': '#black' if estat_actual is None else colorscale(estat_actual),
}
# plot the map of barcelona
hourly_map = folium.Map([41.3947,2.1557], zoom_start=12.4, tiles='cartodbpositron')
# plot the tram lines
tram_map = (
folium.GeoJson(
gdf_hourly,
name="Barcelona Tram Capacity (27 May 2019)",
style_function=style_function_capacity,
tooltip=folium.GeoJsonTooltip(
fields=['idTram', 'estatActual', 'estatPrevist', 'diff_perc', 'Descripció'],
aliases=['Tram ID', 'Observed Capacity', 'Forecasted Capacity', 'Load (%)', 'Description'],
localize=True)
)
.add_to(hourly_map)
)
# add legend, see <https://github.com/python-visualization/folium/issues/528>
hourly_map.add_child(colorscale)
# add a searchbox bound to the Description of lines
statesearch = Search(
layer=tram_map,
geom_type='Line',
placeholder='Search for a Tram Description',
collapsed=False,
search_label='Descripció',
weight=3
).add_to(hourly_map)
# plot map
hourly_map
Final words¶
Thanks for reading until here! Regarding the data, I've explored manually isolated cases, so my conclusions need a bit of work. I'll refer only to the last map being shown (May 27).
- It's not common to observe capacity levels over 3
- Traffic loads range from 90 to 110%. Forecasts are really tight but it seems there's no need for accounting for more.
- It may be worth exploring plots of
diff_perc
and other error measurements.
It's easy to extend these maps to any time of the day, using the dt
accessor and using the data_datetime
as a pivot for indexing.
My intention with this notebook is to show the power of leaflet and python, and the libraries available to manipulate geocoded data like geopandas and shapely. I've found that there is an application that uses this data, check https://com-shi-va.barcelona.cat/en/transit-
Until next time.