A step-by-step guide to analysing Strava data with Python
A couple of years ago I taught myself how to use R by analysing my Strava data. It was a fun way to learn and combined two of my passions – data and running (well, sport in general).
Now, I’m doing the same in Python. In this post I’ll be sharing the steps I took in performing some data wrangling and Exploratory Data Analysis (EDA) of my Strava data, whilst hopefully pulling out some interesting insights and sharing useful tips for other Python beginners. I am relatively new to Python but I find that documenting the process helps me learn and might help you too!
Downloading your Strava data
First off, we need to get our dataset. I downloaded my data as a CSV from the Strava website, but you can also connect directly to the Strava API.
To get your data from the Strava website, navigate to your profile by clicking your icon on the top right hand side of the page. Navigate to ‘My Account’ and hit ‘Get Started’ at the bottom of the page.
On the next page you’ll see three options. Underneath option 2 ‘Download Request’, hit ‘Request Your Archive’. Shortly after (usually under an hour depending on the size of the archive) the email associated with your Strava account will receive your archive in a zip file format. The file you want will be called activities.csv, I tend to ignore the rest of the files in there.
Navigate to your Jupyter Notebook and upload the CSV by hitting the upload files button, beneath ‘View’.
You should now see your CSV in the file browser, on the left hand side of your Jupyter notebook.
Import Libraries
Now we need to download the libraries we’ll be using for the analysis.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from datetime import datetime as dt
Next we need to transform the CSV we’ve uploaded into Jupyter into a table format using the Pandas library, which is a very powerful and popular framework for data analysis and manipulation.
df = pd.read_csv('strava_oct_22.csv') #read in csv
df.columns=df.columns.str.lower() #change columns to lower case
Data Wrangling
I know from past experience that the activities download from Strava includes a whole host of nonsensical and irrelevant fields. Therefore, I want to cleanse the dataset a little before getting stuck in.
Below are a couple of functions which help you get a feel for your dataset:
df.shape(818, 84)
The shape function tells you the number of rows and columns within your dataset. I can see I have 818 rows of data (each row is unique to an individual activity) and 84 columns (also known as variables or features). I know the majority of these columns will be useless to me as they’re either NaN values (Not a Number) or just not useful, but to confirm we can use the .head() function, which brings back the top 5 values for each column within the dataset.
We can also use .info() to bring back all the column titles for the variables within the dataset.
You can also see the ‘ Non Null Counts’ below each variable in the output above, which gives you an idea of how many rows for that variable are populated. Let’s get rid of the unnecessary noise by selecting only columns that we think will be relevant.
#Create new dataframe with only columns I care about
cols = ['activity id', 'activity date', 'activity type', 'elapsed time', 'moving time', 'distance',
'max heart rate', 'elevation gain', 'max speed', 'calories'
]
df = df[cols]
df
That looks better, but the dataset is missing some key variables that I want to include in my analysis which aren’t part of the extract by default, for example average pace and km per hour. Therefore, I need to compute them myself. Before I do that I need to double check the datatypes in my new data frame to ensure that each field is in the format I need it to be in. We can use .dtypes to do this.
df.dtypes
I can see that the majority of the fields in the dataset are numeric (int64 is a numeric value as is float64, but with decimals). However I can see that the activity date is an object which I will need to convert to a date datatype when I come to look at time series analysis. Similarly, distance is an object, which I need to convert to a numeric value.
I can use the datetime library to convert my activity date from object to a date datatype. We can also create some additional variables from activity date, i.e. pulling out the month, year and time, which we will use later in the analysis.
#Break date into start time and date
df['activity_date'] = pd.to_datetime(df['activity date'])
df['start_time'] = df['activity_date'].dt.time
df['start_date_local'] = df['activity_date'].dt.date
df['month'] = df['activity_date'].dt.month_name()
df['year'] = df['activity_date'].dt.year
df['year'] = (df['year']).astype(np.object) #change year from numeric to object
df['dayofyear'] = df['activity_date'].dt.dayofyear
df['dayofyear'] = pd.to_numeric(df['dayofyear'])
df.head(3)
Next we need to convert distance to be a numeric value using the to_numeric function from the Pandas library. This will allow us to effectively create our new variables needed for the analysis.
#convert distance from object to numeric
df['distance'] = pd.to_numeric(df['distance'], errors = 'coerce')
Now that distance is a numeric value, I can create some additional variables to include in my data frame.
#Create extra columns to create metrics which aren't in the dataset already
df['elapsed minutes'] = df['elapsed time'] /60
df['km per hour'] = df['distance'] / (df['elapsed minutes'] / 60)
df['avg pace'] = df['elapsed minutes'] / df['distance']
Since I have added and amended some variables, let’s use .dtypes again to check the data frame is in the right format.
That looks much better. Lastly, before we move onto the EDA I want my dataset to include runs only, as I know that the majority of my activities are of this type. To confirm I can count the number of activities for each activity type using .value_counts().
df['activity type'].value_counts()
As you can see, the vast majority of my activities are runs, so I’m going to make a new data frame called ‘runs’ and focus exclusively on that. I also know that there’s a few erroneous data entries in the data, where I may have forgotten to stop my watch or Strava was having a meltdown, so I’m also going to filter out some extreme results.
runs = df.loc[df['activity type'] == 'Run']
runs = runs.loc[runs['distance'] <= 500]
runs = runs.loc[runs['elevation gain'] <= 750]
runs = runs.loc[runs['elapsed minutes'] <= 300]
runs = runs.loc[runs['year'] >= 2018]
Exploratory Data Analysis
We now have a cleansed dataset that we can start to visualise and pull out interesting insights from. A good first step in EDA is to create a pairs plot, which quickly allows you to see both distribution of single variables and relationships between two variables. This is a great method to identify trends for follow-up analysis. If you have lots of variables in your dataset this can get messy, so I’m going to pick a few to focus on.
pp_df = runs[['distance', 'elevation gain', 'km per hour', 'max heart rate', 'calories']]
sns.pairplot(pp_df);
That one line of code is pretty powerful and we can already pull out some useful insights from this. The histograms on the diagonal allows us to see the distribution of a single variable whilst the scatter plots on the upper and lower triangles show the relationship (or lack thereof) between two variables. I can see that there is a positive correlation between distance and calories. I can see on the distance histogram that there is a left skew, meaning more of my runs are of a shorter distance. It’s also interesting that calories follow an inverse skew to km per hour and max heart rate — Pushing yourself harder doesn’t necessarily mean more calories lost.
The pairs plot is a nice way to visualise your data, but you might want to look at some summary statistics of your dataset, to get an idea of the mean, median, standard deviation etc. To do this, use the .describe function.
runs.describe().round(0)
I really like the describe function, it’s a quick way to get a summary of your data. My first reaction when I saw the above output was surprise that for distance, the median (or 50th percentile) of my runs is just 6km! But then again, I think I’ve ramped up my distance only in the last year or so. The beauty of EDA is that as you are performing initial investigations on data, you are able to discover patterns, spot anomalies, test hypothesis and check assumptions, so let’s see dig a little deeper into the distance variable.
I’m going to visualise the spread of the distance by year using a boxplot, to see if my hypothesis that I have increased my distance in the last year is correct. A boxplot is a great way of showing a visual summary of the data, enabling us to identify mean values, the dispersion of the dataset, and signs of skewness.
fig, ax = plt.subplots()
sns.set(style="whitegrid", font_scale=1)
sns.boxplot(x="year", y="distance", hue="year", data=runs)
ax.legend_.remove()
plt.gcf().set_size_inches(9, 6)
As I thought, my runs have indeed increased in distance in the last couple of years, with the median distance increasing slightly each year from 2018 until 2022 when it increased much more considerably.
Let’s breakout the years by month to see how distance covered varies throughout year. We can do this by using a bar plot from the seaborn library.
sns.set_style('white')
sns.barplot(x='month', y='distance', data=runs, hue='year', ci=None, estimator=np.sum, palette = 'hot',
order =["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"])
plt.gcf().set_size_inches(17, 6)
plt.legend(loc='upper center')
#plt.legend(bbox_to_anchor=(1.05, 1), loc='upper right', borderaxespad=0)
;
It is clear from the above bar plot is that my distance drops in the summer months. Taking 2022 as an example, in January, February and March 2022 the distance I ran was over 150km, notably more than what I cover in April-August. This trend is evident in other years too.
Let’s create a new variable called season by grouping months into their respective season using .isin().
runs['season'] = 'unknown'
runs.loc[(runs["month"].isin(["March", "April", "May"])), 'season'] = 'Spring'
runs.loc[(runs["month"].isin(["June", "July", "August"])), 'season'] = 'Summer'
runs.loc[(runs["month"].isin(["September", "October", "November"])), 'season'] = 'Autumn'
runs.loc[(runs["month"].isin(["December", "January", "February"])), 'season'] = 'Winter'
We can now create another boxplot to visualise distance by season.
ax = sns.boxplot(x="season", y="distance", palette="Set2",
data=runs,
order =["Spring", 'Summer', 'Autumn', 'Winter'])
plt.gcf().set_size_inches(9, 7)
For those not familiar with boxplots, the bold line represents the median, the box represents the interquartile range (the middle 50% of the data), the lower and upper lines represent the min and max, and the black dots represent outliers. I find it really interesting to see how my behaviour changes by season, it appears that I really am not a fan of long summer runs, with 75% of my runs being under 7k.
End Notes
We’ve covered quite a bit in this introduction to Python, including how to get started with reading a file into pandas, cleaning the dataset to optimise performance and creating additional variables to include in our analysis. We’ve generated some summary statistics to better understand our data, and started to visualise our dataset using matplotlib and seaborn.
Now the dataset is cleansed and we have a better understanding I am looking forward to conducting further analysis on the data and bringing in some data science techniques. For example, I plan to build a linear regression model to predict what my next 5k time will be, depending on variables such as what day/season/time of day it is, or what the elevation of the route is, or even the temperature and wind speed if I include some weather data.
Thanks for reading, until the next one!
Analysing Strava Data with Python Republished from Source https://towardsdatascience.com/analysing-strava-data-with-python-b8a5badb019f?source=rss—-7f60cf5620c9—4 via https://towardsdatascience.com/feed
<!–
–>