Xavier Bas
Posted on January 2, 2019
Have you ever wondered how well you qualify in your Strava club? Do you want to know the level of your friend's club? Would you like to find those that have similar riding performance than you within your Strava club?
Well I asked myself similar questions and decided to investigate the Strava API and run a small analysis to group the members of your Strava club in n number of clusters based on riding performance 🚴♀️🚴 🚴🏿🚴🏻♀️🚴🏼♀️ 🚴🏼♂️🚴🏽♂️🚴🏾♀️🚴🏾♂️🚴🏽♀️🚴🏿♀️🚴♂️🚴🏻♂️💨💨💨 what a nice club ride is this?
Here I would like to focus on the clustering process rather than the use of the Strava API because I think the later topic is widely covered out there. If you are not familiar with the API you might want to have a look at the official documentation here
Let's make a start, shall we?
What is a club activity?
I feel like I have to start from here: definitions
To me a club activity and more specifically a club ride is what happens on a typical Sunday, you wake up early, have a nice breakfast, dress yourself up in lycra and off you go for a bunch of hours with the guys from the club.
Well it turns out the definition is not shared within the Strava world. Stay with me here. Strava considers club activities of a specific club to be all the activities from all the users of that club. In plain English, if you join a club all your activities will be listed as club activities of such club.
The plan
The plan is simple - the simpler the better they say - First and foremost we will make sure we have supplies of your favorite hot beverage, in my case Earl Gray tea 🍵. IMHO this should be always the first step before attempting to do anything glorious. ok, moving forward..
The idea is to retrieve as much data as possible about the rides of the club of interest, make some data cleansing and once we are happy with the data we will get excited as we will be ready to cluster the club rides.
Cracking on
My kettle is on, my Earl Gray is about to get ready ☑️
It's time to look at some code:
import requests
ACCESS_TOKEN = 'your_access_token_here'
n_clubs = 30
endpoint = "https://www.strava.com/api/v3/athlete/clubs?&pagenobody =1&per_page={}&access_token={}"
r = requests.get(endpoint.format(n_clubs,ACCESS_TOKEN))
my_clubs = r.json()
I think this step was not detailed in the plan 🙃 anyway basically what this does is getting a list of all your Strava Clubs you joined. In there you should be able to find the key id for each of your clubs. Once you have identified from the list the club that you are interested in, make a note of its id - from now on we will call this club_id.
Note that you would need the access token here. If you know how to get it that is good news for you, if not I'm afraid I'm won't be covering this here, sorry, I believe other people that can communicate far better than me have already posted the way to get yours.
Now as we have planned we will use the club_id to retrieve as much data we are allowed to:
import pandas as pd
endpoint = "https://www.strava.com/api/v3/clubs/{}/activities?&page={}&per_page={}&access_token={}"
df = None
for ii in range(2):
r = requests.get(endpoint.format(str(club_id),str(ii+1),'100',ACCESS_TOKEN))
club_activity = r.json()
df = pd.concat([df, pd.DataFrame(club_activity)])
df = df.reset_index(drop=True)
# Unpack the nested athlete dictionary into columns
df = pd.concat([df, pd.DataFrame((d for idx, d in df['athlete'].iteritems()))], axis=1)
df.drop(['athlete','resource_state'],axis=1,inplace=True)
df['full_name'] = df.firstname + ' ' + df.lastname
This should result in the generation of a DataFrame with basic information about the club activities of the club of interest.
In[]: df.head()
Out[]:
distance elapsed_time moving_time name
0 55359.5 6709 6709 Afternoon Ride
1 23363.7 5911 5595 Afternoon Ride
2 28746.8 4961 4823 Afternoon Ride
3 64576.7 13551 10647 Afternoon Ride
4 24094.0 2712 2712 Morning Ride
total_elevation_gain type workout_type full_name(*)
0 816.0 Ride 10.0 Sanglier
1 427.0 Ride NaN Julius Pompilius
2 724.0 Ride NaN Moralélastix
3 1343.7 Ride NaN Amnésix
4 146.0 VirtualRide NaN Sténograf
(*) For privacy reasons I will display Astérix characters instead of the actual names.
Few notes here,
- thankfully units seem to be in SI, that's a nice touch! 🙌🙌
- we have two features for describing the time from the data above:
elapsed_time
should include breaks whereasmoving_time
should be what the name describes. If that would be the case I would expect to have higher values of elapsed time than moving time, always. As you can see this is not the case, what makes me think that some rides do not log with autopause turned on 🤦♂️ augh, come on guys! - average speed is not shown so we are computing it with
df['speed_kph'] = df.distance/df.moving_time*3.6
sorry for those folks that don't autopause as their speed will be reduced - it appears that not everybody is hitting the road, Sténograf was pretty comfortable doing an early session at home!
Rearranging the data
We would like to have the data indexed by athlete, one way we can achieve it is using the method groupby chained with mean statistic.
summary = df[df.type=='Ride'].groupby('full_name')['distance','total_elevation_gain','speed_kph'].mean()
In[]: summary.head()
Out[]:
distance total_elevation_gain speed_kph
full_name
Abraracourcix 34325.1 502.1 15.2
Absolumentexclus 50507.7 796.8 23.7
Amnésix 48812.6 981.7 21.5
Amonbofis 54889.8 1014.0 20.5
Aplusbégalix 92074.0 956.0 27.5
So you see, this gives us the 3 features - distance, total elevation gain and speed - for each rider. Note that we disregard virtual rides by filtering out the type of ride. Following, we will use precisely this data to cluster the riders in groups.
Clustering
I'm running low on tea.. hold on a minute this section deserves a bit more than just tea. I think biscuits will do 😋
A good practice when dealing with machine learning algorithms is recaling your data. In this case we will preprocess the data with the minmax scaler. This will scale all features such that its values fall within a given range, typically between 0 and 1. Then we will use these values to feed the K-means clustering algorithm.
from sklearn.preprocessing import minmax_scale
from sklearn.cluster import KMeans
X = minmax_scale(np.array(summary))
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)
summary['cluster'] = kmeans.labels_
Pretty quick isn't it? Well let's have a look at the results before rushing into conclusions. I would like to plot the athlete's performance through our 3 features while displaying the groups we've just made.
import matplotlib.pyplot as plt
import seaborn as sns
_= plt.figure()
_= plt.subplots_adjust(hspace=0,wspace=0)
_= plt.subplot(221)
_= sns.scatterplot(x=summary.distance/1000,y='total_elevation_gain',data=summary,hue=summary.cluster,legend=False)
_= plt.subplot(223)
_= sns.scatterplot(x=summary.distance/1000,y='speed_kph',data=summary,hue=summary.cluster,legend=False)
_= plt.subplot(224)
plt.yticks([])
_= sns.scatterplot(x='total_elevation_gain',y='speed_kph',data=summary,hue=summary.cluster,legend=False)
Nice plot but I'm not entirely satisfied with it. Surely we have succeed on clustering the riders in 3 groups or say 3 teams. Hold the champagne for now, it is good news that we have riders well grouped by the distance they cover but looking a bit closer, some of these teams are quite unbalanced in terms of speed 😓. Look at the bottom subplots - distance vs speed and total elevation gain vs speed - now pay attention at the blue team. Their range in speed is huge and remember this speed is average speed!! I personally wouldn't like to be in the blue team, if you are a top rider you do nothing but waiting the rest and if you are the slowest rider there.. what a nightmare this has to be!!
We need a second attempt.
We would like to have a smaller range in speed on each group so that all riders can easily keep up with the pace of the group. This means the feature speed needs to matter more than the rest. How do you implement this concept? The key is in the scaling. Follow the minmax scaler we will scale the speed by a factor of 2 and leave the other features as they are. This will do the trick.
X_weighted = np.multiply(X, np.tile([1,1,2], (len(X), 1)))
kmeans = KMeans(n_clusters=3, random_state=0).fit(X_weighted)
summary['cluster2'] = kmeans.labels_
And now we create the same figure:
_= plt.figure()
_= plt.subplots_adjust(hspace=0,wspace=0)
_= plt.subplot(221)
_= sns.scatterplot(x=summary.distance/1000,y='total_elevation_gain',data=summary,hue=summary.cluster2,legend=False)
_= plt.subplot(223)
_= sns.scatterplot(x=summary.distance/1000,y='speed_kph',data=summary,hue=summary.cluster2,legend=False)
_= plt.subplot(224)
plt.yticks([])
_= sns.scatterplot(x='total_elevation_gain',y='speed_kph',data=summary,hue=summary.cluster2,legend=False)
This looks much much better now, riders are grouped by the amount of distance they cover, how high they climb and how fast they ride, making sure the spread in average speed within the groups is kept low.
And there you go, how to cluster your Strava club rides. Time to open the bottle of champagne 🍾
Posted on January 2, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.