Ingesting API data to Microsoft Fabric with low-code Dataflows: A Practical Example using the Strava API
Jon Stjernegaard Vöge
Posted on September 18, 2024
Ingesting API data to Microsoft Fabric with low-code Dataflows: A Practical Example using the Strava API
Introduction
Working with API data is often a task reserved for pro-coders. But in Microsoft Fabric, we can leverage the low-code Dataflows Gen2 to connect to any API and output the data to our Fabric Data Lakehouse or Data Warehouse.
The article below takes you through the entire process, step by step, of connecting to Stravas API, setting up automatic refresh of Authorization Tokens, Downloading and transforming all Athlete Activities, and outputting them to a Microsoft Fabric data destination.
Strava prerequisites
For this guide, I am using the Strava API, as it contains data that I as a hobby cyclist have a personal interest in extracting and visualizing. However, many of the principles described will apply to other APIs from applications you use. I may do a Spotify example in a future blogpost.
To get started with the Strava API, simply register an account or login to Strava.com. Then, go to Strava | API and create an App. Then extract the following pieces of information and keep them somewhere safe:
Client ID
Client Secret
Don’t bother with the Authorization Token and Refresh Token as we will get these programatically.
User Authorization
Next, you’ll need to Authorize your newly created app to access the data of one or more Strava users.
This is done by sharing the following link with the person(s) you wish to extract data from, and have them login using their Strava Account, and press Authorize:
https://www.strava.com/oauth/authorize?client_id={InsertYourClientIDHere}&response_type=code&redirect_uri=http://localhost/exchange_token&approval_prompt=force&scope=read,activity:read
After clicking ‘Authorize’, you will be redirected to a localhost URL, and you should now copy the ‘Code’ part of the URL displayed:
Example: http://localhost/exchange_token?state=&code=f97ce75c0113eba8926629369e5843242cc8e9be&scope=read,activity:read
Copy this part: f97ce75c0113eba8926629369e5843242cc8e9be
Note that this code is shortlived, and will need to be used within a few hours for proper API authorization. Luckily, we will do that right away with Postman.
Initial Token Exchange using Postman
Next, we need to do the initial Token Exchange by using the code we just retrieved. I prefer doing this with Postman, but you can use whichever tool you wish.
In Postman, create a new POST request, adding “https://www.strava.com/oauth/token" to the URL. Then add the following four parameters:
client_id: Add the Client ID you retrieved from your Strava API app
client_secret: Add the Client Secret you retrieved from your Strava API app
code: Add the code you got from the user authorization URL in the previous step
grant_type: Add the value “authorization_code”
In Postman, these parameters will automatically be added to your POST request URL.
With all the details inputted, press Send, and you should receive back a Status Code 200 “OK” response, with a JSON response that includes a Bearer + Refresh Token valid for the User in question.
Below you can see my sample request + result, retrieving a Bearer + Refresh token for my own Strava User Profile:
Write down the Refresh_Token and Access_Token in the same place as your Client ID and Secret. We will need it in the following step.
ALSO write down the Athlede ID, as we need that as well.
Creating our Dataflow + Setting up automatic Token Refresh
Now its finally time to jump into Microsoft Fabric.
First, find a suitable Workspace, and create a Warehouse or Lakehouse to be used as Data Destination for the Strava Data. Then, create a new Dataflow Gen2 to be used for the Strava API:
Next, we need to create a query in our Dataflow to refresh our Access Token, so that we don’t accidentally try to download data with an expired token.
In your Dataflow create a new blank query, and paste the following into the Advanced Editor:
let
URL = "https://www.strava.com/api/v3/oauth/token",
Headers = [#"Content-Type"="application/json"],
Body = "{""client_id"": ""InsertYourClientIDHere"", ""client_secret"": ""InsertYourClientSecretHere"", ""grant_type"": ""refresh_token"", ""refresh_token"": ""InsertYourRefreshTokenHere""}",
Source = Json.Document(Web.Contents(URL, [Headers = Headers, Content = Text.ToBinary(Body)])),
#"Converted to table" = Record.ToTable(Source),
#"Filtered rows" = Table.SelectRows(#"Converted to table", each ([Name] = "access_token")),
#"Drill down" = #"Filtered rows"{0}[Value]
in
#"Drill down"
The code above creates a Query to make a POST Request to the https://www.strava.com/api/v3/oauth/token URL just like we did in Postman, but this type the grant_type is “Refresh Token”, meaning we wish to refresh our access and get a new Access Token.
POST requests are notoriously difficult to work with in Power Query / Dataflows, but a few pointers if you find yourself working with them:
The difference between a GET and POST request in Dataflows / Power Query, is that a POST request will include a [Content = Text.ToBinary(InsertBodyHere)] parameter.
The Body of the Content parameter almost always needs to be wrapped in a Text.ToBinary() formula to be sent correctly to the API.
The Body should be formatted in JSON, which is also difficult to do. For this request, the syntax used is: “{“”parameterkey1"”: “”parametervalue1"”, “”parameterkey2"”: “”parametervalue2"”}”. Notice that all keys and values are wrapped in double quotes, and that the full curly bracket is also encased in quotes.. This is not a mistake, but very intentional, as this is the only format which will work
If you did everything correctly, the result should be a query which returns a new Refresh and Access Token, exactly like the Postman request. The difference is, that we no longer need that User Authorization URL code! Now we just need the refresh token, to get new access tokens.
Hence, we can call this query every time we wish to extract new data, to ensure we have the latest token:
Extracting Strava Data
It is finally time to extract some data from the API.
In my case, I want to retrieve Activity data for my user profile, as documented in the Strava API: Strava Developers
Downloading all my data involves some level of complexity, as the API includes pagination, allowing me to only retrieve 100 activities per API call. Hence, I first need to create one query with an API call to identify the number of Activities to download, and then another query to call each Page of Items to be downloaded from the API.
Number of Pages to download
First create a New Text Parameter, “AthleteID” and set the default value equal to the ID of the Athlete you authenticated earlier, which you wrote down after the initial token exchange.
Then, create a new blank query, “fxGetNumberOfPages”, and paste the following code:
(Token as text) =>
let
RelativePath = "/" & AthleteID & "/stats",
Payload = [RelativePath = RelativePath, Headers=[Authorization = Token]],
Source = Web.Contents("https://www.strava.com/api/v3/athletes", Payload),
JSON = Json.Document(Source)
in
JSON
This query is a reusable Function which will take an Authorization Token as input, and we will use that in another query.
Create another new query, “GetPages”, and paste the following code:
let
Source = fxGetNumberOfPages("Bearer " & GetRefreshToken),
#"Converted to table" = Record.ToTable(Source),
#"Filtered rows" = Table.SelectRows(#"Converted to table", each ([Name] = "all_ride_totals" or [Name] = "all_run_totals" or [Name] = "all_swim_totals")),
#"Expanded Value" = Table.ExpandRecordColumn(#"Filtered rows", "Value", {"count"}, {"Value.count"}),
#"Changed column type" = Table.TransformColumnTypes(#"Expanded Value", {{"Value.count", Int64.Type}}),
#"Calculated sum" = List.Sum(#"Changed column type"[Value.count]),
#"Divided value" = #"Calculated sum" / 100,
#"Rounded up" = Number.RoundUp(#"Divided value"),
#"Converted to table 1" = #table(1, {{#"Rounded up"}}),
#"Added custom" = Table.AddColumn(#"Converted to table 1", "StartPage", each 1),
#"Added custom 1" = Table.AddColumn(#"Added custom", "Pages", each {[StartPage]..[Column1]}),
#"Expanded Custom" = Table.ExpandListColumn(#"Added custom 1", "Pages"),
#"Removed columns" = Table.RemoveColumns(#"Expanded Custom", {"Column1", "StartPage"}),
#"Changed column type 1" = Table.TransformColumnTypes(#"Removed columns", {{"Pages", Text.Type}}) in
#"Changed column type 1"
This code will run the function we just defined, using the refreshed Access Token from ‘GetRefreshToken’ as the token input. The result is a JSON response from the API, which the subsequent steps in the query formats into a simple one column table, with one row for each page of items to be downloaded:
Downloading the pages
Create a new blank query, and paste in the following code:
let
Source = GetPages,
#"Added custom" = Table.TransformColumnTypes(Table.AddColumn(Source, "Token", each "Bearer " & GetRefreshToken), {{"Token", type text}}),
#"Added custom 1" = Table.AddColumn(#"Added custom", "RelativePath", each "/activities?page=" & [Pages] & "&per_page=100"),
#"Changed column type" = Table.TransformColumnTypes(#"Added custom 1", {{"RelativePath", type text}}),
#"Added custom 3" = Table.AddColumn(#"Changed column type", "Custom", each Web.Contents("https://www.strava.com/api/v3/athlete", [RelativePath=[RelativePath], Headers=[Authorization = [Token]]])),
CombineBinaries = Table.AddColumn(#"Added custom 3", "Custom2", each Json.Document([Custom]))
in
#"Added custom 3"
The query references the “GetPages” query to start from a one column table of pages to download. Then we add a column for the Access Token as well as a Column which constructs the Relative part of the URL.
Finally, the API GET request is performed by using Web.Contents(), and the contents are combined in one table:
I struggled quite a bit getting this to work well in Fabric. I was often able to get something working inside the Dataflow Editor, but most of the time the automatic refresh would fail. However, the above code, with separation of URL and Relative Path seems to work for me.
After a bit of transformation and filtering, you end up with a final table of all the athletes activities, and you can add your data destination of choice (in my case a Warehouse) before publishing your dataflow:
Summary
My Strava data now lives in my Fabric Data Warehouse, is automatically updated on a schedule, and can be used for any number of things. I can use it in my Semantic Models, create new Views and Queries, and use the data in Power BI reports as I please.
Working with APIs in Dataflows is not the easiest, and due to the editor sometimes being able to produce in-view results which are not achievable with a scheduled flow, they can even be hard to debug.
However, with a bit of persistence, trial and error, it is possible to use Dataflows to get data from any API into your Microsoft Fabric Data Platform.
Posted on September 18, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
September 18, 2024