Edoardo Sanna
Posted on January 13, 2023
Since I started pushing my code to GitHub, I really liked the Top Languages stats in the repository's properties (as well as the beautiful anuraghazra's github-readme-stats cards): I was just amazed that some kind of hidden mechanism was able to tell the language used in almost all the files in the repository.
As I later found out, GitHub uses the Linguist library to measure the amount of lines written in a specific language... which is still pretty magic πͺ.
π― Motivation
What surprised me is that there's no historical data about these statistics: I was wondering if I could produce a nice line chart race to show the progress I made along time on the languages I've been working on, similarly to the Top Programming Languages section of the Octoverse, maybe also adding some additional post-elaboration analytics, like
- what's the fastest growing language
- what's the language I've been working most lately
- what's the least used language, or the 'sleeping' ones
β οΈ Warning: "Top Languages" don't indicate any skill level! But it's definitely interesting to use the amount of written lines of code as an approximate metric of progress, even for simple motivational purpose! Wouldn't it be nice to see a line chart showing the amazing progress you've made on all the languages across, let's say, 10-20 years?
πΆπ» Baby steps
Apparently the GitHub REST Repositories API only has a list request (GET /repos/{owner}/{repo}/languages
, giving the current list of languages used in a specific repo
, but there's no place in GitHub where such historical data are stored.
What if I use this listing feature to get a daily image of the languages used in all my repositories, and persist it on a historical table in a database?
Since I will use it to track my languages' progress, let's call it LangTracker
.
Therefore, below I will try to:
- set up the starter background service
- add the authentication to GitHub API
- connect the service to a database
- deploy and run it
We will need the following ingredients:
π Background Service
Since I'm learning .NET, I will create it as a background Worker Service.
Let's create the project using the worker
template with the .NET CLI:
# Create new project from template
dotnet new worker --name LangTracker
# Create new solution
dotnet new sln --name LangTracker
# Add project to solution
dotnet sln ./LangTracker.sln add ./LangTracker.csproj
The basic Program.cs
is quite simple:
// Program.cs
using LangTracker;
IHost host = Host.CreateDefaultBuilder(args)
.ConfigureServices(services =>
{
services.AddHostedService<Worker>();
})
.Build();
await host.RunAsync();
It just creates an host
and registers the hosted service Worker
in the Dependency Injection container. It then builds and starts the background service, and it finally runs the application.
The worker behavior is defined in the Worker.cs
class, specifically in a while
loop which runs indefinitely until a CancellationToken
is hit:
// Worker.cs
namespace LangTracker;
public class Worker : BackgroundService
{
private readonly ILogger<Worker> _logger;
public Worker(ILogger<Worker> logger)
{
_logger = logger;
}
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
while (!stoppingToken.IsCancellationRequested)
{
// Do something
// ...
// Wait 1000 ms
await Task.Delay(1000, stoppingToken);
}
}
}
You may want to expand a little the execution frequency: to have a daily image, I set it to 24 hours:
await Task.Delay(1000*3600*24, stoppingToken); // ππ» <-- or whatever you prefer
ποΈ Database
I expect each language record to be saved in a single table with a DateTime property and a Size property (i.e. size in KB).
For the sake of simplicity, I'll be using PostgreSQL as DBRMS.
To manage the database, we'll first install the required packages:
dotnet add package Microsoft.EntityFrameworkCore
dotnet add package Microsoft.EntityFrameworkCore.Tools
dotnet add package Npgsql.EntityFrameworkCore.PostgreSQL
Let's then get started by modeling our GithubLanguage record with the following class:
// Models/GitHubLanguage.cs
namespace LangTracker.Models
{
public class GithubLanguage
{
[Key]
public int Id { get; set; }
public DateTime Date { get; set; }
public string? Repo { get; set; }
public string? Language { get; set; }
public double Size { get; set; }
}
}
We'll then scaffold a Database Context class with all the details of our database:
// Data/DbContext.cs
protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
if (_configuration is not null)
{
optionsBuilder.UseNpgsql(_configuration.GetConnectionString("PostgreSQL"));
}
base.OnConfiguring(optionsBuilder);
}
And in mapping the model to the database:
protected override void OnModelCreating(ModelBuilder modelBuilder)
{
// Map table names ("Languages")
modelBuilder.Entity<GithubLanguage>().ToTable("Languages", "dbschema");
modelBuilder.Entity<GithubLanguage>(entity =>
{
entity.HasKey(e => e.Id);
});
base.OnModelCreating(modelBuilder);
}
Remember to add the connection string in the appsettings.json
file:
"ConnectionStrings": {
"PostgreSQL": "User ID=YOUR_POSTGRES_ID;Password=YOUR_POSTGRES_PASSWORD;Host=POSTGRES_HOSTNAME;Port=POSTGRES_PORT;Database=LangTracker;Pooling=true;"
},
β οΈ PostgreSQL has some issues when saving datetimes instead of UTC timestamps, hence remember to add at the beginning of Program.cs
:
// Pgsql-specific configuration for datetimes
AppContext.SetSwitch("Npgsql.EnableLegacyTimestampBehavior", true);
AppContext.SetSwitch("Npgsql.DisableDateTimeInfinityConversions", true);
Finally, let's create the database using Entity Framework Core's migrations, e.g. via .NET CLI:
# Create the first migration, called InitialCreate
dotnet ef migrations add InitialCreate
# Update the database
dotnet ef database update
π» GitHub client
Let's proceed with getting an API personal access token from GitHub: you may generate one in your settings at "Personal Access Tokens".
Then test it via CLI, asking for instance the details of your user:
# bash
user=$"YOUR-GITHUB-USERNAME"
token=$"YOUR-GITHUB-PERSONAL-ACCESS-TOKEN"
curl -i -u "$user:$token" https://api.github.com/users/$user`
The switch -i
will display the HTTP headers: notice the content-type header
(it should be application/json
) and the x-ratelimit-limit
header (maximum amount of available request per hour, it should be 5000 for authenticated requests - usage is tracked by the x-ratelimit-remaining
).
In order not to hardcode our GitHub username and token, we will save them as environment variables and pass them to our background service.
So now, back to our Worker, we'll have to inject the configuration :
public Worker(ILogger<Worker> logger, IConfiguration configuration) // ππ» <-- add configuration here
{
_logger = logger;
_configuration = configuration; // ππ» <-- and here
}
This loads all the .NET default configuration sources, which includes non-prefixed environment variables and user secrets.
Hence, by adding a new set of environment variables:
GITHUB_LOGIN=[YOUR-GITHUB-USERNAME-HERE]
GITHUB_TOKEN=[YOUR-GITHUB-TOKEN-HERE]
We will be able to retrieve them just by adding to our while
loop:
// Read token from env variables
string login = _configuration["GITHUB_LOGIN"];
string token = _configuration["GITHUB_TOKEN"];
To interact with the GitHub API, let's install the Octokit.NET library with dotnet add package Octokit
, then instantiate a client as simply as:
// Instantiate Github Client
var client = new GitHubClient(new ProductHeaderValue("lang-tracker"));
// Authenticate with token
var tokenAuth = new Credentials(login, token);
client.Credentials = tokenAuth;
Fetch data
Once the client is authenticated with a specific login on the GitHub API, we should be able to retrieve all the repositories for a specific user:
// Get all repos, public & private, from current login
var repos = await client.Repository.GetAllForCurrent();
Then, loop over the found repositories, saving the current image of all the languages' sizes in KB:
foreach (Repository repo in repos)
{
// Filter away Company repos & forked repos
if (repo.Owner.Login == login && !repo.Fork)
{
// All languages in current repo
var langs = await client.Repository.GetAllLanguages(repo.Id);
foreach (var lang in langs)
{
// New language record
var githubLangRecord = new GithubLanguage
{
Date = DateTime.Now,
Repo = repo.Name,
Language = lang.Name,
Size = lang.NumberOfBytes / 1024.00
};
}
}
}
The code above creates a new GithubLanguage
object, made of the retrieved language, the corresponding repository, the size in KB and it labels it with the current datetime.
I've neglected the Company's repositories and the forked ones, as I only want to keep track of my personal projects.
Save to database
Now, let's just add the operations required to save each record to the database:
foreach (var lang in langs)
{
var githubLangRecord = new GithubLanguage
{
// ...
};
dbContext.Add(githubLangRecord); // ππ» <-- Add entity
}
dbContext.SaveChanges(); // ππ» <-- Save into db
π Deployment
Now, I chose the Worker Service template because it can be easily deployed as a systemd daemon or a Windows service with a simple addition to Program.cs
:
// Program.cs
using LangTracker;
IHost host = Host.CreateDefaultBuilder(args)
.UseWindowsService() // ππ» <-- here
.UseSystemd() // ππ» <-- and here
.ConfigureServices(services =>
{
services.AddHostedService<Worker>();
})
.Build();
await host.RunAsync();
If you want to try it out, remember to install the packages:
dotnet add package Microsoft.Extensions.Hosting
dotnet add package Microsoft.Extensions.Hosting.Systemd
dotnet add package Microsoft.Extensions.Hosting.WindowsServices
Nevertheless, deploying as a containerized service is sooo much more convenient, as it gives me the chance to specify the installation instructions in a single docker-compose.yaml
file and run it with a single command.
Therefore, we'll be using
- the official ASP.NET Core runtime image as base image, and
- the official .NET SDK image as the build image.
The Dockerfile is the standard one automatically built by Visual Studio:
# base image
FROM mcr.microsoft.com/dotnet/aspnet:6.0 AS base
WORKDIR /app
# build image: restore and build
FROM mcr.microsoft.com/dotnet/sdk:6.0 AS build
WORKDIR /src
COPY ["LangTracker.csproj", "."]
RUN dotnet restore "./LangTracker.csproj"
COPY . .
WORKDIR "/src/."
RUN dotnet build "LangTracker.csproj" -c Release -o /app/build
# publish image: publish
FROM build AS publish
RUN dotnet publish "LangTracker.csproj" -c Release -o /app/publish
# run image: run
FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish .
ENTRYPOINT ["dotnet", "LangTracker.dll"]
The next step is to provide a docker-compose.yaml
recipe to start two container services:
-
langtracker_app
containing the app (based on theDockerfile
above) and -
langtracker_db
containing the database (based on the official postgres image):
To ensure connectivity between the two containers, remember to:
- Add the two environment variables
POSTGRES_PASSWORD
andPOSTGRES_PORT
, respectively the password for the postgres user and the local port to be mapped to the container's default PostgreSQL port 5432 - Change your connection string's content in
appsettings.json
fromHost=POSTGRES_HOSTNAME
toHost=langtracker_db
, taking advantage of the default bridge network driver providing automatic DNS resolution between containers.
Here's the docker-compose
file I've been using:
services:
db:
image: postgres
container_name: langtracker_db
restart: always
ports:
- "${POSTGRES_PORT}:5432"
environment:
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
volumes:
- postgres-data:/var/lib/postgresql/data
app:
container_name: langtracker_app
build:
context: .
dockerfile: ./Dockerfile
depends_on:
- db
environment:
- GITHUB_LOGIN=${GITHUB_LOGIN}
- GITHUB_TOKEN=${GITHUB_TOKEN}
volumes:
postgres-data:
Let's just run it:
docker-compose up -d
Depending on your network, it may require a few minutes to download the base images and run the two containers, but in the end:
Now, when you access the database in your langtracker_db
container (even with a simple sudo -u postgres psql -h localhost -p YOUR_POSTGRES_LOCAL_PORT
), you may perform a nice aggregating query like:
select "Date"::date as Day, "Language", sum("Size") as TotalSizeKB from "dbschema"."Languages"
group by "Date"::date, "Language"
order by Day desc, TotalSizeKB desc;
Giving you the final result of your daily image of languages (below, a couple of days during last summer):
day | Language | totalsizekb
------------+------------+----------------
2022-07-20 | C# | 166.75390625
2022-07-20 | PowerShell | 166.16015625
2022-07-20 | HTML | 151.7119140625
2022-07-20 | Python | 94.58984375
2022-07-20 | TSQL | 23.9765625
2022-07-20 | JavaScript | 23.9599609375
2022-07-20 | CSS | 9.599609375
2022-07-20 | Shell | 2.54296875
2022-07-20 | Dockerfile | 2.412109375
2022-07-19 | PowerShell | 166.16015625
2022-07-19 | HTML | 148.2822265625
2022-07-19 | C# | 143.2822265625
2022-07-19 | Python | 94.58984375
2022-07-19 | TSQL | 23.9765625
2022-07-19 | JavaScript | 23.7392578125
2022-07-19 | CSS | 8.5712890625
2022-07-19 | Shell | 2.54296875
2022-07-19 | Dockerfile | 1.56640625
Small and progressive steps in C# and HTML, apparently π...
We're good to go! πβ¨ I'll just run it for a few months, then maybe I'll adjust it to track the language stats only once a month or so, and see the results.
π‘ Next steps
The background service does what it's supposed to do, but it's way far from perfect. Here are some random ideas popping up:
πͺ² Fixes:
If the only purpose is to show aggregated data, maybe it'd be a good idea to save in the database only the aggregated form
-
The service works from the moment you start it, but it cannot retrieve past data.
- Maybe we could replace the
/repos/{owner}/{repo}/languages
call with the/repos/{owner}/{repo}/commits
request- Even the GitHub tree API looks interesting for this purpose
- Then retrieve all the files in each commit, and invoke the above-mentioned Linguist library to get the list of languages
- This looks like a very time consuming effort, as perfectly described by @ahegiy in its note on a GitLab issue on the same topic.
- I particularly like the idea of doing this 'on demand' when asked by the user
- Maybe we could replace the
πΌ Features:
- Create a frontend application to show nice line chart (maybe with ChartJS?)
- Explore the GitHub GrapQL API instead of the REST API!
Maybe I'll try to deep dive into one of these points in the next month or so, we'll see.
Meanwhile, have a great time and keep coding! ππ»
Posted on January 13, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.