Analysing Github Stars - Extracting and analyzing data from Github using Apache NiFi®, Apache Kafka® and Apache Druid®
Vijay
Posted on January 12, 2023
As part of the developer relations team in Imply, I thought it would be interesting to extract data about users who had starred the apache/druid repository. Stars don’t just help us understand how many people find Druid interesting, they also give insight into what other repositories people find interesting. And that is really important to me as an advocate – I can work out what topics people might be interested in knowing more about in my articles and at Druid meetups.
Spencer Kimball (now CEO at CockroachDB) wrote an interesting article on this topic in 2021 where they created spencerkimball/stargazers based on a Python script. So I started thinking: could I create a data pipeline using Nifi and Kafka (two OSS tools often used with Druid) to get the API data into Druid - and then use SQL to do the analytics? The answer was yes! And I have documented the outcome below. Here’s my analytical pipeline for Github stars data using Nifi, Kafka and Druid.
Sources - the Github API
Github provides an API (/repos/{owner}/{repo}/stargazers) for extracting stargazer data that returns 30 users per page (with results in multiple pages). Each page has an array like below:
[
{
"starred_at": "2012-10-23T19:08:07Z",
"user": {
"login": "user1",
"id": 45,
"node_id": "MDQ6VXNlcjQ1",
"avatar_url": "https://avatars.githubusercontent.com/u/45?v=4",
"gravatar_id": "",
"url": "https://api.github.com/users/user1",
"html_url": "https://github.com/user1",
"followers_url": "https://api.github.com/users/user1/followers",
"following_url": "https://api.github.com/users/user1/following{/other_user}",
"gists_url": "https://api.github.com/users/user1/gists{/gist_id}",
"starred_url": "https://api.github.com/users/user1/starred{/owner}{/repo}",
"subscriptions_url": "https://api.github.com/users/user1/subscriptions",
"organizations_url": "https://api.github.com/users/user1/orgs",
"repos_url": "https://api.github.com/users/user1/repos",
"events_url": "https://api.github.com/users/user1/events{/privacy}",
"received_events_url": "https://api.github.com/users/user1/received_events",
"type": "User",
"site_admin": false
}
},
{
"starred_at": "2012-10-23T19:08:07Z",
"user": {
"login": "user2",
"id": 168,
"node_id": "MDQ6VXNlcjE2OA==",
"avatar_url": "https://avatars.githubusercontent.com/u/168?v=4",
"gravatar_id": "",
"url": "https://api.github.com/users/user2",
"html_url": "https://github.com/user2",
"followers_url": "https://api.github.com/users/user2/followers",
"following_url": "https://api.github.com/users/user2/following{/other_user}",
"gists_url": "https://api.github.com/users/user2/gists{/gist_id}",
"starred_url": "https://api.github.com/users/user2/starred{/owner}{/repo}",
"subscriptions_url": "https://api.github.com/users/user2/subscriptions",
"organizations_url": "https://api.github.com/users/user2/orgs",
"repos_url": "https://api.github.com/users/user2/repos",
"events_url": "https://api.github.com/users/user2/events{/privacy}",
"received_events_url": "https://api.github.com/users/user2/received_events",
"type": "User",
"site_admin": false
}
}]
This is the pipeline that I decided to build:
- Nifi - fetches JSON results from the multiple pages returned by the API, then splits the JSON into multiple JSONs - one for each star.
- Kafka - acts as the reliable delivery from NiFi to Druid and then the source of ingestion.
- Druid - ingests the JSON and allows me to use JSON path in SQL queries, and do analytics along the timeline
As this was a bit of an experiment, I thought I would build up two tables - “Blog3” to contain users who have starred the druid repository in github, “Blog 4” contains organisation names. So in Kafka, I would create two topics with the same names.
Druid expects new line delimited JSON – it doesn’t support JSON arrays at the top level of the JSON (arrays inside the JSON are ok). To get this data into Druid easily, I decided to break up each [...] array into a separate JSON, then publish it to a Kafka topic.
As for the schema, Druid needs to have a __time column. This was easy to work out – I would use the datetime from the JSON when the repository was starred. I’d also have a field “user”: a JSON containing properties of the user who did the starring.
Install and configure Kafka
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
Kafka is a natural fit with Druid – Druid has an out-of-the-box consumer that guarantees exactly once ingestion, and that I could scale up and down quickly thanks to Druid’s architecture.
You can install Kafka from https://kafka.apache.org/quickstart.
Because Druid and Kafka both use Apache Zookeeper, I opted to use the Zookeeper deployment that comes with Druid, so didn’t start it with Kafka.
Once running, I created two topics for me to post the data into, and for Druid to ingest from:
./kafka-topics.sh --create --bootstrap-server localhost:9092 --topic blog3 --replication-factor 1
Install and configure NiFi
Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.
Nifi is very useful when data needs to be loaded from different sources. In this case, I will nifi to access the Github API as it is very easy to make repeated calls to a Http endpoint and get data from multiple pages.
You can see what I did by downloading NiFi yourself and then adding my template from the Druid Datasets repo:
https://github.com/implydata/druid-datasets/blob/main/githubstars/github_stars.xml
Here’s a screenshot of the flow.
- GenerateFlowFile: Generates dummy content to trigger the flow
- UpdateAttribute: Generates an attribute “p3” to handle the multiple pages from the Github end point.
- InvokeHTTP: Invoke this Github API to get Druid stargazers
- SplitJson: We now have the endpoint (https://api.github.com/repos/apache/druid/stargazers?page=${p3}) but it needs to be split using this processor
- MergeContent: Merge the split JSONS with a new line separator
- PublishKafka_2_6: Post JSON to kafka.
- EvaluateJsonPath: Extract loginid and then…
- InvokeHTTP: …use it to invoke this API: https://api.github.com/users/${loginid}/orgs
- SplitJson: Split the array.
- EvaluateJsonPath: Extracts orgid from JSON.
- AttributesToJSON: Creates JSON with loginid and orgid.
- PublishKafka_2_6: Post JSON to kafka.
Steps 8 - 12 use the login id retrieved in step 7 to get the orgid associated with each starring user by calling the corresponding end point in the Github API.
For the invokeHTTP processors, I set my Github API key.
Install and configure Druid
Apache Druid is a real-time database to power modern analytics applications.
This was going to be how I would look at the data from the APIs using SQL.
Druid can be downloaded from https://druid.apache.org/docs/latest/tutorials/index.html - I just started it up with the default configuration.
The druid console is on http://localhost:8888 by default, so I could quickly get into the ingestion setup wizard and connect to Kafka. The wizard creates a JSON-version of the ingestion specification – you can see mine here:
- https://github.com/implydata/druid-datasets/blob/main/githubstars/blog3_ingest.json
- https://github.com/implydata/druid-datasets/blob/main/githubstars/blog4_ingest.json
You can submit the specifications yourself in the Supervisors pane under ingestion:
It’ll show you a preview before you submit:
Querying Druid
As soon as the Kafka ingestion supervisor was running, I could see the two sources in Druid’s query tab: table blog3 and table blog4.
Now I can see Blog3: the users who have starred the druid repository in github.
And I can also see Blog4: organization names for each login.
I could straight away do some SQL querying on the incoming Kafka data. Some examples are below.
- Number of star gazers who are site admins
select JSON_VALUE(user,'$.site_admin'),count(*) from blog3
group by 1
- Star gazers added by month
select TIME_FLOOR(time,'P1M'),count(*) from blog3 group
by 1
- Number of users by org who have starred druid repo
select orgid,count() from blog4 group by 1 order by
count() desc
Conclusion
I started this with wanting to get some insights on engaging with the community. Where am I on that?
From last query above above I get the below top ten orgs
Clearly I should definitely think about gaming focussed content. Maybe see if there are Trino meetups I can present in. I should also try reaching out to fossasia.
What else can I see? When going through the github API I realized that the end point https://api.github.com/users/USERNAME/starred will allow me to fetch the other repositories starred by the users who starred the Druid repository.
I enhanced the Nifi template to add this new end point (https://github.com/implydata/druid-datasets/blob/main/githubstars/nifi_other_repos.xml )
and used the Druid supervisor spec (https://github.com/implydata/druid-datasets/blob/main/githubstars/nifi_other_repos.xml ) to ingest this into the same datasource (blog4) and ran the query
select repo,APPROX_COUNT_DISTINCT_DS_THETA (loginid) from blog4 where repo not in ('apache/druid') and repo<>'' group by 1 order by 2 desc limit 20
to get
This clearly tells me that I should look at content related to React (ant-design is a UI framework using React), Superset, Tensorflow,Flink,Kubernetes,Metabase and Spark. I should also try to engage these communities to help them use Druid with these other products.
Using Nifi, Kafka and Druid I’ve put together the beginnings of a real-time modern analytics application. This pipeline fetches data from the github API and helps me analyze the users that have starred the Druid repository in github. All the three products- Nifi, Kafka and Druid are capable of handling large data volumes. All three products can be run in clusters and are horizontally scalable.
Next step – a UI to sit on the top of the data! Watch this space for a next post on a UI to go with this.
If you have questions on using Druid please do go to the community link below and sign up or come to the POC clinic
Learn more
https://druid.apache.org/community – connect with other Apache Druid users
https://learn.imply.io/ - Free druid courses with hands on labs
Kafka Ingestion tutorial on Druid docs
https://kafka.apache.org/ - all things Kafka
https://nifi.apache.org/ - all things Nifi
https://druid.apache.org/community – connect with other Apache Druid users
Posted on January 12, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.