When You Merge Pull Requests You Lose Knowledge
Marcelo Sousa
Posted on August 24, 2021
What happens to comments and discussions of a pull request when it is merged? Isn’t that knowledge valuable for the team? Why isn’t it readily accessible to future pull requests?
Pull requests (PRs) are a mainstream way of proposing changes to a codebase. One of the most important features around pull requests is the ability to perform a code review, allowing developers to inspect and comment on the proposed changes.
Frequently, the knowledge in these comments is invaluable to the team. Reviewers, which are familiar with the codebase and related technologies, will notice bad patterns, suggest optimisations, enforce team guidelines and best practices, etc.
However, as soon as you merge that pull request on GitHub, Gitlab, or Bitbucket, all this information is lost into the bag of closed PRs that you will rarely (if ever) inspect.
We all have had that déjà vu feeling when writing a comment in a code review: “Didn’t I write this comment already?” When preparing a PR sometimes we also have that feeling: “I think I’m forgetting something in this line of code.” Or better yet: “Didn’t someone already thought of this before?”
Let’s look at three scenarios.
1. Knowledge useful to a broader audience
Knowledge that could be beneficial to developers not involved in this codebase. For instance, this comment could be useful for other developers modifying similar code.
2. Knowledge about design decisions
Teams with a lot of contributors can suffer from problems associated with the lack of context in design decisions. For instance, when a decision was taken on a discussion on a previous pull request. Consider the following discussion:
Assuming that this PR is merged, when someone else modifies this code in the future, it would be useful to retrieve this discussion. In this case, that would be particularly relevant if this person starts the support for a decrementing path.
3. Review déjà vu
Senior developers that are active reviewers also experience what we call review déjà vus. These are situations where you are repeating similar comments in different PRs, like this one:
These situations also occur when there are some hidden assumptions that do not make sense to document in the code.
Wouldn’t it be nice if we could re-use the comments from past reviews into future PRs without having to maintain them?
Enter Reviewpad
Reviewpad builds semantic diffs of changes in pull requests and associates metadata such as comments to these semantic objects. This is a big shift from the current behavior of existing code hosts such as Github, Gitlab, and Bitbucket. They associate comments to a line of a file in a particular git diff – as soon as the contents of the line change, this comment is marked as outdated.
We’ll showcase Reviewpad with the Web Crawler exercise of A Tour of Go using the linked GitHub repository. In this exercise of A Tour of Go, we are given a first implementation of a fake web crawler – the exercise is to parallelise it and prevent fetching the same URL twice.
Semantic git diffs
As a first step, we used Reviewpad to create a pull request with this first implementation.
Reviewpad performs a relational static analysis (in the style of [1] and [2]) over the git diff associated with the pull request to build semantic diffs. These semantic diffs form what we call the Explore Tree in Reviewpad.
An explore tree is an annotated file tree of the modified files. For each modified file, we present an annotated semantic tree of the changes. For example, the file fakeFetcher.go
as presented on GitHub requires an entire scan to understand the main symbols added: 1) the type fakeFetcher
, 2) the struct fakeResult
with two fields body
and urls
and 3) the method Fetch
. Note that the method Fetch
is shown as a child of the type fakeFetcher
as it is the receiver type. At the moment, our analysers focus on types and methods – this is why we don’t show variable fetcher
. The color in the beginning of the symbol indicates the type of modification: green for added, red for deleted and yellow for modified.
We show the explore tree on the left of the description and general conversation so that developers can quickly match the description and discussions to code changes.
Each element of the explore tree is clickable to its representation in the git diff. For example, clicking on the fakeFetcher
type directs the user to the beginning of the symbol in the diff:
Note: For those you noticed the coloured icon next to the file go.mod
– this means that the file is automatically hidden from the review mode. Each Reviewpad user can tweak their file view in their review settings to exclude files.
In this PR, I did a first review to document parts of the code. For example, while the original code for Crawl
is:
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
// TODO: Fetch URLs in parallel.
// TODO: Don't fetch the same URL twice.
// This implementation doesn't do either:
...
}
I chose to comment on the function as a code diff comment. These review comments are inserted into the knowledge base maintained by Reviewpad:
Note that the comment in the explore tree appears as a child of Crawl
symbol. This means that the comment was made in the definition of the symbol (not just in its signature).
Comments from past reviews
After the PR with the initial code was squash and merged, I created a new PR with a solution presented by the GitHub user harryhare.
The explore tree again shows the overview of the solution which involve the introduction of a SafeCounter
struct with a map v
to check if an URL was visited, a mutex mux
and a method checkvisited
that checks if URL was already visited or not.
The rest of the solution involves changes to the functions Crawl
and main
to fetch URLs in parallel. In the Crawl
method, Reviewpad presents the comment from the past PR that documents the method. This way, developers can retrieve discussions from previous PRs that changed this method.
A user review raised an issue with the checkvisited
which is fixed in the commit [fix from review](https://github.com/reviewpad/dejavus/pull/3/commits/a629a38fddad2305a13f2ce07da6a001c263920b)
.
A final review approves the PR with the request to remove the TODOs in the code.
In the final commit of this PR, we add the reference to the solution as a code comment in the Crawl
method and remove the TODOs.
Hovering on the function Crawl
in the explore tree displays the code documentation associated with this function for readability.
Keeping track of comments and changes over time
At this point you might be wondering:
- What happens if the comment was made in a PR that was not the exact previous one?
- What happens if I start refactoring the code?
To answer these questions, let’s go over the two open PRs in our project.
The first is an improvement proposed by another Github user that uses wait groups.
The comment shown in this pull request was the one made in the first PR. Reviewpad tracks comments regardless of their time difference. For example in the following PR from the google/error-prone project:
The PR was opened in September 2020 and it is changing class methods that were commented in January 2020.
The second open pull request demonstrates a beta version of our refactoring analyser:
In this PR, we introduced two commits, the first that changes the signature of the function Crawl
by renaming the variable url
to reqUrl
and the second renames the file.
Reviewpad understands that it is the same function and presents the comment made in the first PR. Note that we are squash and merging these pull requests – the commit associated with the comment is not an ancestor of the commits in this PR.
Cool. How do I check this out by myself?
We have a public beta version of Reviewpad available at https://reviewpad.com/get-started/. You will need to create a new account and once you log in for the first time, you will see the following page to connect to a code host:
You can connect to GitHub through our OAuth app or manually add a personal access token. The OAuth requires minimal scopes to be able to read and comment on public repositories.
And that’s it – you will be able to give Reviewpad a spin on public repositories! We will be adding more and more public repositories in the upcoming days – feel free to reach us on our community Slack with requests!
References
[1]: Cartesian hoare logic for verifying k-safety properties.
Posted on August 24, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.