Improving Hasura's Internal PR Review process

noriste

Stefano Magni

Posted on December 6, 2022

Improving Hasura's Internal PR Review process

During 2022, thanks to Haystack, we realized our Pull Requests stayed open, on average, for seven days.

Haystack reporting Hasura's PRs stay open for 171.3 hours on average!

As you can imagine, this is troublesome because:

  1. It slows down the Software Development Life Cycle, preventing us from merging PRs and releasing new features for our users.
  2. It requires a long back and forth between the PR author and the reviewers, sometimes resulting in the need to reload the context in people's minds again and again.
  3. The PR’s code becomes outdated, often requiring repeated merges/rebases and resolving conflicts.

We all see the problem, but what causes it? Inspired by GitLab's model, we created a dedicated internal Working Group (inspired by GitLab's Working Groups) to dig into the human aspects of this problem.

How the Working Group proceeded

Despite this being only the second internal Working Group, we already have a standard way to proceed:

  1. State the problem and explain why you think it's important
  2. Look for a sponsor among all the engineering managers
  3. Define the exit criteria
  4. Create a dedicated Slack channel
  5. Get people involved

When speaking about the specific "PR Review Working Group", we decided to:

  1. Schedule some early meetings: people interact the most during sync meetings compared to async documents
  2. Elaborate on the problem and the possible solutions
  3. Request feedback
  4. Interview some internal engineers: we opted for interviews instead of sending out a survey
  5. Find some best practices
  6. Present the best practices internally

So, Rob Dominguez, Daniel Chambers, and I (Stefano Magni) started working on the Exit Criteria, drafted the first notes and the first opinionated solutions, asked for feedback, etc.

The exit criteria

Defining the exit criteria of the Working Group is crucial to work toward a goal. We determined the following indicators:

  1. Identifying some best practices to follow to get PRs merged as soon as possible.
  2. Identifying the best practices in newly-opened PRs.
  3. Significantly shortening the average lifecycle of a PR.
  4. Improving the relationship between author and reviewer on both ends as measured by a monthly survey.

Over the next several months, we'll monitor the progress and adoption of points 2 through 4. For the remainder of this article, we'll focus on how we identified these best practices.

The root causes the Working Group found out

We all know that a slow PR review process is terrible, but what are the root causes for this slowness? What happens between a developer creating the PR and getting the PR merged? Well, let’s list them in order of when they occur during the end-to-end PR creation and review process:

  1. Missing or not enough detailed PR description: Reviewing a PR is mostly about understanding the author’s context and validating their changes against this context. When a description is missing or too short, it deprives the reviewers of context, making it more difficult to understand and adding a lot of drag to the whole process.

  2. Big PRs: PR size directly impacts the review time. Working for a full sprint and then opening a single PR guarantees the PR will not reviewed promptly.

  3. Code Owners: GitHub Code Owners help to define the specific people or groups that own part of the codebase. Until recently, PR reviews from code owners were mandatory, which prevents merging a PR quickly.

  4. Lack of ownership: When GitHub assigns reviews to a code owners group instead of an individual, nobody feels the review is their responsibility because they believe someone else will pick it up. An author won't waste time waiting for someone to review; they'll naturally move on to their next task, which, as we'll see later, causes them to lose their context on their work.

  5. Notifications missing: The modern developer has so many services competing for their attention, and, in the deluge of notifications, it can be difficult to parse what is essential and what is simply noise. We determined that important messages notifying us of pending reviews or feedback on PRs went unnoticed too often.

  6. PR review time not explicitly allocated: Reviewing PRs takes time, primarily because of the complexity of Hasura's system. When planning the sprints, estimating features, etc., we must never forget to dedicate part of the day to reviewing each other people's PRs.

  7. Timezone overlaps: As a fully-distributed, remote-only company, Hasura's engineers work worldwide. Short overlap adds some delays when you need to interact with people on the opposite side of the world.

  8. CI issues: CI is a gift and a curse, and it can make our review process more streamlined and give reviewers more confidence when used correctly. However, automated steps exceeded the scope of our WG; another group recently began work on this task in conjunction with our Infrastructure team to improve the developer experience.

The best practices

One by one, let's check the best practices we identified for every mentioned problem.

PR description missing or insufficiently detailed

PR authors must try to walk in the reviewer's shoes. And the reviewers need to know a lot of things like

  • What has changed?
  • Why did you choose this solution? More importantly, why did you not select other solutions?
  • What is the dependency graph of the changes? Where should I start reviewing the changes?
  • How should I test/verify?

In one word: context! The PR author must describe the PR context in detail, adding images, videos, Code Tours, diagrams, pseudo-code; whatever helps reviewers understand the PR changes.

It's important to understand that this activity could take one hour or more, but every minute dedicated to describing the PR saves every reviewer two minutes. Also, the description acts as documentation for posterity and could help the PR authors when they have to return to the PR after some time.

Thanks to a detailed description, reviewers can dedicate their time by checking that the developers’ code reflects their intentions instead of guessing what was in the developers’ minds when they wrote the code.

Tests are also an essential part of the PR; they should be present and reflect the changes introduced by the PR. Tests serve as a quick summary answering the question, “What should the code do?”

Big PRs

Authors need to create reasonably-sized PRs. All projects - or even subsets of codebases - are different, but the size of a PR exponentially impacts the time required to review them. The best way to create easily-parsed PRs is... writing a detailed description! Why? Because if you are describing your PR in detail, and you realize that you are spending hours describing it, chances are the PR is too big.

Code Owners

Before August 2022, we organized Hasura into stack-oriented teams (the Server team, the Console team, etc.), and code-owners groups reflected this internal structure. Since this created too many cross-team dependencies (resulting in slowing down E2E features implementation), we reorganized into feature-focused teams that include all the necessary people to implement and release features independently.

As a result, if GitHub automatically assigns your PR to the Console code-owners group, there's a high probability that you have at least a couple of Console developers on your team! And since they participate in your standup and align with the team goals, they are the best candidates to review your Console-related PR.

The suggestion is always to use Code Owners as a hint to understand what knowledge people should have to review your PR, then find the people with that knowledge in your team and assign the PR to them.

In the future, code owners at Hasura will reflect the new teams’ structure, but this is only partially possible at the time of writing.

Lack of ownership

Lack of ownership happens in two different moments:

  1. When no one reviews a PR
  2. When the PR authors forget about their PR

We can automatically resolve the former case with what we shared in the previous point via assigning a specific person to review a PR. In contrast, the latter happens when the PR authors become tired of waiting for a review and start working on new tasks.

Whenever possible - aka, when getting the PR merged is a matter of hours - the PR authors should not start working on new tasks, but they should opt for getting the existing PR merged. Working on new tasks means forgetting and de-prioritizing the current PR, while striving to get it merged to ensure it will bring to completion.

What happens when a review takes too long? Simply put: authors get stressed. However, this response is critical to get them to become self-advocates and draw attention to the issue:

  • Raise their hands.
  • Send messages.
  • Ask for help.
  • Report the problem to the managers and in the retros.
  • Escalate if the problem persists.

We know this process could be better, but it also works towards resolving the issue. Working on new tasks, instead, buries the problem under the rug.

Notifications missing

We solved this problem by:

Firstly, enabling GitHub real-time alerts since it's the best way to receive only the important notifications in Slack
GitHub showing all the options for Slack real-time alerts

Secondly, we enabled Haystack daily reports in Slack, so that teams never forget their waiting, stale or large PRs.
GitHub real-time alerts for Slack

PR review time not explicitly allocated

We suggested that:

  • Individuals should enable GitHub real-time alerts, since it is the best way to receive only important notifications in Slack.
  • Teams should prioritize reviewing PRs (who cares if a team creates a lot of PRs without getting them merged?)
  • Teams should raise the problem of lack of review time during retros
  • Teams should set correct expectations in terms of workload the team can afford, also considering the code reviews
  • Team managers always remember that soft code reviews often result in bugs hitting our users in production

Timezone overlaps

We reduced this problem almost to zero naturally by assigning PR reviews to peers from the same team instead of code owners. We built feature teams with timezone compatibility in mind.

What happens if we do not follow the above best practices?

Simply put:

  • We start working on many features, but we release only a few.
  • We release features at a slower pace.
  • We end up with a backlog of stale PRs that haven't seen activity in months and wither away, neglected on GitHub's servers.

The presentation

Instead of opting for the standard slide deck, we made a more engaging presentation focused on a story: a Hasura developer opening a PR and facing all the PR review-related problems. We wanted attendees to garner empathy - a company core value - for devs on either side of this process.

To make the story more engaging, we relied heavily on gifs and exaggerated - only slightly! - some of the impediments we all face in this industry.

We created some vertical slides to have the slides side-by-side with the presenter (using a virtual camera and OBS, you can find all the instructions in this tweet about how to do the same).

Here is a picture of the presentation layout:
The presenter side-by-side with the vertical slides of the presentation

What's next?

Once we identified the best practices, we immediately put the "Starting to review PRs" topic in our onboarding docs to ensure new hires understand the importance of these best practices within their first few weeks.

Over the next few months, we'll continue analyzing metrics via Haystack to understand if the average number of days a PR stays open decreases. If not, we will try to understand the developer's difficulties applying the best practices mentioned.

Special thanks

I (Stefano Magni) volunteered to create and drive the Working Group, but I want to thank Rob Dominguez for raising the quality of our work by doing a lot of under-the-hood things. His presence and energy always motivated me too!

Thanks a lot also to Daniel Chambers. Daniel got involved in all the discussions, provided a lot of feedback, and volunteered to interview some of our developers. Thanks, Daniel!

Finally, thanks to all the people we interviewed and those who participated in every async and sync discussion ❤️.

External resources

💖 💪 🙅 🚩
noriste
Stefano Magni

Posted on December 6, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related