I ran a ludicrously complex engineering project (and survived)
jcreenaune
Posted on December 12, 2018
Picture boring a new subway tunnel while the city above you goes on about its business, oblivious to the machinations under the surface. That’s pretty much the project we completed earlier this year. Only we were working in the cloud.
Solid software engineering principles are the key to delivering massive, complicated projects – like rearchitecting Confluence and Jira to be stateless, multi-tenant cloud applications. It was the largest single program of work Atlassian has ever undertaken, clocking in at three calendar years, which equated to several hundred dev years. But we also did it without disrupting existing customers, most of whom weren’t even aware of the giant architectural change happening beneath them.
This was the type of project that can lead to ulcers and sleepless nights if not run properly. After we described the size, scope, and aggressive timeline to one of our cloud engineering managers, they responded with “I really like the idea, but it’s giving me vertigo”.
We did a bunch of things right to make this giant program work. Chief among them are the four engineering principles we used to plan and execute the project:
All-in – This was a massive program and it needed the entire company, from exec to intern, behind it.
Attack the riskiest assumption – Don’t think “MVP.” Think about what the riskiest part of the entire program is, and focus on that until it’s no longer the top risk.
Incremental, even when it hurts – Reduce risk by breaking the migration into as many small parts as possible, even when that increases dev time.
Sprint to 100% – In this program, the payoff was when we could completely stop deploying to the old infrastructure. Don’t slow down or lose focus until you are done, done, done, and done.
These, along with our company values (especially “don’t f37k the customer” and “build with heart and balance”), formed the foundation for the entire project.
First, some background. Jira’s first lines of code were committed in 2002. Confluence followed in 2004. Very few customers were ready to host mission-critical data in the cloud back then, so it made sense to design them as single-tenant applications to be hosted on customers’ own servers. In 2007, we started offering Confluence and Jira SaaS options, which continued to evolve. But everything was a tweak that used containerisation to isolate that single-tenant architecture separately for each individual cloud customer. Fast-forward to 2015: more and more customers are moving to the cloud, and it was clear that this single-tenant architecture could not scale to their needs – especially if you project outward about 15 years. We needed to rearchitect Confluence and Jira to be true multi-tenant, stateless cloud applications. Or, slowly crumble under a mountain of technical debt.
How we would actually deliver on that vision was hotly debated over the years. Several approaches were discussed internally, some taking very cautious steps over many many years. Let’s unpack these four engineering principles one by one as I tell you the story of “Project Vertigo”.
Engineering principle #1: All-in
At the start of 2015, we decided to do away with incremental steps and head as fast as we sensibly could to the final destination. We looked at what would happen if we just put the existing Confluence and Jira backend teams onto the project. It would have taken 4-5 years. That’s no good. No matter how much developers believe in the final destination, it’s hard to keep the same people excited about the same project for that length of time. Additionally, this was the kind of project where there is little incremental value until the entire project is done. That makes it even harder to maintain morale.
You also want to bring that big payoff as far forward as you can. As a company, we knew that going all-in on this – making disruptive team changes internally to get to the payoff faster – was the only sensible way forward.
However you clearly can’t just go all-in on day one of a project. In reality, the timeline looked like this:
January 2015 – Architectural spikes and proofs of concept to show that the architecture could work and handle the required scale.
September 2015 – Warming up by landing some of the core work in earnest. A large part of the work required applying the same patterns (like removing tenanted data from in-memory caches) broadly across the entire codebase. In this phase of work, we coded the first few examples of those repeated patterns, including patterns for testing and continuous integration.
March 2016 – Really, truly all in. Every available hand was on deck to land the entire scope of work required to migrate the first customer to the new architecture.
December 2016 – We migrated the first customer!
December 2017 – Successfully migrated the last customer. By this time, we had also completed everything that was originally cut from scope before we migrated the first customer (both functionality and performance).
It’s exciting to reflect. We spent years knowing (and fearing) that this huge thing was on the horizon at some future point. Once we committed to tackling it head-on, the bulk of the work was done in nine months. The whole company worked together to get it done in the shortest time possible.
If you want to do such a massive transformation, it’s only going to work if you make tough choices. In our case, this meant breaking down pre-existing department barriers and working around pre-existing roadmaps and commitments. That, in turn, will only work with exec buy-in. In our case, with 20/20 hindsight we could have gone even faster. I would have loved to compress the first 15 months of spiking, proof-of-concept, and groundwork even tighter. The turning point there was a new CTO, Sri Viswanath, who brought a higher level of exec buy-in and the confidence we needed to put all our chips behind Vertigo.
Changing the structure of your organisation and moving teams, or the individuals in those teams, can be hard. If the folks in your company have personal attachment to the teams they are in and the work they do – and developers at Atlassian tend to have very high camaraderie and personal investment in their work – then changing team structures can be viewed as forced and unwelcome, and reduce morale.
As a management group, you can’t overcome that unless the developers who are affected by change believe in the vision. Don’t underinvest in the work on internal blogs, presentations, speaking to folks 1:1, etc. Bring them along on the journey when it requires such significant internal change.
Engineering principle #2: Attack the riskiest assumption
Whether the project is a startup’s first prototype or a big cross-department program in a large org, advice on running software engineering programs will tell you to be lean and cut as much scope as is sensible to get working software in users’ hands. That’s great advice, but it’s just the start. It also helps to avoid shipping increments that don’t teach you anything about the complexities of the problem you have in front of you.
My main tool here is thinking through the riskiest assumptions in your project. One of my favourite posts on this subject is The MVP is dead. Long live the RAT. That post is from the lean startup world, where risk is mostly around finding product-market fit. But its principles are equally applicable in a large engineering project like Vertigo. You need something to sharpen your focus and determine what should be in the first release – what is valuable to prove now vs. what is low-risk enough to push out to a later milestone.
How did this actually work in practice? Let’s look at three risky assumptions we encountered on the way to migrating the first customer.
Lots of stuff to fit together
In addition to vast re-architecture on the Confluence + Jira monoliths, we were building 15 new services to handle things like provisioning a customer, accessing customer metadata, distributed scheduling, inbound and outbound email, and a completely new user authentication platform. Each of those services had teams working on them. In an environment with concurrent development of highly-coupled services, one of your biggest risks is getting integration wrong. APIs can drift in isolation and when you put them together, they don’t work. Then you have rounds of bug fixing that push out the program’s overall delivery date.
To mitigate that, take another lesson from the startup world: build a throwaway prototype. In our case, it was an app to integrate all those services, with an owner whose job it was to hit integration milestones. This prototype was completed 6 months before we migrated the first customer.
That wasn’t zero-effort. It required teams to change their priorities to commit to the integration milestones. There was also pushback from dev teams. If your head is in your own silo of work, integration milestones can feel like they’re slowing you down within that silo. Plus it’s extra work to coordinate, and extra code to write that won’t ever be shipped to customers. Bringing the inevitable risk around “integration crunch” forward by six months gave the project momentum, confidence, and a lot more time to think about even bigger risks described below!
Data leakage
Confluence and Jira are 14 and 16 years old, respectively. For all of those years, they have operated under the assumption that only a single tenant was in the system. When turning a single-tenant system to multitenancy, the biggest risk is that you’ll leak data between customers. The impact of this, especially for systems with highly sensitive data like Confluence and Jira, is catastrophic. Internally we used the phrase “company-ending event” to describe what would happen if a serious data leakage bug was released to production. Alternatively, we’d say “this is what gets us on the front page of Hacker News” – and not for the right reasons.
The scope of things that could go wrong was extraordinarily broad. Consider that anything living in long-term memory (e.g., in the Java world, a static member or member on a singleton) containing tenanted data was perfectly acceptable for 16 years, but would cause an egregious bug in a multi-tenant world. One of our developers stepped up as owner for multi-tenant data leakage and took on the job of developing strategies for discovering multitenant violations pro-actively so there’d be zero potential problems when we migrated the first customers.
To give us confidence here, we used a combination of several techniques including static code analysis, runtime memory analysis (i.e., do a test run inserting known fields into UI / API endpoints, then search for the presence of those strings in memory after the test’s conclusion), and runtime tracing of string read/writes. We made a significant investment in time to build and run tooling to get a high level of confidence here. We definitely went down a few rabbit holes that produced a low signal-to-noise ratio, or too many false positives to be really useful.
But in the end, it worked out. Two years since migrating the first customer and one year since 100% migration, we have steered clear of Hacker News infamy.
We might f#&k the first customer
A massive re-architecture on two monoliths with the introduction of many new services is a high-risk dev activity. The most critical thing to reduce this risk once we had high confidence of zero data leakage was to get a small set of customers onto this new platform as early as possible. As the owner of the entire program, my job was to push back on scope creep as much as possible. The first customers for both Confluence and Jira were migrated simultaneously in December 2016. That first release didn’t support any 3rd-party add-ons, any extensions on top of Confluence or Jira (like Jira Service Desk or Confluence Questions), or any full-site import/export. It supported less than 50% of JQL syntax and was only performant at very low scale.
Now, we needed to get those first few customers onto the platform without breaking our company value “don’t f#&k the customer“. We looked at which features, JQL syntax, and add-ons our customers were using, their performance characteristics, amounts of user-generated data, and usage patterns to find customers who would fit that profile with no impact on their user experience. We picked customers who had been with us a long time – the rationale being that a customer who has used a product for more than three years and not scaled significantly (or started using advanced functionality, or adopted new add-ons) is far less likely to suddenly start doing so than a new customer.
We added alerting to let us know if those customers did hit any unimplemented features and built a single-button “reverse migrate” to get them immediately back to the old platform if they tripped those alarms. Again, this wasn’t free. But it was worth it to get those first customers across and prove to ourselves and the whole company that this was on the right track.
We migrated 22 customers in December 2016, and history shows we picked well. None required migration back to the old platform.
“Go incrementally” or “ship working software in iterations” is easy to say. In any problem of sufficient complexity, it’s hard to work out what the most valuable next increment is. We used the riskiest assumption rule to guide us. And it’s a continual focus. After you knock over the first riskiest assumption, the next one might not be immediately obvious. For leaders, it requires constant vigilance, and sometimes tough conversations, to keep your team asking “what’s the biggest risk” rather than “what’s the easiest thing to ship next”.
Engineering principle #3: Incremental, even when it hurts
One way of phrasing the “all-in” approach described above is “run as fast as sensible” to the end state. But what did “sensible” mean? To explain that, I’m going to dive into the architectural differences between the old and new platform. Bear with me.
The program of work here was to take a single-tenant system and make it multi-tenant. We also use the term “zero affinity”, to indicate that the compute nodes in the application cluster are never tied (i.e. have an affinity) to any specific customer, which means that all the compute nodes can service any customer request. Getting from single-tenant to multi-tenant and zero-affinity is basically a process of taking every piece of tenant-related state from the application, and externalising it.
At some point, you need to migrate customers from the old platform to the new one. The riskiest and least sensible strategy here is to move all the data around and change all the code from old to new as part of one massive migration. “Sensible” for us was to make as many small, incremental changes as possible in the old platform so by the time we get to the last step (the actual migration) the delta between the old and new systems was minimal.
Some examples of incremental migration throughout the project are:
Files and attachments – Moving from the local filesystem to an external store. We were already in the process of externalising this to provide higher DR mitigations to customers, so this program brought forward the urgency of that migration.
Identity and userbase – Moving authentication and userbase management to a single, external system. Again, this was already in progress across Atlassian, and this program made it more urgent.
Search – Moving from local Apache Lucene to external Elasticsearch.
Local caching – Moving from mutable local state to external caches (or alternatively, remove the caches and optimise data access).
Ecosystem – Atlassian add-ons have completely different architectures for server vs. cloud. We’d partnered with a few successful vendors to allow them to offer their server add-ons to cloud customers. Vertigo required collaborating with them to transition to our cloud add-on architecture.
All these features were required by the new platform but implemented on the old platform, with customer data migrated, well before well before the final migration for a particular tenant. So at the time of migrating a customer from the old to new platforms, all data that needed to be externalised had already been externalised. All coding patterns had been changed, released, and optimised. We kept the number of feature toggles between the old and new platforms to a minimum. The only thing to move was the main database, with a few config changes after migration.
Having said that, some of the things that slowed down during the year between the first and last customer migration were places we’d cut corners in externalising. We needed to migrate automated jobs to a new platform, but felt it was low-risk enough to couple to the final migration… then hit performance issues on the scheduler and had to pause migrations when we started migrations at scale. We also changed the system timezone during final migration (the old architecture set the system timezone to the customers’ local time zone) – again, we thought this was low-risk but encountered bugs early on.
In retrospect, I wish we’d done the work to get around that scheduler quirk and decouple scheduler migration from the final cutover. I wish we’d changed the timezones on the old system before migration. If you’re in this position, where you have a key principle you trust like “minimise delta between old and new”, don’t follow it 9 out of 10 times. Dial that shit up to 11.
Engineering principle #4: Sprint to 100%
I’ve done a few big migration projects in my life. And I’ve seen more than one migration project where you get through the bulk of the work, and that last 1% turns out to be so difficult that it’s another year before you’re fully complete. Often it’s a series of many small, different problems. We call them “snowflakes” because each is unique and umm, “beautiful” in its own way.
As long as we had a single customer on the old platform, we still needed to keep the code paths, build pipelines, tests and CI, deployment pipelines, and customer provisioning infrastructure up and running. This was a drag on devspeed for every Confluence and Jira cloud developer. We could not tolerate a long tail of tough issues to resolve before we could stop deploying to the old platform. We needed to run as fast as we could until the work was completed to 100.0%. Not almost done, not approximately 100%, but completely done.
In practice, this meant we were very disciplined about tackling and removing the snowflakes early on in the program. We spent time auditing all the configurations for customers on the old platform: system configuration and internal Atlassian plugin configuration could vary per-customer on the old architecture, but needed to be consistent on the new. Even the deployed version of Confluence or Jira was 99.9% consistent, but could vary widely in that 0.1% of outliers. We had folks focused for many months cleaning up the surrounding parts of the old platform – e.g., resolving inconsistencies between our purchasing systems and the corresponding tenant management systems in infrastructure – to ensure that there weren’t any hidden tenants that could again push out the tail of migration.
Does that clash with the above advice to be laser-focused on chasing the riskiest assumptions? Yes! And we had robust debates on that topic internally. Assuming that “we can wrap up the last 1% in a matter of days, not months” is risky in itself. After proving out the platform with the first few customers, it was clearly one of the top risks. You need to start investing in this early to get on top of it.
It’s also a motivator for the team. By the time the last customer had migrated, some folks had been on this for two and half years. If you’ve ever been in that situation, you’ll know how crucial (and difficult) it is to keep motivation high. Knowing that the end is actually the end and not just another stone waiting to overturn more issues is a powerful motivator for dev and leadership teams.
The investment in removing snowflakes paid off. After the first customer, we started attacking those snowflakes while implementing the remaining features and performance work for all customers. By the time we hit the home stretch of migration – the largest and trickiest customers to migrate – we were 100% free from snowflakes. It was liberating and motivating for the team to know that once they had tackled the work for those large customers, we were 100% there. No long tail, no weird configs to resolve. You are done.
Finally
If you’re in the cloud world and you have happy and growing customers, then you need to be always improving, reinventing, and optimising your systems. And most of the time, this improvement is going to involve some level (from small to colossal) of architecture and data migration.
We’ve tried to make this post as open and no-bullshit as possible. I hope the engineering principles and lessons outlined here can help you navigate whatever vertigo-inducing project comes your way next!
Posted on December 12, 2018
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.