The (non) Monorepo Mistake
Patrick Kelly
Posted on October 24, 2020
I messed up and am paying the price for it. Hopefully, others can learn from my mistake. Everything I've been working on should have been in a monorepo, and I didn't do that.
I did the opposite.
Let me just get this out there right now. This is not going to be explaining why you should use a monorepo. There's a good chance you should not. This is why I should have. And even more specifically, why I should have for only one project. I have other projects that should stay discrete repos.
The benefits of monorepos are well documented. [1] [2] [3] [4]. So are the detriments. [1]
Since I didn't, let's go over why I didn't first, but then what I learned each time. Following along with Matt Kleins excellent article:
Theoretical monorepo benefits...
Easier collaboration...
I actually disagreed with this for much of the same reasons Klein disagrees with it. In fact, I still do. But I'd even go further to add that very large repos are intimidating to collaborators and increases the burden of onboarding. There's part of what I learned that shoots a hole straight through this argument though. Klein states "or to search through it using tools like grep ... must provide ... sophisticated source code indexing/searching/discovery capabilities as a service". Well, yeah, exactly! Any time I need to look through the .NET Core runtime, it's a huge PIA. It's hopeless to try to find your way and you have to rely on search tools. Break things apart and you can navigate manually far easier. So this is a good thing, right! The fallacy here is that discoverability is a function of modularity or lack-thereof, when it is in fact a function of the overall size of the code base. Breaking it apart introduced an even bigger problem. With code split across repo boundaries, search tools would fail. And you were still left with navigation problems just with a different level of abstraction. What was "what folder might this be in" was changed to "what repo might this be in". I didn't solve anything, I just traded some problems for others. The problems I had before were already solved by in-editor tooling, and are therefore preferable. Where he states "given that a developer will only access small portions .. is there any real different between checking out a small portion or checking out multiple repositories..."; I agree with this for the most part, this is more of a prodecural than functional thing and has little significant effect to ops. He further goes on to state "In terms of source code indexing/..." which is absolutely correct. However, this requires contributors to be familiar with an additional tool, and even worse, a tool that exists outside of the editor they are familiar with, to navigate your overall project. With the monorepo everything is searchable with the tooling you should already be familiar with, and, importantly, works the way you've gotten familiar with. You don't need additional tooling in this instance.
Single build...
No, he's entirely right here from my experiences. The build tools that work well for monorepos dgaf and work well for submoduling or other polyrepo styles. Furthermore, there's also techniques to pull off a single build pipeline of multiple discrete repos in various DevOps solutions. This purported benefit is highly dubious.
Code refactors...
He rightfully points out a fallacy but also falls prey to the fallacy of scale himself. I still question the efficacy of massive monorepos like Google does. However, in my case we're talking a small monorepo that's completely reasonable to check out the entire thing on a single machine.
But I'd also like to address a few points. "a developer can clone and edit the entire codebase ... how often does that actually happen?" In my case, actually quite often, as adding a new feature may potentially involve adding subfeatures in numerous locations. As I'm adding in a serialization framework right now, this involves adding files to Defender, Collectathon, and Philosoft, because the overall feature spans multiple scopes. And because serialization is useless without the ability to stream the serialized data, whether to disk or across the network, this necessary addition also needs to be added to the streams API I've written. Whether the core stream API remains in Stringier is still unknown, but regardless, this feature spans across the entire overall project. "They can give up, and work around the API issue...", I was actually doing this far more with things split up polyrepo style. As I merge things into the monorepo I'm finding numerous cases where I've done this and am syncing everything back up.
Unique monorepo downsides
Tight coupling...
He opens with "a monorepo encourages tight coupling", to which I can say, yes I believe this does. However that's why I'm trying to find a very careful boundary for this, because I have code that is very tightly coupled.
But tight coupling is bad!
Hol' up. Let me explain real quick what's going on here and we'll address this point again. Stringier, Collectathon, Defender, and others are part of an overall project where I'm developing a language runtime and standard library for a DSL I've needed. That language is being developed with cross-language compatibility in mind, and is part of why it's being developed with .NET languages, because of CLS compliance. A language runtime is inherently tightly coupled. This being said, reducing coupling is always a good thing, and I've been fighting to find various boundaries. Historically, Stringier was broken up into multiple subprojects but I've been undoing that. However, the aforementioned projects are hard boundaries that I've identified and have not struggled with since. This is why I can do things like write an article about using guard clauses and how Defender offers these, and not having to explain how this ties directly to or is dependent on other things. Because of modularity and CLS compliance, these components are available to other consumers like C# and F#.
So is coupling bad? Well, there's a sweet spot. I could potentially go through an insane amount of decoupling where each individual function is its own project with its own artifact that's published as a nuget package. But that's insane and imposes an inordinant amount of additional work on me. Some level of coupling enables efficient compatibility. Let's look at it this way: If a contractor built your house without any bolts or nails you'd be concerned, and "But coupling is bad and I've ensured everything is modular" would be anything but reassuring. But you can definately couple way too much as well. If you walked into your new home and your furniture was bolted to the floor, that would be equally concerning for entirely different reasons. There's an ideal amount of coupling that's important, and too much or too little is problematic in their own ways.
Klein then says, "If an organization wishes to create or easily consume OSS, using a polyrepo is required.". This one in particular bugs me. Scale might be important here, because I found the exact opposite. I shot myself in the foot by splitting things up. Where I originally thought I would be making things easier to navigate and discover, I actually trashed the amount of navigation to my repo. I can safely say trashed because as I merge things back into a single repo, I'm back up to the traffic numbers I was at before.
VCS scalability
"Scaling a single VCS to hundreds of developers, hundreds of millions lines of code, and a rapid rate of submissions is a monumental task.", completely agreed, although I'd add that scaling anything to that size is a monumental task.
Other arguments.
We're done with Kleins article and this has to do with other things I've read or heard that shaped my decision, to which I'm now backtracking on.
Dependencies
Okay, this is straight up bullshit right? Dependabot can keep polyrepo dependencies up to date no problem.
And it can. But that's missing a key thing. However, it's a key thing in my case that I've never heard about once. In the majority of cases, good use of Dependabot can completely avoid this purported claim.
So what's my case? Analyzers. I make extensive use of analyzers for code review. There's a lot of them that are suppressed, sure, either at the local level because of a specific exception, or at the project level because of disagreement with that specific analyzer. In any case, there's a justification for the suppression that goes along with it. In many of these cases, that justification is that the issue is complex and requires human review to adequately assess. The majority of analyzers still stand, diligently doing their job of helping me manage an absurdly large and still growing codebase for a single person. As I learn of more and more useful analyzers, I'd add them to the projects. With the polyrepo set up, these analyzers have to be added to each project within each repo. This is obviously less than ideal and meant the analyzers were out of sync. Not in that the version numbers were out of sync; no, Dependabot kept that in order. But rather, different projects were using different analyzers. Some are quick to jump on the fact that Visual Studio can install many of these analyzers as extensions and apply them to any project. This has several faults. Firstly, it assumes I use Visual Studio, which I do. Secondly, it assumes I only use Visual Studio, which I don't. Thirdly, it assumes I'm going to keep my extensions in sync across all my Visual Studio installs, which is just the exact same problem at a different level. How does the monorepo help in this instance? Directory.Build.props
and Directory.Build.targets
! These inject, at various times, properties to all projects within that directory. Put at the top level, it ensures every project is utilizing the same analyzer. This had other unintended benefits of course, such as being able to set properties for repo url, project url, copyright date, and others universally, also keeping those in sync.
Documentation
I never see this brought up, ever. Documentation is incredibly easier to deal with in the monorepo setting. Just add in a single project, remove the template sources you probably had, add in a docfx
dependency, build it once to set up the scaffolding, and then configure it as appropriate. Now, every single build will build your docs, keeping them in sync. This might be a bit tedious for development purposes, so what I do is set the configuration to not build this in debug mode, and build it in release mode. Since I upload nuget packages in release mode, this ensures the docs are built and up-to-date on the releases.
As I mentioned, however, I've been merging everything into a monorepo. So I wasn't using a monorepo before. How'd I deal with docs?
Cries in DevOps
There was three repos dealing with documentation. One held articles, one held the built documentation, and one held the documentation project. There was a build pipeline that would check out the documentation project, clone the articles repo into it, clone the build documentation into it, clone all the actual projects into it, proceed to build the documentation, make a commit of the built documentation and then push that documentation. Note that this implicitly requires everything I mentioned for the monorepo case above.
I supposed this is a bit easier for me to see, since I'm wearing so many hats in this project. The monorepo isn't only about the developer. Documentation was immensely simpler and easier to keep in sync in the monorepo case.
Testing
Specifically I'm gonna talk about test coverage. It's easier to calculate and more correct in the coverage numbers when you have all the sources present. That's all.
Summary
In summary, this was the right decision for me, although I'm sure there'll be refinements as time goes on, just as there has already been. No project is the same, and it takes experience to determine what exactly works best for any given project.
Is this the right decision for you? Well, I'm nowhere near presumptuous enough to tell you what you should do. If my reasons make sense to you then maybe it is. If my reasons seem like I'm an idiot, then maybe I am, and maybe you shouldn't use a monorepo. These are my reasons, and I'm only explaining my reasons.
Posted on October 24, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024