Nikita Vetoshkin
Posted on October 23, 2023
Software is eating the world. For better or worse. Among many applications we try to make computers perform is automation - the most tedious, repetitive, and thus error-prone for humans type of task. A single mistake in an everyday routine operation may take down Internet-scale businesses for minutes and hours, taking more than 24 hours to fully recover and incur millions of dollars in losses. Automating away most operations is thus a legitimate goal that saves operating expenses and improves business robustness.
On the other hand automation is not free. It is an expensive and long process and should be properly evaluated from a business and engineering perspective. Overengineering and building low business value Rube Goldberg machines is often can be as bad as under-engineering and leaving crucial parts manual. Here I would like to describe a reasoning approach to such software engineering tasks. This approach is not a law of nature, but a useful abstraction that I came up with to guide decisions for me, my team, and the business I work for. Let’s start with a task that sounds like a good junior DevOps assignment:
Convert from RAW to MP3 and backup all call records files on a call center server
Do it somehow
How do we start things? Manually - employing the power of our brain, its experiences, and modern Internet search. Our brain is good at decomposing tasks and addressing them one by one. We also constantly run a cost function, checking the efforts and time spent against current progress and potential profits. After some time and trial and error, we arrived at one of the possible results.
The first one: the task is impossible, there’s no currently known workaround for a fundamental limitation. This is a good (alas disappointing) result. If we were planning to bet our business on that - we have data against and can plan and act accordingly.
Another similar conclusion: the task is possible, but prohibitively slow and/or expensive. Again, armed with this data we can make better business decisions.
Best case: the task is possible after a series of manual steps, here’s the result. That is a great achievement. At this point it is worth to ask yourself a question:
Do we ever need to repeat that?
Or, even better, what is the probability we can assign to the positive response. If the answer is “no” or “probably, no”, then we’re done. No need to spend time and resources on this, no need to move to the next step of automation. Usually software engineering projects have lots of other things to work on.
If we’d like to persist and make this task easily repeatable, then it is time to move to the next step.
Write it down
According to the “Software Engineering at Google” book, software development is a team effort integrated over time and one of many aspects of it is preserving and sharing the knowledge among org members. Right now there is only one person who knows how to solve the task (or even if it solvable at all) and can repeat it - you. The bus factor is 1 and that is not the state we want our team to be in.
So let us put a HOWTO.txt in our project’s repo or write the exact steps on a wiki page called “136 easy steps to …”. Yes, as simple as that. What’s the profit?
We serialized our experience (probably double-checking it in the process) and put it on persistent storage, that is more reliable than a human brain in the long run. In a month, in a year or 5 we can read and repeat. In software we serialize data when we need to pass it around, to share. The same directly applies here, because we’ve just shared the knowledge and someone else can carry out the task. Again, software engineering is a team game.
Another useful property to stress out is the low requirements for preciseness of the document - it is targeted to be interpreted by a human being, which handles inaccuracies and errors much more gracefully than, say, Python interpreter. Many details can be omitted, meaning the task of writing the doc can be accomplished fast and thus cheap.
Are we done? It is imperative to ask following questions.
Is the task tedious enough to invest in it further?
If the task is a sequence of a couple of bash commands with clear path on how to handle errors and deal with changes - probably we’re done. If we need to wait for hours for something to be downloaded or juggle and copy-paste tens of variables, certificates and unreadable long hash strings (remember, humans are bad at that) - maybe it is worth moving to the next step.
How long does it take to execute the task?
If it takes hours to execute it (i.e. longer than a typical human attention span) - it is a good indicator, than we (i.e. our business) will benefit from further automation.
- Are the readers of our doc capable of following it?
What if a manager or a customer might need to follow the steps - can they accomplish and confidently deal with imperfections, handle errors, and recover from missed steps?
Depending on the answers we might decide to move on to the next step.
Translate into a script
Usually all it takes is follow the existing doc and translate from a natural into a more strict computer language. It can be as simple as a page long bash script or an elaborate Python script, employing rich ecosystem of freely or locally available libraries. Today (in 2023) some may even consider using Go for this task.
Let’s look at the profit:
- A script can accept parameters like a version to work with, where to put resulting artifacts, etc.
- We lowered the bar, executing the task is even easier - just checkout/download and run.
- Thus it can be integrate into other automation pipelines like CI/CD
- A script can log usage stats to allow data driven assessment of its usefulness
This is no small feat. We improved robustness and now our team/org can execute the same task only in a couple of seconds of human attention during execution: to start and check the results.
The script approach often works great for periodic, cron-style jobs. If we want to run the task daily or hourly we put install the script on the server (package it along with other server components) and configure local cron daemon (or systemd.timer accordingly). Done. Simple and stateless automation is easy to manage and debug.
On the flip side the script approach comes short when we need to react to some external event: RPC call, disk size reaching some limit, etc. In other words, a script is not well suited for interactive or reactive tasks. Next shortcoming is lack of state: if we need to keep TCP connections or load some lookup data from a remote storage during each execution - script might be too resource-hungry and slow.
Let it run the background
If we make the next logical step and make our script stateful and running in the background - now it is called a daemon. With a daemon we can pro-actively react to local changes, timers, disk space, etc. And do that efficiently - connections state and in-memory cache our at our disposal. Though state can be fragile to manage, we can keep the process single threaded to continue keeping things simple. Do not forget, in our case simple means “reliable”. The cost of having a daemon is highe: we need to carefully work with state and possible memory leaks - we had no such issues with a script. That is why we need to carefully estimate if this costs its while in each case.
Daemon/process approach in our Internet scale times has some shortcomings too. First, it cannot handle something that does not fit a single host. Modern servers can be super powerful and oftentimes it can be cheaper and faster to buy (or rent in your cloud) a beefier machine, while keeping the software simple. We might add multiple threads to handle the load and utilize additional cores. The upside here is that we can improve things gradually, having something that works and gets things done on each step.
Provide a service
If our task:
- fails to fit into a single machine
- needs to provide an external service (in our case: remote sound file encoding and storing)
needs to be single machine failure resistant
we can solve it by promoting our daemon process into a service by adding an external API, making it discoverable and reachable via service mesh, DNS or other means. Upsides:Flexible: the task execution is not tight to a particular machine
Scalable: potentially we can more replicas and scale our service horizontally
Reliable: we can be resilient to multiple domain failures: server, rack, DC, etc.
Downsides are there and include:
- Increased complexity: we have a distributed system on our hands with all its complexity and potential issue
- Increased cognitive load: we need to carefully design and evolve API
- Increased operational costs: we need to monitor if our service is reachable over the network, protect against DoS attacks, potentially customer isolation, etc. All these lead to increased costs, which must be assessed against the profits.
Breaking the rules
All these steps are not a definitive recipe for success, but they do provide a framework to reason about engineering tasks at hand. Throughout my career I’ve seen many attempts to skip ahead, ignore a step or two to get to the final solution faster. Sometimes that does work, but oftentimes it doesn’t and we end up wasting a lot of time and efforts. E.g. given the same task I would like to try myself in API design and skip the boring manual and script parts. After days of intensive labour I arrive at a perfect set of API functions my encoding server will provide. Then I can decide to make a tiny prototype to do the encoding… Given there’s only one good ending to this story, there are many not so “happy ever after” ones:
- Network costs to send uncompressed files are too large given the refined requirements (which were not there from the start, of course).
- There’s no spare server/capacity on premises to run this service on.
- The script solution fits the needs and is needed yesterday
- The daemon solution we started off with is leaking memory and clients are not happy
The list can go on and on, the message here is: be cautious, know the rules and justify breaking them if needed. And that’s all, thank you.
Posted on October 23, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.