Building CI/CD for Vertex AI pipelines: The real world
Oleksandr Borodavka
Posted on September 20, 2022
Our first solution for a CI/CD implementation is a great start for us. It works well when you do not change many things at the same time when the team is small and updates are not frequent. However, the real workflow usually is not so simple. Let's take a look at some possible situations that can happen if we stay with the first implementation.
Concurrency mess
Issue
Let's look at what we get if we open a pull request with several changes. For instance, we have two changed components and one pipeline, and the pipeline contains both of these components.
In this case, GitHub Actions runs 3 workflows in parallel. Since the components are related to the pipeline, the pipeline will be built after each component. So the same job will be run 3 times. Also, we do not have any guarantee about the order of execution. It can be done in any order. Like on the diagram, firstly the pipeline will be built with component2, then with component1, and as a result, we will receive the pipeline only with updated component1, but none of the builds will be with both updated components.
Solution
We can avoid such situations by running the processes in a predefined order. In our case, there are no relations like component-to-component or pipeline-to-pipeline, we have only relations component-to-pipeline. So, if we would be able to run firstly all processes for components and after all processes for pipelines, it would guarantee us the correct build in the end.
Our solution is as follows, within one GHA workflow we do the following steps:
- Analyze the list of changed files in a pull request
- Find changed components
- Find changed pipelines
- Find pipelines related to changed components
Run jobs for all component
Run jobs for all pipelines
It can be done with the matrix strategy and some additional logic in the code. The matrix strategy is a feature of GitHub Action that lets to run many jobs in parallel for a matrix of parameters. In our case, we can extract a list of changed components/pipelines and run all necessary tasks for them with a single job definition.
Doing all jobs in one workflow deprives us of duplicates as well. We also gain one more important outcome, we do not need to manually add and maintain workflows for every pipeline and component anymore as we did in the first implementation. Everything is being built on the fly.
Many Pull Requests
Issue
Great, but what if we have two or more open pull requests? Each PR is run in the required order, it is already solved in the previous step. However, since we store all the configurations and docker images in the cloud, they are shared resources. It means that one CI/CD process can interfere with another process, rewrite common resources, and so on.
Here we have a mess again. In addition, if component1 from PR1 is buggy and it is built in time between component1 and pipeline1 from PR2, then PR2 will fail, just because of a bug in PR1. Or even worse, when the situation is reversed and a buggy component can pass due to overwrite.
Solution
Probably the simplest solution here will be to deny concurrent CI/CD jobs. Just process one pull request at a time.
Pull Request 1
-> Pull Request 2
-> ...
With GitHub Actions, it can be done with the concurrency feature. That allows us to define concurrency groups to ensure that only a single job or workflow using the same group will run at a time.
Yes, it can be a bottleneck in the future, especially if the builds take significant time. But it solves the issue and it works for us now. Also, it is quite easy to implement, so let's go further.
Versioning
Issue
Another dangerous area is versions of docker images. We use containerized components that allow us to work with them as with independent applications, it increases reusability, and so on. It means that a ready-to-use component consists of two entities: a specification file and a docker image.
Let's look at what can happen if we always use the image:latest
version of the docker image.
Situation 1
For instance, we already have a pipeline specification.
{
"pipelineSpec": {
...
"deploymentSpec": {
"component1": {
"image": "gcr.io/components/component1:latest"
...
We do not want to change anything, just want to use it to run the pipeline. Maybe we do it automatically on schedule. At the same time, we can have an ongoing CI/CD job that already rebuilt a part of the components, but some part is still in progress. Outcomes can be unpredictable.
Situation 2
Say two developers work with the same component. Tom updates component1, then works with component2, and at this moment Bet updates component1 as well. When Tom builds pipeline1 the right version of component2 will be used, but the latest image of component1 was overwritten by Bet. Tom will receive something unexpected as result. It is good if the difference between these parallel changes is obvious. Otherwise, it can take a lot of time to understand what is wrong.
Solution
By linking the concrete versions of components with image:version
and doing only one CI/CD job at a time, we will be fine in the Staging and Production environments. The situation in the Development environment is still tricky. As an option, we can provide a unique dev environment for each developer. However, it will increase the price and require additional work on setting up and maintaining a variety of environments.
As a solution for the dev, we can add another CLI command (or extend the existing one) to rebuild the whole pipeline with all related components.
In this way, we will avoid the occasional overwriting of components. Everything will be built from the current code and used locally without a chance of interruption.
Conclusion
Here we have considered some important aspects that are not so obvious at the start. However ignoring it can make the CI/CD processes unstable and unpredictable.
Next time we will apply this knowledge to get a new version of CI/CD.
To be continued...
Posted on September 20, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.