EDA for AWS operations
Jakub Wołynko
Posted on March 22, 2024
Welcome
Great that you're here. Today, I will write a bit about Event-Driven Ansible and its possibilities for improving the operational efficiency of our AWS environment. But wait, what does it mean by Event-Driven Ansible? My favorite sentence from the documentation is "EDA provides a way of codifying operational logic," so in short words, we can code logic for responding to different events. For example, we can trigger playbook execution with Lambdas make our self-service process a bit more ops-oriented, or introduce ChatOps.
Before we begin
Native AWS services?
Seems that could be a question here. Why not use System Manager, as it is a dedicated service for operations? Or OpsWorks? Or even Lambdas. Let’s break it down now.
System Manager
As AWS has many complementary solutions with quite similar functionality, we will start with the most well-known, and easy to use. System Manager is agent agent-based solution for managing a whole fleet of instances, no matter whether cloud or on-prem, if an agent is installed System Manager can be used. Additionally, with the usage of Automation, we can use more than 370 pre-defined tasks, written with PowerShell, Bash, or Python, and that is great. If we wanted to schedule servers update, we can do it, and execute this operation on the whole fleet at the same time(which is not the best idea), or on part based on tags and groups.
So what’s the challenge? System Manager isn’t declarative, If we would like to add a custom ssm-document for creating a new folder, we need to write for example:
---
schemaVersion: '2.2'
description: "State Manager mkdir example"
parameters: {}
mainSteps:
- action: aws:runShellScript
name: make /opt/test
inputs:
runCommand:
- sudo mkdir /opt/test && chmod -R 0440 && chown admin. /opt/test
It’s fine, but it's like taking a step back. For service restart, it’s a great solution, but refreshing our system or using complex functions becomes even more complex.
OpsWorks
The solution was designed to serve managed Chef/Puppet to customers, unfortunately, all of them will reach End of Life withe the end of May 2024. During the time of writing this article (1-half of March), you can read about it on the public service page. OpsWorks.
So as a summary, nice solution unfortunately based on Chef/Puppet, not a SaltStack, also the idea of stacks could be a blocker for a multi-cloud environment.
Lambdas
Lambdas and the whole serverless ecosystem were called a “new big thing”, a lot of people advocate for using function as much as possible and everywhere. After a while, we realized, that Lambdas-based architecture is very hard to keep clean and simple, also can easily make our bill much higher. That is why we started using them when it makes sense. For example image processing, data processing, etc, but most importantly, functions can act as a glue, for everything, connecting services that aren’t natively integrated, executing tasks, or even stopping our instance, when some criteria weren’t met. The thing is that it will be quite hard and not very time-efficient to use Lambdas for service restart tasks. The main reason is that functions operate on different logic layers, let’s say hypervisor level. However, server-level operation could be a challenge.
Initial infrastructure
First, we need to set up our base infrastructure and install the needed packages.
Let’s assume, that our infrastructure will be very simple. We will use just for simplification three small VMs, all of which will be placed in a private subnet, one will be acting as a bastion host, for communication with our ansible-runner, just to keep the whole setup a bit more “real world” and secure.
As it’s quite a straightforward task I will use AWS CDK for that, please take a look at the user-data part, which is a place where we’re installing EDA components:
userData.addCommands(
`dnf --assumeyes install git maven gcc java-17-openjdk python3-pip python3-devel`,
`echo 'export JAVA_HOME=/usr/lib/jvm/jre-17-openjdk' >> /etc/profile`,
`pip3 install wheel ansible ansible-rulebook ansible-runner`,
`echo 'export PATH=$PATH:/usr/local/bin' >> /etc/profile`,
`mkdir /opt/ansible`,
`git clone https://github.com/3sky/eda-example-ansible /opt/ansible/eda-example-ansible`,
`cd /opt/ansible/eda-example-ansible && /usr/local/bin/ansible-galaxy collection install -r collections/requirements.yml -p collections`,
`chown -R ec2-user:ec2-user /opt/ansible/*`,
);
When the stack is ready we can connect with a regular .ssh/config
file.
Host jump
PreferredAuthentications publickey
IdentitiesOnly=yes
IdentityFile ~/.ssh/id_ed25519_eda
User ec2-user
# this one could change
# as its public IP
Hostname 3.126.251.7
Host runner
PreferredAuthentications publickey
IdentitiesOnly=yes
ProxyJump jump
IdentityFile ~/.ssh/id_ed25519_eda
User ec2-user
Hostname 10.192.0.99
Host compute
PreferredAuthentications publickey
IdentitiesOnly=yes
ProxyJump jump
IdentityFile ~/.ssh/id_ed25519_eda
User ec2-user
Hostname 10.192.0.86
With that being done, we can focus on popular use cases.
EDA configuration
As today’s article is an EDA introduction, we will focus on basic http listener and basic service configuration, next article will be focused on more complex event sources like MSK or watchdogs.
Server configuration
First let’s use a basic server setup, based on documentation:
---
- name: Listen for events on a webhook
hosts: all
## Define our source for events
sources:
# use Python range function
- ansible.eda.range:
limit: 5
# use webhook with path /endpoint
- ansible.eda.webhook:
host: 0.0.0.0
port: 5000
# use kafka
- ansible.eda.kafka:
topic: eda
host: localhost
port: 9092
group_id: testing
Seems rather straight, right? First, we’re using the keyword sources
, then with typical Ansible syntax, we’re providing the FQDN of the module. Also as you can see there are multiple event sources, all of which are described here. In the case of introduction, I will be using web-hook endpoint.
Rules configuration
Now let’s define our first rules:
---
- name: Listen for events on a webhook
hosts: all
sources:
# use webhook with path /endpoint
- ansible.eda.webhook:
host: 0.0.0.0
port: 5000
# define the conditions we are looking for
rules:
- name: Rule 1
condition: event.payload.message == "Hello"
# define the action after meeting the condition
action:
run_playbook:
name: what.yml
# As it's a list of dicts, we can add another rule
- name: Rule 2
condition: event.payload.message == "Ahoj"
action:
run_playbook:
name: co.yml
Execution phase
Great! Let’s assume, that both of our run_playbook
looks very similar, and are simple:
---
- hosts: localhost
connection: local
tasks:
- debug:
msg: "World"
Now, we’re good to go, and execute our ansible-rulebook
command:
ansible-rulebook --rulebook eda-example.yml -i inventory.yml --verbose
Then let’s open another terminal window(or login into bastion
host) and execute the testing http request:
curl -H 'Content-Type: application/json' -d "{\"message\": \"Test\"}" 10.192.0.92:5000/endpoint
Inside the log we can see:
2024-03-20 20:42:26,859 - aiohttp.access - INFO - 10.192.0.92 [20/Mar/2024:20:42:26 +0000] "POST /endpoint HTTP/1.1" 200 158 "-" "curl/7.76.1"
Then, let’s try met out conditional, by changing our message to “Hello”, and next to “Ahoj”.
2024-03-20 20:44:39,199 - aiohttp.access - INFO - 10.192.0.92 [20/Mar/2024:20:44:39 +0000] "POST /endpoint HTTP/1.1" 200 158 "-" "curl/7.76.1"
PLAY [localhost] ***************************************************************
TASK [Gathering Facts] *********************************************************
ok: [localhost]
TASK [debug] *******************************************************************
ok: [localhost] => {
"msg": "World"
}
PLAY RECAP *********************************************************************
localhost : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
2024-03-20 20:44:41,923 - ansible_rulebook.action.runner - INFO - Ansible runner Queue task cancelled
2024-03-20 20:44:41,924 - ansible_rulebook.action.run_playbook - INFO - Ansible runner rc: 0, status: successful
2024-03-20 20:46:14,240 - aiohttp.access - INFO - 10.192.0.92 [20/Mar/2024:20:46:14 +0000] "POST /endpoint HTTP/1.1" 200 158 "-" "curl/7.76.1"
PLAY [localhost] ***************************************************************
TASK [Gathering Facts] *********************************************************
ok: [localhost]
TASK [debug] *******************************************************************
ok: [localhost] => {
"msg": "Svet"
}
PLAY RECAP *********************************************************************
localhost : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
2024-03-20 20:46:16,620 - ansible_rulebook.action.runner - INFO - Ansible runner Queue task cancelled
2024-03-20 20:46:16,621 - ansible_rulebook.action.run_playbook - INFO - Ansible runner rc: 0, status: successful
Now we have a working example, of working EDA. Let’s focus on something more useful.
Example use cases
As you probably notice already it’s more like REST API to Ansible, and you’re right. It could do that as well, as web-hook is an event as well. So let’s focus on basic administration tasks.
---
- name: Listen for events on a webhook
hosts: all
sources:
- ansible.eda.webhook:
host: 0.0.0.0
port: 5000
- name: Restart Server
condition: event.payload.task == "restart"
action:
run_playbook:
name: reboot.yml
- name: Reload Nginx
condition: event.payload.task == "reload-nginx"
action:
run_playbook:
name: reload-nginx.yml
With the following rulebook, we’re able to restart the server if needed or reload nginx with the usage of REST API. That can be easily connected with ITSM systems like Jira or ServiceNow. However please note that there is no filter or tags implemented - based on documentation, which means, we need a dedicated playbook for a particular group of hosts, or issues.
Summary
EDA seems to be a more and more popular topic among many organizations, which is why in my opinion is a good time to take a look at it. To make it even simpler to start with I prepared a whole ready-to-use project that contains:
- playground based on AWS
- all the playbooks and rules included in this article
- step-by-step README file
That is why please use this article just like an introduction, the whole magic should take place on your AWS account. Please play with rules and use-cases, the thing about possibilities which this pattern can deliver in your organization, or which aspect can improve. If you have any questions or need help don't hesitate and contact me.
Posted on March 22, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.