Building a Scalable Audit Log System with AWS and ClickHouse
Sebastian
Posted on November 26, 2024
Audit logs might seem like a backend feature that only a few people care about, but they play a crucial role in keeping things running smoothly and securely in any SaaS or tech company. Let me take you through our journey of building a robust and scalable audit log system. Along the way, I’ll share why we needed it, what exactly audit logs are, and how we combined tools like AWS, ClickHouse, and OpenAPI to craft a solution that works like a charm.
The Case of the Disappearing Configuration
At epilot, we’ve encountered a frustratingly familiar scenario. A customer reaches out, upset that one of their workflow configurations has mysteriously vanished. Their immediate question? “Who deleted it?”—and the assumption is that someone on our team is responsible.
Now here’s the tricky part: how do we, as engineers, figure out who did what and when?
One obvious approach is to dive into the application logs. But here’s the catch: most of the production logs aren’t enabled by default. Even when they are, they’re often sampled, capturing only about 10% of the actual traffic. Additionally, those logs often seem to lack the required information. This means we’re left piecing together incomplete data, like trying to solve a puzzle with half the pieces missing.
What Are Audit Logs Anyway?
Audit logs provide clear visibility into system changes, aiding teams in investigations, diagnosing incidents, and tracing unauthorized actions. They empower admins by reducing support reliance and ensuring clarity on actions like role or workflow updates. For enterprise customers, audit logs are a critical, expected feature that supports compliance with standards like ISO 27001. Additionally, they lay the groundwork for enhanced threat detection capabilities in the future. In simple terms they try to help to answer the following questions:
WHO is doing something. Typically a user or a system (api call)
WHAT is that user/system doing?
WHERE is that occurring from? (e.g. an IP address)
WHEN did it occur?
WHY? (optional) Why did the user log in? → we don’t know, Why is its IP blocked? → User logged in 5 times with the wrong password
Key Considerations for a Successful Audit Log System
Before diving into the technical details, it’s crucial to define what makes an audit log system effective. While the exact requirements depend on your company’s domain, there are some universal points worth considering:
Compliance: Ensure the system adheres to regulations like GDPR. For example, customers may request the deletion of personal data, so you’ll need a straightforward way to erase all logs tied to a specific customer.
Sustainability: Audit logs grow rapidly, especially in high-traffic systems. Storing them indefinitely may not be feasible. Decide on strategies for archiving or purging logs over time.
Permissions: Define who is allowed to access audit logs to maintain security and privacy.
Format: Standardize the structure of your logs to ensure they’re easy to interpret and query.
Data Selection: Carefully determine what actions and events are worth logging to answer critical questions effectively, without unnecessary noise.
Making It Happen: How We Built Our Audit Logs
At epilot, our APIs are built around serverless components provided by AWS. From the outset, we recognized that AWS API Gateway events provided a rich source of information for building audit logs. These events capture critical details such as user identities, actions performed (through the request payload), IP addresses, headers, and more.
Given our microservices architecture, where services are organized by domain and accessed through an API Gateway (see our system architecture), we needed a solution that seamlessly integrated with this structure.
High-Level Overview
Our approach to audit logging can be summarized as:
- Capturing events asynchronously.
- Validating and transforming raw events into a standard format.
- Persisting the data in a read-only, scalable, and query-friendly storage system.
This design adheres to several key technical principles:
Asynchronous Event Capture
We use Amazon SQS to decouple event capture from the main HTTP request flow. For example, when a user creates a new workflow configuration, the relevant API Gateway event is pushed to an SQS queue by middleware wrapping the API. This ensures that audit logging does not introduce latency or affect the performance of the core application logic.
From Raw to Standardized Events
Our focus is on capturing system modifications, specifically HTTP methods like POST, PUT, PATCH, and DELETE. These provide meaningful insights into changes occurring within the system. GET requests, on the other hand, generate excessive noise and are generally excluded—though we offer an opt-in mechanism for services where logging GET requests adds value.
A Lambda function processes raw API Gateway events from the SQS queue, transforming them into a structured and validated format. This includes filtering relevant data, enhancing it using metadata like OpenAPI specifications, and ensuring consistency across all logged events.
Data Persistence
For storing audit logs, we chose ClickHouse, a highly scalable, SQL-based database that aligns with our requirements:
- Read-only access: Supports immutability to preserve data integrity. Scalability: Proven in our data lake setup to handle large volumes of data efficiently.
- Querying: SQL capabilities allow for precise filtering and analysis, which is more complex with alternatives like DynamoDB. By leveraging ClickHouse, we ensure a robust and scalable foundation for our audit logs, simplifying future integrations and analysis.
Integration
To make audit logging effortless for our microservices, we focused on seamless integration. At epilot, we rely heavily on middy, a middleware engine used across all our services. Building on this, we introduced a new middleware: withAuditLog.
import { withAuditLog } from '@epilot/audit-log'
import middy from '@middy/core'
import type { Handler } from 'aws-lambda'
export const withMiddlewares = (handler: lambda.Handler) => {
return middy(handler)
.use(enableCorrelationIds())
.use(...)
.use(
withAuditLog({
ignorePaths: ['/v1/webhooks/configs/{configId}/trigger']
})
)
}
This middleware integrates directly into existing services and simplifies the audit logging process by:
Capturing API Gateway Events: It hooks into the request lifecycle to extract the API Gateway event details.
Omitting GET Requests by Default: To reduce noise, it filters out GET requests, with an option to opt them in for specific services where needed.
Forwarding to SQS: Its primary role is to forward the event to an SQS queue for asynchronous processing.
With this middleware, adding audit logging to any microservice is as simple as including withAuditLogs in the service's middleware stack and giving the SQS:SendMessage permission. This ensures consistency, reduces implementation effort, and keeps the integration process dead simple.
Technical Considerations
This article focuses on our high-level approach to building audit logs, as there are numerous ways to tackle the problem, each with its trade-offs. During our research, we explored alternatives like EventBridge for emitting events at the end of each request or Kinesis for streaming data. Ultimately, we chose a solution that met our key requirements: decoupling log emission from the main flow while offering flexibility in managing throughput and batching.
Here’s why we chose SQS:
Decoupling from the Main Flow
SQS allows us to process audit logs asynchronously, ensuring that the main HTTP request flow remains unaffected. This means audit log processing won’t slow down user-facing operations.
Flexibility with Throughput and Batching
With SQS, we can fine-tune parameters like long-polling and batch windows to optimize throughput without compromising efficiency. This ensures scalable and reliable processing regardless of traffic spikes.
Scalability for POST/PUT/PATCH/DELETE Events
Since we exclude GET requests by default, the system can handle fewer, more meaningful events. Capturing GET requests would require supporting a higher volume of events, potentially leading to Lambda concurrency issues, as multiple Lambda environments subscribing to the same queue could interfere with other services also using Lambda.
Exposing Audit Logs to Users
To make audit logs accessible and actionable, we introduced a new SST-based microservice that acts as a bridge to query data from ClickHouse. This microservice provides a simple and intuitive interface for users to explore their audit logs.
Key Features:
Search and Filtering: A user-friendly search bar allows users to combine filters effortlessly, enabling them to pinpoint specific events or patterns within the logs.
Activity Messages: Each audit log entry includes an activity message, a concise summary of what occurred. This message is dynamically constructed on the API side, tailored to the specific service name, making it customizable and relevant.
By customizing the activity messages for each service, users can quickly understand what happened in their systems without wading through raw data. This tailored approach ensures that the audit logs deliver immediate value and clarity to the end users.
Summary
In this article, we detailed the design and implementation of our audit log system at epilot, highlighting the key decisions and considerations that shaped its architecture. Our approach leverages AWS serverless components to seamlessly integrate audit logging into our microservices, ensuring scalability, efficiency, and ease of use.
Capturing Events: Using a custom middleware, withAuditLogs, we extract API Gateway events asynchronously and forward them to an SQS queue, ensuring the logging process does not block the main application flow.
Processing and Storing Logs: A Lambda function transforms raw events into a standardized format, focusing on meaningful system modifications (POST, PUT, PATCH, DELETE) and stores them in a scalable, SQL-based ClickHouse database.
User Accessibility: A new SST-based microservice provides a simple interface for querying and filtering logs. Tailored activity messages enhance usability, helping users quickly understand what occurred.
Technical Considerations: SQS was chosen for its ability to decouple the logging process, optimize throughput, and handle scalability challenges. While other solutions like EventBridge or Kinesis were viable, SQS met our specific requirements effectively.
This high-level overview provides a flexible, scalable, and user-friendly solution for audit logging while ensuring system integrity and maintaining performance.
Do you want to work on features like this? Check out our career page or reach out to my Twitter
Posted on November 26, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.