Amazon Security Lake: Centralized Data Management for Modern DevSecOps Toolchains

AWS introduced its Amazon Security Lake service in May 2023, as the heir to AWS CloudTrail Lake, a new data lake that serves to augment a lot of the capabilities, services, sources, analysis, and transformation the CloudTrail data lake can provide for security management. When doing the research on this service which is gaining adoption, I stumbled upon the roundup below, which provides a good comparison between the two services. In this post, I’d like to dive into the AWS Security Lake capabilities, why this is an excellent new service for AWS-based operations for powering up your security engineering, and wrap up with a useful example of how to get started.

Source: https://isaaczapata.notion.site/Data-Lake-Dilemma-Amazon-Security-Lake-vs-AWS-CloudTrail-Lake-54ce57e4045b4de5adedc3e3696eead7

Why Do We Need Another Data Lake?

If we look at the current AWS service catalog, there are quite a number of data sources we leverage on a day-to-day basis to power our cloud operations –– S3, CloudTrail, Route53, VPC, AWS lambda, Security Hub, as well as third-party tooling and services. All of these data sources rely on different and proprietary formats and fields. Being able to normalize this data will make it possible to provide additional capabilities on top, such as dashboarding and automation–which is becoming increasingly important for security management and visibility.

This is something we learned early on when building our own DevSecOps platform that ingests data from multiple tools and then visualizes the output in a unified dashboard. Every vendor and tool has its own syntax and proprietary data format. When looking to apply product security in a uniform way, one of the first challenges we encountered was how to normalize and align the data from several best-of-breed tools into a single schema, source and platform.

Our cloud operations today are facing the same challenge. The question is - how do we do this at scale?

This is exactly what the security data lake comes to solve.

Amazon Security Lake provides a unification service that knows how to ingest the logs and data from myriad sources––whether native AWS services, integrated SaaS products or internal, homegrown custom sources or even on-prem, takes these data sources’ output from the unified format called ASFF (AWS Security Finding Format) into parquet using OCSF schema framework’s format, which is the backbone of Amazon Security Lake, and stores them into S3.

AWS is betting heavily on OCSF, which is an open source framework launched by Splunk and built upon Symantec’s ICD Schema, that AWS is contributing to significantly. OCSF provides a vendor-agnostic, unified schema for security data management. The idea is for the OCSF format to provide a framework for data security management that organizations today require.

Getting Started: Security Data Lake in Action

Once the data is normalized and unified into the OCSF schema - which can be achieved by leveraging an ETL service like Glue, it is then partitioned and stored in the parquet format in S3, and any number of AWS services can be leveraged for additional data enrichment. These include Athena for querying the data, OpenSearch for search and visualization capabilities, and even tools like SageMaker for machine learning to detect patterns and anomalies.

You can even bring your own analytics and BI tools for deeper analysis of the data. This security data is ingested from the many sources supported by the flexible format that is column-based. This also makes it economical, and bypasses the need to mount the entire query in-memory, making it possible to connect it to analytics and BI tools as a subscriber, on top of the lake. (A caveat: the service itself is free, but you will pay on a consumption basis for all the rest of the AWS tooling–S3, Glue, Athena, SageMaker, ...).

Another important benefit is for compliance monitoring and reporting on a global scale. This data lake makes it possible for organizations with many engineering groups and regions to apply this service globally. Therefore, engineering organizations with many accounts and regions will not have to configure this 50 separate times in each account, but can do this a single time by creating a rollup region. This means you can rollup all of your global organizational data into a single ingestion feed into your security data lake.

What is unique is that once the data is partitioned and stored in this format, it becomes easily queryable and re-usable for many data enrichment purposes. The Security Lake essentially makes it possible to centralize security data at scale both on a source level and infrastructure level––from your own cloud workloads and data sources, custom and on-prem resources, SaaS providers, as well as multiple regions and accounts.

As a strategic new service for AWS, when first launched, it already came supported with 50+ out-of-the-box integrations and services from many security vendors from Cisco to Palo Alto Networks, CrowdStrike and others, to help support its adoption and applicability to real engineering stacks.
A DevSecOps application of the Security Data Lake

In order to understand how you can truly harness the power of the AWS Security Lake, we’d like to walk through a short example that helps capture (really only the tip of the iceberg) of what this security lake makes possible.

In this example, we’ll demonstrate how to use the Security Data Lake with one of the most popular security tools - Gitleaks, for secret detection. We will use Github Actions to add Gitleaks to our CI/CD to detect secrets.

Once our CI/CD runs it will send the data to our Security Hub which is also auto-configured to send data to our security lake. This is then stored in an S3 bucket, and the Glue ETL service is leveraged to transform the ingested data into the ASFF format for the OCSF schema. A Glue crawler monitors the S3 Bucket, and the data, once transformed, is sent to the Glue Catalog, which holds the database schema. This data is now queryable via Athena to extract important information, such as secrets detected in certain workloads.

The Repo

This repo consists of a simple Gitleaks example including secrets for detection to demo how it works and sends the data to Security Hub.

Link:

https://github.com/security-lake-demo/gitleaks-to-security-hub/tree/main

Configuring Gitleaks

Next, we configure Gitleaks to send the detected secrets to the AWS Security Hub

name: Gitleaks Scan

on:
  push:
    branches:
      - main
permissions:
  id-token: write   # This is required for requesting the JWT
  contents: read 
jobs:
  gitleaks_scan:
    runs-on: ubuntu-latest
    env:
      AWS_ACCESS_KEY: ${{ secrets.AWS_ACCESS_KEY }}
      AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}      
    steps:
    - name: Checkout code
      uses: actions/checkout@v3

#     - name: configure aws credentials
#       uses: aws-actions/configure-aws-credentials@v2.0.0
#       with:
#         role-to-assume: arn:aws:iam::950579715744:role/security-lake-demo-github-action
#         role-session-name: GitHub_to_AWS_via_FederatedOIDC
#         aws-region: "us-east-1"
#       # Hello from AWS: WhoAmI

#     - name: Sts GetCallerIdentity
#       run: |
#         aws sts get-caller-identity

    - name: Install Gitleaks
      run: |
        wget https://github.com/gitleaks/gitleaks/releases/download/v8.17.0/gitleaks_8.17.0_linux_x64.tar.gz
        tar -xzvf gitleaks_8.17.0_linux_x64.tar.gz
        chmod +x gitleaks

    - name: Run Gitleaks
      run: |
        ./gitleaks detect -v --report-format json --redact --no-git --source . --report-path report.json --exit-code 0

    - name: Upload to security Hub
      run: |
        pip install boto3==1.27.0 pydantic==2.0.1
        python ./upload_data_to_security_hub.py



Link: https://github.com/security-lake-demo/gitleaks-to-security-hub/blob/main/.github/workflows/gitleaks.yml 



The Security Hub Schema
The Security Hub schema is configurable with simple Python code:

import json
from datetime import datetime
import os
import boto3

# AWS Credentials
# Make sure you've set these up in your environment
region_name = 'us-east-1'  # set your AWS region
account_id = '950579715744'

from pydantic import BaseModel
from typing import Optional, List


class AwsSecurityHubFinding(BaseModel):
    SchemaVersion: str
    Id: str
    ProductArn: str
    GeneratorId: str
    AwsAccountId: str
    Types: List[str]
    FirstObservedAt: str
    LastObservedAt: str
    CreatedAt: str
    UpdatedAt: str
    Severity: dict
    Title: str
    Description: str
    Resources: List[dict]
    SourceUrl: Optional[str]
    ProductFields: Optional[dict]
    UserDefinedFields: Optional[dict]
    Malware: Optional[List[dict]]
    Network: Optional[dict]
    Process: Optional[dict]
    ThreatIntelIndicators: Optional[List[dict]]
    RecordState: str
    RelatedFindings: Optional[List[dict]]
    Note: Optional[dict]

def reas_report():
    with open("report.json") as f:
        return json.load(f)

def transform_gitleaks_output_to_security_hub(data):
    output = []
    for record in data:
        output.append({
            'SchemaVersion': '2018-10-08',
            'Id': record['RuleID'] + "-" + record['File'],
            'ProductArn': f'arn:aws:securityhub:{region_name}:{account_id}:product/{account_id}/default',
            'Types': [record['RuleID']],
            'GeneratorId': 'gitleaks',
            'AwsAccountId': account_id,
            'CreatedAt': datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%S.%f')[:-3] + 'Z',
            'UpdatedAt': datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%S.%f')[:-3] + 'Z',
            'Severity': {'Label': 'HIGH'},
            'Title': record['Fingerprint'],
            'Description': record['RuleID'],
            'Resources': [{'Type': 'Other', 'Id': record['File']}]
        })
    return output

if __name__ == '__main__':
    securityhub = boto3.client('securityhub',
                               aws_access_key_id=os.environ.get("AWS_ACCESS_KEY"),
                               aws_secret_access_key=os.environ.get("AWS_SECRET_ACCESS_KEY"),
                               region_name=region_name)

    # Get the report
    data = transform_gitleaks_output_to_security_hub(reas_report())

    # Then use the AWS SDK
    response = securityhub.batch_import_findings(
        # Findings=[finding.dict()]
        Findings=data
    )

    print(response)

Link: https://github.com/security-lake-demo/gitleaks-to-security-hub/blob/main/upload_data_to_security_hub.py

Detected secrets in action:

You can then navigate to Security Hub and see the findings there:

While useful for visualization and understanding that our configurations are working as expected, the queries available in the Security Hub are basic, and it’s not possible to enrich the data. We want to be able to know if this secret, in the context of our own systems, is even interesting and needs to be prioritized for remediation.

Let’s navigate to the Security Lake.

In our Security Lake, it’s possible to see all of the configured sources:

Once in our Security Lake we can search for the Athena service, and find our data source.

We locate our data source, where we can then see all of the tables we are able to query, where each data source has its own table.

We then run our query to try and find high severity secrets in a specific region.

And we can see the resulting output:

With the data sources now available in a single queryable location - cloud workload data alongside DevSecOps toolchains, it's now possible to run complex queries––everything from IP reputation to severity. With all of the many findings our tooling today outputs and alerts about, it’s now possible to minimize the possibilities to relevant context, and prioritize remediation.

Why Security Data Lake is Exciting

The Security Data Lake is set to help with security data format heterogeneity. By creating a single and unified standard, it becomes easier for developers to leverage, enrich and build upon this data––likewise to test and launch services.

By providing a scalable solution for both the data sources and the global resource coverage, engineering organizations can apply data enrichment capabilities across services, tooling, and regions, providing greater context and correlation of security findings. All of this together simplifies compliance monitoring & reporting, programmability, and automation that together provide more resilient and robust DevSecOps programs for engineering organizations.

Blog

Amazon Security Lake: Centralized Data Management for Modern DevSecOps Toolchains

David Melamed

Why Do We Need Another Data Lake?

Getting Started: Security Data Lake in Action

The Repo

Configuring Gitleaks

Why Security Data Lake is Exciting

Join Our Newsletter. No Spam, Only the good stuff.

Related