How I used Amazon Q Developer to move faster with data

094459

Ricardo Sueiras

Posted on July 17, 2024

How I used Amazon Q Developer to move faster with data

Spending our limited time and resources on the right things, is something that has obsessed me for over two decades. Using data and looking for signals, have been my go to mechanism to achieve this, and are the perfect corroborating partners for your instincts and judgement (never ignore these!). One example is how I review usage of code repositories, to find out the traction and impact, and help identify future content or demos to work on.

I currently use a simple lambda function (see code here)) that grabs various metrics of interest - how many times a repo has been viewed, cloned, as well as who has referred any traffic. It uses the standard GitHub API, and then stores the data in CloudWatch.

{"eventType": "ReferralPath", "repo_name": "094459-amazon-q-git-demo", "path": "/094459/094459-amazon-q-git-demo", "title": "GitHub - 094459/094459-amazon-q-git-demo: A simple Java app that shows how Am...", "count": 3, "uniques": 2}
{"eventType": "View", "repo_name": "094459-amazon-q-git-demo", "count": 6, "uniques": 3}
{"eventType": "Clone", "repo_name": "094459-amazon-q-git-demo", "count": 5, "uniques": 4}
Enter fullscreen mode Exit fullscreen mode

Why did I do this I can hear some of you ask. GitHub only stores data for fourteen days, so if you wanted to get any insights into how your code repos are performing, then you need to implement something like this. I have built up dashboards using CloudWatch Insights, which has been a super frugal and simple way to generate reports that I have been using for a few years now.

Example CloudWatch Insight dashboard of GitHub activity

The need to move - extracting more insights from the data

My current solution helped me address my primary use case, storing my repo usage beyond the fourteen day window. Over the past eighteen months, it has worked pretty well, and I really love the power and simplicity that CloudWatch Logs together with CloudWatch Log Insights provides.

Over time however, my needs have changed and I find that I want to do more with this data. I wanted to keep/retain the data where it currently is, but also make it available to share this data with other systems, applications, or people. I also want to open up access to that data so that I can use a broader set of SQL and other data tools, that would allow me to explore that data in ways that perhaps I had not been able to to in CloudWatch. What I needed was an approach that would allow me to take that daily data, and then make it available via standard SQL so that I could run queries and build reports against those, perhaps using data visualisation tools like Amazon Quiksight or OpenSearch (or even a simple front end web application that I get Amazon Q Developer to help me build).

Note! I am currently creating an updated Lambda Function that will directly store the data outside of CloudWatch logs, but first wanted to experiment with what that might look like. The value of being able to quickly prototype and experiment with new ideas is something that Amazon Q Developer has really helped me with. Across different AWS data services, to different approaches to storing the data, and to suggested SQL to create data structures, Amazon Q Developer will help me iterate faster and then get to a place that helps me solve my problem in the best way. I find this totally awesome!

This is actually something that I have been wanting to do for some time, but it always felt that this would take up a lot of time. Working with and manipulating data is something that I probably find the hardest thing to do, and I think this contributed to me always putting this off! I think what changed things for me was that I was able to use Amazon Q Developer to help me close that knowledge gap and provide me with more confidence in working with this data. In fact, I was able to put together a solution in less than four hours!

So what was the magic this time? There is no magic, although sometimes, if you like Arthur C Clark's definition of magic, using Amazon Q Developer can seem like the best kind of magic there is. In this post I am going to share how I started from data structures, and from there was able to use Amazon Q Developer to help me put together code that would allow me to update that data so I could then use in other AWS data services. Whilst this is a very simple example, I hope that it will get you thinking about the possibilities of how you can apply using Amazon Q Developer to do similar.

Lets dive right into it.

The approach

In the past, I learned it is critical to know WHAT questions you want to answer, before you start to think about design and technical decisions. In the context of working with coding assistants like Amazon Q Developer, this is even more so, as those questions helps provide useful context for your prompts. In my case I wanted to be able to generate graphs that showed both view and clone data across each repositories, I wanted to identify the top performing repositories, I was hoping to get insights into trending patterns, as well as see if there were any referring sites that were consistently driving traffic to those repos.

With that in mind I turned to my approach. Whilst I have been using CloudWatch to store my logging data for as long as I can remember, I would not say that I am up to speed with all the possibilities it provides. So my first port of call is going to gather information that will help me both provide options, and then based on what I want to do, narrow those down. In my head at least, I had this idea that I could generate standardised data output (for example, csv or similar) files from the original source data that was in CloudWatch Logs. Once I had that data, I would then ask Amazon Q Developer to see if it could provide some recommendations into what approaches I should consider, following up with suggestions around potential AWS data services that I should consider.

So with some ideas and some questions, I turn to my trusted companion, Amazon Q Developer.

Exporting CloudWatch log data

The first thing I ask Amazon Q Developer is a simple prompt:

Amazon Q Developer prompt "If I want to export my CloudWatch log data, what are my options?"

I am surprised by some options that I had not considered or know about. (Try this for yourself and see what you get). Here is my output (summarised)

  1. Export to Amazon S3
  2. Export via Subscription filters
  3. Export to Third-Party Log Management Services
  4. Export to CloudWatch Logs Insights

As I mentioned before, one of the things that Amazon Q Developer has provided me with is the confidence to experiment more when approaching new capabilities within AWS that I have never used before. I think this is a great example, as three of these are new to me.

I spent around five to ten minutes using Amazon Q Developer to help me with the first one, using a prompt

Amazon Q Developer prompt "Provide me with a step by step guide on how I can take an existing CloudWatch log group and export it to an s3 bucket".

In no time, I have a new gz file, which when I open up provides me with some very familiar looking data:

2024-07-02T15:14:16.347Z Getting views for building-data-pipelines-apache-airflow
2024-07-02T15:14:16.527Z {'eventType': 'View', 'repo_name': 'building-data-pipelines-apache-airflow', 'count': 27, 'uniques': 8}
2024-07-02T15:14:16.527Z b'{"eventType": "View", "repo_name": "building-data-pipelines-apache-airflow", "count": 27, "uniques": 8}'
2024-07-02T15:14:16.527Z Getting clones for building-data-pipelines-apache-airflow
2024-07-02T15:14:16.699Z b'{"eventType": "Clone", "repo_name": "building-data-pipelines-apache-airflow", "count": 2, "uniques": 2}'
24-07-02T15:14:16.699Z Getting referral data for cdk-mwaa-redshift
2024-07-02T15:14:16.883Z {'eventType': 'Referral', 'repo_name': 'cdk-mwaa-redshift', 'referrer': 'Google', 'count': 20, 'uniques': 2}
2024-07-02T15:14:16.883Z b'{"eventType": "Referral", "repo_name": "cdk-mwaa-redshift", "referrer": "Google", "count": 20, "uniques": 2}'
2024-07-02T15:14:16.883Z {'eventType': 'Referral', 'repo_name': 'cdk-mwaa-redshift', 'referrer': 'blog.beachgeek.co.uk', 'count': 3, 'uniques': 2}

Enter fullscreen mode Exit fullscreen mode

Experimentation leads to unlikely insights sometimes, and one of the things I noticed by reviewing the output was that the code that was writing data to CloudWatch logs was duplicating data - you can see it in the above example in the second and third lines. This means that one of the things I was going to have to do is clean up the raw data before I was going to be able to use this (and yeah, try and figure out in the source code why it was doing this too - but one thing at a time!).

One of the things I had not considered was that I could use my existing CloudWatch Logs Insights queries via the command line. Amazon Q Developer provided me with some helpful examples, and a few minutes later I was up and running.

LOG_GROUP_NAME="/aws/lambda/github-traffic-cron"
START_TIME=$(date -d "yesterday 00:00" +%s)000  # Linux
START_TIME=$(date -j -f "%Y-%m-%d %H:%M:%S" "$(date -v-1d +"%Y-%m-%d 00:00:00")" +%s)000 #MacOS
END_TIME=$(date +%s)000  # Current time
QUERY_STRING="fields @timestamp, @message | filter @message like /b'/  | filter eventType = 'Clone'"
aws logs start-query \
    --log-group-name "$LOG_GROUP_NAME" \
    --region=eu-central-1 \
    --start-time "$START_TIME" \
    --end-time "$END_TIME" \
    --query-string "$QUERY_STRING"  \
    --output text \
    --query 'queryId' \
    > query_id.txt

cat query_id.txt

9b6cb741-17ca-4387-8ac1-de65822ac52b
Enter fullscreen mode Exit fullscreen mode

When I then run the following, again provided by Amazon Q Developer

QUERY_ID=9b6cb741-17ca-4387-8ac1-de65822ac52b
aws logs get-query-results \
    --query-id "$QUERY_ID" \
    --region=eu-central-1 \
    --cli-binary-format raw-in-base64-out \
    --output text | sed 's/\x1e/,/g' > logs.csv
Enter fullscreen mode Exit fullscreen mode

Which provided me with something that looked familiar

RESULTS @ptr    Cn0KQAosNzA0NTMzMDY2Mzc0Oi9hd3MvbGFtYmRhL2dpdGh1Yi10cmFmZmljLWNyb24QAiIOCICDztaHMhCI7pWZiDISNRoYAgZiLvg6AAAAAbEoDuIABmhrviAAAAaiIAEorse88ocyMLSNvfKHMjhkQO14SK4pUMYhGAAgARA5GAE=
RESULTS @timestamp  2024-07-04 15:13:58.270
RESULTS @message    b'{"eventType": "Clone", "repo_name": "ada-python-demo-app", "count": 3, "uniques": 3}'

RESULTS @ptr    Cn0KQAosNzA0NTMzMDY2Mzc0Oi9hd3MvbGFtYmRhL2dpdGh1Yi10cmFmZmljLWNyb24QAiIOCICDztaHMhCI7pWZiDISNRoYAgZiLvg6AAAAAbEoDuIABmhrviAAAAaiIAEorse88ocyMLSNvfKHMjhkQO14SK4pUMYhGAAgARAwGAE=
RESULTS @timestamp  2024-07-04 15:13:57.402
RESULTS @message    b'{"eventType": "Clone", "repo_name": "active-directory-on-aws-cdk", "count": 1, "uniques": 1}'

RESULTS @ptr    Cn0KQAosNzA0NTMzMDY2Mzc0Oi9hd3MvbGFtYmRhL2dpdGh1Yi10cmFmZmljLWNyb24QAiIOCICDztaHMhCI7pWZiDISNRoYAgZiLvg6AAAAAbEoDuIABmhrviAAAAaiIAEorse88ocyMLSNvfKHMjhkQO14SK4pUMYhGAAgARAhGAE=
RESULTS @timestamp  2024-07-04 15:13:56.645
RESULTS @message    b'{"eventType": "Clone", "repo_name": "094459-amazon-q-git-demo", "count": 4, "uniques": 3}'

RESULTS @ptr    Cn0KQAosNzA0NTMzMDY2Mzc0Oi9hd3MvbGFtYmRhL2dpdGh1Yi10cmFmZmljLWNyb24QAiIOCICDztaHMhCI7pWZiDISNRoYAgZiLvg6AAAAAbEoDuIABmhrviAAAAaiIAEorse88ocyMLSNvfKHMjhkQO14SK4pUMYhGAAgARAOGAE=
STATISTICS  114774.0    32.0    706.0
Enter fullscreen mode Exit fullscreen mode

I also spent some time looking at subscription filters in CloudWatch, again using Amazon Q Developer to help answer questions on this functionality. I think it might be a useful part of a solution, to help specifically clean up the data. Something to think about - but it got me thinking I perhaps need to tweak my prompt. Whilst these approaches provided some good alternatives, I wanted something that allowed me more programatic control. I ask a slightly different prompt

Amazon Q Developer prompt " If I want to export my CloudWatch log data, what are my options? I want to do this programatically, and run this on a daily schedule."

I then get slightly different responses

  1. Use AWS Lambda with CloudWatch Events
  2. Use the AWS CLI to create an export task
  3. Use AWS Step Functions
  4. Use AWS Batch

I can also ask some additional questions to understand trade offs or ask Amazon Q Developer to help me prioritise on specific requirements (maybe cost for example, or a specific AWS region I might be interested in). This is the prompt I end up with:

Amazon Q Developer prompt "If I wanted to export my CloudWatch log data, what are my options? I want to do this programatically, and run this on a daily schedule. Can you prioritise this list for 1/simplicity, 2/cost, 3/availability in the eu-central-1 region. I am a Python developer, with some bash scripting knowledge."

The steer I get is that the two recommended approaches are "Based on your preference for simplicity, cost-effectiveness, and availability in the eu-central-1 region, as well as your Python and bash scripting knowledge, using the AWS CLI or AWS Lambda with CloudWatch Events would be the most suitable options for exporting your CloudWatch log data programmatically on a daily schedule."

So at the end of this stage, I have used Amazon Q Developer to help me get some broader insights into potential approaches, used Amazon Q Developer to help me quickly experiment and validate some of those ideas, and then use some follow up questions to help me refine and end up with an approach that seems to make sense - lets build a new Lambda function!

Building the CloudWatch data export function with Amazon Q Developer

I had been thinking that I might end up writing a lambda function to do much of the work, so was happy with Amazon Q Developer confirming this was a good option was my first. The next stage was to get Amazon Q Developer to help me write the code.

I start off with the following prompt:

Amazon Q Developer prompt "I need to create a Python script that will process every CloudWatch log event. I want to display all events, but please skip/ignore any events that do have "eventType" in the Message."

Which provides me with my skeleton code, which initially does not do much other than dump all the GitHub data I have been recording in CloudWatch.

Output from Amazon Q Developer prompt

Reviewing this data was important as it reminded me that I needed to clean up the data. There was duplicated data, something that I had been manually fixing in my CloudWatch Insight queries, but that I would need to address now. Reviewing the data, it was clear that I only needed to capture CloudWatch log data that met a specific criteria, so I ask Amazon Q Developer to help me adjust the code.

Amazon Q Developer prompt How do I select only those entries that have the b' prefix from all event messages'

And it provides me a completely rewritten code,

# Check if the log event message contains the string "eventType" and doesn't start with b'
        if "eventType" in event['message'] and event['message'].startswith("b'"):
Enter fullscreen mode Exit fullscreen mode

When I re-run the script, I now have de-duped GitHub repo data.

1720019679366 {'eventType': 'View', 'repo_name': 'ragna', 'count': 1, 'uniques': 1}
1720019680304 {'eventType': 'Clone', 'repo_name': 'ragna-bedrock', 'count': 5, 'uniques': 4}
1720019683064 {'eventType': 'Clone', 'repo_name': 'sigstore-demo', 'count': 1, 'uniques': 1}
Enter fullscreen mode Exit fullscreen mode

The next thing I want to do is make the time stamp more human readable, as not everyone knows epoch time! We can ask Amazon Q Developer to help us work with this by asking:

Amazon Q Developer prompt Can you update the script so that the Timestamp value is converted to something more readable

Amazon Q Developer prompt that helps convert time stamp data

And when I re-run the script, we now get date and time in a format that is much easier to understand.

2024-07-03 16:14:39 {'eventType': 'View', 'repo_name': 'ragna', 'count': '1', 'uniques': '1'}
2024-07-03 16:14:40 {'eventType': 'Clone', 'repo_name': 'ragna-bedrock', 'count': '5', 'uniques': '4'}
2024-07-03 16:14:43 {'eventType': 'Clone', 'repo_name': 'sigstore-demo', 'count': '1', 'uniques': '1'}
Enter fullscreen mode Exit fullscreen mode

The next thing I want is for this data to be flattened, as I eventually want to export this data as a csv file and make that data more useful in other tools.

I ask Amazon Q Developer with the following prompt:

Amazon Q Developer prompt How can I update the code so that the ouput Timestamp: 2024-07-01 16:14:35, Message: {'eventType': 'View', 'repo_name': 'ragna', 'count': 1, 'uniques': 1} is flattened?

Amazon Q Developer provided additional code this time. It created a new function (called flatten_json) and then amended the code so that it would use this function.

    for event in response['events']:
        # Check if the log event message contains the string "eventType" and doesn't start with b'
        if "eventType" in event['message'] and event['message'].startswith("b'"):
            timestamp = datetime.fromtimestamp(event['timestamp'] / 1000.0)
            readable_timestamp = timestamp.strftime('%Y-%m-%d %H:%M:%S')
            clean_message = event['message'].replace("b'", '').replace("'", '')
            message_data = json.loads(clean_message)
            flattened_message = flatten_json(message_data)
Enter fullscreen mode Exit fullscreen mode

The final part of this script was to generate the csv file, so I ask Amazon Q Developer:

Amazon Q Developer prompt "update the code so that it outputs to a csv file, and the output is comma separated. Make sure that Clones and Views are exported to one csv file, and Referral is exported to another csv file. Use the date as a prefix for the files."

Amazon Q Developer provides more code that completes this more complex task.

# Open two CSV files for writing
timestamp = datetime.now().strftime('%Y%m%d')
view_clone_file = open(f"{timestamp}_repo_activity.csv", "w", newline='')
other_events_file = open(f"{timestamp}_repo_referal.csv", "w", newline='')

view_clone_writer = csv.writer(view_clone_file)
other_events_writer = csv.writer(other_events_file)

# Process the log events
for event in response['events']:
    # Check if the log event message contains the string "eventType" and doesn't start with b'
    if "eventType" in event['message'] and event['message'].startswith("b'"):
        timestamp = datetime.fromtimestamp(event['timestamp'] / 1000.0)
        readable_timestamp = timestamp.strftime('%Y-%m-%d %H:%M:%S')

        # Parse the JSON message and flatten it
        try:
            clean_message = event['message'].replace("b'", '').replace("'", '')
            message_data = json.loads(clean_message)
            flattened_message = flatten_json(message_data)

            # Write the event to the appropriate file based on the eventType
            if message_data['eventType'] in ['View', 'Clone']:
                csv_row = [readable_timestamp] + [f"{key}={value}" for key, value in flattened_message.items()]
                view_clone_writer.writerow(csv_row)
            else:
                csv_row = [readable_timestamp] + [f"{key}={value}" for key, value in flattened_message.items()]
                other_events_writer.writerow(csv_row)
        except json.JSONDecodeError:
            # Print the original message if it's not a valid JSON
            print(f"Timestamp: {readable_timestamp}, Message: {event['message']} (Not a valid JSON)")

# Close the CSV files
view_clone_file.close()
other_events_file.close()

Enter fullscreen mode Exit fullscreen mode

When I run the updated script, I am delighted that I now have two files, {date}_repo_activity.csv and {date}_repo_referal.csv. When I look at the file contents, it looks pretty awesome!

2024-07-10 16:14:35,eventType=View,repo_name=ragna,count=1,uniques=1
2024-07-10 16:14:38,eventType=Clone,repo_name=robotics-resources,count=1,uniques=1
2024-07-10 16:14:39,eventType=Clone,repo_name=til,count=1,uniques=1
Enter fullscreen mode Exit fullscreen mode

Thinking ahead, I realise that this data structure might be harder to work with, so I want to simplify by removing the key in the following code:

csv_row = [readable_timestamp] + [f"{key}={value}" for key, value in flattened_message.items()]
Enter fullscreen mode Exit fullscreen mode

to the followiing

csv_row = [readable_timestamp] + [f"{value}" for key, value in flattened_message.items()]
Enter fullscreen mode Exit fullscreen mode

When I rerun the script now, the output I get is cleaner.

2024-07-10 16:14:35,View,ragna,1,1
2024-07-10 16:14:38,Clone,robotics-resources,1,1
2024-07-10 16:14:39,Clone,til,1,1
Enter fullscreen mode Exit fullscreen mode

I realise that I want to also add one last feature for this script, to be able to upload it to an S3 bucket. I ask Amazon Q Developer:

Amazon Q Developer prompt "Update the script so that it looks for an environment variable called S3_TARGET and if found, it copies the csv files to this bucket"

Amazon Q Developer does not disappoint. After a short while, I get some additional code

s3_target_bucket = os.environ.get('S3_TARGET')
if s3_target_bucket:
    timestamp = datetime.now().strftime('%Y%m%d')
    view_clone_file_key = f"logs/activity/{timestamp}_repo_activity.csv"
    other_events_file_key = f"logs/referal/{timestamp}_repo_referal.csv"

    with open(f"{timestamp}_repo_activity.csv", "rb") as f:
        s3_client.upload_fileobj(f, s3_target_bucket, view_clone_file_key)

    with open(f"{timestamp}_repo_referal.csv", "rb") as f:
        s3_client.upload_fileobj(f, s3_target_bucket, other_events_file_key)
else:
    print("S3_TARGET environment variable is not set. Skipping file upload.")
Enter fullscreen mode Exit fullscreen mode

I try this with and without setting the S3_TARGET environment variable, and confirm that I now have my csv files in my S3 bucket.

Now that this works, I ask Amazon Q Developer how to alter this script so that I can deploy this as a lambda function, and it provides me with code that I am able to deploy.

Amazon Q Developer prompt "Convert this script so that it can run as a Lambda function"

Here is the completed code.

import boto3
from datetime import datetime
import json
import csv
import os
import io

def lambda_handler(event, context):
    def flatten_json(data, parent_key='', separator='.'):
        """
        Flatten a nested JSON data structure.
        """
        items = []
        for key, value in data.items():
            new_key = parent_key + separator + key if parent_key else key
            if isinstance(value, dict):
                items.extend(flatten_json(value, new_key, separator).items())
            elif isinstance(value, list):
                for idx, item in enumerate(value):
                    if isinstance(item, dict):
                        items.extend(flatten_json(item, new_key + separator + str(idx), separator).items())
                    else:
                        items.append((new_key + separator + str(idx), str(item)))
            else:
                items.append((new_key, str(value)))
        return dict(items)

    session = boto3.Session()
    logs_client = session.client('logs')
    s3_client = session.client('s3')

    # Specify the log group
    log_group_name = '/aws/lambda/github-traffic-cron'

    # Find the latest log stream in the log group
    response = logs_client.describe_log_streams(
        logGroupName=log_group_name,
        orderBy='LastEventTime',
        descending=True,
        limit=1
    )
    latest_log_stream_name = response['logStreams'][0]['logStreamName']

    response = logs_client.filter_log_events(
        logGroupName=log_group_name,
        logStreamNames=[latest_log_stream_name]
    )

    # Create in-memory CSV files
    view_clone_file = io.StringIO()
    other_events_file = io.StringIO()

    view_clone_writer = csv.writer(view_clone_file)
    other_events_writer = csv.writer(other_events_file)

    # Process the log events

    for event in response['events']:
        # Check if the log event message contains the string "eventType" and doesn't start with b'
        if "eventType" in event['message'] and event['message'].startswith("b'"):
            timestamp = datetime.fromtimestamp(event['timestamp'] / 1000.0)
            readable_timestamp = timestamp.strftime('%Y-%m-%d %H:%M:%S')

            # Parse the JSON message and flatten it
            try:
                clean_message = event['message'].replace("b'", '').replace("'", '')
                message_data = json.loads(clean_message)
                flattened_message = flatten_json(message_data)

                # Write the event to the appropriate file based on the eventType
                if message_data['eventType'] in ['View', 'Clone']:
                    csv_row = [readable_timestamp] + [f"{value}" for key, value in flattened_message.items()]
                    view_clone_writer.writerow(csv_row)
                else:
                    csv_row = [readable_timestamp] + [f"{value}" for key, value in flattened_message.items()]
                    other_events_writer.writerow(csv_row)
            except json.JSONDecodeError:
                # Print the original message if it's not a valid JSON
                print(f"Timestamp: {readable_timestamp}, Message: {event['message']} (Not a valid JSON)")


    # Get the CSV file contents as strings
    view_clone_file_contents = view_clone_file.getvalue()
    other_events_file_contents = other_events_file.getvalue()

    # Upload the CSV files to the S3 bucket
    s3_target_bucket = os.environ.get('S3_TARGET')
    if s3_target_bucket:
        timestamp = datetime.now().strftime('%Y%m%d')
        view_clone_file_key = f"logs/activity/{timestamp}_repo_activity.csv"
        other_events_file_key = f"logs/referal/{timestamp}_repo_referals.csv"

        s3_client.put_object(Body=view_clone_file_contents.encode('utf-8'), Bucket=s3_target_bucket, Key=view_clone_file_key)
        s3_client.put_object(Body=other_events_file_contents.encode('utf-8'), Bucket=s3_target_bucket, Key=other_events_file_key)
    else:
        print("S3_TARGET environment variable is not set. Skipping file upload.")

    return {
        'statusCode': 200,
        'body': 'CSV files processed and uploaded to S3 bucket.'
    }

Enter fullscreen mode Exit fullscreen mode

The only thing I need to do when I configure the lambda function is to 1/ Ensure that the timeout is set to around 30 seconds, 2/ set an Environment Variable for S3_TARGET, and 3/ Make sure that the lambda function execution role has permissions to both read CloudWatch logs and write data to the S3 bucket.

I schedule this script once day, to run at 11am, and I can see when I review my S3 bucket later, that the script has been scheduled and executed as expected.

stree  094459-oss-projects

094459-oss-projects
└── logs
    ├── activity
    │   └── 20240715_repo_activity.csv
    └── referal
        └── 20240715_repo_referals.csv
Enter fullscreen mode Exit fullscreen mode

Note! If you are not familiar with stree, it is a really cool open source project that I featured in #189 of my open source newsletter, and a tool I use almost every day - its awesome.

Extracting insights from my GitHub data

Now that I have sanitised and controlled data that is being uploaded to a defined S3 bucket on a daily basis, I can now begin the next step which is using AWS analytics and data services to help me extract insights from that data.

First of all I want to see what Amazon Q Developer might suggest, with the following prompt:

Amazon Q Developer prompt "I have an S3 bucket that contains daily uploads of this csv data. Which AWS services should I consider to be able to query and get insights from this data."

I kind of knew what to expect, as in my head I had already thought that Amazon Athena was the way to go, and sure enough Amazon Q Developer confirms my choice. The guidance you get will be determined by the context and additional information you provide. With the above prompt, I got some additional suggestions which did not make that much sense. I tweaked my prompt as follows:

Amazon Q Developer prompt "I have an S3 bucket that contains daily uploads of this csv data. Which AWS services should I consider to be able to query and get insights from this data. Provide recommendations that are simple to implement, cost effective, and serverless. The amount of processed data is very low, with a few thousand records being ingested daily. I need to ability to run SQL queries against the data, in an ad-hoc as well as planned way. I also need to make this data available to other AWS services"

This provided a much shorter list, with Amazon Athena as the top choice.

Output from Amazon Q Developer that shows Amazon Athena as a good option for my requirements

I have used Amazon Athena many times as part of various demos and presentations. I do not use it every day though, so I am a little rusty and need some time to get back in the groove. Luckily, Amazon Q Developer is just the perfect guide to help me with this, making it easy to remind me of what I need to do and get me up and running in no time.

I am going to ask Amazon Q Developer for a quick refresher on how to set up Amazon Athena against my data. The first thing I do is open up in VSCode one of the sample data files (in my case, the repo_activity.csv), and then ask the following prompt.

Amazon Q Developer prompt "Using the data structure in the repo_activity.csv, provide step by step instructions on how I can use Amazon Athena to run queries on this data that will allow me to sort by repo_name, identify the repos with the highest count, and provide summarised views of total number of clones by a given date. The data files are stored in an S3 bucket called 094459-oss-projects"

The output was good, I just needed to adjust the S3 bucket details.

Amazon Q output from prompt

When we now run that in Amazon Athena, we can see that the query runs ok.

Running the code from Amazon Q Developer in Athena

When we preview the data, we can see that our GitHub data is now available for us to run queries against.

Preview data from Amazon Athena

The original output from the prompt also provided me with some sample queries to help me look at the data. The first one was to provide a list of repos by name, which looked very much like the preview data, so not that interesting. It then provided a query to identify repos with the highest count.

SELECT repo_name, SUM(count) AS total_count
FROM repo_activity
GROUP BY repo_name
ORDER BY total_count DESC;
Enter fullscreen mode Exit fullscreen mode

Which generates the expected output

Amazon Athena query output to show output by repo

The next query was to find out the total clones by date, and the suggested query

SELECT date(event_time) AS event_date, SUM(count) AS total_clones
FROM repo_activity
WHERE event_type = 'Clone'
GROUP BY event_date
ORDER BY event_date;
Enter fullscreen mode Exit fullscreen mode

generated an error "COLUMN_NOT_FOUND: line 4:10: Column 'event_date' cannot be resolved or requester is not authorized to access requested resources"

No problems, I can use Amazon Q Developer to help me figure out how to fix this. This is the prompt I use:

Amazon Q Developer prompt "When I run the summarise total clones by date query, it generates this error "COLUMN_NOT_FOUND: line 4:10: Column 'event_date' cannot be resolved or requester is not authorized to access requested resources This query ran against the "default" database, unless qualified by the query. " How do I fix this"

It suggests a new query

SELECT event_date, SUM(total_clones) AS total_clones
FROM (
  SELECT DATE(event_time) AS event_date, count AS total_clones
  FROM repo_activity
  WHERE event_type = 'Clone'
)
GROUP BY event_date
ORDER BY event_date;
Enter fullscreen mode Exit fullscreen mode

This also fails, but with a different error. Again I turn to Amazon Q Developer to help with. This is the prompt I use

Amazon Q Developer prompt "This generates a different error "INVALID_CAST_ARGUMENT: Value cannot be cast to date: 2024-07-15 15:13:56"

This time Amazon Q provides a more detailed response, together with an updated suggestion. In fact, it provides a couple of suggestions as it has quite rightly determined that the date format is in fact date and time. The new query:

SELECT CAST(PARSE_DATETIME(event_time, 'yyyy-MM-dd HH:mm:ss') AS DATE) AS event_date, SUM(count) AS total_clones
FROM repo_activity
WHERE event_type = 'Clone'
GROUP BY CAST(PARSE_DATETIME(event_time, 'yyyy-MM-dd HH:mm:ss') AS DATE)
ORDER BY CAST(PARSE_DATETIME(event_time, 'yyyy-MM-dd HH:mm:ss') AS DATE);
Enter fullscreen mode Exit fullscreen mode

runs perfectly, and provides me with exactly what I wanted to know.

Output of query in Amazon Athena for total number of clones by date

As you can see, the output from Amazon Q Developer is not always perfect, but by combining the initial output, with oversight of what I am trying to do, as well as using the interactive nature of the chat interface, you can quickly get resolution of the errors you come across.

We are nearly finished, but there is one more thing I need to do. I need to ask Amazon Q Developer is what happens or what do I need to do as I add more data to the S3 bucket (or rather, when the scheduled Lambda function does that).

Amazon Q Developer prompt "When more data gets added to my S3 bucket, how do I make sure that Athena will use all the new data that has been added"

Amazon Q Developer provides me with a nice summary an options. The one I need however, is

MSCK REPAIR TABLE repo_activity;
Enter fullscreen mode Exit fullscreen mode

I wait for the next day so that there will be more data in the S3 bucket before running this query. It takes a few seconds to run,

Running msck repair

but when I then re-run the queries from before, I can now see that I have two days worth of data.

updated query output
updated query output from a different query

You might be thinking, how do you automate that (or was it just me!). To find out my options, lets ask Amazon Q Developer

Amazon Q Developer prompt "What is the best way of running MSCK REPAIR on a daily basis so that the indexes are always up to date"

And it provides a couple of options, the one that I like most is running a daily lambda function and it provides some code you can use.

I think that pretty much is where I want to leave this. I have only spent a couple of hours doing this across a couple of days. Time very well spent I think.

What next

This has been a fun experiment to see how Amazon Q Developer could help me with an area that I had a good understand and grip of, but needed some hand holding in the details and implementation. I certainly feel that using Amazon Q Developer in my data related tasks is going to be a major productivity boost, but also I think, going to help me explore and experiment more.

Stay tuned for further adventures of data with Amazon Q Developer. Who knew working with data could be so much fun!

If you want to learn more, check out the Amazon Q Developer Dev Centre which contains lots of resources to help get you started. You can also get more Amazon Q Developer related content by checking out the dedicated Amazon Q space, and if you are looking for more hands on content, take a look at my interactive tutorial that helps you get started with Amazon Q Developer.

💖 💪 🙅 🚩
094459
Ricardo Sueiras

Posted on July 17, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related