Rimpal Johal
Posted on March 28, 2023
Amazon Simple Storage Service (S3) is a popular cloud storage service that provides scalable and secure object storage for various types of data. Managing S3 objects can be a daunting task, especially when dealing with large datasets. In this blog post, we'll explore how to use Python and the boto3 library to manage S3 objects and list the objects created after a specific date and time.
To get started, we'll need to install the boto3 library, which is the Amazon Web Services (AWS) SDK for Python. Once installed, we can create an S3 client and set the region name and bucket name that we want to work with. We'll also set the timezone to Melbourne, Australia, using the pytz library.
Next, we'll set the start date for listing the S3 objects. In this example, we'll set the start date to March 5, 2023, at 12:00 AM in Melbourne time. We'll use the datetime and tzinfo modules to define the start date and timezone.
We'll then use a paginator to iterate over all objects in the S3 bucket and filter the objects that were created after the specified start date. The paginator will help us handle large datasets and ensure that we don't exceed any API rate limits.
We'll convert the UTC LastModified time of each object to Melbourne timezone using the astimezone() method and format it as a string that includes the timezone name and offset. We'll also convert the size of each object from bytes to megabytes and store the filtered objects in a list.
Finally, we'll print and write the list of objects to a file named 's3_objects.txt' using the pprint and open() methods. The output file will include the object key, LastModified time, and size in megabytes.
Here's the complete Python script:
import boto3
import datetime
import pprint
import json
import pytz
# Create an S3 client
s3 = boto3.client('s3', region_name='ap-southeast-2')
# Set the S3 bucket name
bucket_name = 'myuats3bucket'
tz = pytz.timezone('Australia/Melbourne')
# Set the start date for listing objects
start_date = datetime.datetime(2023, 3, 5, tzinfo=tz)
start_after = start_date.strftime('%Y-%m-%d %H:%M:%S')
print (start_after)
# Set the prefix for filtering objects
prefix = ''
# Initialize the list of objects
object_list = []
# Use a paginator to iterate over all objects in the bucket
paginator = s3.get_paginator('list_objects_v2')
try:
for page in paginator.paginate(Bucket=bucket_name, Prefix=prefix, StartAfter=start_after):
# If there are no more objects, stop iterating
if 'Contents' not in page:
break
# Iterate over each object in the current page of results
for obj in page['Contents']:
# Add the object to the list if it was created after the start date
if obj['LastModified'] >= start_date:
# Convert UTC LastModified time to Melbourne timezone
melbourne_time = obj['LastModified'].astimezone(tz)
obj['LastModified'] = melbourne_time.strftime('%Y-%m-%d %H:%M:%S %Z%z')
# Convert size to MB
obj['Size'] = round(obj['Size'] / (1024 * 1024), 2)
object_list.append(obj)
except Exception as e:
print("Error:", e)
# Print the list of objects
pprint.pprint(object_list)
# Open the file for writing
with open('s3_objects.txt', 'w') as f:
# Write the formatted output to the file
for obj in object_list:
f.write(f"Key: {obj['Key']}, Last Modified: {obj['LastModified']}, Size: {obj['Size']} MB\n")
In summary, using Python and the boto3 library to manage Amazon S3 objects is a powerful way to automate and streamline your data management workflows. With the ability to filter and format S3 objects based on specific criteria, you can save time and reduce errors in your data processing pipeline. Whether you're dealing with small or large datasets, this approach can help you manage your S3 objects more efficiently.
Thank you for reading this blog post, and we hope that you found it helpful. If you have any questions or feedback, please feel free to leave a comment below.
Posted on March 28, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.