Recovering Accidentally Transitioned S3 Files from the Glacier Tier

tomxiaoyz

TomXiaoYZ

Posted on October 8, 2023

Recovering Accidentally Transitioned S3 Files from the Glacier Tier

Are you stuck with S3 files that were accidentally moved to the Glacier storage tier? Do you need an efficient yet simple way to restore them? If yes, then this guide is for you.

Amazon S3 is a scalable object storage service while Amazon Glacier is a secure, durable, and low-cost storage service for data archiving and long-term backup. Sometimes, files may end up in Glacier storage either intentionally, for cost-saving purposes, or unintentionally due to errors.

Regardless, the restoration process can be daunting. This guide will help you simplify this process with Python and AWS’s boto3 library.

Below is the Python script that will get the job done:

import multiprocessing as mp
import os
import sys
import shutil
import traceback
import zipfile
import requests
import json
import socket
import boto3
import threadpool
import time
from datetime import timedelta, datetime
from pymongo import MongoClient
from loguru import logger
from botocore.exceptions import ClientError
import re

ACK = "Your Access Key"
ACS = "Your Secret Key"

bucket_name = "Your Bucket Name"
s3_remote_dir = "Path to Your S3 Directory"

s3 = boto3.client('s3',  aws_access_key_id=ACK, aws_secret_access_key=ACS)

# local path
local_save_path = './temp/'  # temporarily download files to local path


def _get_all_s3_objects(**base_kwargs):
    """
    Get all objects under s3_remote_dir
    """
    try:
        continuation_token = None
        while True:
            list_kwargs = dict(MaxKeys=1000, **base_kwargs)
            if continuation_token:
                list_kwargs['ContinuationToken'] = continuation_token
            response = s3.list_objects_v2(**list_kwargs)
            yield from response.get('Contents', [])
            if not response.get('IsTruncated'):  # At the end of the list?
                break
            continuation_token = response.get('NextContinuationToken')
    except:
        # send_dingtalk_message(traceback.format_exc())
        logger.error(traceback.format_exc())


def head_object(bucket_name, object_name):
    s3 = boto3.client('s3')
    response = None
    try:
        response = s3.head_object(Bucket=bucket_name, Key=object_name)
    except ClientError as e:
        logger.error(e)
        logger.error(
            f"NoSuchBucket, NoSuchKey, or InvalidObjectState error == the object's, storage class was not GLACIER. {bucket_name} {object_name} ")
        return None
    return response


# resore objects from glacier tier
def restore_object(bucket_name, object_name, days, retrieval_type='Expedited'):
    request = {'Days': days,
                'GlacierJobParameters': {'Tier': retrieval_type}}
    # s3 = boto3.client('s3')
    try:
        s3.restore_object(Bucket=bucket_name, Key=object_name, RestoreRequest=request)
    except ClientError as e:
        logger.error(e)
        logger.error(
            f"NoSuchBucket, NoSuchKey, or InvalidObjectState error == the object's, storage class was not GLACIER. {bucket_name} {object_name} ")

        return False
    return True


key_content_list = []
total_content_size = 0


while True:
    doing_count, done_count, need_count = 0, 0, 0

    s3_objects = _get_all_s3_objects(Bucket=bucket_name, Prefix=s3_remote_dir)
    for obj in s3_objects:
        key = obj.get('Key', None)
        file_name = key.split("/")[-1]
        # In hive, files under glacier tier are then moved to sub-directories called 'HIVE_UNION_SUBDIR_*'
        # We need to move it to the right place
        to_key = re.sub("\/HIVE_UNION_SUBDIR_[\d]+\/", "/", key)
        print(key, to_key, file_name)
        success = head_object(bucket_name, key)
        need_count += 1

        if success:
            if success.get('Restore'):
                print('Restore {}'.format(success['Restore']))
                index = success['Restore'].find('ongoing-request=\"false\"')
                if -1 == index:
                    print(f"{need_count} Under recovering...{key}")
                    doing_count += 1
                else:
                    print(success['Restore'][success['Restore'].find('expiry-date='):])
                    print(f"{need_count} Recover succeeded...{key}")
                    find_spec_path = file_name.find('HIVE_UNION_SUBDIR')
                    if -1 != find_spec_path:
                        print("no need to download...{key}")
                    else:
                        s3.download_file(bucket_name, key, local_save_path + file_name)
                        s3.upload_file(local_save_path + file_name, bucket_name, to_key)
                    done_count += 1
            else:
                print(f'{need_count} neet to recovery... {key}')
                restore_object(bucket_name, key, 10)
        print(doing_count, done_count, need_count)

    if done_count == need_count:
        break
    time.sleep(15)
Enter fullscreen mode Exit fullscreen mode

Understanding the Code
The solution uses the boto3 library to interact with the AWS S3 and Glacier services.

The _get_all_s3_objects function is created to list all the objects in the specified S3 bucket and directory.

The head_object function retrieves metadata from an object without returning the object itself. This is useful for checking if a file exists and its status.

The restore_object function is the heart of the script. It initiates a job to restore a file from Glacier to S3. The Days parameter specifies the lifetime of the temporary copy of the object in the S3 bucket, while GlacierJobParameters sets the speed (tier) of the restoration process.

The script then loops through the objects in the specified S3 location. For each object, it checks if a restore operation is already in progress or completed using the head_object function. If a restore operation is in progress, it leaves it alone. If the restore operation is complete, it re-uploads the file to the desired location. If the file hasn't been restored at all, it starts a restore operation.

The process continues until all files have been restored.

In Conclusion
The script provides a way to automate the process of restoring files from the Glacier storage tier back to S3. It can be a lifesaver if you’ve got hundreds or even thousands of files to restore.

Remember to replace the placeholders in the script with your actual AWS credentials, bucket names, and file paths. Happy restoring!

💖 💪 🙅 🚩
tomxiaoyz
TomXiaoYZ

Posted on October 8, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

What was your win this week?
weeklyretro What was your win this week?

November 29, 2024

Where GitOps Meets ClickOps
devops Where GitOps Meets ClickOps

November 29, 2024

How to Use KitOps with MLflow
beginners How to Use KitOps with MLflow

November 29, 2024