More storage for media : organize files using ChatGPT : part 1, data preparation
hexfloor
Posted on October 29, 2024
Age of AI
Given the rise of ChatGPT and other Gen AI tools I have to reinvent my posts to make them relevant, some introduction here.
Hence since now my posts will be less technical and more functional. From my conversation with ChatGPT it (from now on I will use pronouns it) granted me the permission to use our conversation in this blog and to reference ChatGPT as it
.
Introduction
Let's consider a common problem : we wish to optimize storage of media files.
Now, let's imagine we have made a dump of data with photos and videos by year in different formats and a magic tools that may convert them to the format of our choice.
However, we need to do some preparation work : group the files of each type instead of having the grouping by year. How to achieve this using ChatGPT ?
First steps
Hers is our dump of media :
Let's get some input from ChatGPT :
windows using git bash and linux commands, I wish to rename the folders in the current dir having name 'Photos from ' simply to year
for dir in Photos\ from\ *; do
if [[ -d "$dir" ]]; then
year="${dir##* }" # Extract the year from the folder name
mv "$dir" "$year" # Rename the folder
fi
done
Result :
Great, let's move on!
Metadata : data preparation
The very first thing you should consider is to fix the metadata, which in the narrow sense mean that each file should contain the Date taken
for the image files and Media created
for the video files. In the best case the image name should contain:
- formatted timestamp in the form
date_time
, like 20240101_000000 or even longer - origin discriminator, anything that would help you to classify the file, when compressing could the the original extension
- some counter with as many digits as necessary to avoid collisions during any compression job
Some of the modern phones do create a speaking file id, some do not, in any case for the old files it may happen that the data is present only in the adjacent json file, the structure name of it usually corresponds to the following :
<filename_with_extension>{<.suffix_of_type_metadata>}.json
. Hence before any processing it is wise to go through all the images and all the videos and update the time attributes, for example withexiftool
for images andffmpeg
for videos and as well it's a good idea to copy the processed file to the new directory with an extension that has as a prefix the formatted_date, and those who do not - move to another folder to fix them up manually.
Overall the idea could be the following:
- list file extensions in all the folders to evaluate the necessary work
ls -R ./ | awk -F. '/\./ {print $NF}' | sort -u
- prepare the next which will be to convert all the files to the lower case to simplify further operation, prior to this we should check for the case insensitive duplicates
#!/bin/bash
# Find all subdirectories and process files in each subdirectory separately
find . -type d | while read -r dir; do
# Find files in the current directory
find "$dir" -maxdepth 1 -type f | \
# Remove the path, leaving only the filename
sed 's/.*\///' | \
# Convert filenames to lowercase for case-insensitive comparison
tr '[:upper:]' '[:lower:]' | \
# Sort the filenames
sort | \
# Find duplicates in the sorted list
uniq -d | \
# Print the duplicates with their directory path
while read -r filename; do
echo "Duplicates in '$dir':"
find "$dir" -maxdepth 1 -type f -iname "$filename"
done
done
- convert all the files to the lower case:
#!/bin/bash
# Find all files recursively
find ./ -type f | while read -r file; do
dir=$(dirname "$file")
base=$(basename "$file")
# Convert the filename to lowercase
lower_base=$(echo "$base" | tr '[:upper:]' '[:lower:]')
# If the filename is different (case insensitive), do the two-step rename
if [[ "$base" != "$lower_base" ]]; then
# Step 1: Rename to an intermediate name (tmp_<lowercase filename>)
mv "$file" "$dir/tmp_$lower_base"
# Step 2: Rename to the final lowercase name (removing tmp_ prefix)
mv "$dir/tmp_$lower_base" "$dir/$lower_base"
fi
done
At this point you are ready to start fixing the metadata
Metadata : updating images using data from jsons
At this point we are ready to start fixing the metadata, the bare minimum is to ensure that the Date taken
is up to date, and it's quite often that this date is not set in the image itself but stored in the json separately, here is the script you may use to sort jpg
images into two folder: one containing images with accurate metadata and another containing the images that need to be viewed further and therefore prefixed with the origin folder.
I will use the exiftool
for this purpose and add a formatted date prefix for each file that has a relevant metadata stored in json.
#!/bin/bash
# Hardcoded input directory and output directories
INPUT_DIR="./input" # Input directory set to ./input
OUTPUT_DIR_JPG="./jpg" # Directory for modified JPGs
OUTPUT_DIR_JPG_TO_FIX="./jpg_to_fix" # Directory for JPGs needing fixing
# Create output directories if they don't exist
mkdir -p "$OUTPUT_DIR_JPG"
mkdir -p "$OUTPUT_DIR_JPG_TO_FIX"
# Iterate through all JPG files in the input directory and subdirectories
find "$INPUT_DIR" -type f -iname "*.jpg" | while read -r jpg_file; do
# Get the base name of the JPG file without the extension
base_name=$(basename "$jpg_file" .jpg)
# Look for the corresponding JSON file that starts with base_name and ends with .json
json_file=$(find "$(dirname "$jpg_file")" -type f -iname "${base_name}.jpg*.json" | head -n 1)
# Check if the corresponding JSON file exists
if [[ -f "$json_file" ]]; then
# Extract the creation time and description from the JSON file
creation_time=$(jq -r '.photoTakenTime.timestamp' "$json_file")
description=$(jq -r '.description' "$json_file")
# Format the creation time for ExifTool (assuming it's in Unix timestamp)
if [[ "$creation_time" =~ ^-?[0-9]+$ ]]; then
formatted_date=$(date -d @"$creation_time" +"%Y%m%d_%H%M%S" 2>/dev/null)
if [[ $? -ne 0 ]]; then
formatted_date=""
fi
else
formatted_date=""
fi
# **Always update** EXIF data based on JSON content (even if they already exist in JPG)
if [[ -n "$formatted_date" ]]; then
exiftool -overwrite_original -DateTimeOriginal="$formatted_date" "$jpg_file" >/dev/null 2>&1
fi
if [[ -n "$description" ]]; then
exiftool -overwrite_original -Description="$description" "$jpg_file" >/dev/null 2>&1
fi
else
# If JSON file is not found, notify but continue to check Date taken
echo "JSON file not found for: $jpg_file"
fi
# Now check if Date taken is set in the JPG file
date_taken=$(exiftool -DateTimeOriginal -s -s -s "$jpg_file")
# Construct the new filename based on the Date taken
if [[ -n "$date_taken" ]]; then
# Format the date_taken string to be filename-safe
formatted_date=$(echo "$date_taken" | sed -e 's/://g' -e 's/ /_/g') # Remove colons and spaces
safe_date_taken="${formatted_date:0:8}_${formatted_date:9:6}" # Separate date and time
new_filename="${OUTPUT_DIR_JPG}/${safe_date_taken}_$(basename "$jpg_file")" # New filename based on formatted date
cp "$jpg_file" "$new_filename"
else
# Get the last directory name if Date taken is not set
last_dir=$(basename "$(dirname "$jpg_file")")
new_filename="${OUTPUT_DIR_JPG_TO_FIX}/${last_dir}_$(basename "$jpg_file")"
cp "$jpg_file" "$new_filename"
fi
done
echo "Processing complete."
Feel free to adjust the logic for different file types using ChatGPT
or another Generative AI of your choice.
Metadata : updating videos using data from jsons
The same trick might be used for video files using ffmpeg
, however this time it's a bit more tricky as with ffmpeg
you must create a new file copy:
#!/bin/bash
# Hardcoded input directory and output directories
INPUT_DIR="./input" # Input directory set to ./input
OUTPUT_DIR_MP4="./mp4" # Directory for modified MP4s
OUTPUT_DIR_MP4_TO_FIX="./mp4_to_fix" # Directory for MP4s needing fixing
# Create output directories if they don't exist
mkdir -p "$OUTPUT_DIR_MP4"
mkdir -p "$OUTPUT_DIR_MP4_TO_FIX"
# Iterate through all MP4 files in the input directory and subdirectories
find "$INPUT_DIR" -type f -iname "*.mp4" | while read -r mp4_file; do
# Get the base name of the MP4 file without the extension
base_name=$(basename "$mp4_file" .mp4)
# Look for the corresponding JSON file that starts with base_name and ends with .json
json_file=$(find "$(dirname "$mp4_file")" -type f -iname "${base_name}.mp4*.json" | head -n 1)
# Initialize the new filename variable
new_filename=""
# Check if the corresponding JSON file exists
if [[ -f "$json_file" ]]; then
echo "JSON file IS found for: $mp4_file"
# Extract the creation time from the JSON file
creation_time=$(jq -r '.photoTakenTime.timestamp' "$json_file")
# Format the creation time for ffmpeg (assuming it's in Unix timestamp)
if [[ "$creation_time" =~ ^-?[0-9]+$ ]]; then
# Convert Unix timestamp to "YYYY-MM-DD HH:MM:SS" format
formatted_date=$(date -d @"$creation_time" +"%Y-%m-%d %H:%M:%S" 2>/dev/null)
if [[ $? -ne 0 ]]; then
formatted_date=""
fi
else
formatted_date=""
fi
# **Always update** MP4 metadata based on JSON content (creation_time)
if [[ -n "$formatted_date" ]]; then
# The filename date format: YYYY-MM-DD_HH:MM:SS
filename_date=$(echo "$formatted_date" | tr -d ' ' | tr -d '-') # Strip spaces and dashes
filename_date="${filename_date//:/}" # Replace colons with empty string for filename safety
# Construct the new filename based on the formatted date
new_filename="${OUTPUT_DIR_MP4}/${filename_date}_$(basename "$mp4_file")" # New filename based on formatted date
# Debugging: Log the new filename
echo "New filename (with date): $new_filename"
# Update the creation time in the MP4 metadata and copy it to the new location in one step
ffmpeg -i "$mp4_file" -c copy -metadata creation_time="$formatted_date" "$new_filename"
fi
fi
# If we did not successfully update the file, copy it to the 'mp4_to_fix' directory
if [[ -z "$new_filename" ]]; then
echo "JSON file not found or timestamp invalid. Moving to mp4_to_fix: $mp4_file"
# Get the last directory name (in case the file is being moved to mp4_to_fix)
last_dir=$(basename "$(dirname "$mp4_file")")
new_filename="${OUTPUT_DIR_MP4_TO_FIX}/${last_dir}_$(basename "$mp4_file")"
# Debugging: Log the action taken (moving to mp4_to_fix)
echo "Moving to $OUTPUT_DIR_MP4_TO_FIX: $new_filename"
# Copy the file to the mp4_to_fix directory
cp "$mp4_file" "$new_filename"
fi
done
echo "Processing complete."
At this point we are done with the metadata and we can go further.
ID : using proper id
It may happen that your files are coming from different sources and have completely different id's, now as your files are prefixed with a date in format 20000101_000000
you may erase the previous filename and use an identifier of your choice and an ad-hoc counter, here is an example of transformation of all id's to the format 20000101_000000_jpg_0001.jpg
:
#!/bin/bash
# Set the input and output directories
input_dir="./input"
output_dir="./output"
# Make sure the output directory exists, create it if it doesn't
mkdir -p "$output_dir"
# Counter variable, starting at 1
counter=1
# Loop through sorted .jpg files, handling files with spaces correctly
find "$input_dir" -type f -name "*.jpg" | sort | while IFS= read -r file; do
# Extract the first part of the filename (before the first three underscores)
base_name=$(basename "$file")
prefix=$(echo "$base_name" | cut -d'_' -f1-2) # Extract "20000101_000000" part
# Build the new filename with a 4-digit counter and "_jpg_" prefix
new_filename=$(printf "%s_jpg_%04d.jpg" "$prefix" "$counter")
# Copy the file to the output directory with the new filename
cp "$file" "$output_dir/$new_filename"
# Increment the counter
((counter++))
done
echo "Files renamed and copied to $output_dir"
Compress images
Using ImageMagick
from part 2 you may convert all the images to the format of your choice:
#!/bin/bash
# Input and Output directories
input_dir="./input"
output_dir="./output"
# Ensure the output directory exists
mkdir -p "$output_dir"
# Loop through all jpg files in the input directory
for img in "$input_dir"/*.jpg; do
# Get the image dimensions (width x height)
dimensions=$(identify -format "%wx%h" "$img")
width=$(echo $dimensions | cut -d'x' -f1)
height=$(echo $dimensions | cut -d'x' -f2)
# Check for vertical images (height >= width)
if [ "$height" -ge "$width" ]; then
# Vertical and height > 1280, resize to height 1280
if [ "$height" -gt 1280 ]; then
output_file="$output_dir/$(basename "$img" .jpg).heic"
magick "$img" -resize x1280 -quality 80 "$output_file"
echo "Resized and converted $img to $output_file"
else
# Vertical image with height <= 1280, just convert to HEIC
output_file="$output_dir/$(basename "$img" .jpg).heic"
magick "$img" "$output_file"
echo "Converted $img to $output_file"
fi
elif [ "$width" -gt "$height" ]; then
# Landscape and width > 1280, resize to width 1280
if [ "$width" -gt 1280 ]; then
output_file="$output_dir/$(basename "$img" .jpg).heic"
magick "$img" -resize 1280x -quality 80 "$output_file"
echo "Resized and converted $img to $output_file"
else
# Landscape image with width <= 1280, just convert to HEIC
output_file="$output_dir/$(basename "$img" .jpg).heic"
magick "$img" "$output_file"
echo "Converted $img to $output_file"
fi
else
echo "Skipping $img: does not fit any criteria"
fi
done
Compress videos
Similar logic can be applied to the videos using ffmpeg
from part 3:
#!/bin/bash
# Hardcoded input directory and output directory
INPUT_DIR="./input" # Input directory set to ./input
OUTPUT_DIR="./output" # Output directory for converted MP4s
# Create the output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"
# Iterate through all MP4 files in the input directory and subdirectories
find "$INPUT_DIR" -type f -iname "*.mp4" | while read -r mp4_file; do
# Get the base name of the MP4 file without the extension
base_name=$(basename "$mp4_file" .mp4)
# Get video dimensions (width and height) using ffprobe
dimensions=$(ffprobe -v error -select_streams v:0 -show_entries stream=width,height -of csv=s=x:p=0 "$mp4_file")
width=$(echo "$dimensions" | cut -d 'x' -f 1)
height=$(echo "$dimensions" | cut -d 'x' -f 2)
# Debugging: Log the dimensions
echo "Dimensions of $mp4_file: Width=$width, Height=$height"
# Initialize the new filename variable
new_filename="${OUTPUT_DIR}/${base_name}_converted.mp4"
# Check if rescaling is needed and apply the appropriate scale
if [[ "$height" -ge "$width" && "$height" -gt 1280 ]]; then
# If height >= width and height > 1280, rescale to -2:1280
scale="-2:1280"
echo "Rescaling $mp4_file to $scale"
elif [[ "$width" -gt "$height" && "$width" -gt 1280 ]]; then
# If width > height and width > 1280, rescale to 1280:-2
scale="1280:-2"
echo "Rescaling $mp4_file to $scale"
else
# No scaling needed
scale=""
echo "No rescaling needed for $mp4_file"
fi
# Run ffmpeg with or without scaling, based on the conditions
ffmpeg -i "$mp4_file" \
-r 30 -c:v libx265 -crf 28 -preset medium \
-c:a aac -b:a 192k \
-metadata creation_time="$formatted_date" \
${scale:+-vf "scale=$scale"} \
"$new_filename"
done
echo "Processing complete."
Summary
Overall, if some formats have a limited support for metadata, for example png
, it could be a good idea to encode it into the filename in order to make it available for an eventual conversion tool.
Posted on October 29, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
October 29, 2024