🦊 ChatGPT, If You Please, Make Me a GitLab Jobs YAML Attributes Sorter

bcouetil

Benoit COUETIL 💫

Posted on March 30, 2023

🦊 ChatGPT, If You Please, Make Me a GitLab Jobs YAML Attributes Sorter

Initial thoughts

If you came here searching for a script that sorts jobs attributes by a given order in GitLab yaml files, "we" managed to make a very satisfying one, with nearly no previous experience in Python myself. You will find it at the end of this article.

I've searched extensively for a tool to sort job attributes in GitLab yaml files, and could not find any. I've postponed the creation of one by myself until recently.

My first attempt using shell script was not satisfying, because yq, the best shell tool to work on yaml, handles emojis poorly and strips blank lines.

I've used a workaround for the emojis, but overall it was not flexible enough to handle edge cases.

Here is the script I wrote, for the record:

#!/bin/zsh
set -e

IRed='\033[0;91m'    # Red
IGreen='\033[0;92m'  # Green
IYellow='\033[0;93m' # Yellow
COff='\033[0m'       # Text Reset

fix_emojis() {
    echo "${IGreen}Fixing bad yq emojis handling on $1...$COff"
    INPUT_FILE=$1
    OUTPUT_FILE=$2
    # using echo to translate emoji chars from unicode to UTF-8
    # + replacing unnecessary double quotes aroung emoji fields
    echo "${IGreen}$(cat $INPUT_FILE | sed 's/"\(.*\\U0.*\)"/\1/g')$COff" >$OUTPUT_FILE
}

GITLAB_YML_FILE_TO_SORT=${1:-.gitlab-ci.yml}
echo "${IGreen}Processing $GITLAB_YML_FILE_TO_SORT...$COff"

rm -rf .glsort 2>/dev/null
mkdir -p .glsort

rm -rf $GITLAB_YML_FILE_TO_SORT.orig

echo "${IGreen}Reorganizing attributes...$COff"
yq '(.* | select(has("extends") or has("script"))) |= pick(["extends", "stage", "image", "dependencies", "cache", "services", "variables", "before_script", "script", "after_script", "environment", "artifacts", "interruptible", "allow_failure", "when", "rules"] + keys | unique )' $GITLAB_YML_FILE_TO_SORT >.glsort/$GITLAB_YML_FILE_TO_SORT.organized

fix_emojis .glsort/$GITLAB_YML_FILE_TO_SORT.organized .glsort/$GITLAB_YML_FILE_TO_SORT.organized-emojis

echo "${IGreen}Making a blankline-free version of $GITLAB_YML_FILE_TO_SORT for later diff...$COff"
yq '.' $GITLAB_YML_FILE_TO_SORT >.glsort/$GITLAB_YML_FILE_TO_SORT.blankline-free
fix_emojis .glsort/$GITLAB_YML_FILE_TO_SORT.blankline-free .glsort/$GITLAB_YML_FILE_TO_SORT.blankline-free-emojis

echo "${IGreen}Preparing final diff file...$COff"
diff --ignore-blank-lines .glsort/$GITLAB_YML_FILE_TO_SORT.blankline-free-emojis .glsort/$GITLAB_YML_FILE_TO_SORT.organized-emojis >.glsort/$GITLAB_YML_FILE_TO_SORT.diff || echo "${IYellow}diff command has (silently ?) failed, please check your pipeline file, you might have to move a line by yourself$COff"

echo "${IGreen}Patching $GITLAB_YML_FILE_TO_SORT...$COff"
patch $GITLAB_YML_FILE_TO_SORT .glsort/$GITLAB_YML_FILE_TO_SORT.diff || true

if [ -f $GITLAB_YML_FILE_TO_SORT.rej ]; then
    echo "${IYellow} Some diff unapplied, please ignore if special characters problem or line already exists: $COff"
    cat $GITLAB_YML_FILE_TO_SORT.rej
    rm -f $GITLAB_YML_FILE_TO_SORT.rej
fi

echo "${IGreen}Deleting intermediate .glsort files...$COff"
rm -rf .glsort

[ -f $GITLAB_YML_FILE_TO_SORT.orig ] && echo "Original file saved as $GITLAB_YML_FILE_TO_SORT.orig"
Enter fullscreen mode Exit fullscreen mode

I decided that these GitLab yaml files had to be handled as text files, not as yaml files, to avoid messing with presentation choices of the developer.

With the rise of ChatGPT, it was finally time to try the famous folk everyone is speaking about early 2023 on the internet. At the time of writing, the language version model used is GPT-3.5.

Meeting the cool kid

I've not read much about ChatGPT. My first tests, a few weeks ago, for non code subjects, were a mix of "wow, it seems human" and "man, how could you forget what I said a few sentences earlier ?".

But now was the time to request some code !

First try at asking for a script

With my approximative english, I naively asked first:

I have a file, with lines not starting with spaces, lines starting with only 2 spaces (and more non spaces characters, and lines starting with at least 4 spaces.

Lines starting with 2 spaces are, individually, just before a block of lines starting with at least 4 spaces.

These blocks of line should move at once and be reordered alphabetically.

Could you make me a script that does that

ChatGPT answered:

Certainly! Here is a Python script that should do what you're asking for:

with open('input_file.txt', 'r') as f:
    lines = f.readlines()

output_lines = []
current_block = []

for line in lines:
    if line.startswith('  '):
        current_block.append(line)
    else:
        if current_block:
            current_block.sort()
            output_lines += current_block
            current_block = []
        output_lines.append(line)

if current_block:
    current_block.sort()
    output_lines += current_block

with open('output_file.txt', 'w') as f:
    f.writelines(output_lines)
Enter fullscreen mode Exit fullscreen mode

This script reads in the lines from an input file, then iterates through them, keeping track of any block of lines starting with two spaces. When a non-indented line is encountered, the script checks if there is a current block of lines and, if so, sorts the block alphabetically and appends it to the output. The non-indented line is then appended to the output, and the script moves on to the next line.

If there is still a current block of lines at the end of the file, it is sorted and appended to the output as well. Finally, the sorted lines are written to an output file.

Note that this assumes that the input file exists and is named input_file.txt, and that the output should be written to a file named output_file.txt. You will need to replace these filenames with the appropriate names for your own files.

I was blown away at first. But after trying the script, it was doing nothing good at all, except sorting some lines.

Two steps forward, one step back: not very productive iterations

I spend quite some time trying to explain him, step by step, what was wrong:

This does not work as expected, so I will clarify:

Lines not indented do not move.

Lines indented with 2 spaces only are the start of a block, containing 0, 1 or multiple indented lines with 4 spaces or more.

Move these blocks alphabetically by the first line of the blocks

And, as always, ChatGPT apologizes and try something new.

I was gradually realizing that it was me that was in the process of learning my requirements (and speaking better english 😅). And that ChatGPT forgets very quickly.

Now I know what I want: stepping up with full requirements

Now I really know how to say what I want. It is the moment to start over, and from now on I don't hesitate to do it multiple time:

Let's aggregate what I told the make a better script:

Let's parse a file. Not standard input, but a file that you have to open.

If the current line has no indentation, print it.

If a line starts with 2 spaces only, put it in a new block of lines. These lines always start with a word followed by ":".

If a line starts with 4 spaces or more, put it in the current block.

A block is never sorted.

when a line is empty, sort the blocks between them, by comparing the first word, and print all of them, then delete them.

The sorting is not be alphabetically, but by this descending order of starting words: "extends", "stage", "image", "dependencies", "cache", "services", "variables", "before_script", "script", "after_script", "environment", "artifacts", "interruptible", "allow_failure", "when", "rules"

don't forget that lines starting with 4 spaces also start with 2 spaces, so they should be checked before lines with 2 spaces only.

you should print blocks the moment an empty line is found, not at the end of the loop.

each time you find an empty line, you have to sort blocks between them and empty the blocks structure, then print the empty line, and continue with the rest of the file

separate cases between empty lines and non-indented lines, for now you put that in a single else, and that way you print twice the same lines

There is always an empty line at the end of the file, so no need for printing residual blocks.

I forgot I could help on writing the script

I realized I wanted ChatGPT to make the script alone, but I could work with ChatGPT instead of explaining what to change over and over.

This was a major leap in "our" productivity.

I improved your script, could you explain it ?

# python3 glsort.py input_file.txt > output_file.txt

import sys

# Define the order in which job_attributes should be sorted
ATTRIBUTES_ORDER = ["extends", "stage", "image", "dependencies", "cache", "services", "variables",
                    "before_script", "script", "after_script", "environment", "artifacts",
                    "interruptible", "allow_failure", "when", "rules"]

# Read the filename from the command line arguments
filename = sys.argv[1]

# Initialize variables for tracking job_attributes and lines
job_attributes = []
current_attribute = []


def print_sorted_attributes():
    if job_attributes:
        sorted_job_attributes = sorted(job_attributes, key=lambda b: ATTRIBUTES_ORDER.index(b[0].lstrip().split(':')[0])
                                       if b[0].lstrip().split(':')[0] in ATTRIBUTES_ORDER
                                       else len(ATTRIBUTES_ORDER))
        for block in sorted_job_attributes:
            for line in block:
                print(line, end='')


# Open the file and start reading lines
with open(filename, 'r') as f:
    for line in f:

        # Check if the line starts with 4 or more spaces (i.e., is indented)
        if line.startswith(' ' * 4):
            # Add the line to the current block
            current_attribute.append(line)
        # 'stages' special case
        elif line.lstrip().startswith('-'):
            print(line, end='')
        # Check if the line starts with 2 spaces (i.e., is the start of a new block)
        elif line.startswith(' ' * 2):
            # Add the current block to the list of job_attributes (if it's not empty)
            current_attribute = [line]
            job_attributes.append(current_attribute)
        # Otherwise, the line is not indented and should be printed
        else:
            print_sorted_attributes()

            # Reset the job_attributes variable and continue to the next line
            job_attributes = []
            current_attribute = []
            print(line, end='')

# If there are any remaining job_attributes at the end of the file, sort and print them
print_sorted_attributes()
Enter fullscreen mode Exit fullscreen mode

ChatGPT responded:

Sure, here's an explanation of the updated script:

  • We start by defining a list ATTRIBUTES_ORDER which specifies the order in which the job attributes should be sorted. This list includes all the possible attributes that may appear in the input file, in the desired order.

[...]

ChatGPT can explain itself on specific code

Some obscure code was preventing me from going further in refactoring myself. Once again, I was holding myself back for no good reason, ChatGPT can explain itself:

when you gave this, can you explain ?

sorted_blocks = sorted(blocks, key=lambda b:
    SORT_ORDER.index(b[0].split(':')[0])
    if b[0].split(':')[0] in SORT_ORDER else len(SORT_ORDER))
Enter fullscreen mode Exit fullscreen mode

Certainly!

This line of code defines sorted_blocks as a sorted version of the blocks list, using a custom key function for sorting.

The key function is defined using a lambda function, which takes a block b as input and returns the position of its first word in the SORT_ORDER list, or the length of SORT_ORDER if the first word is not in SORT_ORDER.

Here is a breakdown of the key function:

b[0] selects the first line of the block, which contains the keyword and colon character.
b[0].split(':')[0] splits this line on the colon character and selects the first element, which is the keyword.
if b[0].split(':')[0] in SORT_ORDER checks if the keyword is in the SORT_ORDER list.
SORT_ORDER.index(b[0].split(':')[0]) returns the position of the keyword in SORT_ORDER if it is in the list.
else len(SORT_ORDER) returns the length of SORT_ORDER if the keyword is not in the list.

Putting it all together, the key function returns the position of the first word in the SORT_ORDER list if it is in the list, or the length of the list if it is not. This ensures that blocks with recognized keywords are sorted before blocks with unrecognized keywords, and within each group, blocks are sorted according to the position of their first word in the SORT_ORDER list.

The sorted() function then applies this key function to each block in blocks to determine the sorting order, and returns the sorted list as sorted_blocks.

ChatGPT can add logs to help understand algorithms

My low understanding of Python could be helped with a few logs in obscure parts. No problem for ChatGPT:

could we add prints in the lambda process to better understand what is happening at runtime ?

Response:

Certainly! Here's the same lambda function with added print statements:

sorted_blocks = sorted(blocks, key=lambda b: (b[0], print(b[0]), SORT_ORDER.index(b[0].split(':')[0]) if b[0].split(':')[0] in SORT_ORDER else len(SORT_ORDER), print(SORT_ORDER.index(b[0].split(':')[0]) if b[0].split(':')[0] in SORT_ORDER else len(SORT_ORDER))))
Enter fullscreen mode Exit fullscreen mode

This will print out the value of b[0] and the index it is being sorted by using the SORT_ORDER list. It should give you a better idea of what's happening at runtime.

ChatGPT can generate script heading... beyond expectation

I simply asked:

Could you generate a heading explanation I could paste in the script ?

I got in response:

GLSORT

Script for sorting GitLab CI job attributes by a predefined order.

Usage:
  python3 glsort.py input_file.txt > output_file.txt

The script reads a GitLab CI YAML file and sorts the job attributes by a predefined order,
which can be customized by modifying the ATTRIBUTES_ORDER list. The sorted YAML file is
output to stdout and can be redirected to a file.

Algorithm:
- Initialize variables for tracking job attributes and lines
- Define the order in which job attributes should be sorted
- Read the filename from the command line arguments
- Open the file and start reading lines
- If the line starts with 4 or more spaces, add it to the current job attribute
- If the line starts with 2 spaces, start a new job attribute
- If the line is not indented, print the sorted job attributes and the line
- When an empty line is found, sort the job attributes and print them, then print the empty line
- If there are any remaining job attributes at the end of the file, sort and print them
- Job attributes are sorted by the order defined in ATTRIBUTES_ORDER
- 'stages' section is a special case and will be printed as is
Enter fullscreen mode Exit fullscreen mode

While I was reading this satisfying text (for which it is very good), I freaked out. Let me emphasis on this part:
The script reads a GitLab CI YAML file and sorts the job attributes by a predefined order

GitLab !? Not a single time did I tell ChatGPT that we were working on GitLab YAML files...

I had to ask:

how did you understand that it was a gitlab ci job attributes sorter ?

ChatGPT responded:

From the structure of the input file and the specific details mentioned in the task description, it was clear that the script was meant to sort GitLab CI job attributes. GitLab CI uses YAML files to define jobs and their attributes, and these files follow a specific structure. The task description mentioned key details such as blocks of code beginning with a word followed by a colon (e.g., "script:"), the use of indentation to denote block hierarchy, and the presence of specific job attributes such as "image", "script", and "variables". These details align with the structure and syntax used in GitLab CI YAML files.

Pretty neat, hum ? 😎

For small improvements, better ask ChatGPT for a subscript/subtask

To avoid ChatGPT tendency to forget and rework too much, you can help focus on parts of the code or part of the requirements. Then add the change yourself to the code.

It is also helping avoid reaching the maximum text size of the free version of ChatGPT.

One example:

Let's concentrate on a small part of the script.

Please make me a python script which:

  • print the first argument if it is a file
  • print "folder" if it is a folder
  • print ".gitlab-ci.yml" if there is no argument
  • print usage in all other cases

Answer:

Here's a Python script that checks the first command line argument and prints the appropriate output:

import os
import sys

if len(sys.argv) == 1:
    print(".gitlab-ci.yml")
    sys.exit(0)

arg = sys.argv[1]

if os.path.isfile(arg):
    print(arg)
elif os.path.isdir(arg):
    print("folder")
else:
    print("Usage: python script.py [filename or foldername]")
    sys.exit(1)
Enter fullscreen mode Exit fullscreen mode

Another example:

Would it be possible to define a list of ignored folders for this function ?

def process_files_recursively(directory):
    for root, dirs, files in os.walk(directory):
        for file in files:
            if GITLAB_FILENAME_REGEX.match(file):
                file_path = os.path.join(root, file)
                process_file(file_path)
Enter fullscreen mode Exit fullscreen mode

Response:

Yes, you can define a list of ignored directories and use it to modify the dirs list in the os.walk function. Here's an updated version of the function that does this:

def process_files_recursively(directory):
    ignored_dirs = [".git", ".svn", "node_modules"] # List of directories to ignore
    for root, dirs, files in os.walk(directory):
        # Remove ignored directories from the list
        dirs[:] = [d for d in dirs if d not in ignored_dirs]
        for file in files:
            if GITLAB_FILENAME_REGEX.match(file):
                file_path = os.path.join(root, file)
                process_file(file_path)
Enter fullscreen mode Exit fullscreen mode

Wrapping up

Overall, ChatGPT 3.5 seems to me a genius 9 years old child, with problems understanding beyond my words. It wants to please me, more than discover the truth. But it is a great enabler, improving my productivity with satisfying bootstrapping code.

In about 2 hours, "we" produced a new script in a language I was not proficient in.

Here it is for anyone to use:

#!/usr/local/bin/python

# GitLab Yaml sorter

# Script for sorting GitLab CI job attributes by a predefined order.

# Usage: gitlab-yaml-sort.py [file|folder]

# Parameters:
# file: the file to to be processed in-place
# folder: a folder where .gitlab-ci.yml (and similarly named files) will be processed recursively
# no parameter: .gitlab-ci.yml in current folder will be processed

# The script edits in-place a GitLab CI YAML file and sorts the job attributes by a predefined order.
# Unknown attributes are pushed to the end.
# If any error occurs while processing, the original file is kept has a .bak file.

import sys
import os
import re
import shutil

# Define the order in which job_attributes should be sorted
ATTRIBUTES_ORDER = ["extends", "stage", "image", "needs", "dependencies", "cache", "services", "variables",
                    "before_script", "script", "after_script", "coverage", "environment", "artifacts",
                    "interruptible", "allow_failure", "when", "rules", "tags"]

# Define special keywords to skip sorting of the following block. 'default' is a special keyword but should be sorted
IGNORED_TOP_LEVEL_KEYWORDS = ["stages", "includes", "variables"]

# List of directories to ignore
IGNORED_DIRECTORIES = [".git", ".history",
                       "node_modules", "tmp", ".gitlab-ci-local", "build-docs"]

# Define regex for matching filenames
GITLAB_FILENAME_REGEX = re.compile(r'.*\.gitlab-ci.*\.ya?ml$')


def sort_job_attributes(job_attributes):
    sorted_job_attributes = sorted(job_attributes, key=lambda b: ATTRIBUTES_ORDER.index(b[0].lstrip().split(':')[0])
                                   if b[0].lstrip().split(':')[0] in ATTRIBUTES_ORDER
                                   else len(ATTRIBUTES_ORDER))
    return [line for block in sorted_job_attributes for line in block]


def process_file(filename):

    # make a backup of the original file in case of error processing it
    shutil.copyfile(filename, filename + ".bak")

    # Initialize variables for tracking job_attributes and lines
    job_attributes = []
    current_attribute = []
    is_current_block_sortable = True
    last_line_was_empty = False

    with open(filename, 'r') as f:
        lines = f.readlines()

    with open(filename, 'w') as f:
        for line in lines:
            # Check if the line starts with special keywords
            if any(line.startswith(keyword) for keyword in IGNORED_TOP_LEVEL_KEYWORDS):
                # flush current attribute content
                f.write(''.join(sort_job_attributes(job_attributes)))
                job_attributes = []
                current_attribute = []
                if last_line_was_empty:
                    f.write('\n')
                # don't reorganize and write
                f.write(line)
                is_current_block_sortable = False
            # Check if the line is indented in a non-sortable block
            elif line.startswith(' ' * 2) and not is_current_block_sortable:
                # Just write
                f.write(line)
            # Check if the line is indented to sub-sublevel
            elif line.startswith(' ' * 4):
                # Add the line to the current block
                current_attribute.append(line)
            # Check if the line is indented to sublevel: this is the beginning of a new block to be sorted
            elif line.startswith(' ' * 2):
                # Add the current block to the list of job_attributes (if it's not empty)
                current_attribute = [line]
                job_attributes.append(current_attribute)
                is_current_block_sortable = True
            # Handle special case when there are empty lines in attributes (such as 'script')
            elif line.strip() == '':
                last_line_was_empty = True
            # Otherwise, the line is not indented and should be written
            else:
                f.write(''.join(sort_job_attributes(job_attributes)))
                if last_line_was_empty:
                    f.write('\n')
                f.write(line)
                # Reset variables and continue to the next line
                job_attributes = []
                current_attribute = []
                is_current_block_sortable = True
                last_line_was_empty = False

        if current_attribute:
            f.write(''.join(sort_job_attributes(job_attributes)))

    print("successfully sorted job attributes in " + filename)
    os.remove(filename + ".bak")


def process_files_recursively(directory):
    for root, dirs, files in os.walk(directory):
        # Remove ignored directories from the list
        dirs[:] = [d for d in dirs if d not in IGNORED_DIRECTORIES]
        for file in files:
            if GITLAB_FILENAME_REGEX.match(file):
                file_path = os.path.join(root, file)
                process_file(file_path)


if __name__ == '__main__':
    if len(sys.argv) == 1:
        process_file(".gitlab-ci.yml")
    elif len(sys.argv) == 2:
        path = sys.argv[1]
        if os.path.isfile(path):
            process_file(path)
        elif os.path.isdir(path):
            process_files_recursively(path)
    else:
        print("usage: python {} [file|folder]".format(sys.argv[0]))
Enter fullscreen mode Exit fullscreen mode

Further reading

💖 💪 🙅 🚩
bcouetil
Benoit COUETIL 💫

Posted on March 30, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related