A Small Code Renaming Example in Python

techocodger

techocodger

Posted on July 8, 2023

A Small Code Renaming Example in Python

I've recently gone to make some edits in a program, and immediately realised that my variable and function naming was poor and should be rectified before I make the next edits.

The general situation is in a part of the application Foldatry, in particular the part that does the hash-based matching of files by their content. In the process of that, it visits every file and calculates a hash by reading the entire file. This is a fairly standard way of finding files that have the exact same content, regardless of the filenames and timestamps.

What I would like to do, is to amend the program to let me have an option to do a faster process in which it first matches file by their filenames, sizes and timestamps. But first, some housekeeping seemed worthwhile.

Prefixes

As I'll be using them in this example, while not vital to make sense of the following, it may be useful to know my personal prefix codes are:

  • i = internal - where that seems useful to distinguish from one of similar name with a different prefix
  • v = var parameter - aka passed by reference (the "var" here comes from Pascal)
  • p = parameter - by passed by value
  • r = result - a value that will be quoted in a Python "return" statement
  • g = global - which in the context of Python means global to a module - I try to avoid needing this.

The Outer Call

Here is the outer layer of the call that does this.

def filedupetry_command( p_tree_paths, p_multi_log):
    # set up structures for Phase 1
    d_dct_size_lst_pfn = {} # dictionary of file sizes each with a list of pathfiles
    mtc_dct_hash_lst_tplfilinf = {} # dictionary of hashes 
    mtc_dct_file_selves = {} # dictionary of files with info about them
    # ---------------------
    # Phase 1 - tree traversal
    # Traverse the tree calculating and collecting hashes
    tpi = 0
    tpn = len(p_tree_paths)
    for i_tree_path in p_tree_paths:
        tpi = tpi + 1
        traverse_tree_add_hash( d_dct_size_lst_pfn, mtc_dct_file_selves, i_tree_path, p_multi_log)
    # ---------------------
    # Phase 2 - Hashes for files the same size
    build_hash_dct_from_Size_dct( mtc_dct_hash_lst_tplfilinf, d_dct_size_lst_pfn, p_multi_log)
    # ---------------------
    # Phase 3 - Analysis - make a size based dictionary of the hashes
    dct_size_lst_hash_multi_files = make_dct_size_lst_hashed_with_multi_files( mtc_dct_hash_lst_tplfilinf )
    # ---------------------
    return mtc_dct_hash_lst_tplfilinf, mtc_dct_file_selves, dct_size_lst_hash_multi_files
Enter fullscreen mode Exit fullscreen mode

It was passed in a list of folders to process - i.e. a list of strings giving the paths. You can ignore all the p_multi_log references as that is for my logging system.

So actually the first thing that's poor about this code already: is that the name of a function it calls is quite misleading.
i.e. traverse_tree_add_hash because while that name was valid the first time I wrote it, I later changed the approach.
Actually now, that function builds just a dictionary of files keyed by their sizes - for the idea that only those of the same size should then be hashed for contents. However, this improvement is reperesented in the name of the local parameter being passed to it d_dct_size_lst_pfn being a dictionary keyed by size and having a list of path-filenames.

So, a better name would now be something like traverse_tree_make_sizes_dct

While we're here, let's note that the running of that function is just a first phase, with a second phase being to use that size-keyed dictionary to then build yet another resource keyed by the content hashes - by calling build_hash_dct_from_Size_dct

Even while we're looking at this outer part of code, there are some other names that I might now want to change.
Let's look at these three :

    d_dct_size_lst_pfn = {} # dictionary of file sizes each with a list of pathfiles
    mtc_dct_hash_lst_tplfilinf = {} # dictionary of hashes 
    mtc_dct_file_selves = {} # dictionary of files with info about them
Enter fullscreen mode Exit fullscreen mode

At the time I probably named d_dct_size_lst_pfn with a d prefix to indicate it was "data" being assembled. Now, I would rename this to have a prefix of i to indcate that it is internal to this function, hence to become i_dct_size_lst_pfn

At the time I probably named mtc_dct_hash_lst_tplfilinf with a prefix of mtc to distinguish which layer of function it was in, with the "mtc" being a reference to that function context. That's not relevant here, and is only a hangover from when this function was a copy-and-adapt construction from earlier code. So this can be renamed to r_dct_hash_lst_tplfilinf

Similarly, mtc_dct_file_selves can be renamed to r_dct_file_selves to clarify that this will be built and then returned.

Ditto for

  • dct_size_lst_hash_multi_files which can become r_dct_size_lst_hash_multi_files

Hence, after applying those renamings the code looks like:

def filedupetry_command( p_tree_paths, p_multi_log):
    # set up structures for Phase 1
    i_dct_size_lst_pfn = {} # dictionary of file sizes each with a list of pathfiles
    r_dct_hash_lst_tplfilinf = {} # dictionary of hashes 
    r_dct_file_selves = {} # dictionary of files with info about them
    # ---------------------
    # Phase 1 - tree traversal
    # Traverse the tree calculating and collecting hashes
    tpi = 0
    tpn = len(p_tree_paths)
    for i_tree_path in p_tree_paths:
        tpi = tpi + 1
        traverse_tree_make_sizes_dct( i_dct_size_lst_pfn, r_dct_file_selves, i_tree_path, p_multi_log)
    # ---------------------
    # Phase 2 - Hashes for files the same size
    build_hash_dct_from_Size_dct( r_dct_hash_lst_tplfilinf, i_dct_size_lst_pfn, p_multi_log)
    # ---------------------
    # Phase 3 - Analysis - make a size based dictionary of the hashes
    r_dct_size_lst_hash_multi_files = make_dct_size_lst_hashed_with_multi_files( r_dct_hash_lst_tplfilinf )
    # ---------------------
    return r_dct_hash_lst_tplfilinf, r_dct_file_selves, r_dct_size_lst_hash_multi_files
Enter fullscreen mode Exit fullscreen mode

Note that here we have glibly renamed a function call to be traverse_tree_make_sizes_dct but we will of course need to rename that where it is defined.

The Inner Call

Now let's look at the inner function, and what renaming could or should be done in there.

Actually, it is really two layers because there is a recursive function, with a non-recursive outer call.

Here is the outer call:

# recursively traverse a folder structure, deriving and storing hashes
# this is the outer non-recursive part
# currently the only difference is the use of a depth counter in the recursive call
# and that's only there to help with debugging
def traverse_tree_add_hash( d_dct_size_lst_pfn, folderselfs_d, p_tree_path, p_multi_log):
    i_lst_ncalls = [0]
    traverse_tree_recurse( d_dct_size_lst_pfn, folderselfs_d, p_tree_path, 0, i_lst_ncalls, p_multi_log)
Enter fullscreen mode Exit fullscreen mode

in which I've left comments in, to show that these also need changing - as the function does not in fact do any "deriving and storing hashes"

So, as we've already covered, we want to rename this to traverse_tree_make_sizes_dct

# recursively traverse a folder structure, compiling a dictionary of sizes
# this is the outer non-recursive part
# currently the only difference is the use of a depth counter in the recursive call
# and that's only there to help with debugging
def traverse_tree_make_sizes_dct( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, p_multi_log):
    i_lst_ncalls = [0]
    traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, 0, i_lst_ncalls, p_multi_log)
Enter fullscreen mode Exit fullscreen mode

So what I've done there is to also do some internal variable renaming:

  • d_dct_size_lst_pfn becoming v_dct_size_lst_pfn
  • folderselfs_d becoming v_dct_folderselfs

Now let's look at the inner, recursive, function.

# recursively traverse a folder structure, deriving and storing hashes
def traverse_tree_recurse( d_dct_size_lst_pfn, d_pfn_stock, p_tree_path, p_tree_depth, p_lst_ncalls, p_multi_log):
    # d_hash_lst_tplfilinf is a dictionary being built, with hashes for keys, and lists of tuples for values
    # d_pfn_stock is a dictionary being built, with folderpaths for keys, and tuples for values
    # p_tree_path is the folder location now being accounted 
    p_lst_ncalls[0] += 1
    fldr_n = p_lst_ncalls[0]
    entries = explrtry.get_lst_sorted_os_scandir( p_tree_path )
    filcount = 0
    dircount = 0
    for entry in entries:
        if entry.is_file(follow_symlinks=False):
            # assemble specifics and derive hash
            pfn = im_osp.join(p_tree_path, entry.name) 
            size = entry.stat().st_size
            # make as a tuple
            n_t = tpl_FileInfo( pfn, entry.name, size, "")
            # now store it
            store_dct_size_lst_add_PathFile( d_dct_size_lst_pfn, n_t )
            store_dct_pfn_tpl_FileInfo( d_pfn_stock, n_t )
            filcount = filcount + 1
        elif entry.is_dir(follow_symlinks=False):
            # recursive call returning the hash of the subfolder
            new_subpath = im_os.path.join(p_tree_path, entry.name)
            traverse_tree_recurse( d_dct_size_lst_pfn, d_pfn_stock, new_subpath, p_tree_depth + 1, p_lst_ncalls, p_multi_log)
            # count subfolders and prepare for recursion further down
            dircount = dircount + 1
    return 
Enter fullscreen mode Exit fullscreen mode

Here is the plan of which things I want to rename:

  • d_dct_size_lst_pfn becomes v_dct_size_lst_pfn
  • d_pfn_stock becomes v_dct_pfn_stock
  • p_lst_ncalls becomes v_lst_ncalls - we'll say more about this later

Here is the code after the renaming has been done,

# recursively traverse a folder structure, deriving and storing hashes
def traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_pfn_stock, p_tree_path, p_tree_depth, v_lst_ncalls, p_multi_log):
    # d_hash_lst_tplfilinf is a dictionary being built, with hashes for keys, and lists of tuples for values
    # v_dct_pfn_stock is a dictionary being built, with folderpaths for keys, and tuples for values
    # p_tree_path is the folder location now being accounted 
    v_lst_ncalls[0] += 1
    fldr_n = v_lst_ncalls[0]
    entries = explrtry.get_lst_sorted_os_scandir( p_tree_path )
    filcount = 0
    dircount = 0
    for entry in entries:
        if entry.is_file(follow_symlinks=False):
            # assemble specifics and derive hash
            pfn = im_osp.join(p_tree_path, entry.name) 
            size = entry.stat().st_size
            # make as a tuple
            n_t = tpl_FileInfo( pfn, entry.name, size, "")
            # now store it
            store_dct_size_lst_add_PathFile( v_dct_size_lst_pfn, n_t )
            store_dct_pfn_tpl_FileInfo( v_dct_pfn_stock, n_t )
            filcount = filcount + 1
        elif entry.is_dir(follow_symlinks=False):
            # recursive call returning the hash of the subfolder
            new_subpath = im_os.path.join(p_tree_path, entry.name)
            traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_pfn_stock, new_subpath, p_tree_depth + 1, v_lst_ncalls, p_multi_log)
            # count subfolders and prepare for recursion further down
            dircount = dircount + 1
    return 
Enter fullscreen mode Exit fullscreen mode

Issues that Renaming Brings To Light

Now one of the benefits of doing this kind of renaming - other than making the code a little clearer for the next time it needs to be read and comprehended - is that the process brings to light various other things that should be re-done.

In this case, we bump into some things I did very early in the process of adapting from my generic programming experience over to how to do things in Python.

  • that the recursive calls should be sub-scoped to the outer call - i.e. the functions should be nested
  • that this allows a replacement of an ugly use of a list as a single value store

First let's look at the nesting. Here is a pseudo-brief of the functions as they are.

def traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_pfn_stock, p_tree_path, p_tree_depth, v_lst_ncalls, p_multi_log):
    return 

def traverse_tree_make_sizes_dct( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, p_multi_log):
    traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, 0, i_lst_ncalls, p_multi_log)
    return 
Enter fullscreen mode Exit fullscreen mode

That happens to be how things get done when coding in a language that supports recursion but not nesting. An example of this is VBA.

As Python does support nesting (as indeed does Pascal which is my mental internal coding language) that means we can restructure like this:

def traverse_tree_make_sizes_dct( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, p_multi_log):
    def traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_pfn_stock, p_tree_path, p_tree_depth, v_lst_ncalls, p_multi_log):
        return 
    traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, 0, i_lst_ncalls, p_multi_log)
    return 
Enter fullscreen mode Exit fullscreen mode

As presaged above, this allows a change to how one of the tracking variables is handled.

Now to be clear: as the code here was from my very early tackling of Python, when my inclination was to have a passed-by-reference scalar variable (i.e. just a number value) that would travel through the recursion to act as a progress counter. As Python doesn't support that idea - a scalar parameter when written to, creates a new local mutable instance - then I "got around" that by passing a list of just one value.

This makes for an ugly (in several senses) arrangement as see in this abstraction:

def traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_pfn_stock, p_tree_path, p_tree_depth, v_lst_ncalls, p_multi_log):
    v_lst_ncalls[0] += 1
    fldr_n = v_lst_ncalls[0]
    print( "Folders traversed: " + str( fldr_n) )
    return 

def traverse_tree_make_sizes_dct( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, p_multi_log):
    i_lst_ncalls = [0]
    traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, 0, i_lst_ncalls, p_multi_log)
    return 
Enter fullscreen mode Exit fullscreen mode

Where the list-of-one-value has to be created in the outer layer, passed to the inner layer and carefully handled there - incremented and then passed to itself in recursion.

But with the nested version, we can dispense with the list-of-one-value and also not need to pass it as a parameter.

def traverse_tree_make_sizes_dct( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, p_multi_log):
    i_ncalls = 0
    def traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_pfn_stock, p_tree_path, p_tree_depth, p_multi_log):
        nonlocal i_ncalls
        i_ncalls += 1
        print( "Folders traversed: " + str( i_ncalls) )
        return 
    traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, 0, p_multi_log)
    return 
Enter fullscreen mode Exit fullscreen mode

Do note that this requires one of Python's rare syntactic-only elements nonlocal to be deployed, to ensure that the required scoping of the variable i_ncalls is implemented across the function call depths. I simply didn't know about this when I originally wrote the code.

  • It's also a curious thing in Python syntax because it is only a guide instruction to the Python interpreter - no program action occurs at that point in the code. Also - in case it's not obvious - it requires the referred variable to already be known to the parser. Hence the following, which perhaps should be identical, will cause an error.
def traverse_tree_make_sizes_dct( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, p_multi_log):
    def traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_pfn_stock, p_tree_path, p_tree_depth, p_multi_log):
        nonlocal i_ncalls
        i_ncalls += 1
        print( "Folders traversed: " + str( i_ncalls) )
        return 
    i_ncalls = 0
    traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, 0, p_multi_log)
    return 
Enter fullscreen mode Exit fullscreen mode

End Note

This was written as notes for making a video. If that happens then a link will be edited into here.

πŸ’– πŸ’ͺ πŸ™… 🚩
techocodger
techocodger

Posted on July 8, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

How to Use KitOps with MLflow
beginners How to Use KitOps with MLflow

November 29, 2024

Configure python file in vscode
undefined Configure python file in vscode

November 30, 2024

Configure python file in vscode
undefined Configure python file in vscode

November 30, 2024