Learn Golang basics by creating a file counter

Intro

Beware, this post is something I'm writing down while learning the basics of Golang, so this are just some kind of "learning notes".

I did not expect to learn much just by creating a simple script like this, but there are always hidden pitfalls where we don't expect it.

The code can be found here

So I'm going to start a new job soon and I'll be using Golang. As of today I always made frontend, mostly in javascript (typescript just for some toy projects).
So I looked at the golang tour to get the basics and also followed this amazing tutorial that I suggest everyone.

Making the script

To try out something I've decided to create a script that counts the number of files in a filesystem in 2 ways:

synchronously
asynchronously This should let me get my hands dirty with some of the core concepts of the language.

Synchronous solution

Following the 'learn go with tests' approach, via TDD, I'm first going to write a test for my function:

package filecounter

import (
    "io/fs"
    "testing"
    "testing/fstest"
)

var (
    fakeFS = fstest.MapFS{
        "root-folder":                                  {Mode: fs.ModeDir},
        "root-folder/file-1.md":                        {Data: []byte("I'm a file in the root")},
        "root-folder/sub-folder-1":                     {Mode: fs.ModeDir},
        "root-folder/sub-folder-2":                     {Mode: fs.ModeDir},
        "root-folder/sub-folder-2/file-1.md":           {Data: []byte("I'm a file in folder2")},
        "root-folder/sub-folder-2/file-2.md":           {Data: []byte("I'm another file in folder 2 ")},
        "root-folder/sub-folder-3":                     {Mode: fs.ModeDir},
        "root-folder/sub-folder-3/sub-sub-1":           {Mode: fs.ModeDir},
        "root-folder/sub-folder-3/sub-sub-1/file-1.md": {Data: []byte("file")},
    }
)

func TestFileCounter(t *testing.T) {
    t.Run("should read the number of files in a fileSystem", func(t *testing.T) {
        got, err := FileCounterSync(fakeFS)
        want := 4

        if err != nil {
            t.Errorf("Didnt expected an error, but got one %s", err)
        }

        if got != want {
            t.Errorf("got %d wanted %d", got, want)
        }
    })

Let's see what we got here...
Basically we are creating a fake FileSystem (using the fstest package) with some subfolders and a total of 4 files, this way we can test our script as many times as we want without worring that the filesystem may change. That's it.

If we try to run the test

go test

we are going to receive

undefined: FileCounterSync

So let's define this function in our file

package filecounter

import (
    "io/fs"
)

func FileCounterSync(fileSystem fs.FS) (int, error) {
    return 0, nil
}

Running now should give us a different message, something like

FAIL: got 0 wanted 4

This is expected, we didn't write the body of the function yet, so let's do it in the easiest way possible:

func FileCounterSync(fileSystem fs.FS) (int, error) {
    var numOfFiles int
    err := fs.WalkDir(fileSystem, ".", func(path string, d fs.DirEntry, err error) error {
        if err != nil {
            fmt.Println(err)
            return err
        }

        if !d.IsDir() {
            numOfFiles++
        }

        return nil
    })

    if err != nil {
        return 0, nil
    }

    return numOfFiles, nil
}

This is the easiest way that came to my mind, we are just using the WalkDir API of the 'fs' package, that let us define a function that are run for each file/folder of the given filesystem. There we can just check if the given parameter is a directory, if it's not, increment the result 🎉.

This did not teach me much, I'm writing this script to learn something about the language, not just to count files. So let's rewrite this but this time without taking advantage of the built-in WalkDir API.

func FileCounterSync(fileSystem fs.FS) (int, error) {
    dir, err := fs.ReadDir(fileSystem, ".")
    if err != nil {
        return 0, err
    }

    var numOfFiles int
    for _, f := range dir {
        if !f.IsDir() {
            numOfFiles++
        } else {
            dirs, err := fs.ReadDir(fileSystem, f.Name())
            if err != nil {
                return 0, err
            }

            for _, file := range dirs {
                if !file.IsDir() {
                    numOfFiles++
                } else {
                    n, err := countFilesRecursively(fileSystem, f.Name(), file)
                    if err != nil {
                        return 0, err
                    }
                    numOfFiles += n
                }
            }
        }
    }

    return numOfFiles, nil
}

// Helper function
func countFilesRecursively(fileSystem fs.FS, prevPath string, dir fs.DirEntry) (int, error) {
    var n int
    newPath := prevPath + "/" + dir.Name()
    dirs, err := fs.ReadDir(fileSystem, newPath)

    if err != nil {
        return 0, err
    }

    for _, file := range dirs {
        if !file.IsDir() {
            n++
        } else {
            num, err := countFilesRecursively(fileSystem, newPath, file)
            if err != nil {
                return 0, err
            }

            n += num
        }
    }

    return n, nil
}

This is a synchronous (and the dumbest) solution, basically we cycle through every file/folder, increment a counter everytime we encounter a file, call the recursion otherwise.

This was very simple and did not teach me much (apart from some syntax and the fact that I should probably handle errors better).

Try to run the tests and everything should work just fine.

Asynchronous solution

Golang is known especially for concurrency, infact the language provides some constructs to help the develpers.

I decided to try to build the script following this "rules":
Read the root folder:

count each file in it
for each subfolder calls a goroutine that:
- call the same goroutine for each subfolder
- increment the number of files for each file
wait for all the spawned goroutines to end
return the counter

The test case is still the same, so we can just rewrite the function. My first bet was:

func FileCounterAsync(fileSystem fs.FS) (int, error) {
    dir, err := fs.ReadDir(fileSystem, ".")
    if err != nil {
        return 0, err
    }

    var numOfFiles int

    for _, f := range dir {
        if !f.IsDir() {
                // count files in the root folder
            numOfFiles++
        } else {
            go countFiles(fileSystem, f.Name())
                // somehow retrieve info from the spawned goroutine
        }
    }

    return numOfFiles, nil
}

First define an helper function that given the filesystem and the path to "scan", returns the number of files AND calls itself for each subfolder

func countFiles(fileSystem fs.FS, path string) {
    dirs, err := fs.ReadDir(fileSystem, path)
    if err != nil {
        fmt.Printf("Error while reading %s", path)
    } else {
        for _, f := range dirs {
            if !f.IsDir() {
                fmt.Println("Found file " + f.Name())
                // do something to count
            } else {
                fmt.Println("Found a folder: " + f.Name())
                // spawn a dedicated goroutine
                go countFiles(fileSystem, path+"/"+f.Name())
            }
        }
    }
}

This function differentiate between files and subfolders, spawn goroutines for each subfolders, but we need a way to "return" to the called the number of files.

The right construct (and I hope I'm right) to address this issue is to create a channel, we can then use that to "return" data from the goroutines.
Also we need a way to tell the script when to end, for this I'm going to use a waitgroup.
This way the script will know when to finish.

func FileCounterAsync(fileSystem fs.FS) (int, error) {
    // previous stuff

    var numOfFiles int
    var wg sync.WaitGroup
    c := make(chan int)

    for _, f := range dir {
        if !f.IsDir() {
            numOfFiles++
        } else {
            wg.Add(1)
            go countFiles(fileSystem, &wg, f.Name(), c)
        }
    }

    go func() {
        wg.Wait()
        close(c)
    }()
    for v := range c {
        numOfFiles += v
    }

    return numOfFiles, nil
}

func countFiles(fileSystem fs.FS, wg sync.WaitGroup, path string, c chan int) {
    defer func() {
        fmt.Println("closing group for " + path)
        wg.Done()
    }()

    dirs, err := fs.ReadDir(fileSystem, path)
    if err != nil {
        fmt.Printf("Error while reading %s", path)
    } else {
        for _, f := range dirs {
            if !f.IsDir() {
                c <- 1
            } else {
                wg.Add(1)
                go countFiles(fileSystem, wg, path+"/"+f.Name(), c)
            }
        }
    }
}

Now we are "returning" the number of files counted by the goroutines with the channel and we are telling the main function "how long to wait".

Benchmarks

I wanted to confirm and see how much faster the async version is compared to the sync one so I've done some basic benchmarkings:

So using the benchmarking feature of the go test this are the results:

cpu: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
BenchmarkFileCounterSync-8        159486              6456 ns/op
BenchmarkFileCounterAsync-8        69724             16638 ns/op

as we can see, the async version is a lot faster.

I've also tested the implementations towards a folder on my PC, this time I recorded also the sync version that used the WalkDir built-in API:

Using built-in WalkDir
Count: 34381 
Ealapsed time: 347.5736ms 

Sync version
Count: 34381 
Ealapsed time: 327.1953ms 

Async version
Count: 34381
Ealapsed time: 117.6043ms

This is the main function to accept the path as CLI parameter and to make some dumb benchmarks:

package main

import (
    filecounter "filecounter/filecounter"
    "fmt"
    "os"
    "time"
)

func main() {
    if len(os.Args) < 2 {
        panic("Path must be specified as command line argument")
    }
    path := os.Args[1]
    fs := os.DirFS(path)

    start1 := time.Now()
    res1, _ := filecounter.FileCounterEasy(fs)
    elapsed1 := time.Since(start1)

    fmt.Printf("\nEasy\nCount: %d \n", res1)
    fmt.Printf("Ealapsed time: %s \n", elapsed1)

    start2 := time.Now()
    res2, _ := filecounter.FileCounterSync(fs)
    elapsed2 := time.Since(start2)

    fmt.Printf("\nSync\nCount: %d \n", res2)
    fmt.Printf("Ealapsed time: %s \n", elapsed2)

    start3 := time.Now()
    res3, _ := filecounter.FileCounterAsync(fs)
    elapsed3 := time.Since(start3)

    fmt.Printf("\nAsync\nCount: %d \n", res3)
    fmt.Printf("Ealapsed time: %s \n", elapsed3)
}

Conclusions

the solution is probably very dumb, but it actually teach me something, especially the asynchronous because while writing it I've done a lot of small mistakes.
If you have any opinion or tips let me know!

Blog