Taking out the Trash: Garbage Collection in High-Level Languages

rizz0s

Summer Rizzo

Posted on June 11, 2020

Taking out the Trash: Garbage Collection in High-Level Languages

Let's talk about trash. And no, I don't mean the code you just wrote. Only some of it.

Jokes aside, even if you haven't worked with low-level languages like C or Rust before, you may have heard of their need for manual memory management. This leaves the responsibility of memory allocation, deallocation, and reallocation in the hands of the programmer. Different types have a fixed amount of memory - for instance, an int is generally 4 bytes on a 32-bit system. In times when dynamic memory allocation is needed (like when you may not be sure of the memory needed at compile time, or it will be exceeded/change at run time), an appropriate amount of memory must be allocated beforehand. This might look something like this (in C):

// n being some variable that is determined at run time
// malloc is a function to allocate a given amount of memory
int *array = malloc(n * sizeof(int));

...

// after the array is no longer used, deallocate
free(array);

A naive example, but you get the idea. There's more to it than this - as you might imagine, memory allocation may get more complex as a program grows in complexity. There are different approaches to memory allocation as well, but I won't get into that now. I'll leave some resources for further reading at the end if you're interested.

There are a few possible problems that can occur in a program with manual memory management in regards to memory deallocation. It is more susceptible to memory leaks - a situation that occurs when memory isn't freed appropriately, thus causing memory to be used unnecessarily. On the other end, memory might be freed too quickly, preventing access to it later on resulting in a dangling pointer. One might attempt to free already freed memory again.

If you're used to programming in high-level languages, like Java, JavaScript or Python, you know that this is not something you have to consider as directly because they use automatic memory management (as a side note, memory allocation issues can still happen despite this). One of the features that contributes to this is garbage collection - the automatic deallocation & reallocation of memory.

For a brief introduction, garbage collection is a technique developed by John McCarthy in 1959 to more easily work with memory management in Lisp. Some low-level languages with manual memory management can make use of it through libraries, though it is not included in the spec. Different languages have different implementations of GC, but they all essentially perform the task of deallocating memory of objects that are no longer used and reallocating it elsewhere.

There are several different strategies to garbage collection. The most common of which is called tracing. Tracing revolves around deallocating memory by tracing which objects are "reachable" by at least one variable in the program. Within tracing, there are a few implementation and algorithm variances, but they are generally based on the "Mark & Sweep" idea. Mark & Sweep operates on two phases - "Mark" and "Sweep." It utilizes a linked-list as its data structure.

Upon initialization, an object is set to "unmarked" - false. In the mark phase, using a depth-first traversal approach, each reachable object's "marked" value is set to true. In the sweep phase, all unreachable objects - those whose "marked" value is false - are cleared from the memory heap. So, memory of an object that is no longer used is not freed immediately, rather it and all other unused objects are cleared at once in a given sweep phase. The GC may be triggered at a time when memory is beginning to run low. One disadvantage to this approach is that processes are generally halted while the GC is running. There have been innovations to this idea - such as tri-color marking - that help solve some of this and other disadvantages.

Another common strategy is reference counting. The idea behind reference counting is evocative of its name - a count of references to an object is initialized, incremented and decremented appropriately, and once the count reaches 0 the object is freed. Unlike tracing, objects are kept track of continuously and thus deallocated as soon as they are no longer used. There are several disadvantages - one of the most notable being its inability to reclaim memory when two objects reference each other, which can lead to a memory leak. Another is the overhead it introduces - for each object, memory must be allocate to keep track of its references.

There are a few other garbage collection methods, and even more nuanced implementations to each one mentioned above. As I mentioned earlier, different languages and compilers have different approaches to garbage collection, even though they may use the same basic technique. In any case, thanks to garbage collection, those of us using high-level languages or GC libraries don't have to worry as much about missing trash day (and our code smells less because of it!).

As promised, here are some links for further reading:

With <3, happy coding!

💖 💪 🙅 🚩
rizz0s
Summer Rizzo

Posted on June 11, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related