Programmer Musings: More Thoughts on Resource Recovery

This site will look much better in a browser that supports web standards, but is accessible to any browser or Internet device.

June 12, 2004

More Thoughts on Resource Recovery

In The Semantics of Garbage Collection, I explained why I don't like the term garbage collection. I also introduced the term resource recovery, and suggested that this change in naming could generate a useful change in viewpoint.

Many programmers have been indoctrinated with the belief that garbage collection is the solution to all memory problems, and that any language that does not use garbage collection is destined to be full of memory leaks. I disagree. But to really understand why I think otherwise will require a little history.

In the Beginning

Well, at least in the beginning of my programming career was using the Fortran programming language. In the first dialect of Fortran I ever used, all memory was global memory. Even local variables in a subroutine were allocated as global memory; they just weren't visible outside the subroutine. There was no allocation of memory. Therefore, there were no memory leaks. There was also no recursion, because calling a subroutine from itself would wipe out the current values of its "local" variables.

These conditions no longer apply to Fortran, but they were the state of Fortran when I started. (Or at least the state of Fortran as I understood it then.)

The Stack and the Heap

Later in my career, I began to work in more modern languages. The first of these was the C programming language. In C, there were two types of memory besides the static memory I knew from Fortran. There was the stack and the heap. (I had actually seen these before, but I'll ignore that detail for now.)

The stack had the wonderful property that it would grow as you needed it and automatically be reclaimed at the end of the function (or block). It also allowed multiple copies of a function to have its own local variables, giving the possibility of recursion. Most of the time, stack memory is easy to use and is recovered automatically. As long as you didn't try to put a huge amount of stuff on the stack or recurse too deeply, everything worked out fine.

The heap had another wonderful property. I could allocate as much of it as I wanted (within reason) and give it back when I was finished. This allowed me to work with different amount of data without guessing the biggest size I could deal with ahead of time. However, this feature came with a dark side: the memory leak.

Objects

Then I discovered Object Oriented Programming. One of the most wonderful features of objects for me was the whole concept of the object lifetime. I could have an object that allocated memory as part of its construction and that would free that memory at destruction time. No more memory leaks! I could make allocated memory act like the stack memory that I had always liked. Well, of course things did not really work like that. People can allocate objects as well and automate the whole process of leaking memory.

Perl and Memory Management

At about the same time I found C++ and OOP, I also discovered the Perl programming language. Now Perl did not have objects at the time, but it did do automatic handling of memory in many cases. Over time, Perl developed real local variables (called my variables) and objects. Perl supported a simple garbage collection system called reference counting. Unlike the more popular mark and sweep garbage collection, reference counting has two really important virtues: it's simple to understand and it preserves object lifetime.

In a reference counting scheme, every object keeps count of the number of references pointing to it. As a reference goes away, the count of the references is decremented. When the count reaches 0, the object is really destroyed. This happens at a well-defined time, when the last reference is removed. This method is simple to understand and works very well in general. In fact, to my knowledge, reference counting has only one flaw: circular references. If object A references object B and object B references object A, and no other references point to either of them, we have a memory leak. These two objects are not referenced anywhere, but they cannot be reclaimed.

Reference counting works particularly well when objects are only referenced once or twice. This covers the simple object creation case and many uses of temporary objects created by the compiler. These objects are created and destroyed exactly the way you would expect.

Mark and Sweep

The great flaw of reference counting is one of the reasons that the mark and sweep approach was developed. Since mark and sweep starts from known references and works through all memory to find any that is referenced, circular references are not a problem. Unfortunately, mark and sweep suffers from a few problems of its own.

The garbage collector can not be allowed to delete or move memory that is part of the stack.
The time taken to garbage collect is a function of the amount of currently allocated memory, not the amount of memory to reclaim.
The garbage collector does not usually respect object lifetimes.

The first point is why the Java programming language does not allow objects to be created locally on the stack. Otherwise, objects on the stack would need to be cleaned up differently than objects on the heap. With mark and sweep, this would require recognizing what parts of the objects are in which locations and treating them accordingly. The easiest solution was to define the problem away and only allow heap-allocated objects.

In order to perform the mark portion of the algorithm, we will need to touch each reference to every object to verify what is still in use. In order to perform the sweep portion of the algorithm, all referenced objects need to be moved so that the old memory can be reclaimed. More advanced versions of the system partition objects in different ways (new and old objects, etc.) to reduce this overhead to some extent.

For me, the worst effect of mark and sweep is the loss of object lifetime. An object's lifetime should run from the point where it is created to the last place it can be referenced or when it is explicitly killed. With the mark and sweep system, there is a period of time when the object is left in a kind of limbo. It cannot be referenced, but it hasn't been cleaned up. It's like a ghost of an object.

The Wrong Turn

Although I understand the purpose of garbage collection, it seems to me that we've taken a wrong turn. There was a time when we had stack-based data that was cleaned up automatically and was easy to deal with, and we had heap-based data that was more powerful, but could be misused. Now we've added garbage collection to reduce the chance of misusing the heap-based data, but to make the garbage collection simpler, we've thrown out stack-based objects and defined object lifetimes.

In the quest to stop memory leaks, we've made every object creation a memory leak and provided an automatic system to clean up the leaks we've made. Moreover, the loss of a defined time for the object to die has resulted in the loss of the destructor concept. We cannot be guaranteed that the object's cleanup/destructor/finalizer method will ever be called (even in normal operation).

This is why I have been thinking that maybe a new approach is needed. Next time, I plan to explore a possible new approach.

Posted by GWade at June 12, 2004 11:52 PM. Email comments