Software Refactoring: September 2014

State clearly the invariants associated with a given type.
Make sure that these invariants are preserved.
If necessary, split a type with conflicting uses, so the invariants of each can be described precisely.

Invariants are the properties you ascribe to an object that should be true throughout the lifetime of that object. Often, these are structural assumptions that can be used to traverse a data structure. For example, in a tree structure, every child might have a parent pointer that is expected to point to its parent. This is easily checked.

class TreeNode{ TreeNode* parent;
TreeNode* first_child; TreeNode* next; public: bool isOK() { // Check the structure at this node.
for (TreeNode* i = first_child; i; i = i->next) { if (i->parent != this) return false; if (! i->isOK()) return false; } return true; }};

Implicit in the above example is that the list of children at a given level are represented as a linked list -- linked through the next pointer. The last child in the list will have a next pointer that is NULL. Also implied is that if a node has no children, its first_child pointer will be NULL.

Other invariants are possible. Rather than letting the parent field of the root node contain random junk, one might impose the invariant that it be NULL. That way a traversal from a leaf up to the root will know when to stop.

bool TreeNode::isRoot() { return parent == NULL; }

The designer of a class should state the class invariants clearly and ensure that they are maintained. They must be established by (each of) the constructors, and left undisturbed or re-established by any operation upon an instance of the class.

In this example, the TreeNode type does double-duty, representing not only a node in the tree, but also the subtree rooted at that point (along with any siblings to the right). This makes it obvious that isRoot() can return true for all the siblings at the top level, so the way the tree is defined, there is not a unique definition for "the root". In the interior of the tree, it is possible to enumerate all of ones siblings by getting one's parent, then first_child and then traversing the list following next points. At the top level, this is not possible.

To fix the problem, one can establish a new invariant -- that the root node has no siblings. Then we would have an additional check in isOK():

bool TreeNode::isOK(){ if (isRoot()) if (next) return false; // Check the structure at this node, as before. ...}

It seems a bit wasteful to perform this extra check at every node in the tree, just to make sure that the root node is OK. Another approach would be to create a special type to represent the root node, inheriting from the more general TreeNode type. That way, the RootNode will have all of the properties of a TreeNode, and also the special property that its parent and next pointers are null.

class RootNode : public TreeNode{ public: bool isOK() { if (parent || next) return false; return TreeNode::isOK(); // Explicitly call parent class version. }}

So. What does all this have to do with refactoring?

It is quite often the case that the invariants associated with a type defined in legacy code are not clearly stated. This may be because the original author did not understand the issues involved or he simply didn't take the time to document them. In any case, as capabilities were added and the purpose of the type migrated over time, the original invariants were not preserved. That makes the class a fertile area for bug-creation.

Suppose, in our toy example, that one of the capabilities added later was to insert a new node after a given sibling. The code looks like this:

class TreeNode::insertAfter(TreeNode*& new){ assert(new->next == NULL); new->next = this->next; assert(new->parent == NULL); new->parent = this->parent; this->next = new; new = NULL; // Steal this pointer.
}

That all works fine, but if this happens to be the root node, we have unwittingly violated the invariant that the root node has no siblings. If some other part of the code depends on the root node being the solitary sibling at the top level, that code may break in mysterious ways. Had the invariant been clearly stated and religiously adhered to, the bug could never have crept in.

Often, the lack of clarity surrounding class invariants will lead to a type being used in an overloaded way. It expresses one set of invariants in one usage and a different set in another. Client code compensates for this by inserting predicates to sort out which set of invariants is being expressed in a given context. Not only does this make the client code very ugly, it is an open invitation for bugs.

A ready example can be found in a bug I recently isolated in the Chapel code. The problem was that (with optimizations turned on), the C compiler was complaining that there was an attempt to deallocate a statically-allocated object. Here is the failing procedure:

inline proc _cast(type t, x) where t == string && x.type != c_string { const cs = _cast(c_string, x); const ret = toString(cs); if !isBoolType(x.type) && !isEnumType(x.type) then chpl_free_c_string(cs); return ret;}

The fault occurred in the call to chpl_free_c_string(). The variable cs -- initialized with the return value from the nested call to _cast() -- is inferred to have type c_string. A case in which the c_string was statically allocated was reaching the call to chpl_free_c_strring().

The root cause goes back to the early implementation of the run-time support for character strings in Chapel. In it, generally, low-level character strings (c_strings) were sometimes dynamically allocated and sometimes not. Memory management was not a concern, so the dynamically-allocated strings were leaked, the others being automatically reclaimed (if stack allocated) or never reclaimed (if statically-allocated). One might have chosen

Much later, calls (to chpl_free_c_string()) were added to reclaim some of the leaked memory. At that point in time, cases where the return value from _cast were known to be statically allocated were filtered out. Unfortunately, there were other overloads of _cast that returned a statically-allocated c_string and that were not enumerated in the cases where the call to chpl_free_c_string() was to be avoided.

This code has exactly the symptoms I mentioned earlier. The lack of an explicit invariant (regarding whether a c_string is heap-allocated or not) forces the client code to use predicates (here ! isBoolType() and ! isEnumType) to compensate. Not only is this ugly, but it's incomplete. That was the cause of the bug. A weak specification of the interface led to weak code.

How can we remedy the situation? One way would be to store in the class whether an instance contains a heap-allocated c_string. This would make the interface expectations explicit. A statically-allocated c_string that had its heap-allocated bit set to "true" would clearly be lying. Then it would be a matter of tracing back to see where the bit was set, and ensuring that that piece of code satisfied the new, stricter interface. All downstream code could assume that the bit was correctly set.

Code that was concerned about how a c_string was allocated would still have to test that bit, but that detail need not necessarily be exposed in client code. As long as c_string objects are manipulated only through the c_string interface, manipulations involving memory allocation can examine the bit and "do the right thing". Let us assume, for example, that toString() will make a copy of its operand if it is statically-allocated (or stack-allocated), and will otherwise "steal" its operand and set the input pointer to zero. We further assume that chpl_free_c_string() will accept a null pointer as its input, and in that case do nothing. Then the client code can be simplified to:

inline proc _cast(type t, x) where t == string && x.type != c_string { const cs = _cast(c_string, x); const ret = toString(cs); chpl_free_c_string(cs); return ret;}

The error-prone predicates are eliminated from the client code.

Another alternative is to use a distinct types to represent heap-allocated and non-heap-allocated c_string objects. In languages that support function overloading, this approach can be as elegant as hiding the allocation state within the object itself. In addition, where static function binding is possible, it can provide a slight performance advantage. The decision of whether to use the code path for heap-allocated vs. non-heap-allocated c_strings is made at compile time; the extra time required to extract and test the allocation flag in a c_string object is avoided. The downside of this approach is that it expands the size of the executable.

Suppose that we use the type c_string_copy to represent a heap-allocated c_string. For this case, the corrected client code would be

inline proc _cast(type t, x) where t == string && x.type != c_string_copy { const cs = _cast(c_string_copy, x); const ret = toString(cs); return ret;}

In this case, we can only have the nested call to _cast return one type (since we want calls to be statically bound). We choose to promote the input variable to the c_string_copy type. The overloaded version of toString() always steals its operand so the call to chpl_free_c_string() is no longer necessary. That makes the cast routine very simple indeed.

You will encounter much legacy code in which the interfaces are not clearly specified. In the cases where the invariants associated with a type are not clear, it may suffice to simply state those assumptions clearly and then make all of the methods affecting that type obey the new, stricter interface.

If it turns out (as in this case), that there are two or more conflicting uses of the same time, the options are to:

Hide those choices within the type itself
Split the type into two or more static types
Split the type into two or more types using inheritance

External considerations will determine the best choice.

Software Refactoring

Monday, September 29, 2014

Establishing and Preserving Invariants