Don’t waste your work

More than once I’ve seen code that wastes previously calculated work. This can manifest itself in several ways.

Store your work

The most obvious of these manifestations is simply not storing the result of an operation for later use. This applies to storing the result in a database, in a temporary data store, a caching layer, a local variable, etc. The decision to cache something is based on the space required to store the result versus the time saved in recalculating the result. In today’s world of storage capacities growing faster than CPU and IO speeds, the trade off is moving more and more towards storing versus recalculating.

At the large scale, we have operations that can’t be performed cheaply in near real time. This would include things like video transcoding and data mining. In some cases the operation can be performed in real time, like video transcoding, but the cost of maintaining the necessary CPU power exceeds the cost of storage for any video that is accessed with any frequency.

At the medium scale are things like OS filesystem caches and  database caches. Filesystem and database caches are used to avoid accessing disks, which are much slower than memory. Two common mistakes I’ve seen countless times are: 1) database servers configured with too small of caches; and 2) web servers configured without enough free memory for effective file system caching. The problem is often compounded when both are running on the same machine.

Also at the medium scale are application layer caches. Application layer caches should be used to avoid fetching data repeatedly and to avoid performing the same operations repeatedly. A good example is a user’s data in a forum thread. Chances are high the user has made more than one post in a thread, so the user’s data should be parsed and rendered into an HTML string only once, and the string inserted where needed. Another example is caching whole pages or parts of pages that don’t need to be rerendered for every request, such as what is displayed to a guest user.

At the small level are local variables. Caching the results of a function call is a good idea. Here is another place I frequently see missed optimization. Here is a somewhat contrived example in PHP:


$total = 0;
for ($i = 0; $i < count($my_array); $i++)
{
    $total += $my_array[$i];
}

This seems pretty benign until you realize count() is being called for every item in the array. A better solution would be:


$count = count($my_array);
$total = 0;
for ($i = 0; $i < $count; $i++)
{
    $total += $my_array[$i];
}

Which avoids the repeated function call. (In this case, even better would be using a for-each loop that avoids the function call, the conditional evaluations, and the increment operations. Best would be to use the array_sum function, but that comes down to knowing your language.)

Use your work

Beyond simply recalling values from a cache, using your work includes choosing smart algorithms.

For instance, if I want to reverse the order of a list I sorted, it makes far more sense to simply reverse the list (a linear operation) than to sort the list again (with quicksort, the worst case and exponentially slower!)

Or say I were merging some sorted lists. The naïve approach is to simply append the arrays and sort everything again. The far better approach is to use a merge sort. Depending on the size of the data involved, even a poor performance merge sort can be vastly faster than a high performance quicksort.

Lastly, be aware of the underlying data access patterns of your chosen algorithms. If your algorithm accesses data randomly it will constantly be flushing caches. If you can find a way to access data in clusters or linearly, do so.

Hello world!

It’s about time I started blogging. I plan to write mainly about technological things, that is, more specifically, topics from system administration to building highly scalable web applications. I have no plans to write a personal blog; I’ve had this domain for nearly a decade and I’ve never had the urge.

If I get some detail wrong, please do point out my error. If you have an improvement on any ideas I post, please do suggest it.