Monday, May 31, 2010

Grokking Functional Data Structures

Purely Functional Data StructuresWith so many languages espousing functional programming paradigms, it's no wonder that functional implementations of common data structures are also becoming more and more popular. As a programmer it's important that we learn to adopt them into our repertoire of tools and techniques.

Imperative data structures operate through in-place mutation of data. In our everyday usage of data structures in Java we mutate lists, arrays and any other type to realize the algorithm that we implement. Most of the books on data structures and algorithms explain all the complexity analyses of them using imperative data structures as case studies. When we create an object in Java (an instance of an abstract data type) and invoke some methods on it (some of which can potentially mutate its state) we get some result as an observable outcome of the sequence of operations. This sequence of operations on the value of the abstract data type constitutes a logical future of that type.

With functional data structures, immutability is the buzz of the day, though it's not mandatory that all of them have to be immutable. Many functional languages allow mutable data structures, which, more often than not have also proved to be more pragmatic than the pure school of thought. I wrote a blog post on this same topic some time back.

The other point of view that purists say is that the main point is not that whether you need a mutable array or a hash table. The more important part is that you need to access or update a data structure within a certain bound of time and space complexity. Okasaki's book Purely Functional Data Structures contains a rich collection of functional implementations of many of the popular imperative data structures. And quite contrary to what we tend to believe, given proper language support, these implementations are often as efficient as their imperative counterparts.

When we talk about functional programming, we talk about immutability of data structures and purity of functions. In the absence of mutable state, these are the two cornerstones that make reasoning of concurrent programs much easier in the functional paradigm. Since data structures are immutable, there's no in-place update of values, which means that every operation that changes the value of a data structure will create a new instance of it. So at the end of n operations on an instance of an ADT we will have all the previous versions of the data structure available for access. This is called persistence of data structures, unlike the imperative ones that we discussed earlier, where one update destroys its previous state (we call them ephemeral data structures). Unlike ephemeral data structures which has a single logical future (as we saw earlier), persistent data structures can potentially have multiple logical futures.

Have a look at the following figure, where the list l can have 2 logical futures based on the two possible mutating operations that we make on it. None of them mutate the original list and at the end we still have l as '(1 2 3). Only the two logical futures point to two different instances of the ADT.

Since Clojure sequences are persistent, we have all instances of the ADT accessible even after the entire sequence of operations. Had persistence meant brutal copying of instances, obviously we would not have the same performance guarantees as the imperative ones. Persistent data structures are implemented through careful sharing of structures across instances that make them affordable in real life applications. Have a look at the following tree structure that employs copying of paths from the root down to the newly inserted leaf while sharing the rest with the original ADT.

As with the earlier scenario, both the versions of the tree are available even after adding the new node 8 to the tree - xs points to the original tree, while ys points to the new one.

Despite the fact that functional data structures are persistent, it's possible that their implementations make use of mutable storage. It's mostly done for efficiency of implementation, as we have seen with transients in Clojure. However, for all practical purposes, the data structure is published to its clients as an immutable one.

It's also possible that we can implement a persistent data structure using an imperative language like Java. In fact, FunctionalJava does exactly the same and offers a rich suite of persistent data structures developed in Java. But of course there's a trade-off. First, the usage looks intimidating with Java being quite verbose and without much of type-inferencing capabilities. And, as Okasaki points out in his book, you need to have an implementation of call-by-need in order to have improved amortized complexity of functional data structures. That's the subject of another post, some other day.

What would be a life like without mutable arrays and hash-tables, the bread and butter data structures on which imperative programming thrives upon?

Well, in functional languages lists, trees and tuples take the place of arrays and hash-tables. The basic unit of a functional data structure is to find a general representation of a sequence. And then apply recursive transformations on it to implement specific operations. In this post I give you a very brief look at a couple of such building blocks that are used extensively to implement persistent data structures in many languages.

Consider the use case for a data structure where you would like to reach a specific item (often called a focus point) within the structure and do manipulations around it. With mutable structures you can reach a specific node within a list, an array or a tree and make changes to it. Or the very sole reason for which doubly linked lists exist is to allow you to reach a node within the list and move in either direction. Or you may want to have a new node and splice it within an existing doubly linked list using a minimum number of pointer manipulations.

In the functional world we have zippers which can be adapted to this very use case. Zipper, invented by Gerard Huet, is a generic data structure which allows you to identify a focal point within an ADT and make all manipulations around it as cheap as in-place updates. And it does all this in a completely immutable and persistent way. Every focal point within the tree is identified by a pair of Tree and Context, each of which is again a recursive data structure. The root of the Tree is the focal point around which we would like to do mutable operations. The functional pearl paper by Huet has all the details of this functional data structure. Clojure has a nice implementation of zipper as part of its core distribution. Zipper is a general purpose data structure and has lots of exciting use cases. I hope to cover some of them in my upcoming posts.

Another interesting data structure used functionally is the finger tree, invented by Hinze and Paterson. A finger tree is a 2-3 tree representation of a sequence that gives you cheap access to 'fingers', where you would like to do your manipulations on. And then you can define transformation functions using reductions and monoids that implement specific operations. For example, you can annotate every node of a finger tree using a characteristic function (modeled as a monoid), which you can use to look up specific elements having that property. If you want to implement fast positional operations (like accessing the nth element of the sequence), annotate the finger tree with the 'size' function, so that every node has the size of the subtree stored as its measure. Given size annotations we can now find the nth element in O(log n) time. Similarly annotating the tree with a 'priority' function turns a finger tree into a priority queue. Similarly you can implement deques, interval trees and a host of other data structure variants using the single underlying representation.

This post is just an introduction to functional data structures and I hope to cover some of the specific ones in future. The key difference in implementation with mutable data structures is that the functional ones are based on recursive transformations, easier to model using the power of pure functions. They are persistent and as we discussed above can have multiple logical futures. This makes their runtime analyses a big challenge. In order to obtain amortized complexity that match their mutable counterparts you need to have support for laziness and call-by-need from the underlying implementation language. Computations once forced need to be memoized so that the next logical future can use it without any additional expense on its realized cost of execution.

Monday, May 10, 2010

Laziness in Clojure - Some Thoughts

Some readers commented on my earlier post on Thrush implementation in Clojure that the functional way of doing stuff may seem to create additional overhead through unnecessary iterations and intermediate structures that could have been avoided using traditional imperative loops. Consider the following thrush for the collection of accounts from the earlier post ..

(defn savings-balance
  [& accounts]
  (->> accounts
       (filter #(= (:type %) ::savings))
       (map :balance)
       (apply +)))

Is it one iteration of the collection for the filter and another for the map ?

It's not. Clojure offers lazy sequences and the functions that operate on them all return lazy sequences only. So in the above snippet Clojure actually produces a composition of the filter and the map that act on the collection accounts in a lazy fashion. Of course with apply, everything is computed since we need to realize the full list to compute the sum. Let's look at the following example without the sum to see how Clojure sequences differ from a language with eager evaluation semantics ..

user> (def lazy-balance
      (->> accounts
           (filter #(= (:type %) ::savings))
           (map #(do (println "getting balance") (:balance %)))))

lazy-balance has not been evaluated - we don't yet have the printlns. Only when we force the evaluation we have it computed ..

user> lazy-balance
(getting balance
getting balance
200 300)

Had Clojure been a strict language it would have been stupid to follow the above strategy for a large list of accounts. We would have been doing multiple iterations over the list generating lots of intermediate structures to arrive at the final result. An imperative loop would have rested the case much more cheaply.

Laziness improves compositionality. With laziness, Clojure sequences and the higher order functions on them essentially reify loops so that you can transform them all at once. As Cale Gibbard defends laziness in Haskell with his comments on this LtU thread .. "It's laziness that allows you to think of data structures like control structures."

Clojure is not as lazy as Haskell. And hence the benefits are also not as pervasive. Haskell being lazy by default, the compiler can afford to make aggressive optimizations like reordering of operations and transformations that Clojure can't. With Haskell's purity that guarantees absence of side-effects, deforestation optimizations like stream fusion generates tight loops and minimizes heap allocations. But I hear that Clojure 1.2 will have some more compiler level optimizations centered around laziness of its sequences.

Laziness makes you think differently. I had written an earlier post on this context with Haskell as the reference language. I have been doing some Clojure programming of late. Many of my observations with Haskell apply to Clojure as well. You need to keep in mind the idioms and best practices that laziness demands. And at many times they may not seem obvious to you. In fact with Clojure you need to know the implementation of the abstraction in order to ensure that you get the benefits of lazy sequences.

You need to know that destructuring's & uses nthnext function which uses next that needs to know the future to determine the present. In short, next doesn't fit in the lazy paradigm.

The other day I was working on a generic walker that traverses some recursive data structures for some crap processing. I used walk from clojure.walk, but later realized that for seqs it does a doall that realizes the sequence - another lazy gotcha that caught me unawares. But I actually needed to peek into the implementation to get to know it.

Being interoperable with Java is one of the benefits of Clojure. However you need to be aware of the pitfalls of using Java's data structures with the lazy paradigm of Clojure. Consider the following example where I put all accounts in a java.util.concurrent.LinkedBlockingQueue.

(import '(java.util.concurrent LinkedBlockingQueue))
(def myq (new LinkedBlockingQueue))
(doto myq (.put acc-1) (.put acc-2) (.put acc-3))

Now consider the following snippet that does some stuff on the queue ..

(let [savings-accounts (filter #(= (:type %) ::savings) myq)]
     (.clear myq)
     (.addAll myq savings-accounts))

Should work .. right ? Doesn't ! filter is lazy and hence savings-accounts is empty within the let-block. Then we clear myq and when we do an addAll, it fails since savings-accounts is still empty. The solution is to use a doall, that blows away the laziness and realizes the filtered sequence ..

(let [savings-accounts (doall (filter #(= (:type %) ::savings) myq))]
     (.clear myq)
     (.addAll myq savings-accounts))

Of course laziness in Clojure sequences is something that adds power to your abstractions. However you need to be careful on two grounds :
  • Clojure as a language is not lazy by default in totality (unlike Haskell) and hence laziness may get mixed up with strict evaluation leading to surprising and unoptimized consequences.
  • Clojure interoperates with Java, which has mutable data structures and strict evaluation. Like the situation I described above with LinkedBlockingQueue, sometimes it's always safe to bite the bullet and do things the Java way.