Ruminations of a Programmer: October 2008

Monday, October 27, 2008

Data 2.0 - Is your Application ready ?

Brian Aker talking about assumptions on Drizzle and future of database technologies ..

"Map/Reduce will kill every traditional data warehousing vendor in the market. Those who adapt to it as a design/deployment pattern will survive, the rest won't. Database systems that have no concept of being multiple node are pretty much dead. If there is no scale out story, then there is not future going forward."

Later on in the same post he goes on to mention the forces that he hopes would drive the performance of tomorrow's database technologies - map/reduce and asynchronous queues. Drizzle is a fork out of MySql, it is revolutionary in many respects compared to all other forks of the same product or similar products of the same genre. Drizzle does away with the erstwhile dodos of database technology viz. stored procedures, triggers etc. and is being planned exclusively for scaling out in the cloud. The development model is also ultra open source, aka organic open source and is being driven by the community as a whole.

Drizzle is clearly a step towards N > 1 and quite a forceful step too. Erlang was the silent roadmaker towards true N > 1 with the entire ecosystem of OTP, mnesia and the supervisor hierarchies .. Erlang offers the platform and the correct set of primitives for delivering fault tolerance in applications along with replicated mnesia storage. Erlang was not built for N = 1 .. it had N > 1 right into it and right from day 0 ..

Couchdb is another instance that may hit the sweet spot of getting the combination right - asynchronous map/reduce along with replicated, distributed, loosely coupled document storage. Implements REST, BASE, MVCC .. what else .. that leads to eventual consistency of the system.

All the buzz about future database technologies have been somehow related to the cloud, or at least being planned keeping the scale out factor in mind. Erlang started it all, and is now being actively driven in multiple dimensions by all similar complementary/competing technologies and platforms. No one talks about normalization or ACID or multi-phase commit these days. Call it Web 2.0 or Enterprise 2.0 or social networking, everything boils down to how enterprise IT can make data access easier, better, and more resilient to datacenter or network failures without compromising on the quality of service.

Middleware matters, data storage in the cloud matters, data processing in the cloud matters, and we are looking at lots of vendors fighting for a space there in. Applications of Enterprise 2.0 need to be smart and malleable enough to participate in this ecosystem. Make sure they are implemented as loosely coupled components that talk to middleware services asynchronously, can consume the atom feeds that other enterprise applications generate and do not depend on ACID properties or rely on synchronous 2-phase commit protocols from the data layer. Are we there yet ?

Monday, October 20, 2008

How Scala's type system works for your Domain Model

When you have your type system working for you and model many of the domain constraints without a single line of runtime checking code, you can save much on the continuous tax payable down the line. I have been working on a domain model in Scala, and enjoying the expressiveness that the type system brings in to the implementation. There is so much that you can do with minimal ceremony, just using the powers of abstract types, abstract vals and self type annotations. In this post, I will share some of my experiences in implementing explicit domain constraints and invariants using Scala's type system and review some of the benefits that we can derive out of it, with respect to code readability, maintenability and succinct expression of the essence of the model.

I am talking about modeling security trades from a brokerage solution stack .. (horribly simplified) ..

object TradeComponent {
  import ReferenceDataComponent._
  /**
   * A trade needs to have a Trading Account
   */
  trait Trade {
    type T <: Trading

    val account: T
    def valueOf: Unit
  }

  /**
   * An equity trade needs to have a Stock as the instrument
   */
  trait EquityTrade extends Trade {
    type S <: Stock

    val equity: S
    def valueOf {
      //.. calculate value
    }
  }

  /**
   * A fixed income trade needs to have a FixedIncome type of instrument
   */
  trait FixedIncomeTrade extends Trade {
    type FI <: FixedIncome

    val fi: FI
    def valueOf {
      //.. calculate value
    }
  }
  //..
  //..
}

All of the above traits can be mixed in with actual service components and provide explicit abstraction over the constrained types that they declare as members. Note we have not yet committed to any concrete type in any of the above abstractions so far, but provided declarative constraints, just enough to model their domain invariants, e.g. a fixed income trade can only be instantiated with an instrument whose type is bounded on the upper side by FixedIncome. Once declared, the compiler will ensure this forever ..

Now we define a helper component that is only applicable for FixedIncome trades .. Note the appropriate bound on the type, explicitly declared to enforce the business rule ..

object TradeComponent {
  //.. as above
  //..

  /**
   * Accrued Interest is computed only for fixed income trades
   */
  trait AccruedInterestCalculatorComponent {
    type T <: FixedIncomeTrade

    val acc: AccruedInterestCalculator
    trait AccruedInterestCalculator {
      def calculate(trade: T)
    }
  }

  /**
   * Implementation of AccruedInterestCalculatorComponent. Does not
   * yet commit on the actual type of the instrument.
   */
  trait AccruedInterestCalculatorComponentImpl
    extends AccruedInterestCalculatorComponent {
    class AccruedInterestCalculatorImpl extends AccruedInterestCalculator {
      override def calculate(trade: T) = {
        //.. logic
      }
    }
  }
  //..
  //..
}

Let us now define a generic service component that provides trading service for all types of trades and instruments. Ignore the details of what the service offers, details have been elided to focus on how we can build expressive domain models using the abstraction capabilities offered by Scala's type system. The generic service still does not commit to any implementation. It uses the services of another component TaxRuleComponent using self type annotations.

object TradeComponent {

  //.. as above
  //..

  /**
   * Generic trading service
   */
  trait TradingServiceComponent {
    type T <: Trade

    val trd: TradingService
    trait TradingService {
      def valueTrade(t: T)
    }
  }

  /**
   * Implementation of generic trading service
   */
  trait TradingServiceComponentImpl extends TradingServiceComponent {
    this: TaxRuleComponent =>
    class TradingServiceImpl extends TradingService {
      override def valueTrade(t: T) = {
        val l = tax.getTaxRules(t.account)
        //.. logic
        t.valueOf
      }
    }
  }
  //..
  //..
}

When you define your type contraints correctly, Scala provides great support for wiring your components together through explicit type and value definitions in the final assembly. The following component assembly uses the Cake pattern that Jonas Boner recommended for implementing dependency injection. We are going to implement a concrete component assembly for fixed income trades. We need to define the concrete type once in the object and all type dependencies will be resolved through the Scala compiler magic. And we need to provide concrete implementations for all of the abstract vals that we have declared above in defining the generic components ..

object TradeComponent {

  //.. as above
  //..

  /**
   * The actual component that will be published and commits on the concrete types and vals
   */
  object FixedIncomeTradeComponentRegistry extends TradingServiceComponentImpl
    with AccruedInterestCalculatorComponentImpl
    with TaxRuleComponentImpl {

      type T = FixedIncomeTrade
      val tax = new TaxRuleServiceImpl
      val trd = new TradingServiceImpl
      val acc = new AccruedInterestCalculatorImpl
  }
  //..
  //..
}

Now that's a lot of domain constraints implemented only through the power of the type system. And these business rules are explicit within the code base and not buried within spaghetti of procedural routines, making your code maintenable and readable. Imagine how many lines of runtime checks we would have to write to implement the same in a dynamically typed language. And, btw, when you express your constraints through the type system, you don't need to write a single line of unit test for their verification - the compiler does that for you. Next time when you wonder how concise or terse your favorite dynamically typed language is, don't forget to figure that count in.

Monday, October 13, 2008

Map Composition in Scala (or The Virtues of Laziness)

I have been dabbling around with some lazy optimization techniques in designing functional data structures. I am using Scala to implement some of the functional data structures of Okasaki .. hence thought it would be appropriate to think aloud of some of the lazy optimization that Scala offers.

Scala is a big language with myriads of features and some unfortunate syntactic warts that will definitely rub you the wrong way till you get comfortable with them. Some time back, a popular joke on twitter was related to the number of meanings and interpretations that the character "_" has in Scala. Indeed there are many, and not all of them are intuitive enough .. :-( ..

Now, on to real some Scala snippets ..

Do you know all the differences in interpretation of the following variants of function definition and application ?

// call by value
def foo(x:Bar) = {
  val a = x
  val b = x
}

// call by name
def foo(x: => Bar) = {
  val a = x
  val b = x
}

// no argument function implemented as thunks
def foo(x:() => Bar) = {
  val a = x
  val b = x
  val c = x()
}

// call by name to call by need
def foo(x: => Bar) = {
  lazy val y = x
  //...
}

This post is not about discussing the details of the above 4 cases .. Scala wiki has an excellent FAQ entry that describes each of them with sufficient rigor and detail .. go get it ..

Lazy evaluation is one of the techniques of performance optimization when dealing with large functional data structures. In Scala, Seq[A].Projection is an abstraction that makes operations lazy.

val l = List(...) // a big list
l.map(_ * 4).map(_ + 2).filter(_ % 6 == 0)

In the above snippet, every operation on the list creates a separate list, which gets chained on to the next operation in line. Hence if you are dealing with large collections, it always sucks in performance. Go lazy and you have savings both in memory requirements and in elapsed time ..

l.projection.map(_ * 4).map(_ + 2).filter(_ % 6 == 0)

This will return a Stream, which will do a lazy evaluation on demand. I found the term Stream first in SICP, where Abelson and Sussman introduce it as a data structure for delayed evaluation, which enables us to represent very large (even infinite) sequences.

In Scala, a Stream is a lazy list, and follows the semantics of SICP streams, where elments are evaluated only when they are needed.

Making your custom collection lazy ..

I was recently working with a custom recursive tree-like data structure in Scala, which, for simplicity, let us assume is a binary tree. And since I was fetching records from a database and then loading up data structures in memory, I was working with a really big sized collection. Let us see how we can implement Projection on my custom data structure and make things lazy on my own. Scala, unlike Haskell, is not an inherently lazy language, and abstractions like Projection, help implement laziness in evaluation. Eric Kidd wrote a great post on Haskell rewrite rules to implement declarative fusion of maps using compiler directives. This post has some inspiration from it, through the looking glass of Scala, an admittedly more verbose language than Haskell.

trait Tree[+A] {
  def map[B](f: A => B): Tree[B]
}
/**
 * Non empty tree node, with a left subtree and a right subtree
 */
case class Node[+A](data: A, left: Tree[A], right: Tree[A]) extends Tree[A] {
  override def map[B](f: A => B): Tree[B] = Node(f(data), left.map(f), right.map(f))
}
/**
 * Leaf node
 */
case class Leaf[+A](data: A) extends Tree[A] {
  override def map[B](f: A => B): Tree[B] = Leaf(f(data))
}
/**
 * Empty tree object
 */
case object E extends Tree[Nothing] {
  override def map[B](f: Nothing => B): Tree[B] = throw new Exception
}

We have a map operation defined for the tree, that uses a recursive implementation to map over all tree nodes. The map operation is a strict/eager one, much like List.map ..

val t = Node(7, Node(8, Leaf(9), Leaf(10)), Node(11, Leaf(12), Leaf(13)))
t.map(_ * 2).map(_ + 1)

will result in a new tree that will have both the map operations done in succession. And in the process will generate intermediate tree structures, one for each of the map operations in chain. Needless to say, for a large collection, both space and time will hit you.

Getting rid of the intermediate trees ..

Implement laziness .. make evaluations lazy, so that effectively we have one final tree that evaluates it's nodes only when asked for. In other words, lift the operation from the collection to an iterator, which gets evaluated only when asked for.

Here is a sample bare bone unoptimized iterator implemented via scala.collection.mutable.Stack ..

class TreeIterator[A](it: Tree[A]) extends Iterator[A] {
  import scala.collection.mutable.Stack
  val st = new Stack[Tree[A]]
  st push it

  override def hasNext = st.isEmpty == false
  override def next: A = st.pop match {
    case Node(d, l, r) =>
      st push r
      st push l
      d
    case Leaf(d) =>
      d
  }
}

Using this iterator, we define the Projection for Tree with the lazy map implementation and integrate it with the main data structure through a projection method ..

Here is the modified Tree[+A] ..

trait Tree[+A] {
  def elements: Iterator[A] = new TreeIterator(this)
  def map[B](f: A => B): Tree[B]
  def projection : Tree.Projection[A] = new Tree.Projection[A] {
    override def elements = Tree.this.elements
  }
}

and the Projection trait in an accompanying singleton for Tree ..

object Tree {
  trait Projection[+A] extends Tree[A] {
    override def map[B](f: A => B): Projection[B] = new Projection[B] {
      override def elements = Projection.this.elements.map(f)
    }
  }
}

Now I can use my data structure to implement lazy evaluations and fusion of operations ..

val t = Node(7, Node(8, Leaf(9), Leaf(10)), Node(11, Leaf(12), Leaf(13)))
t.projection.map(_ * 2).map(_ + 1)

Eric Kidd reported on making Haskell maps 225% faster through fusion and rewrite rules. In Scala, implementing laziness through delayed evaluation (or Projection) can also lead to substantial reduction in memory usage and elapsed time.

Thursday, October 09, 2008

To Tail Recurse or Not

Today I am going to talk about maps, not the java.util.Map, but, map as in map-reduce or map as in scala.List.map. Of course all of us know what map is and map does, and how this powerful concept has been used in all functional languages that we use on a regular basis. I will talk maps in the context of its implementation, as we find in all the languages, which brings out some of the important principles of using tail recursion.

A couple of months back, there was a thread in the Erlang discussion list, where someone wondered why the implementation of map in Erlang stdlib Lists.erl is NOT tail recursive. Here it is, faithfully copied from Erlang stdlib ..

map(F, [H|T]) ->
    [F(H)|map(F, T)];
map(F, []) when is_function(F, 1) -> [].

Clearly not a tail recursive one ..

The thread of discussion explains the rationale behind such implementation. And it has a lot to do with the compiler optimizations that have been done in R12B. Here is a quote from the Erlang efficiency guide, which explains the myth that tail-recursive functions are ALWAYS much faster than body-recursive ones ..

"In R12B, there is a new optimization that will in many cases reduces the number of words used on the stack in body-recursive calls, so that a body-recursive list function and tail-recursive function that calls lists:reverse/1 at the end will use exactly the same amount of memory. lists:map/2, lists:filter/2, list comprehensions, and many other recursive functions now use the same amount of space as their tail-recursive equivalents."

Since a tail recursive map needs to do a reverse ..

the incremental space that it needs to keep both the lists makes it equally space consuming with the body-recursive version

it puts pressure on the garbage collector, since the space used by the temporary list cannot be reclaimed immediately

The general advice is that you need to measure the timings of your use case and then decide whether to tail recurse or not.

I was curious enough to check what the Scala library does for map implementation. Here is the snippet from scala.List ..

final override def map[B](f: A => B): List[B] = {
  val b = new ListBuffer[B]
  var these = this
  while (!these.isEmpty) {
    b += f(these.head)
    these = these.tail
  }
  b.toList
}

It is not a functional implementation at all. In fact it adopts a clever use of localized mutation to achieve performance. Using mutation locally is a perfectly valid technique and a definite area where hybrid non-pure languages score over the purer ones. The contract for map is purely functional, does not have any side-effect, yet it uses localized side-effects for performance. This would not have been possible in Erlang. Neat!

Just for fun, I cooked up a tail recursive version of map in Scala, as a standalone function ..

def tr_map[B](f: A => B, l: List[A]): List[B] = {
  def iter_map(curr: List[B], l: List[A]): List[B] = l match {
    case Nil => curr.reverse
    case _ => iter_map(f(l.head) :: curr, l.tail)
  }
  iter_map(Nil, l)
}

It performed very slow compared to the native librray version.

Scala also offers a reverseMap function which does not need the additional reverse, which the tail-recursive map would require. And not surprisingly, the implementation is based on tail recursion and pattern matching ..

def reverseMap[B](f: A => B): List[B] = {
  def loop(l: List[A], res: List[B]): List[B] = l match {
    case Nil => res
    case head :: tail => loop(tail, f(head) :: res)
  }
  loop(this, Nil)
}

So, how does the story end ? Well, as usual, benchmark hard, and then decide ..

Friday, October 03, 2008

Erlang VM : now hosting multiple languages

In an earlier post, I had wondered why the Erlang virtual machine does not host a more diversified set of languages ..

"BEAM provides an ecosystem that offers phenomenal scalability with concurrent processes and distributed programming, which is really difficult (if not impossible) to replicate in any other virtual machine being worked upon today. After all, it is much easier to dress up Erlang with a nicer syntax than implementing the guarantees of reliability and extreme concurrency in any of your favorite languages' runtime."

Then I had blogged about Reia, the Python/Ruby like scripting language on BEAM.

A few days back, Robert Virding released a stable version of LFE - Lisp Flavored Erlang, a concurrent Lisp based on the features and limitations of the Erlang VM. Unlike Lisp it doesn't have global data or mutating operations. Instead it has the goodness of Lisp macros, sexprs, code-as-data together with the Erlang power of pattern matching and binary comprehensions. And the best part is that LFE hosts seamlessly with vanilla Erlang/OTP.

Along with Erlang being used to develop middleware applications, we are seeing increased use of Erlang VM, hosting more and more language variants. This is a clear indication that the Erlang ecosystem is growing. As Ted Leung has rightly observed in his post on VMs for everybody, we are going to see not only flourishing new virtual machines, but also lots of languages atop existing virtual machines.

Real good time to be a hacker .. a pity though only a few lucky ones get paid for hacking ..

Functional List with fast Random Access

I needed a List with fast random access capabilities. Standard implementations of a List takes O(i) to access the ith element. I am using Scala and arrays do not really cut as a well-behaved functional data structure. Besides I needed dynamic resizing, persistence and good worst case complexity. Anyway I was trying to justify implementing Okasaki's Purely Functional Random Access Lists ..

The data structure provides O(log n) random access along with O(1) list primitives (head, tail and cons). The data structure uses a collection of complete binary trees as the representation of the list. Access is through preorder traversal, allowing O(1) head. And the number of trees is determined by a skew binary decomposition of integer n, for n being the size of the list. A skew binary decomposition is a collection of skew binary terms, where a skew binary term is of the form (2^k - 1). e.g. the skew binary decomposition of 43 is (1 + 1 + 3 + 7 + 31). If the list has 43 elements, we will represent it using a collection of 5 binary trees of sizes 1, 1, 3, 7 and 31.

For a random lookup, we first need to locate the tree, a log n operation and then search within the tree, another log n operation. Okasaki's paper has all the details of analysis. It is straightforward, yet elegant ..

Two implementations are given ..

the first one uses a list of tuples (List[(Int, Tree[T])]) to implement RAList[T]

object RAList {
  def empty[T]: RAList[T] = RAList.Nil

  private [immutable] case object Nil extends RAList[Nothing] {
    override def equals(that : Any) = this eq that.asInstanceOf[AnyRef]
  }

  private [immutable] case class Root[+T](trees: List[(Int, Tree[T])])
    extends RAList[T] {
  }
}

and

the second one uses a recursive data structure to implement the same. This is a better implementation than the first one and proved to be faster as well ..

object RandomAccessList {
  def empty[T]: RandomAccessList[T] = RandomAccessList.Nil

  private [immutable] case object Nil extends RandomAccessList[Nothing] {
    override def equals(that : Any) = this eq that.asInstanceOf[AnyRef]
  }

  private [immutable] case class Root[+T](size: Int, tree: Tree[T], rest: RandomAccessList[T])
    extends RandomAccessList[T] {
  }
}

I have setup a Google project, which I named pfds (Purely Functional Data Structures), containing both the implementations. The implementations come with some unit tests using specs. Everything is quite rudimentary at this stage, but I plan to enrich the project with more functional data structures .. Meanwhile you can browse through the source code, critique the implementation with suggestions to make them more Scalaesque ..