Gaurav's Blog

Static to Dynamic Transformation - I

Instead of going into fractal trees directly, I am going to be posting a lot of assorted related material that I am going through. Most of it is joint work with Dhruv and Bhuwan.

This post is about the Static-to-Dynamic Transformation lecture by Jeff Erickson. The motivation behind this exercise is to learn, how can we use a static data-structure, (which uses preprocessing to construct itself and answer queries in future) and create a dynamic data-structure that can continue taking inserts continuously.

I won’t be formal here, so there would be a lot of chinks in the explanation. So either excuse me for that, or read the original notes.

A Decomposable Search Problem

A search problem $Q$ with input $\mathcal{X}$ over data-set $\mathcal{D}$ is said to be decomposable, if, for any pair of disjoint data sets, $D$ and $D’$, the answer over $D \cup D’$, can be computed from the answers over $D$ and $D’$ in constant time. Or:

$Q(x, D \cup D’) = Q(x, D) \diamond Q(x, D’)$

Where $\diamond$ is an associative and commutative function which has the same range, as $Q$. Also, we should be able to compute $\diamond$ in $O(1)$ time. Examples of such a function would be $+$, $\times$, $min$, $max$, $\vee$, $\wedge$ etc. (but not $-$, $\div$, etc.).

Examples of such a decomposable search problem can be a simple existence query, where $Q(x, D$ returns true if $x$ exists in $D$. Then the $\diamond$ function is the binary OR. Another example can be where the dataset is a collection of coordinates, and the query is the number of points which lie in a given rectangle. The $\diamond$ function here is $+$.

Making it Dynamic (Insertions Only)

If we have a static structure that can store $n$ elements by needing $S(n)$ space, after $P(n)$ preprocessing, and can answer a query in $Q(n)$ time. What we mean by a static data-structure is that we can only make inserts into the data-structure exactly once. But we can iterate through that data-structure (this is what the notes have missed, but is a requirement to get the bounds).

Then, we can construct a dynamic data-structure the space requirement of the dynamic structure would be $O(S(n))$, with query time of $O(\log n).Q(n)$, and insert time of $O(\log n).\frac{P(n)}{n}$ amortized.

How do we do this?

Query: Our data-structure has $l$ = $\lfloor{\lg{n}\rfloor}$ levels. Each level $i$ is either empty, or has a static-data structure with $2^i$ elements. Hence, since the search query is decomposable, the answer is simply $Q(D_{0}) \diamond Q(D_{1}) \diamond … \diamond Q(D_{l})$. It is easy to see why the total time taken for the query would be $O(\log n)$ Q(n).

An interesting point is, if $Q(n) > n^\epsilon$, where $\epsilon > 0$ (which essentially means, if $Q(n)$ is polynomial in $n$), then the total query time is $O(Q(n))$. So, for example, if $Q(n) = n^2$, with the static data-structure, the query time with the dynamic data-structure would be $O(Q(n))$. Here is the proof, the total query time is: $\sum{Q(\frac{n}{2^i})} \implies \sum{(\frac{n}{2^i})^\epsilon} \implies n^\epsilon \sum{(\frac{1}{2^i})^\epsilon} \implies n^\epsilon . c \implies O(n^\epsilon) \implies O(Q(n))$ .

Insert: For insertion, we find the smallest empty level $k$, and build $L_k$ with all the preceding levels ($L_0$, $L_1$, …, $L_k$) and the new element, and discard the preceding levels. Since it costs $P(n)$ to build a level, and each element will participate in the array building process $O(\log n)$ times (or jump levels that many times), we will pay the $P(n)$ cost $O(\log n)$ times. Over $n$ elements, that is $O(\log n).\frac{P(n)}{n}$ per element amortized. Again, if $P(n) > n^{1+\epsilon}$ for any $\epsilon > 0$, the amortized insertion time per element is $(O(P(n)/n)$. The proof is similar to what we described above for the query time.

Interesting Tidbit Recollect what we mean when we say that a data-structure is static. A Bloom Filter is a static data-structure in a different way. You can keep inserting elements into it dynamically up to a certain threshold, but you can’t iterate on those elements.

The strategy to make it dynamic is very similar, we start with a reasonably sized bloom-filter, keep inserting into it as long as we can. Once it is too full, we allocate another bloom-filter of twice the size, and insert elements into that, from now on. And so on. The queries are done on all the bloom-filters and are a union of their individual results. An implementation is here.

What Next: Deamortization of this data-structure with insertions as well as deletions. Then we will move on to Cache-Oblivious Lookahead Arrays, Fractional Cascading, and eventually Fractal Trees. This is like a rabbit hole!

Read more

Not-Just-Sorted-Arrays

It is fascinating, how simple data structures can be used to build web-scale systems (Related: A funny video on MongoDB). If this doesn’t make sense to you yet, allow me to slowly build up to the story. One of the most simple, and yet powerful algorithm a programmer has in his toolbox is the Binary Search. There are far too many applications to it. Consider reading this Quora answer for simple examples. I personally use it in git bisect to hunt down bad commits in a repository with tens of thousands of commits.

The humble sorted array is a beautiful thing. You can search over it in $O(\log n)$ time. There is one trouble though. You cannot modify it. I mean, you can, but then you will spoil the nice property of it being sorted, unless you pay an $O(n)$ cost to copy the array to a new location, and insert the new element. If you have reserved a large enough array before hand, you don’t need to copy to a new array, but still have to shift elements and that will still be an $O(n)$ cost.

Also, if we were allowed to plot complexities on a graph, we can plot the insert complexity on the X-axis and search complexity on the Y-axis. Then all the suitable data-structures would hopefully be bound by the square with edges on <$O(1)$, $O(1)$> and <$O(n)$, $O(n)$>. The sorted array with <$O(n)$, $O(\log n)$> would lie somewhere on the bottom right corner, whereas, a simple unsorted array would be on the top-left with <$O(1)$, $O(n)$>. You can’t do insertions better than $O(1)$ and you can’t do searches better than $O(\log n)$ (although the bases and constants matter a lot, in practice).

Now, how do we use a static structure, so that we retain the goodness of a sorted array, but allow ourselves the ability to add elements in an online fashion? What we have here, is a ‘static’ data-structure, and we are trying to use it for a ‘dynamic’ usecase. Jeff Erickson’s notes on Static to Dynamic Transformation are of good use here. The notes present results related to how to use static data-structures to build dynamic ones. In this case, you compromise a bit on the search complexity, to get much better insert complexity.

The notes present inserts-only and inserts-with-deletions static to dynamic transformations. I haven’t read the deletions part of it, but the inserts-only transformation is easy to follow. The first important result is:

If the static structure has a space complexity of $S(n)$, query complexity of $Q(n)$, and insert complexity of $P(n)$, then the space complexity of the dynamic structure would be $O(S(n))$, with query complexity of $O(\log n).Q(n)$, and insert complexity of $O(\log n).\frac{P(n)}{n}$ amortized.

Then the notes present the lazy-rebuilding method by Overmars and van Leeuwen. Which improves the first result’s insertion complexity by getting the same complexity in the worst case instead of amortized. (Fun fact: Overmars is the same great fellow who wrote Game Maker, a simple game creating tool, which I used when I was 12! Man, the nostalgia :) I digress..)

The inserts-only dynamic structure, is pretty much how LSM trees work. The difference is the $L_0$ array starts big (hundreds of MBs, or a GB in some cases), and resides in memory, so that inserts are fast. This $L_0$ structure is later flushed to disk, but does not need to be immediately merged with a bigger file. That is done by background compaction threads, which run in a staggered fashion, so as to minimize disruption to the read workload. Read the BigTable paper, to understand how simple sorted arrays sit at the core of the biggest databases in the world.

Next Up: Fractal Trees and others.

Read more

A fortune cookie server in Go

I really like the concept of Fortune Cookies in *nix systems, and I absolutely love the Hindi movie Andaz Apna Apna. So, I thought I would write up a simple fortune-cookie server which serves random quotes from the movie. A lot of my friends liked it.

So, I thought it will be even nicer if I could generalize it, and add a bunch of other movies and TV serials, that are popular. So I wrote up Elixir, which is a generic fortune-cookie server written in Go. This new fortune-cookie server was hosted on rand(quotes), and had quotes from movies like Quentin Tarantino’s movies, Lord of the Rings, and the popular TV shows Breaking Bad, and Game of Thrones.

Using Elixir, it is extremely simple to write a fortune-cookie server which serves quotes from multiple quote databases. All you need to do is create the a file that contains the quotes you want to serve, one per line, and give it a name like foo.quotes. Place it in the directory where the server was started from, and those quotes would be serve from the /foo endpoint.

To make it more fun, /foo?f=cowsay returns the quote in the cowsay format! Something like this

 _________________________________________
/ Walter White: If you don’t know who I   \
| am, maybe your best course would be to  |
\ tread lightly.                          /
 -----------------------------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

You can create many such quotes databases, and also add/delete/modify them while the server is running, and the server will pick up the changes.

A full-featured fortune-cookie server would look something like this:

package main

import (
	"flag"
	"github.com/reddragon/elixir"
)

func main() {
	listenPort := flag.Int("port", 80,
		"The HTTP port to listen on (default: 80)")
	flag.Parse()
	elixir.Start(*listenPort)
}

Implementation Note: To implement the feature of keeping tab on the quote databases, without having to restart the server, one way was to use the Inotify subsystem in Linux, using Go. But this didn’t work for OSX. So I wrote up a quick and dirty implementation which does ioutil.ReadDir(".") periodically, and filters all the files which have a .quotes extension.

Read more