Hadoop: Getting My Hands Dirty

29 Jan 2012 - 2 minute read

Those who are about to read this post, beware. Hadoop is a big beast. There are a lot of things churning inside. I am going to work on the Hadoop Scheduler this semester, so I will be dealing with it day-in and day-out. Having never used Hadoop, it took some time for me to get it up-and-running, and running some test code.

I am making some casual notes of how I went about it. It might help some of you in the future, hopefully, and will let me keep a ready-reference too.

The first step is to understand, what Map Reduce, and Hadoop are. The Wiki pages for both are good-enough sources for the same.
Then, to actually set up Hadoop on Ubuntu, I used this excellent tutorial. (Courtesy Deepak). The procedure should be slightly different for other distributions.
You might want to read this to understand how Hadoop can be used to implement Map Reduce jobs.
The Hadoop Wordcount example is a good place to understand how a basic Map Reduce job is written in Hadoop. This is the standard Hadoop documentation, which explains the example.However there are small changes in the API, and the documentation has not been updated to reflect that :-(. For example, OutputCollector has been replaced by the Context object in the Mapper and Reducer function prototypes. Here is a small summary. But mostly, you should be fine with the above-mentioned tutorial. Using the Wordcount example, I made a trivial modification to find the number of palindromes in a set of files. Here is the code.

These are the sequence of steps I followed to get my job running:

$HADOOP_HOME/bin/hadoop/start-all.sh
javac -classpath $HADOOP_HOME/hadoop-*-core.jar Palindrome.java
jar -cvfe /path/for/jar/file/Palindrome.jar Palindrome Palindrome.class Palindrome\$PalindromeMapper.class Palindrome\$PalindromeReducer.class
$HADOOP_HOME/bin/hadoop dfs -copyFromLocal /path/to/input /user/hadoop/input/path
$HADOOP_HOME/bin/hadoop jar /path/for/jar/file/Palindrome.jar /user/hadoop/input/path /user/hadoop/output/path
$HADOOP_HOME/bin/hadoop dfs -getmerge /user/hadoop/output/path /path/to/final/output

Here are some of the common Hadoop shell commands that you might need. I used dfs, ls, cat, getmerge for running the Palindrome code.
I also read an example which deals with estimating the value of Pi. While there are series, which help in finding the value iteratively, and can be parallelized by assigning the range of terms each node has to calculate, it can prove to be expensive. The reason being, each term needs to be calculated with arbitrary precision.The innovative method is to generate a large number of points within a unit square centered at the origin. If the number of points are large and randomly distributed, the ratio of the number of points inside the circle to the total points gives us the a good approximation of PI/4. We can parallize this by dividing the points based upon the quadrants / sectors they lie in, and summing up the respective counts. Note, here we need to do only one final arbitrary precision calculation.

Reversing Bits

29 Dec 2011 - less than 1 minute read

The problem is, Given a number of n bits, reverse the bits. Normally, I would have used 2n bitwise operations to do this.

In CSE548, for a problem we reversed a string recursively, by reversing the two halves of the string, and then exchanging the positions of the two reversed halves. A string of bits, is a string after all.

So, we start with reversing the odd-even bits, then reversing pairs of bits, then reversing sets of 4 bits, and so on. So, this requires 3*log(n) bitwise operations, which is better. Here is an efficient implementation for reversing a 32-bit unsigned int.

unsigned int reverse(unsigned int x)
{
    x = ((x & 0x55555555)<<1)  | ((x & 0xAAAAAAAA)>>1);
    x = ((x & 0x33333333)<<2)  | ((x & 0xCCCCCCCC)>>2);
    x = ((x & 0x0F0F0F0F)<<4)  | ((x & 0xF0F0F0F0)>>4);
    x = ((x & 0x00FF00FF)<<8)  | ((x & 0xFF00FF00)>>8);
    x = ((x & 0x0000FFFF)<<16) | ((x & 0xFFFF0000)>>16);
    return x;
}

void wrapper()
{
    cout << reverse(0xF00BAAF0);
}

Heaps Demystified

25 Dec 2011 - 4 minute read

I think the first time I learnt heaps was in an undergraduate course on Data Structures. Though I understood the basic idea, it was never crystal clear for me to code during a competition.

I took up the task of learning it afresh today. Here is a short gist.

A heap is a data structure that can be used to make fast (constant time) queries of the type ‘what is the maximum element in the array’, with fast updates (logarithmic time).

A heap is usually implemented as a balanced binary tree, but there are other variants like fibonacci heap, which I won’t be discussing here. In the below discussion and code, I explain how I would implement a max-heap. For min-heaps, use the exact same techniques with small changes like flipping the comparison operators.

We need the following operations when implementing a heap:

topElem()
insertElem(t)
deleteTopElem(t)

All the operations should be done in $O(\log{n})$ time. Internally in a binary-heap, the root element is always atleast as big as both of its children, and this property of the heap exists recursively. (In min-heaps, root is as small, or smaller than both of its children)

It is easy to see that topElem() can be executed in constant time, since the maximum element would be at the root. (In min-heaps the minimum element would be at the root)

Now, when we need to remove the maximum (or the minimum element in min-heaps), we need to remove the root and preserve the heap property. Removing the root is simple, but what it leaves behind is not so trivial. Either the left, or the right child would be the biggest (or the smallest, in min-heaps) element left now. So, we compare the two. The bigger (or smaller. You get the idea now, I guess) of the two is placed in the root, leaving a void at the place where it existed. So, now we need to correct either the left, or the right sub-tree of the root, until the sub-tree is empty. This operation will be executed $O(\log{n})$ times, (costing $O(1)$ time, each time). So the complexity is $O(\log{n})$. Thus, we are done with deleteTopElem().

insertElem(t) is even simpler, we place the new element in the lower-most level of the binary tree, where we have a place, or create a new level if we don’t. Now, it is possible that the parent of the newly inserted element has lost the heap property. Thus, if the parent of the newly inserted element is greater than it, we swap the parent with the newly inserted element, and check if the new-parent has also lost the heap property. We continue this till we reach the root. This operation will be executed $O(\log{n})$ times, (costing $O(1)$ time, each time). So the complexity is $O(\log{n})$.

Here is a messy little implementation:

template < class T >
class Heap
{
	std::vector<T> store;
	int capacity, used;

	void bubbleUp();
	void pushDown();

	public:
	Heap();

	void insertElem(T elem);
	T topElem();
	T deleteTopElem();
};

template< class T >
Heap< T >::Heap()
{
	capacity = 2;
	used = 1;
	store.resize(2);
}

template< class T >
void Heap< T >::insertElem(T elem)
{
	if(used == capacity)
	{
		capacity = (((capacity-1)<<1) + 1) + 1;
		store.resize(capacity);
	}
	store[used++] = elem;
	bubbleUp();
}

template<class T>
T Heap<T>::topElem()
{
	// There is no element in the heap.
	if(used - 1 == 0)
		return 0;

	return store[1];
}

template<class T>
void Heap<T>::bubbleUp()
{
	int ptr = used - 1;
	while(ptr != 1)
	{
		if(store[ptr] > store[ptr>>1])
		{
			swap(store[ptr], store[ptr>>1]);
			ptr >>= 1;
		}
		else break;
	}
}

template<class T>
T Heap<T>::deleteTopElem()
{
	// Empty heap
	if(used - 1 == 0)
		return 0;

	T ret = store[1];
	swap(store[1], store[used - 1]);
	used--;

	if(used - 1 > 0)
		pushDown();

	return ret;
}

template<class T>
void Heap<T>::pushDown()
{
	int ptr = 1;
	while(ptr < used)
	{
		if(ptr<<1 >= used) break;
		T left = store[ptr<<1];

		T right;
		bool right_branch = true;
		if ((ptr<<1) + 1 < used) right = store[(ptr<<1)+ 1];
		else right_branch = false;

		if(store[ptr] >= left && (!right_branch || store[ptr] >= right))
			break;

		if(left >= store[ptr])
		{
			if(!right_branch || left >= right)
			{
				swap(store[ptr], store[(ptr<<1)]);
				ptr = (ptr<<1);
				continue;
			}
			else
			{
				swap(store[ptr], store[(ptr<<1)+1]);
				ptr = (ptr<<1) + 1;
				continue;
			}
		}
		else if(!right_branch)
		{
			swap(store[ptr], store[(ptr<<1)+1]);
			ptr = (ptr<<1) + 1;
			continue;
		}
	}
}

void wrapper()
{
	Heap<int> h;
	int arr[] = { 4, 3, 7, 1, 10};

	for(int i = 0; i <  5; i++)
	{
		h.insertElem(arr[i]);
		cout << h.topElem() << endl;
	}
	cout << endl;
	for(int i = 4; i >=  0; i--)
	{
		cout << h.deleteTopElem() << endl;
	}

	Heap<string> s;
	string sarr[] = { "abc", "def", "ghi", "xyz", "zzz"};
	for(int i = 0; i <  5; i++)
	{
		s.insertElem(sarr[i]);
		cout << s.topElem() << endl;
	}
	cout << endl;
	for(int i = 4; i >=  0; i--)
	{
		cout << s.deleteTopElem() << endl;
	}
}

If you notice the code, I am inserting elements one by one, costing $O(\log{n})$ on each insert. And hence, $O(n\log{n})$ for $n$ elements. However, if we know the entire set of elements before hand, heap construction can be done in linear time, in a manner similar to insertElem(t). Hint: Recursion.

You might also want to overload the comparison operators for custom data-structures.

I used this presentation (ppt) to refresh my memory.

Newer Older