Gaurav's Blog

Miscellaneous Project / Research Worthy Ideas [WIP]

I would scribble some project / research - worthy ideas. That may or may not yield much benefit, but I would probably like to travel down that path some time.

  • Concurrent Linked Lists Linked Lists that can be operated on concurrently. There has been some work done on Lock-Free Linked Lists etc, probably worth checking out, and implementing.

  • Bloom Filters / Quotient Filters Definitely worth implementing an efficient Bloom Filter / Quotient Filter. An exciting research area is to design a Bloom-Filter type data structure to efficiently answer the query - “Is there an element in the range [p,q]?”, with a NO (which is guaranteed), or MAYBE (the false-positive rate should be adjustable).

  • Persistent / Temporal Data Structures See Prof. Erik Demaine’s lecture.

  • Sorting / Searching Practical In-Place Merge-Sort [PS]

  • Dynamic RMQ / LCA in O(1) time. Simplify what Cole and Hariharan did for LCA in this paper. And/or devise a dynamic RMQ mechanism which can be done in O(log n) / O(1).

  • Fractional Cascading This is a really cool idea. Read about it here. It helps speed up bounded-box queries by a log factor.

Please feel free to suggest ideas.

Read more

Competitive Analysis

Professor Bender mentioned this thing called ‘Competitive Analysis’ in class.

It is basically analysing the performance of your algorithm in difficult scenarios, with the optimal performance in easier scenarios. For example, it is easier to do Off-line scheduling than On-line scheduling. One could compare how bad one’s algorithm performs in the on-line version, by looking at its performance relative to the optimal / good schedule in the off-line case.

Read more

Hadoop: Getting My Hands Dirty

Those who are about to read this post, beware. Hadoop is a big beast. There are a lot of things churning inside. I am going to work on the Hadoop Scheduler this semester, so I will be dealing with it day-in and day-out. Having never used Hadoop, it took some time for me to get it up-and-running, and running some test code.

I am making some casual notes of how I went about it. It might help some of you in the future, hopefully, and will let me keep a ready-reference too.

  • The first step is to understand, what Map Reduce, and Hadoop are. The Wiki pages for both are good-enough sources for the same.

  • Then, to actually set up Hadoop on Ubuntu, I used this excellent tutorial. (Courtesy Deepak). The procedure should be slightly different for other distributions.

  • You might want to read this to understand how Hadoop can be used to implement Map Reduce jobs.

  • The Hadoop Wordcount example is a good place to understand how a basic Map Reduce job is written in Hadoop. This is the standard Hadoop documentation, which explains the example.However there are small changes in the API, and the documentation has not been updated to reflect that :-(. For example, OutputCollector has been replaced by the Context object in the Mapper and Reducer function prototypes. Here is a small summary. But mostly, you should be fine with the above-mentioned tutorial. Using the Wordcount example, I made a trivial modification to find the number of palindromes in a set of files. Here is the code.

These are the sequence of steps I followed to get my job running:

  1. $HADOOP_HOME/bin/hadoop/start-all.sh

  2. javac -classpath $HADOOP_HOME/hadoop-*-core.jar Palindrome.java

  3. jar -cvfe /path/for/jar/file/Palindrome.jar Palindrome Palindrome.class Palindrome\$PalindromeMapper.class Palindrome\$PalindromeReducer.class

  4. $HADOOP_HOME/bin/hadoop dfs -copyFromLocal /path/to/input /user/hadoop/input/path

  5. $HADOOP_HOME/bin/hadoop jar /path/for/jar/file/Palindrome.jar /user/hadoop/input/path /user/hadoop/output/path

  6. $HADOOP_HOME/bin/hadoop dfs -getmerge /user/hadoop/output/path /path/to/final/output

  • Here are some of the common Hadoop shell commands that you might need. I used dfs, ls, cat, getmerge for running the Palindrome code.

  • I also read an example which deals with estimating the value of Pi. While there are series, which help in finding the value iteratively, and can be parallelized by assigning the range of terms each node has to calculate, it can prove to be expensive. The reason being, each term needs to be calculated with arbitrary precision.The innovative method is to generate a large number of points within a unit square centered at the origin. If the number of points are large and randomly distributed, the ratio of the number of points inside the circle to the total points gives us the a good approximation of PI/4. We can parallize this by dividing the points based upon the quadrants / sectors they lie in, and summing up the respective counts. Note, here we need to do only one final arbitrary precision calculation.

Read more