13 Jan 2014 - 2 minute read
It has almost been a year since I wrote something. For the most part, I have been too busy to share something. Since the past one year, I have been working on HBase at Facebook, which is the backbone of Facebook Messages, and many other important services at Facebook. Also, I’ve moved to Mountain View, California, and I absolutely love the surroundings.
I have been trying my hand at different things, like learning some ML, Go, trying to learn how to play an Electric Guitar, and other things. One thing I want to follow this year would be to continue doing new things, and keep sharing my experiences.
Finally, I have also ditched WordPress in favor of Octopress, a Markdown-style blogging framework built on top of Jekyll. What this means is, I just need to worry about the content, which I can write in simple Markdown format. Octopress generates a completely static website for me to use. I don’t have to setup the Apache-MySQL monstrosity to serve a simple blog.
However, the transition isn’t very smooth.
yourblog.com/?p=XYZ
, where XYZ
is a number, while Octopress has a more sensible, :year/:month/:date/:title
scheme. Google had indexed my blog according to the older scheme and now, and anybody who has linked to a specific blog post, will now be redirected to the main page. In short, its not pleasant.However, the big win is that it is super-easy for me to write posts and host a blog. Earlier, this blog was put up on a free shared hosting site, and it was very clumsy to manage my blog. And as of the time of writing this post, I am hosting multiple blogs on a very lightweight VPS, and as long as I don’t get DDoS-ed, this machine is more than capable of hosting several such static-only blogs. Because, after all, how hard is it to serve static HTML :)
I have a lot to share about what I learnt in the last year, so keep looking :)
Read more25 Jan 2013 - less than 1 minute read
I had done a visualisation using R, where I plotted the locations of my friends and followers on Twitter on a map. I re-did this visualisation using D3.js. It is a fantastic visualisation tool, which a lot of features. I could only explore a fraction of the API for my purpose, but you can see a lot of cool examples on their page.
The final output was something like this (a better look here):
I wanted to resize the bubbles when there are a lot of points in the vicinity, but I’ll leave that for later.
You can have a look at the D3.js code, and the python script which gets the lat-long pairs using the Twitter API, and the Google GeoCoding API (Yahoo! decided to commercialize their GeoCoding API sadly).
(The map data and some inspiration for the code, is from here)
Read more21 Dec 2012 - 2 minute read
I have been trying to learn some Go in my free time. While, I was trying to code up a simple Bloom Filter, I realized that once a Bloom Filter gets too full (i.e., the false positive rate becomes high), we cannot add more elements to it. We cannot simply resize this original Bloom Filter, since we would need to rehash all the elements which were inserted, and we obviously don’t maintain that list of elements.
A solution to this is to create a new Bloom Filter of twice the size (or for that matter, any multiple >=2), and add new elements to the new filter. When we need to check if an element exists, we need to check in the old filter, as well as the new filter. If the new filter gets too full, we create another filter which is a constant factor (>=2) greater in size than the second filter. bit.ly uses a solution similar to this (via Aditya).
We can see that if we have N elements to insert in our ‘Scalable’ Bloom Filter, we would need log N filters (base r, where r is the multiple mentioned above). It is also easy to derive the cumulative false positive rate for this new Bloom Filter.
If the false positive rate of each of the individual constituent bloom filters is f, the probability that we do not get a false positive in one of the filters is (1-f).
Therefore the probability that we do not get a false positive in any of the q filters is, (1-f)^q.
Hence, the probability that we get a false positive in any of these q filters is:
1 - (1-f)^q.
Some rough estimates show that this cumulative false positive rate is around: q * f (only if f is small). Where, q is about log N, as we noted above. Therefore, if you have four filters, each with a false positive rate of 0.01, the cumulative false positive rate is about 4 * 0.01 = 0.04. This is exactly what we want.
What is beautiful in this construction is, the false positive rate is independent of how fast the filter sizes grow. If you maintain a good (small) false positive rate in each of the constituent filter, you can simply add up their false positive rates to get an estimate of the cumulative false positive rate (only if f is small).
You can simply grow your filters fast (like around 5x), each time one filter becomes too full, so as to keep the number of filters small.
I have implemented this ‘Scalable Bloom Filter’ (along with the vanilla Bloom Filter and the Counting Bloom Filter) in Go. Please have a look, and share any feedback.
Read more