01 Apr 2018 - 3 minute read
I’ve been reading a few papers related to scaling Stochastic Gradient Descent for large datasets, and wanted to summarize them here.
Model-Level Parallelism: Works with large models by splitting the model graph itself into several parts. Each part of the model is assigned to a different machine. If there is an edge between two nodes in different parts, the two machines hosting those parts would need to communicate. This is to get around the problem of fitting a large model on a single GPU.
Downpour SGD: To be able to scale to large datasets, DistBelief also runs several replicas of the model itself. The training data is split into several subsets, and each replica works on a single subset. Each of the replica sends the updates of its params to a Parameter Server. The parameter server itself is sharded, and is responsible for getting updates for a subset of params.
Whenever a new replica starts a new minibatch, it gets the relevant params from the parameter server shards, and then sends its updates when its done with its minibatch.
The authors found Adagrad to be useful in the asynchrous SGD setting, since it uses an adaptive learning rate for each parameter, which makes it easy to implement locally per parameter shard.
16 Aug 2017 - 5 minute read
That was a click-baity title. :)
But seriously, people make a big deal out of ‘Dynamic Programming’, in the context of software engineering interviews. Also, the name sounds fancy but for most problems in such interviews, you can go from a naive recursive solution to an efficient solution, pretty easily.
Any problem that has the following properties can be solved with Dynamic Programming:
You just have to do two things here:
That’s it.
Usually the second part is harder. After that, it is like clockwork, and the steps remain the same almost all the time.
Example
Assume, your recursive solution to say, compute the n-th fibonacci number, is:
\[F(n) = F(n - 1) + F(n - 2)\] \[F(0) = F(1) = 1\]Step 1: Write this as a recursive solution first
int fib(int n) {
if (n == 0 || n == 1) {
return 1;
} else {
return fib(n-1) + fib(n-2);
}
}
Now, this is an exponential time solution. Most of the inefficiency comes in because we recompute the solutions again and again. Draw the recursion tree as an exercise, and convince yourself that this is true.
Also, when you do this, you at least get a naive solution out of the way. The interviewer at least knows that you can solve the problem (perhaps, not efficiently, yet).
Step 2: Let’s just simply cache everything
Store every value ever computed.
int cache[20];
int fib(int n) {
// Pre-fill all the values of cache as -1.
memset(cache, -1, sizeof(cache));
return fibDP(n);
}
int fibDP(int n) {
// Check if we have already computed this value before.
if (cache[n] != -1) {
// Yes, we have. Return that.
return cache[n];
}
// This part is just identical to the solution before.
// Just make sure that we store the value in the cache after computing
if (n == 0 || n == 1) {
cache[n] = 1;
} else {
cache[n] = fibDP(n-1) + fibDP(n-2);
}
return cache[n];
}
Let us compute how many unique calls can we make to fibDP
?
n
.n
unique values of n
can be passed to fibDP.n
unique calls.Now, realize two things:
That’s it. We just optimized the recursive code from a $O(2^n)$ time complexity, $O(n)$ space complexity (recursive call stack space) to an $O(n)$ time, $O(n)$ space (recursive + extra space).
Example with a higher number of parameters
int foo(int n, int m) {
if (n <= 0 || m <= 0) {
return 1;
}
return foo(n-1, m) + foo(n, m-1) + foo(n-1, m-1);
}
Time complexity: $O(3^{n+m})$ [Work it out on paper, why this would be the complexity, if you are not sure.]
DP Code
int cache[100][100];
int foo(int n, int m) {
memset(cache, -1, sizeof(cache));
return fooDP(n, m);
}
int fooDP(int n, int m) {
if (n <= 0 || m <= 0) {
return 1;
}
if (cache[n][m] == -1) {
cache[n][m] = fooDP(n-1, m) + fooDP(n, m-1) + fooDP(n-1, m-1);
}
return cache[n][m];
}
Assume I tweak foo and add an $O(n \log m)$ work inside each call, that would just be multiplied for the time complexity, i.e.,
Time complexity = O(unique calls) * O(work-per-call)
\[\implies O(nm) \times O(n \log m)\] \[\implies O(n^2 m \log m)\]$Space Complexity = O(unique calls) * O(space per call)
\[\implies O(nm) \times O(1)\] \[\implies O(nm)\]Now just reinforce these ideas with this question
Extra Credit
What we saw is called top-down DP, because we are taking a bigger problem, breaking it down into sub-problems and solving them first. This is basically recursion with memoization (we ‘memoize’ (fancy word for caching) the solutions of the sub-problems).
When you absolutely, totally nail the recursive solution, some interviewers might want a solution without recursion. Or, probably want to optimize the space complexity even further (which is not often possible in the recursive case). In this case, we want a bottom-up DP, which is slightly complicated. It starts by solving the smallest problems iteratively, and builds the solution to bigger problems from that.
Only if you have time, go in this area. Otherwise, even if you mention to the interviewer that you know there is something called bottom-up DP which can be used to do this iteratively, they should be at least somewhat okay. I did a short blog-post on converting a top-down DP to a bottom-up DP if it sounds interesting.
Read more30 Jul 2017 - 3 minute read
If you have a dataset that is either too large to fit in memory, or new data is being added to it on an ongoing basis (effetively we don’t know its size), there will be the need to have algorithms, which have a:
Examples of such datasets could be clicks on Google.com, friend requests on Facebook.com, etc.
One simple problem that could be posed with such datasets is:
Pick a random element from the given dataset, ensuring that the likelihood of picking any element is the same.
Of course, it is trivial to solve if we know the size of your dataset. We can simply pick a random number in $(0, n-1)$ ($n$ being the size of your dataset). And index to that element in your dataset. That is of course not possible with a stream, where we don’t know $n$.
Reservoir Sampling does this pretty elegantly for a stream $S$:
sample, len := null, 0
while !(S.Done()):
x := S.Next()
len = len + 1
if (Rand(len) == 0):
sample = x
return sample
The idea is simple. Every new element you encounter, replace your current choice with this new element with a probability of $\large\frac{1}{l}$. Where $l$ is the total length of the stream (including this element), encountered so far. When the stream ends, you return the element that you had picked at the end.
However, we need to make sure that the probability of each element being picked is the same. Let’s do a rough sketch:
There could be a weighted variant of this problem. Where, each element has an associated weight with it. At the end, the probability of an item $i$ being selected should be $\large\frac{w_i}{W}$. Where, $w_i$ is the weight of the $i$-th element, and $W$ is the sum of the weights of all the elements.
It is straight-forward to extend our algorithm to respect the weights of the elements:
len
, keep W
.W
by $w_i$ instead of just incrementing it by 1.In a manner similar to the proof above, we can show that this algorithm will also respect the condition that we imposed on it. Infact the previous algorithm is a special case of this one, with all $w_i = 1$.
Credits:
[1] Jeff Erickson’s notes on streaming algorithms for the general idea about streaming algorithms.
Read more