30 Jul 2017 - 3 minute read
If you have a dataset that is either too large to fit in memory, or new data is being added to it on an ongoing basis (effetively we don’t know its size), there will be the need to have algorithms, which have a:
Examples of such datasets could be clicks on Google.com, friend requests on Facebook.com, etc.
One simple problem that could be posed with such datasets is:
Pick a random element from the given dataset, ensuring that the likelihood of picking any element is the same.
Of course, it is trivial to solve if we know the size of your dataset. We can simply pick a random number in $(0, n-1)$ ($n$ being the size of your dataset). And index to that element in your dataset. That is of course not possible with a stream, where we don’t know $n$.
Reservoir Sampling does this pretty elegantly for a stream $S$:
sample, len := null, 0
while !(S.Done()):
x := S.Next()
len = len + 1
if (Rand(len) == 0):
sample = x
return sample
The idea is simple. Every new element you encounter, replace your current choice with this new element with a probability of $\large\frac{1}{l}$. Where $l$ is the total length of the stream (including this element), encountered so far. When the stream ends, you return the element that you had picked at the end.
However, we need to make sure that the probability of each element being picked is the same. Let’s do a rough sketch:
There could be a weighted variant of this problem. Where, each element has an associated weight with it. At the end, the probability of an item $i$ being selected should be $\large\frac{w_i}{W}$. Where, $w_i$ is the weight of the $i$-th element, and $W$ is the sum of the weights of all the elements.
It is straight-forward to extend our algorithm to respect the weights of the elements:
len
, keep W
.W
by $w_i$ instead of just incrementing it by 1.In a manner similar to the proof above, we can show that this algorithm will also respect the condition that we imposed on it. Infact the previous algorithm is a special case of this one, with all $w_i = 1$.
Credits:
[1] Jeff Erickson’s notes on streaming algorithms for the general idea about streaming algorithms.
Read more08 Jun 2017 - 2 minute read
Semi-Supervised Learning is useful in scenarios where you have some labeled data, and lots of unlabeled data. If there is a way to cluster together training examples by a measure of similarity, and we can assume that examples close to each other in these clusterings are likely to have the same labels, then Semi-Supervised Learning can be pretty useful.
Essentially the premise is that labeling all the data is expensive, and we should learn as much as we can from as small a dataset as possible. For their data-labeling needs, the industry either relies on Mechanical Turks or full-time labelers on contract (Google Search is one example where they have a large team of human raters). Overall, it is costly to build a large labeled dataset. Therefore, if we can minimize our dependence on labeled data, and learn from known / inferred similarity within the dataset, that would be great. That’s where Semi-Supervised Learning helps.
Assume there is a graph-structure to our data, where a node is a datum / row, which needs to be labeled. And an edge exists between two nodes if they are similar, along with a weight. In this case, Label Propagation is a classic technique which has been used commonly.
I read this paper from Google Research which does a good job of generalizing and summarizing similar work done by Weston et. al, 2012, around training Neural Nets augmented by such a graph-based structure.
To summarize the work very quickly, the network tries to do two things:
a. For labeled data, try to predict the correct label (of course),
b. For the entire data set, try to learn a representation of each datum (embedding) in the hidden layers of the neural net.
For nodes which are adjacent to each other, the distance between their respective embeddings should be small, and the importance of keeping this distance small is proportional to the edge weight. That is, if there are two adjacent nodes with a high edge weight, if the neural net doesn’t learn to create embeddings such that these two examples are close to each other, there would be a larger penalty, than if the distance was smaller / the edge weight was lower.
Check the figure above. The blue layer is the hidden layer used for generating the embedding, whose output is represented by $h_{\theta}(X_i)$. $y$ is the final output. If $X_i$ and $X_j$ are close-by, the distance between them is represented by $d(h_{\theta}(X_i), h_{\theta}(X_j))$.
The cost-function is below. Don’t let this scare you. The first term is just the total loss from the predictions of the network for labeled data. The next three terms are for tweaking the importance of distances between the various (labeled / unlabeled) -(labeled / unlabeled) pairs, weighed by their respective edge weights, $w_{uv}$.
I skimmed through the paper to get the gist, but I wonder that the core contribution of such a network is to go from a graph-based structure to an embedding. If we were to construct an embedding directly from the graph structure, and train a NN separately using the embedding as the input, my guess is it should fetch similar results with a less complex objective function.
Read more24 Apr 2017 - 3 minute read
PyTorch is a fairly new deep-learning framework released by Facebook, which reminds me of the JS framework frenzy. But having played around with PyTorch a slight bit, it already feels fun.
To keep things short, I liked it because:
.numpy()
suffix to convert a Tensor to a numpy array.PyTorch’s website has a 60 min. blitz tutorial, which is laid out pretty well.
Here is the summary to get you started on PyTorch:
torch.Tensor
is your np.array
(the NumPy array). torch.Tensor(3,4)
will create a Tensor
of shape (3,4).torch.rand
can be used to generate random Tensors..numpy()
allows converting Tensor to a numpy array.Variable
s which are similar to placeholder
in TF.This is all it takes to compute the gradient, where x
is a variable:
x = Variable(torch.ones(2, 2), requires_grad=True)
y = x + 2
z = y * y * 3
out = z.mean()
out.backward()
print(x.grad)
Doing backprop simply with the backward
method call on the scalar out
, computes gradients all the way to x
. This is amazing!
nn.Module
which wraps around the boring boiler-plate.import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 1, 5)
self.conv2 = nn.Conv2d(1, 1, 5)
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(1 * 10 * 10, 100)
self.fc2 = nn.Linear(100, 10)
def forward(self, x):
x = F.relu((self.conv1(x)))
x = F.relu(F.max_pool2d((self.conv2(x)), 2))
x = x.view(-1, 1 * 10 * 10)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
return F.log_softmax(x)
As seen, in the __init__
method, we just need to define the various NN layers we are going to be using. Then the forward
method just runs through them. The view
method is analogous to the NumPy reshape
method.
The gradients will be applied after the backward pass, which is auto-computed. The code is self-explanatory and fairly easy to understand.
# Create the net and define the optimization criterion
net = Net()
# The loss function
criterion = nn.CrossEntropyLoss()
# Which optimization technique will you apply?
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.5)
# zero the parameter gradients
optimizer.zero_grad()
for epoch in range(20): # loop over the dataset multiple times
# Run per epoch
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
# get the inputs
inputs, labels = data
# wrap them in Variable
inputs, labels = Variable(inputs), Variable(labels)
# zero the parameter gradients
optimizer.zero_grad()
# Forward Pass
outputs = net(inputs)
# Compute loss function
loss = criterion(outputs, labels)
# Backward pass and gradient update
loss.backward()
optimizer.step()
The criterion
object is used to compute your loss function. optim
has a bunch of convex optimization algorithms such as vanilla SGD, Adam, etc. As promised, simply calling the backward
method on the loss object allows computing the gradient.
Overall, I could get to 96% accuracy, with the current setup. The complete gist is here.
Read more