17 Nov 2016 - 5 minute read
One of the problems with serving databases is horizontal scalability (i.e., being able to add machines in the cluster), and load balance the read/write loads.
A naive way to rebalance traffic is to assign a part of the key-space to each machine. For an object $o$, with key $k$, the machine serving the object can be found out by $h$($k$) $\%$ $n$. Where $h$ is a hash-function, such as SHA or MD5.
Advantages
Problems
Consistent Hashing is an optimization on the naive solution, in that it avoids the need to copy the entire dataset. Each key is projected as a point on the edge of a unit circle. Every node in the cluster is assigned a point on the edge of the same circle. Then each key is served by the node closest to it on the circle’s edge.
When a machine is added or removed from the cluster, each machine gives up / takes up a small number of keys. This rebalance happens in such a fashion that only $O$($\frac{K}{n}$) transfers happen. Where $n$ is the number of machines in the cluster.
Advantages
Problems
Note that in both the above methods, when you are adding or removing machines, there is some amount of shutdown involved. In the first case, we need to completely turn-off reads and writes because the cluster is going through a complete rebalance. In the second case, we can just turn-off reads and writes for a fraction of the keyspace which is $\frac{1}{n}$ as compared to the first solution.
Pinterest in it’s blogpost about MySQL sharding talks about a setup where they use the key itself as a marker of which shard the key belongs to. When doing a write for the first time on the object $o$, we generate a key for it, in which we keep the higher $x$ bits reserved for the shard the object belongs to. The next time, there is a request for a read, we use those $x$ bits to find which shard should be queried.
Each shard is located on a certain machine, and the shard->machine map is kept in ZooKeeper (a good place to read & write small configs in a consistent fashion). Upon finding the shard, we lookup this map to locate the machine to which the request needs to be made.
When new machines are added, we simply create more shards, and start utilizing those new shards for populating the bits corresponding to the shards. This way, new writes and the reads correspodning to those writes dont hit the existing machines.
I’m going to refer to this as the “Pinterest trick”, because I read it on their blog. Pretty sure, this is not the first time it’s being done though.
Advantages
Disadvantages
Another trick that some setups apply is to have the key-space sufficiently pre-sharded to begin with. Then these shards are simply moved to other machines, if their current hosts can’t serve them, as traffic increases. For MySQL, each shard is a separate database instance. We used a similar approach when operating HBase at FB, where we expected the need to add more machines in future.
Discussing with Dhruv, brought up an interesting point: Why are we sharding a database? Sure, we want to scale horizontally. But which resource are we running out of? CPU, Disk, Network?
The above tricks that we discussed, scale for disk. Note that, in the case of the Pinterest trick, new shards don’t proportionately help with serving existing read queries. For most Social Networks, the amount of data being created outpaces consumption, and they are bound on disk, rather than CPU.
If you would be bound on CPU, there are several ways to move your shards to not-so-hot machines, depending on which tradeoff you would like to make:
I wrote a lot of this from a high-level knowledge, and discussing with people who have worked on these kind of things. I might have omitted something, or wrote something that is plainly incorrect. Moreover, this is an open topic, with no “right answers” that apply to all. If you have any comments about this topic, please feel free to share in the comments section below.
Read more26 Jul 2016 - 13 minute read
This note contains highlights of SIGIR 2016 held at Pisa, Italy. SIGIR (Special Interest Group for Information Retrieval) is an annual conference related to new advances in Information Retrieval. I haven’t gone through all the papers yet, but sharing some highlights of the parts that I could attend / go through offline via the slides / proceedings.
The first day was dedicated to tutorials. Most tutorials were ‘survey’ like in the content, in that they did not present anything new, but were good if you want to get an idea about what’s happening in a particular area of IR.
Deep Learning Tutorial
This tutorial was conducted by members of the Huawei Labs, China. It was about the current state of Deep Learning in the industry.
Succinct Data Structures
This tutorial was about data-structures related to inverted indices. It went over both, theory (minimal and easy to understand), as well as a hands-on session. I really enjoyed this tutorial.
The second day started with a keynote from Christopher Manning. Some highlights from the talk:
Listing notes from few talks which were interesting:
Learning to Rank with Selection Bias in Personal Search (Google)
Fast and Compact Hamming Distance Index
Scalable Semantic Matching of Queries to Ads in Sponsored Search Advertising
Day 4 had talks from the Industry which talked about IR systems at scale (big / small). I found these to be very interesting. It’s sad that they were not recorded.
Search Is Not a Box - Hadar Shemtov (Google)
There was a related paper presented by Ido Guy from Yahoo! Research.
Searching by Talking: Analysis of Voice Queries on Mobile Web Search
When Watson Went to Work - Aya Soffer (IBM Research)
Ask Your TV: Real-Time Question Answering with Recurrent Neural Networks - (Ferhan Ture, Comcast)
Amazon Search: The Joy of Ranking Products (Daria Sorokina)
Learning to Rank Personalized Search Results in Professional Networks (Viet Ha-Thuc)
The last day had workshops. I attended the first few talks of the NeuIR (Neural Network IR) workshop, before I had to leave to catch my flight. The keynote was given by Tomas Mikolov from FAIR. His slides are here.
Key points:
Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015
Query to Knowledge: Unsupervised Entity Extraction from Shopping Queries using Adaptor Grammars
Explicit In Situ User Feedback for Search Results
This was my first IR conference, and an academic conference in a long time. These are the key takeaways:
16 Jan 2016 - 15 minute read
In the summer of 2015, I wanted to work on a side-project, that I can quickly use, instead of spending lots of weekends, and it getting no where. Luckily, I found the right project on Peter Norvig’s website. Here I will try to describe, how to write a simple LISP interpreter, and how to use that in your iOS app.
To those who are new to LISP, it is pretty simple to explain. LISP programs are based on something called ‘s-expressions’. An s-expression looks like this: $(\mathrm{operator}\ \mathrm{operand_1}\ \mathrm{operand_2}\ \mathrm{operand_3}\ …)$.
For example:
The operands can themselves be recursively computed too.
For example, $(+$ $1$ $(*$ $2$ $3))$ is a valid expression. First we evaluate the inner $(*$ $2$ $3)$ part, then the original expression resolves to $(+$ $1$ $6)$, which then evaluates to $7$. This can go on recursively.
For someone who wants to design an interpreter, LISP is the ideal real-life language to start with. This is for two reasons:
I could only stay motivated, and bring this project to a closure, because you can pick a very small subset of LISP, and still do a lot of interesting things.
To keep you motivated about reading the article, lets do a demo first and then we can go into details about how I built this.
Here is the GitHub repository for the interpreter, and the app on iTunes (Lambda Lisp). Feel free to file issues / contribute.
If you are still reading: Let’s build an interpreter!
Lexing involves finding lexemes, or syntactical tokens, which can then be combined to interpret a grammatical sentence. In the expression $(+$ $1$ $2)$, the tokens are [$($, $+$, $1$, $2$, $)$]. Sophisticated compilers use lex or flex for finding these tokens, handling white-space, attaching a token type to each of them, etc.
I did not want to bloat up my simple interpreter by using lex / flex. I found this nifty one-line bare-bones Lexer in Peter Norvig’s article:
def tokenize(chars):
"Convert a string of characters into a list of tokens."
return chars.replace('(', ' ( ').replace(')', ' ) ').split()
Essentially, what this does is to handle white-space (somewhat). It basically adds spaces around the brackets, and then splits the expression on white-space.
We need to do the replacement for all operators, but otherwise it works well. This is because LISP is simple enough that attaching types to tokens (and error-ing out, if required) can be done at the time of parsing. This is how I did it in Go, just for completeness sake.
func tokenize(exp string) []string {
return strings.Fields(
strings.Replace(strings.Replace(exp, "(", " ( ", -1), ")", " ) ", -1),
)
}
The recurrent theme in the design of this interpreter, is to be lazy and push the harder problems to the next layer.
Given an expression, we would need to make sure that the expression follows a structure, or a Grammar. This means two things in our case:
At this stage, we are only concerned about well-formedness of the s-expression. We don’t care if the $+$ operator received incompatible operands, for instance. This means that given an expression like $(+$ $1)$, we would mark this expression to be okay at this point, because the expression is well-formed. We will catch the problem of too few operands to $+$, at a later time.
We can start solving the problem of checking well-formedness of the expression by using an Abstract Syntax Tree (or AST). An AST is a way of representing the syntactic structure of code. In this tree, the leaf nodes are atomic values, and all the non-leaf nodes are operators. Recursion can be naturally expressed using an AST.
This is how we can represent a node of this tree in the code:
type ASTNode struct {
value string // Only valid for leaf nodes.
isValue bool // Checks if this is a value (also if children == nil).
children []*ASTNode // Children of this AST Node.
}
To actually verify the well-formedness of the expression and build the AST, we would go about it this way:
// This method gets a list of tokens, and returns:
// 1. The ASTNode of the tree so formed.
// 2. Unused tokens in the end of the array.
// 3. Any error while constructing the tree.
// Removed some error handling to make it a little brief.
func buildAST(tokens []string) (*ASTNode, []string, error) {
var token = ""
tokensLen := len(tokens)
// If it is an empty list of tokens, the AST is a nil node
if tokensLen == 1 {
token, tokens = pop(tokens)
if !isValue(token) {
return nil, tokens, errStr("value", token)
}
// Create a one-node AST.
node := new(ASTNode)
node.isValue = true
node.value = token
node.children = nil
return node, tokens, nil
} else {
token, tokens = pop(tokens)
// Expect an opening bracket to start.
if token != openBracket {
return nil, tokens, errStr(openBracket, token)
}
node := new(ASTNode)
node.isValue = false
// Create a slice with 0 length initially.
node.children = make([]*ASTNode, 0)
tokensLen = len(tokens)
for len(tokens) != 0 && tokens[0] != closedBracket {
var childNode *ASTNode = nil
var err error = nil
// If we get an opening brace, it means we need to recursively get the
// AST of the sub-expression from here on.
if tokens[0] != openBracket {
token, tokens = pop(tokens)
childNode, _, err = buildAST([]string{token})
} else {
childNode, tokens, err = buildAST(tokens)
}
if err != nil {
return nil, tokens, err
}
node.children = append(node.children, childNode)
}
// Expect a closing bracket when ending.
token, tokens = pop(tokens)
if token != closedBracket {
return nil, tokens, errStr(token, closedBracket)
}
return node, tokens, nil
}
}
You can see how the grammar for interpreting the s-expression grammar is hard-coded in the code here. We expect the expression to be either a single value, or something like $(\mathrm{operator}\ \mathrm{o_1}\ \mathrm{o_2}\ …\ )$, where $\mathrm{o_i}$ can be an atomic value, or a nested expression.
Note that we construct the AST slightly differently. The operator is also part of the children
.
We combine the parsing and evaluation of the AST into one stage. The result of evaluating an AST is an Atom
, which can either have a Value
or an errror
.
type Atom struct {
Err error
Val Value
}
Here is a stripped down AST evaluation code:
func evalAST(env *LangEnv, node *ASTNode) Atom {
var retVal Atom
if node.isValue {
retVal.Value, retVal.Err = getValue(env, node.value)
return retVal
}
// Assuming that the first child is an operand
symbol := node.children[0].value
operator := env.getOperator(symbol)
// Skipped error handling related to operator & arguments for brevity.
operands := make([]Atom, 0)
for i := 1; i < len(node.children); i++ {
v := evalAST(env, node.children[i])
if v.Err != nil {
return v
}
operands = append(operands, v)
}
v := operator.handler(env, operands)
if v.Err != nil {
return v
}
retVal.Val = v.Val
return retVal
}
Basic evaluation is very simple. We have a struct called LangEnv
, which is the ‘environment’ data-structure storing amongst other things, defined operators. When evaluating an AST, if it is a single node, the value of the node is the result. Otherwise, we simply lookup the operator in the environment using getOperator
, then resolve the operands recursively, and pass the operands to the operator. The operand deals with making sure that the operands are sane.
An operator looks something like this:
type Operator struct {
symbol string
minArgCount int
maxArgCount int
handler (func(*LangEnv, []Atom) Atom)
}
As seen, symbol
is the name of operator, so for the binary addition it can be “+”. handler
is the function which will actually do the heavy lifting we have been avoiding all this while. It takes in a slice of Atom
s (and a LangEnv
, more on that later) and returns an Atom
as a result.
Now, the fun stuff.
Remember Atom
has a Value
inside? Value
is an interface, and any type which wants to be a Value
, needs to implement the following methods.
type Value interface {
Str() string // Returns a string representation of the value.
getValueType() valueType // Return the valueType (enum of all Values).
to(valueType) (Value, error) // Convert to a different Value type.
ofType(string) bool // Check if a given string is of this type.
newValue(string) Value // Create a new Value of this type.
}
This is enough power to figure out which value is of which type. In LangEnv
we keep a list of builtin Value
types, such as intValue
, floatValue
, stringValue
, etc.
To deduce the type of a value, we simply do this:
func getValue(env *LangEnv, token string) (Value, error) {
types := builtinTypes()
for _, t := range types {
if t.ofType(token) {
return t.newValue(token), nil
}
}
return nil, errors.New(fmt.Sprintf("Could not get type for token: %s", token))
}
Now imagine an expression like $(+$ $1.2$ $3)$.
$1.2$ resolves to floatValue
, and $3$ resolves to intValue
. How would we implement the handler
method for the $+$ operator to add these two different types? You might say, that this would involve casting of $3$ to floatValue
. Now how do we decide what values to cast, and what type should we cast them to?
This is how we do it. In the method, typeCoerce
, we try to find out which is the right value type to cast all our values to. It is declared like:
func typeCoerce(operatorName string, operands *[]Atom, typePrecendenceMap map[valueType]int)
(valueType, error)
This is what we do inside typeCoerce
:
Hence, the $+$ operator could be implemented this way:
op :=
&Operator{
symbol: add,
minArgCount: 2,
maxArgCount: 100, // Just an arbit large value.
handler: func(env *LangEnv, operands []Atom) Atom {
var retVal Atom
var finalType valueType
finalType, retVal.Err = typeCoerce(add, &operands, numValPrecedenceMap)
if retVal.Err != nil {
return retVal
}
switch finalType {
case intType:
var finalVal intValue
finalVal.value = 0
for _, o := range operands {
v, _ := o.Val.(intValue)
finalVal.value = finalVal.value + v.value
}
retVal.Val = finalVal
break
case floatType:
var finalVal floatValue
finalVal.value = 0
for _, o := range operands {
v, _ := o.Val.(floatValue)
finalVal.value = finalVal.value + v.value
}
retVal.Val = finalVal
break
}
return retVal
},
}
// Add the operator to the LangEnv's opMap (operator map)
addOperator(env.opMap, op)
Here we basically just call typeCoerce
on the operands, and if its possible to cast them to one single type, we do that casting, and actually perform the addition in the new type.
The $+$ operator can be used to add strings as well. However, we don’t want something like $($$+$ $3.14\ \mathrm{“foo”})$. The typeCoerce
method can be trivially extended to support a list of type valid type precedence maps, and all operands need to belong to the same precedence map. In this case, it could be { {intType: 1, floatType: 2}, {stringType: 1 } }
. This particular list ensures that we don’t add ints and strings for example, because they don’t belong to the same precedence map.
Note that the entire implementation of the operator is defined in the Operator
struct’s handler
method. Whether or not the operator supports this sort of casting, or decides to roll its own, or not use it at all, is the prerogative of the operator.
A typical variable definition could look like this: $(\mathrm{defvar}\ \mathrm{x}\ 3.0)$.
Now, defvar
is an operator too. It expects the first argument to be of varType
(matches the regex of a variable), and the value can be anything (except varType
). We just need to check if the type conditions match, and the variable is not a defined operator. If both are okay, we define the variable in our LangEnv
’s varMap
.
We need to change the part in our evalAST
method which to support variable lookup.
func evalAST(env *LangEnv, node *ASTNode) Atom {
// ...
for i := 1; i < len(node.children); i++ {
v := evalAST(env, node.children[i])
if v.Err != nil {
return v
}
// <!-- Lookup code begins
if v.Val.getValueType() == varType {
v.Val, v.Err = getVarValue(env, v.Val)
if v.Err != nil {
return v
}
}
// Lookup code ends --!>
operands = append(operands, v)
}
// ...
}
Here we can assume we have a helper method called getVarValue
, which looks up the value of a variable in the varMap
, or throws an error if required (this is also simple to implement).
Defining methods is even more fun! We use the defun
operator. Consider this:
(defun circle-area (r) (* 3.14 r r))
The first operand is the name of the method, the second is a list of variables that you would pass to the method, and the last is the actual method, assuming that the variables you need are already defined. (circle-area 10)
after this definition should return 314
.
Calculating the area of a rectangle is pretty much similar.
(defun rect-area (x y) (* x y))
We need a couple of things for function calls to work fine:
astValue
which can be used for keeping ASTs. So far we were keeping ints, floats and so on.evalAST
to not evaluate the AST in the defun
arguments. This is because in circle-area
, the (* 3.14 r r)
itself is the value (AST value).defun
operator needs to add an operator to the opMap
, with the same name as the method, and define its handler
method.(defvar x 3.0)
defining the variable x
, followed by (defun foo (x) (+ 1 x))
which defines a method uses a param labelled x
. The interpreter may look at the varMap
and think that the programmer wants to use the global x
, which is $3.0$. The actual intention of the programmer is to use the parameter x
.For this to work correctly, we would need:
LangEnv
to be created, inside the handler
.varMap
as the parent LangEnv
passed to the handler.evalAST
to evaluate the AST we were provided in the method definition with the new LangEnv
LangEnv
, and it is incremented every time a recursive call is made. If it exceeds a large value (100000 for now), we can error out, so as to salvage the interpreter at least.This is the only complicated part of the interpreter. Those interested in the code can check it out here.
Apart from making sure that we have some sort of recursion depth limit enforced, recursion does not need any special handling. Except, defining some new operators like cond
(the if-else equivalent), which are required for writing something useful.
Here is the implementation of the factorial function:
(defun fact (x)
(cond ; conditional block
((= x 0) 1) ; if x == 0, return 1
(true ; else
(* x (fact (- x 1))) ; return x * fact(x - 1)
)
)
)
fact(10)
returns 3628800
as expected.
Once I had the interpreter working fine, I wanted to run this on an iOS app. Why? Just for fun. It turns out with Go 1.5, a new tool called gomobile. Hana Kim gave a talk about this at GopherCon 2015.
What it does is, it compiles your Go package into a static library, and generates Objective-C bindings for it, and wraps them together in a nice iOS friendly .framework
package.
There are a few restrictions regarding not being able to return complex types such as structs within structs, but apart from that it was fairly easy to use in my bare-bones app. Here is the code for the app, and we have already seen the demo earlier.
(Thanks to Dhruv Matani for reviewing this blog-post.)
Read more