26 Jul 2016 - 13 minute read
This note contains highlights of SIGIR 2016 held at Pisa, Italy. SIGIR (Special Interest Group for Information Retrieval) is an annual conference related to new advances in Information Retrieval. I haven’t gone through all the papers yet, but sharing some highlights of the parts that I could attend / go through offline via the slides / proceedings.
The first day was dedicated to tutorials. Most tutorials were ‘survey’ like in the content, in that they did not present anything new, but were good if you want to get an idea about what’s happening in a particular area of IR.
Deep Learning Tutorial
This tutorial was conducted by members of the Huawei Labs, China. It was about the current state of Deep Learning in the industry.
Succinct Data Structures
This tutorial was about data-structures related to inverted indices. It went over both, theory (minimal and easy to understand), as well as a hands-on session. I really enjoyed this tutorial.
The second day started with a keynote from Christopher Manning. Some highlights from the talk:
Listing notes from few talks which were interesting:
Learning to Rank with Selection Bias in Personal Search (Google)
Fast and Compact Hamming Distance Index
Scalable Semantic Matching of Queries to Ads in Sponsored Search Advertising
Day 4 had talks from the Industry which talked about IR systems at scale (big / small). I found these to be very interesting. It’s sad that they were not recorded.
Search Is Not a Box - Hadar Shemtov (Google)
There was a related paper presented by Ido Guy from Yahoo! Research.
Searching by Talking: Analysis of Voice Queries on Mobile Web Search
When Watson Went to Work - Aya Soffer (IBM Research)
Ask Your TV: Real-Time Question Answering with Recurrent Neural Networks - (Ferhan Ture, Comcast)
Amazon Search: The Joy of Ranking Products (Daria Sorokina)
Learning to Rank Personalized Search Results in Professional Networks (Viet Ha-Thuc)
The last day had workshops. I attended the first few talks of the NeuIR (Neural Network IR) workshop, before I had to leave to catch my flight. The keynote was given by Tomas Mikolov from FAIR. His slides are here.
Key points:
Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015
Query to Knowledge: Unsupervised Entity Extraction from Shopping Queries using Adaptor Grammars
Explicit In Situ User Feedback for Search Results
This was my first IR conference, and an academic conference in a long time. These are the key takeaways:
16 Jan 2016 - 15 minute read
In the summer of 2015, I wanted to work on a side-project, that I can quickly use, instead of spending lots of weekends, and it getting no where. Luckily, I found the right project on Peter Norvig’s website. Here I will try to describe, how to write a simple LISP interpreter, and how to use that in your iOS app.
To those who are new to LISP, it is pretty simple to explain. LISP programs are based on something called ‘s-expressions’. An s-expression looks like this: $(\mathrm{operator}\ \mathrm{operand_1}\ \mathrm{operand_2}\ \mathrm{operand_3}\ …)$.
For example:
The operands can themselves be recursively computed too.
For example, $(+$ $1$ $(*$ $2$ $3))$ is a valid expression. First we evaluate the inner $(*$ $2$ $3)$ part, then the original expression resolves to $(+$ $1$ $6)$, which then evaluates to $7$. This can go on recursively.
For someone who wants to design an interpreter, LISP is the ideal real-life language to start with. This is for two reasons:
I could only stay motivated, and bring this project to a closure, because you can pick a very small subset of LISP, and still do a lot of interesting things.
To keep you motivated about reading the article, lets do a demo first and then we can go into details about how I built this.
Here is the GitHub repository for the interpreter, and the app on iTunes (Lambda Lisp). Feel free to file issues / contribute.
If you are still reading: Let’s build an interpreter!
Lexing involves finding lexemes, or syntactical tokens, which can then be combined to interpret a grammatical sentence. In the expression $(+$ $1$ $2)$, the tokens are [$($, $+$, $1$, $2$, $)$]. Sophisticated compilers use lex or flex for finding these tokens, handling white-space, attaching a token type to each of them, etc.
I did not want to bloat up my simple interpreter by using lex / flex. I found this nifty one-line bare-bones Lexer in Peter Norvig’s article:
def tokenize(chars):
"Convert a string of characters into a list of tokens."
return chars.replace('(', ' ( ').replace(')', ' ) ').split()
Essentially, what this does is to handle white-space (somewhat). It basically adds spaces around the brackets, and then splits the expression on white-space.
We need to do the replacement for all operators, but otherwise it works well. This is because LISP is simple enough that attaching types to tokens (and error-ing out, if required) can be done at the time of parsing. This is how I did it in Go, just for completeness sake.
func tokenize(exp string) []string {
return strings.Fields(
strings.Replace(strings.Replace(exp, "(", " ( ", -1), ")", " ) ", -1),
)
}
The recurrent theme in the design of this interpreter, is to be lazy and push the harder problems to the next layer.
Given an expression, we would need to make sure that the expression follows a structure, or a Grammar. This means two things in our case:
At this stage, we are only concerned about well-formedness of the s-expression. We don’t care if the $+$ operator received incompatible operands, for instance. This means that given an expression like $(+$ $1)$, we would mark this expression to be okay at this point, because the expression is well-formed. We will catch the problem of too few operands to $+$, at a later time.
We can start solving the problem of checking well-formedness of the expression by using an Abstract Syntax Tree (or AST). An AST is a way of representing the syntactic structure of code. In this tree, the leaf nodes are atomic values, and all the non-leaf nodes are operators. Recursion can be naturally expressed using an AST.
This is how we can represent a node of this tree in the code:
type ASTNode struct {
value string // Only valid for leaf nodes.
isValue bool // Checks if this is a value (also if children == nil).
children []*ASTNode // Children of this AST Node.
}
To actually verify the well-formedness of the expression and build the AST, we would go about it this way:
// This method gets a list of tokens, and returns:
// 1. The ASTNode of the tree so formed.
// 2. Unused tokens in the end of the array.
// 3. Any error while constructing the tree.
// Removed some error handling to make it a little brief.
func buildAST(tokens []string) (*ASTNode, []string, error) {
var token = ""
tokensLen := len(tokens)
// If it is an empty list of tokens, the AST is a nil node
if tokensLen == 1 {
token, tokens = pop(tokens)
if !isValue(token) {
return nil, tokens, errStr("value", token)
}
// Create a one-node AST.
node := new(ASTNode)
node.isValue = true
node.value = token
node.children = nil
return node, tokens, nil
} else {
token, tokens = pop(tokens)
// Expect an opening bracket to start.
if token != openBracket {
return nil, tokens, errStr(openBracket, token)
}
node := new(ASTNode)
node.isValue = false
// Create a slice with 0 length initially.
node.children = make([]*ASTNode, 0)
tokensLen = len(tokens)
for len(tokens) != 0 && tokens[0] != closedBracket {
var childNode *ASTNode = nil
var err error = nil
// If we get an opening brace, it means we need to recursively get the
// AST of the sub-expression from here on.
if tokens[0] != openBracket {
token, tokens = pop(tokens)
childNode, _, err = buildAST([]string{token})
} else {
childNode, tokens, err = buildAST(tokens)
}
if err != nil {
return nil, tokens, err
}
node.children = append(node.children, childNode)
}
// Expect a closing bracket when ending.
token, tokens = pop(tokens)
if token != closedBracket {
return nil, tokens, errStr(token, closedBracket)
}
return node, tokens, nil
}
}
You can see how the grammar for interpreting the s-expression grammar is hard-coded in the code here. We expect the expression to be either a single value, or something like $(\mathrm{operator}\ \mathrm{o_1}\ \mathrm{o_2}\ …\ )$, where $\mathrm{o_i}$ can be an atomic value, or a nested expression.
Note that we construct the AST slightly differently. The operator is also part of the children
.
We combine the parsing and evaluation of the AST into one stage. The result of evaluating an AST is an Atom
, which can either have a Value
or an errror
.
type Atom struct {
Err error
Val Value
}
Here is a stripped down AST evaluation code:
func evalAST(env *LangEnv, node *ASTNode) Atom {
var retVal Atom
if node.isValue {
retVal.Value, retVal.Err = getValue(env, node.value)
return retVal
}
// Assuming that the first child is an operand
symbol := node.children[0].value
operator := env.getOperator(symbol)
// Skipped error handling related to operator & arguments for brevity.
operands := make([]Atom, 0)
for i := 1; i < len(node.children); i++ {
v := evalAST(env, node.children[i])
if v.Err != nil {
return v
}
operands = append(operands, v)
}
v := operator.handler(env, operands)
if v.Err != nil {
return v
}
retVal.Val = v.Val
return retVal
}
Basic evaluation is very simple. We have a struct called LangEnv
, which is the ‘environment’ data-structure storing amongst other things, defined operators. When evaluating an AST, if it is a single node, the value of the node is the result. Otherwise, we simply lookup the operator in the environment using getOperator
, then resolve the operands recursively, and pass the operands to the operator. The operand deals with making sure that the operands are sane.
An operator looks something like this:
type Operator struct {
symbol string
minArgCount int
maxArgCount int
handler (func(*LangEnv, []Atom) Atom)
}
As seen, symbol
is the name of operator, so for the binary addition it can be “+”. handler
is the function which will actually do the heavy lifting we have been avoiding all this while. It takes in a slice of Atom
s (and a LangEnv
, more on that later) and returns an Atom
as a result.
Now, the fun stuff.
Remember Atom
has a Value
inside? Value
is an interface, and any type which wants to be a Value
, needs to implement the following methods.
type Value interface {
Str() string // Returns a string representation of the value.
getValueType() valueType // Return the valueType (enum of all Values).
to(valueType) (Value, error) // Convert to a different Value type.
ofType(string) bool // Check if a given string is of this type.
newValue(string) Value // Create a new Value of this type.
}
This is enough power to figure out which value is of which type. In LangEnv
we keep a list of builtin Value
types, such as intValue
, floatValue
, stringValue
, etc.
To deduce the type of a value, we simply do this:
func getValue(env *LangEnv, token string) (Value, error) {
types := builtinTypes()
for _, t := range types {
if t.ofType(token) {
return t.newValue(token), nil
}
}
return nil, errors.New(fmt.Sprintf("Could not get type for token: %s", token))
}
Now imagine an expression like $(+$ $1.2$ $3)$.
$1.2$ resolves to floatValue
, and $3$ resolves to intValue
. How would we implement the handler
method for the $+$ operator to add these two different types? You might say, that this would involve casting of $3$ to floatValue
. Now how do we decide what values to cast, and what type should we cast them to?
This is how we do it. In the method, typeCoerce
, we try to find out which is the right value type to cast all our values to. It is declared like:
func typeCoerce(operatorName string, operands *[]Atom, typePrecendenceMap map[valueType]int)
(valueType, error)
This is what we do inside typeCoerce
:
Hence, the $+$ operator could be implemented this way:
op :=
&Operator{
symbol: add,
minArgCount: 2,
maxArgCount: 100, // Just an arbit large value.
handler: func(env *LangEnv, operands []Atom) Atom {
var retVal Atom
var finalType valueType
finalType, retVal.Err = typeCoerce(add, &operands, numValPrecedenceMap)
if retVal.Err != nil {
return retVal
}
switch finalType {
case intType:
var finalVal intValue
finalVal.value = 0
for _, o := range operands {
v, _ := o.Val.(intValue)
finalVal.value = finalVal.value + v.value
}
retVal.Val = finalVal
break
case floatType:
var finalVal floatValue
finalVal.value = 0
for _, o := range operands {
v, _ := o.Val.(floatValue)
finalVal.value = finalVal.value + v.value
}
retVal.Val = finalVal
break
}
return retVal
},
}
// Add the operator to the LangEnv's opMap (operator map)
addOperator(env.opMap, op)
Here we basically just call typeCoerce
on the operands, and if its possible to cast them to one single type, we do that casting, and actually perform the addition in the new type.
The $+$ operator can be used to add strings as well. However, we don’t want something like $($$+$ $3.14\ \mathrm{“foo”})$. The typeCoerce
method can be trivially extended to support a list of type valid type precedence maps, and all operands need to belong to the same precedence map. In this case, it could be { {intType: 1, floatType: 2}, {stringType: 1 } }
. This particular list ensures that we don’t add ints and strings for example, because they don’t belong to the same precedence map.
Note that the entire implementation of the operator is defined in the Operator
struct’s handler
method. Whether or not the operator supports this sort of casting, or decides to roll its own, or not use it at all, is the prerogative of the operator.
A typical variable definition could look like this: $(\mathrm{defvar}\ \mathrm{x}\ 3.0)$.
Now, defvar
is an operator too. It expects the first argument to be of varType
(matches the regex of a variable), and the value can be anything (except varType
). We just need to check if the type conditions match, and the variable is not a defined operator. If both are okay, we define the variable in our LangEnv
’s varMap
.
We need to change the part in our evalAST
method which to support variable lookup.
func evalAST(env *LangEnv, node *ASTNode) Atom {
// ...
for i := 1; i < len(node.children); i++ {
v := evalAST(env, node.children[i])
if v.Err != nil {
return v
}
// <!-- Lookup code begins
if v.Val.getValueType() == varType {
v.Val, v.Err = getVarValue(env, v.Val)
if v.Err != nil {
return v
}
}
// Lookup code ends --!>
operands = append(operands, v)
}
// ...
}
Here we can assume we have a helper method called getVarValue
, which looks up the value of a variable in the varMap
, or throws an error if required (this is also simple to implement).
Defining methods is even more fun! We use the defun
operator. Consider this:
(defun circle-area (r) (* 3.14 r r))
The first operand is the name of the method, the second is a list of variables that you would pass to the method, and the last is the actual method, assuming that the variables you need are already defined. (circle-area 10)
after this definition should return 314
.
Calculating the area of a rectangle is pretty much similar.
(defun rect-area (x y) (* x y))
We need a couple of things for function calls to work fine:
astValue
which can be used for keeping ASTs. So far we were keeping ints, floats and so on.evalAST
to not evaluate the AST in the defun
arguments. This is because in circle-area
, the (* 3.14 r r)
itself is the value (AST value).defun
operator needs to add an operator to the opMap
, with the same name as the method, and define its handler
method.(defvar x 3.0)
defining the variable x
, followed by (defun foo (x) (+ 1 x))
which defines a method uses a param labelled x
. The interpreter may look at the varMap
and think that the programmer wants to use the global x
, which is $3.0$. The actual intention of the programmer is to use the parameter x
.For this to work correctly, we would need:
LangEnv
to be created, inside the handler
.varMap
as the parent LangEnv
passed to the handler.evalAST
to evaluate the AST we were provided in the method definition with the new LangEnv
LangEnv
, and it is incremented every time a recursive call is made. If it exceeds a large value (100000 for now), we can error out, so as to salvage the interpreter at least.This is the only complicated part of the interpreter. Those interested in the code can check it out here.
Apart from making sure that we have some sort of recursion depth limit enforced, recursion does not need any special handling. Except, defining some new operators like cond
(the if-else equivalent), which are required for writing something useful.
Here is the implementation of the factorial function:
(defun fact (x)
(cond ; conditional block
((= x 0) 1) ; if x == 0, return 1
(true ; else
(* x (fact (- x 1))) ; return x * fact(x - 1)
)
)
)
fact(10)
returns 3628800
as expected.
Once I had the interpreter working fine, I wanted to run this on an iOS app. Why? Just for fun. It turns out with Go 1.5, a new tool called gomobile. Hana Kim gave a talk about this at GopherCon 2015.
What it does is, it compiles your Go package into a static library, and generates Objective-C bindings for it, and wraps them together in a nice iOS friendly .framework
package.
There are a few restrictions regarding not being able to return complex types such as structs within structs, but apart from that it was fairly easy to use in my bare-bones app. Here is the code for the app, and we have already seen the demo earlier.
(Thanks to Dhruv Matani for reviewing this blog-post.)
Read more15 Dec 2014 - 5 minute read
These past two days in some free time, I decided to explore this nifty C library called libevent.
Following the theme from the previous post, the first question is: ‘Why do we need it?’. If you are familiar with network programming, or any multithreaded programming which involves blocking IO, you already know the problem at hand. Right from the hardware level to the software level, a common problem that happens is: IO is slower than the CPU. If we have several tasks to finish, and the current task being executed is waiting for a blocking IO to finish, we should ideally work on the other tasks and let that blocking IO finish in the background, and check on it later.
When we have several such operations happening in the background, we need a way to figure out when a particular operation (such as read, write, accept a connection), can be performed without blocking, so that we can quickly do that, and return to other things. select(2), poll(2), epoll(4), kqueue(2) (on *BSD systems), are one of the several ways to do it. In essence, you register a set of file descriptors that you are interested in, and then call one of these ‘backends’. They would usually block until either one of the fd-s that you are interested in, is ready for data to be read or written to it. If none of them is ready, it would block for a configured amount of time and then return.
The problem that libevent solves is, it provides an easy to use library for notifying when an event happens on the file descriptors which you consider interesting. It also hides the real backend (select, epoll, kqueue) being used, and this helps you avoid writing platform-dependent code (eg., kqueue works only on *BSD) and if there were a new and improved backend in the future, your code would not change. It is like the JVM for asynchronous event notification system.
I only have experience with select
, so my context is limited. Using select
is very tedious.
// Essentially a bitmask
fd_set readfds;
while (true) {
// Zero out the bitmask
FD_ZERO(&readfds);
// Enable the fd that we are interested in
FD_SET(sockfd, &readfds);
int ret = select(sockfd + 1, &readfds, NULL, NULL, &timeout);
if (ret < 0) {
fprintf(stderr, "select() returned %d\n", ret);
}
// Is the bit corresponding to the fd enabled?
// If yes, that fd is ready to be read from.
if (FD_ISSET(sockfd, &readfds)) {
// Do what we want here
}
}
In essence, what we do here is to create a set of file descriptors (fd_set
). We then run a loop where every time we set all the file descriptors we are interested in, into that set. Then we call select(), and it either times out, or one of the bits in that set would be set. We have to check for each of the file descriptors. This makes it a little ugly. Other backends might be more pleasant to use, but libevent is way ahead of select in terms of usability. Here is some sample code:
event_base* eb = event_base_new();
event* t = event_new(eb, sockfd, EV_READ | EV_PERSIST, acceptConnCob, NULL);
event_add(t, NULL);
event_base_dispatch(eb);
An event_base
represents a single event loop like the one that we saw in the select example. To subscribe to changes in a file descriptor, we will first create an event
. This can be done using event_new
, which takes in the event base we created, the file descriptor, flags signalling when the event is active, the callback method and arguments to the callback method. In this particular example, we ask that the acceptConnCob
method be called when the file descriptor is ready for reading (EV_READ
) and persist this event, i.e, even when the event fires, automatically add it for the next time (EV_PERSIST
is to be used here). Note that we had to add the file descriptors in every iteration of the while loop of the select example, so using the EV_PERSIST
flag is a slight convenience here. Once, I have created the event, I need to add that event to the event_base
it should be processed by, along with a timeout, using the event_add
method. If the fd doesn’t become active by the specified time, the callback will be called anyways. Finally, to get things running, we will ‘dispatch’ the event base, which will spawn a new thread to run the event loop.
Note that nowhere have I specified which backend to use. I don’t need to. However, there are ways to prefer or avoid certain backends in libevent, using the event_config
struct. Refer to the link in the end.
I can add multiple events, and there are a lot of options that you can use. For instance, I can create a timer by passing -1 as a file descriptor with the EV_TIMEOUT
and EV_PERSIST
flags, and the required timeout in event_add
. This would call the timeout callback every ‘timeout’ seconds. Example:
event* e = event_new(eb, -1, EV_TIMEOUT | EV_PERSIST, pickFortuneCookieCob, NULL);
timeval twoSec = {2, 0};
event_add(e, &twoSec);
I created a simple fortune cookie server (one of my favorite demo examples), where I have a set of messages, and if someone asks me for a fortune cookie, I will give them the current fortune cookie. Every few seconds, I will pick a new fortune cookie to be returned. This is implemented by having two events, one for accepting connections and the other for having a timer. The code is here.
One small thing to keep in mind is that if the callbacks themselves to do something heavy, then it defeats the purpose of using libevent. This is because the callbacks are called from the same thread as the actual event loop. The longer the callback runs, the longer the event loop is not able to process other events.
libevent allows you to do a lot of customizations. In the above example, I have added callbacks to override the logging mechanism, so that I can use glog (Google’s logging library). There are several other features such buffered events and a lot of utility methods (such as creating simple http servers), that you can find in the libevent book.
There are other similar async event notification systems such as libev, and libuv, etc. but I haven’t had the time to figure out the differences. I hope to cover the very interesting wrapper around libevent in folly, Facebook’s open source C++ library, in a future post.
Read more