Gaurav's Blog

Systems Design: Facebook TAO

TAO is a very important part of the infrastructure at Facebook. This is my attempt at summarizing the TAO paper, and the blog post, and the talk by Nathan Bronson. I am purely referring to public domain materials for this post.

Motivation

Memcache was being used as a cache for serving the FB graph, which is persisted on MySQL. Using Memcache along with MySQL as a look-aside/write-through cache makes it complicated for Product Engineers to write code modifying the graph while taking care of consistency, retries, etc. There has to be glue code to unify this, which can be buggy.

A new abstraction of Objects & Associations was created, which allowed expressing a lot of actions on FB as objects and their associations. Initially there seems to have been a PHP layer which deprecated direct access to MySQL for operations which fit this abstraction, while continuing to use Memcache and MySQL underneath the covers.

This PHP layer for the above model is not ideal, since:

  1. Incremental Updates: For one-to-many associations, such as the association between a page and it’s fans on FB, any incremental update to the fan list, would invalidate the entire list in the cache.

  2. Distributed Control Logic: Control logic resides in fat clients. Which is always problematic.

  3. Expensive Read After Write Consistency: Unclear to me.

TAO

TAO is a write-through cache backed by MySQL.

TAO objects have a type ($otype$), along with a 64-bit globally unique id. Associations have a type ($atype$), and a creation timestamp. Two objects can have only one association of the same type. As an example, users can be Objects and their friendship can be represented as an association. TAO also provides the option to add inverse-assocs, when adding an assoc.

Objects-Assocs in TAO

API

The TAO API is simple by design. Most are intuitive to understand.

  • assoc_add(id1, atype, id2, time, (k→v)*): Add an association of type atype from id1 to id2.
  • assoc_delete(id1, atype, id2): Delete the association of type atype from id1 to id2.
  • assoc_get(id1, atype, id2set, high?, low?): Returns assocs of atype between id1 and members of id2set, and creation time lies between $[high, low]$.
  • assoc_count(id1, atype): Number of assocs from id1 of type atype.
  • And a few others, refer to the paper.

As per the paper:

TAO enforces a per-atype upper bound (typically 6,000) on the actual limit used for an association query.

This is also probably why the maximum number of friends you can have on FB is 5000.

Architecture

There are two important factors in the TAO architecture design:

  1. On FB the aggregate consumption of content (reads), is far more than the aggregate content creation (writes).
  2. The TAO API is such that, to generate a newsfeed story (for example), the web-server will need to do the dependency resolution on its own, and hence will require multiple round-trips to the TAO backend. This further amplifies reads as compared to writes, bringing the read-write ratio to 500:1, as mentioned in Nathan’s talk.

The choice of being okay with multiple round-trips to build a page, while wanting to ensure a snappy product experience, imposes the requirement that:

  1. Each of these read requests should have a low read latency (cannot cross data-center boundaries for every request).
  2. The read availability is required to be pretty high.

Choice of Backing Store

The underlying DB is MySQL, and the TAO API is mapped to simple SQL queries. MySQL had been operated at FB for a long time, and internally backups, bulk imports, async replication etc. using MySQL was well understood. Also MySQL provides atomic write transactions, and few latency outliers.

Sharding / Data Distribution

Objects and Associations are in different tables. Data is divided into logical shards. Each shard is served by a database.

Quoting from the paper:

In practice, the number of shards far exceeds the number of servers; we tune the shard to server mapping to balance load across different hosts.

And it seems like the sharding trick we credited to Pinterest might have been used by FB first :-)

Each object id contains an embedded shard id that identifies its hosting shard.

The above setup means that your shard id is pre-decided. An assoc is stored in the shard belonging to its id1.

Consistency Semantics

TAO also requires “read-what-you-wrote” consistency semantics for writers, and eventual consistency otherwise.

Leader-Follower Architecture

TAO is setup with multiple regions, and user requests hit the regions closest to them. The diagram below illustrates the caching architecture. TAO Leader-Follower Setup

There is one ‘leader’ region and several ‘slave’ regions. Each region has a complete copy of the databases. There is an ongoing async replication between leader to slave(s). In each region, there are a group of machines which are ‘followers’, where each individual group of followers, caches and completely serves read requests for the entire domain of the data. Clients are sticky to a specific group of followers.

In each region, there is a group of leaders, where there is one leader for each shard. Read requests are served by the followers, cache misses are forwarded to the leaders, which in turn return the result from either their cache, or query the DB.

Write requests are forwarded to the leader of that region. If the current region is a slave region, the request is forwarded to the leader of that shard in the master region.

The leader sends cache-refill/invalidation messages to its followers, and to the slave leader, if the leader belongs to the master region. These messages are idempotent.

The way this is setup, the reads can never be stale in the master leader region. Followers in the master region, slave leader and by extension slave followers might be stale as well. The authors mention an average replication lag of 1s between master and slave DBs, though they don’t mention whether this is same-coast / cross-country / trans-atlantic replication.

When the leader fails, the reads go directly to the DB. The writes to the failed leader go through a random member in the leader tier.

Read Availability

There are multiple places to read, which increases read-availability. If the follower that the client is talking to, dies, the client can talk to some other follower in the same region. If all followers are down, you can talk directly to the leader in the region. Following whose failure, the client contacts the DB in the current region or other followers / leaders in other regions.

Read Availability in TAO

Performance

These are some client-side observed latency and hit-rate numbers in the paper.

Read Availability in TAO

The authors report a failure rate of $4.9 × 10^{−6}$, which is 5 9s! Though one caveat as mentioned in the paper is, because of the ‘chained’ nature of TAO requests, an initial failed request would imply the dependent requests would not be tried to begin with.

tao-summary

Comments

  • This again is a very readable paper relatively. I could understand most of it in 3 readings. It helped that there is a talk and a blog post about this. Makes the material easier to grasp.

  • I liked that the system is designed to have a simple API, and foucses on making them as fast as they can. Complex operations have not been built into the API. Eventual consistency is fine for a lot of use cases,

  • There is no transactional support, so if we have assocs and inverse assocs (for example likes_page and page_liked_by edges), and we would ideally want to remove both atomically. However, it is possible that assoc in one direction was removed, but there was a failure to remove the assoc in the other direction. These dangling pointers are removed by an async job as per the paper. So clients have to ensure that they are fine with this.

  • From the Q&A after the talk, Nathan Bronson mentions that there exists a flag in the calls, which could be set to force a cache miss / stronger consistency guarantees. This could be specifically useful in certain use-cases such ash blocking / privacy settings.

  • Pinterest’s Zen is inspired by TAO and implemented in Java. It powers messaging as well at Pinterest, interestingly (apart from the standard feed / graph based use-case), and is built on top of HBase, and a MySQL backend was in development in 2014. I have not gone through the talk, just cursorily seen the slides, but they seem to have been working on Compare-And-Swap style calls as well.

Read more

Systems Design: Twitter Search

In the near future, might be posting about some systems I am reading about. Let’s start with Twitter Search.

EarlyBird Paper

We start by reading the Early Bird paper.

The paper starts with laying out core design principles. Low-latency and high-throughput are obvious. Ability to present real-time tweets is the unique requirement for Twitter at the time of the paper being written, i.e., new tweets should be immediately searchable. Regarding the second requirement, in the past search engines would crawl the web / index their documents periodically, and the indices were built via batch jobs through technologies such as MapReduce.

Since the time paper was authored (Fall 2011), this has changed. Both Google and Facebook surface real-time results in their search results and feed. But arguably a large fraction of Twitter’s core user intent is real-time content, so they have to get it right.

The core of the paper starts with going over the standard fan-out architecture for distributed systems, with replication & caching for distributing query evaluation and then aggregating results. Then they start to focus specifically on what goes on in a single node while evaluating the query.

For IR newbies: An inverted-index maintains something called ‘posting lists’. Consider them to be something like map<Term, vector<Document>> in C++, i.e., a map from a Term to a list of documents. If I am querying for the term beyonce, I’ll look up the posting list for this term, and the list of documents having that term would be present in the list.

Of course, there can be millions of documents with such a popular term, so there is usually a two-phase querying. In the first phase, we do a cheap evaluation on these documents. This is usually achieved by pre-computing some sort of quality score such as PageRank (which is independent of the query and searcher), keeping the documents in the list sorted in descending order according to this quality score. Then at query time, we get the top $N$ candidates from this vector.

Once we have the candidates, then we do a second phase, which involves a more expensive ranking on these candidates, to return a set of ranked results according to all the signals (query, searcher and document specific features) that we care for.

EarlyBird Overview

EarlyBird is based on Lucene (a Java-based open-source search engine). Each Tweet is assigned a static score on creation, and a resonance score (likes, retweets) which is live-updated. Upon querying, the static score, resonance score and the personalization score, which is computed according to the searcher’s social graph are used to rank the tweets.

At the time of the paper being written, they state the latency between tweet creation and it’s searchability was ~ 10s. Their posting lists store documents in chronological order, and at the time of querying, these documents are queried in reverse chrono order (most recent tweet first).

For querying, they re-use Lucene’s and, or etc. operators. Lucene also supports positional queries (you can ask Lucene to return documents which have term A and B, and both are at-most D distance away from each other in the document).

Index-Organization

EarlyBird seems to handle the problem of concurrent read-writes to the index shard by splitting the shard into ‘segments’. All segments but one are read-only. The mutable index continues to receive writes until it ‘fills up’, at which time it becomes immutable. This is analogous to the ‘memtable’ architecture of LSM trees. But I wonder if they do any sort of compactions on the segments. This is not clearly explained here.

Layout for Mutable (Unoptimzed) Index: Then they discuss the problem of how to add new tweets into posting lists. Their posting lists at the time, were supposed to return reverse-chrono results. So they don’t use any sort of document score to sort the results. Instead tweet timestamp is what they want for ordering.

Appending at the end of posting lists, doesn’t gel well with delta-encoding schemes, since they naturally work with forward traversal, and they would have to traverse backwards. Pre-pending at the beginning of the lists using naive methods such as linked lists would be unfriendly for the cache, and require additional memory footprint for the next pointers.

They fall-back to using arrays. The posting list is an array, with 32-bit integer values, where they reserve 24 bits for the document id, and 8 bits for the position of the term in the document. 24 bits is sufficient, because firstly they map global tweet ids, to a local document id in that posting list, secondly their upper limit of how many tweets can go in a segment is < $2^{23}$. Though, I might want to keep additional meta-data about a document, and not just position of the term, so this is a little too-specific for tweets at the time of the paper being authored.

They also keep pools of pre-allocated arrays, or slices (of sizes $2^1$, $2^4$, $2^7$ and $2^{11}$), similar to how a Buddy allocator works. When a posting list exhausts it’s allocated array (slice), they allocate another one which is 8x bigger, until you reach a size of $2^{11}$. There is some cleverness in linking together these slices. If you can get this linking to work, you would not have to do $O(N)$ copy of your data as you outgrow your current allocated slice.

Layout for Immutable (Optimized) Index: The approach of pools is obviously not always efficient. We can end up wasting ~ 50% of the space, if the number of documents for a term are pathologically chosen. In the optimized index, the authors have a concept of long and short posting lists. Short lists are the same as in the unoptimized index.

The long lists comprise of blocks of 256 bytes each. The first four bytes have the first posting uncompressed. The remaining bytes are used to store the document id delta from the previous entry, and the position of the term in a compressed form. I wonder why they don’t do this compression to the entire list, and why have compressed blocks? My guess is that compressing the entire list would be prohibit random access.

Concurrency: Most of the heavy-lifting of consistency within a specific posting list reader-writers is done by keeping a per-posting list value of the maximum document id encountered so far (maxDoc). Keeping this value as volatile in Java introduces a memory barrier. So that there is consistency without giving up too much performance.

Memory Barrier in EarlyBird

Overall

The paper was very easy to read. I would have hoped that the authors would have described how the switching between immutable-to-mutable index happens, how is the index persisted to disk, etc., apart from addressing the rigid structure of meta-data in each posting list entry (just the term position).

Omnisearch and Improvements on EarlyBird

There are a couple of new posts about improvements on top of EarlyBird.

Introducing Omnisearch

This blogpost introduces Omnisearch. As I mentioned earlier, EarlyBird is strongly tied to the tweet’s content. In mature search systems, there are several “verticals”, which the infra needs to support. This blogpost describes how they are moving to a generic infrastructure which can be used to index media, tweets, users, moments, periscopes, etc.

Omnisearch Index Formats

Here is the blogpost, on this topic. It goes over what is mentioned in the paper before describing their contributions. They mostly work on the optimized index, as described earlier.

If a document has duplicate terms, it occurs that many times in the old posting list format. In the new format, they keep (document, count) pairs, instead of (document, position) pairs. They keep another table for positions. To further optimize, since most counts are 1, they store (document, count-1) pairs. They achieve a 2% space saving and 3% latency drop. I’m not entirely convinced why this improves both for tweet-text only index.

However, for indexing terms which are not present in the text (such as for user indices, where we want to keep a term for verified users) and hence the position does not make any sense. In that case, a separate position table makes sense, because we can completely skip the table in those cases.

Super-Root

Super-Root is another layer on top of Twitter’s index servers, which exposes a single API to customers, instead of having them query individual indices themselves.

Super-Root allows them to abstract away lower-level changes, add features like quota limitations, allow query optimization, and allow having thin clients. This is pretty essential when you start having a large number of customers.

Read more

Next Binary Permutation: Bitwise Hackery

Here is a puzzle that is fairly standard:

Given an array of elements, find the lexicographically next permutation of that array.

As an example, if the array is [1, 2, 2, 3], the lexicographically next permutation would be [1, 3, 2, 2], followed by [3, 1, 2, 2] and so on. There is a really neat article explaining how to solve this. If you don’t know how to do it, I encourage you to try examples with array sized 4 and above, and try to see the patterns.

A recent problem I was solving was a variant of this.

Given an unsigned integer, find the lexicographically next unsigned integer, with the same number of bits set.

It’s trivial to observe that, we can reduce this problem to the above problem, as an array with just two kinds of elements, 0s and 1s.

Alright. Are we done? Not quite. The solution mentioned in the link is an $O(n)$ solution. Technically, an unsigned integer would be 32 or 64 bits. So $n$ would be one of those numbers for the purpose of the problem. It should be easy to repurpose the algorithm in the article mentioned above for our case. But I wanted to see if we can do it with as few operations as possible, because looping when operating with binary numbers is just not cool :-)

I chanced upon Sean Anderson’s amazing bitwise hacks page, and I distinctly remember having seen this 7-8 years ago, when I was in Mumbai. Regardless, if you understand his solution: Awesome! It took me some time to figure out and I wrote a slightly slower but arguably easier to comprehend solution, which is 50% slower than his in my micro-benchmark, but better than the naive looping solution. So here goes.

Example

Let’s pick an example: $10011100$. The next permutation would be $10100011$.

As per the article above, we find the longest increasing suffix from right-to-left (called as longest non-increasing suffix from left-to-right in the article). In this case, it will be $11100$. Thus, the example is made of two parts: $100.11100$ ($.$ for separating).

The first zero before the suffix is the ‘pivot’, further breaking down the example: $10.0.11100$.

We swap the rightmost one in the suffix, with this pivot (rightmost successor in the article). So the example becomes $10.1.11000$. However, the suffix part needs to be sorted, since so far we were permuting with the prefix $10.0.$, but this is the first permuation with the prefix $10.1.$, and therefore the suffix should be it’s own smallest permutation (basically, it should be sorted).

Hence, we move the zeroes in the suffix to the end, resulting in $10.1.00011$. This is the correct next permutation.

Solution

uint64_t next(uint64_t n) {
  // Find the first unset bit (pivot).
  uint64_t pivotBit = __builtin_ctz((n | n - 1) + 1);  

  // Set the first unset bit (pivot).
  uint64_t step1 = n | (1<<pivotBit);

  // Unset the first set bit (successor).
  uint64_t step2 = step1 & (step1 - 1);

  // Extract the part after the pivot.
  uint64_t suffixMask = (1 << pivotBit) - 1;
  uint64_t suffix = step2 & suffixMask;

  // Zero out the suffix, and only keep the 1's around.
  uint64_t final = (step2 & ~suffixMask) | (suffix >> __builtin_ctz(suffix));
  return final;
}

We’ll break it down. Trust me, it’s easier to grasp than the terse wisdom of Sean Anderson’s snippet.

Code-Walkthrough

// Find the first unset bit (pivot).
uint64_t pivotBit = __builtin_ctz((n | n - 1) + 1);

To find the pivot in the example $n = 10011100$, we set all the bits in the suffix to 1 first.

$n-1$ will set the trailing zeroes to 1, i.e., $10011011$, and $n | n-1$ would result in a value with all original bits set, along with the trailing zeroes set to 1, i.e., $10011111$. Thus, all the bits in the suffix are now set.

We now set the pivot bit to 1 (and unset the entire suffix), by adding 1. Using __builtin_ctz we can then find the number of trailing zeroes, which is the same as the bit number of the pivot. See the note below for __builtin functions.

// Set the first unset bit (pivot).
uint64_t step1 = n | (1<<pivotBit);

We then proceed to set the pivot. Since the value of $n$ was $10011100$, $step1 = 10111100$.

// Unset the first set bit (successor).
uint64_t step2 = step1 & (step1 - 1);

Now we need to unset the successor, which we can do by a trick similar to how we found the pivot. step1 - 1 would unset the lowest significant set bit (the successor) and set all it’s trailing zeroes (i.e., $10111011$). step1 & (step1 - 1) i.e., $10111100 \& 10111011$ would lead to zero-ing out of the successor bit and the trailing zeroes. Hence, $step2 = 10111000$.

// Extract the part after the pivot.
uint64_t suffixMask = (1 << pivotBit) - 1;
uint64_t suffix = step2 & suffixMask;

This is fairly straightforward, we extract the suffix mask, i.e., all the bits corresponding to the suffix part are set, and then $\&$ with the number $10.1.11000$ so far gives us the modified suffix, i.e. $11000$.

uint64_t final = (suffix >> __builtin_ctz(suffix)) | (step2 & ~suffixMask);
return final;

All we need to do now is to pull the 1s to the left in the suffix. This is done by left-shifting them by the number of trailing zeroes, so we get $00011$ (the first part of the calculation of final). This is our ‘sorted’ suffix we mentioned earlier.

Then we OR it with the number so far, but except the suffix part zero-ed out, so that we replace the unsorted suffix with this sorted suffix, i.e. $00011 | 10100000 \implies 10100011$.

I hope this helped you breakdown what’s going on, and probably served as a bridge between the naive solution and one-liner bitwise hacks.

Please leave any comments below. I’d be super happy to discuss any alternative solutions!

Appendix: __builtin Functions

Whenever you are working with bitwise operations, gcc provides built-in methods such as __builtin_popcount (number of set bits in the argument), __builtin_ctz (number of trailing zeroes in the argument), and so on. These functions map to fast hardware instructions, and are also concise to write in code, so I use them whenever I can.

Read more