Using a bloom filter to optimize a SQL query
February 08, 2010Several weeks ago, we started hitting some bottle necks with our database. We have 3 MySQL machines: 1 master, 1 online slave and 1 offline slave. The online slave started lagging pretty badly, about 0.46 seconds per second. This post is about one of the things I did to alleviate the load on this machine and eliminate slave lag.
The slave database was only used for a couple types of queries: comments and user search. This meant that people would post comments, see them (write-through to memcache) and then see them disappear next time they came back (they didn't exist on the slave yet). In addition, our user search page would be out-dated and individuals' details would be incorrect. In order to find out which of these queries was a bigger contributor to slave lag, I watched the output of 'show full processlist' and looked for queries that took a long time (you can alternatively enable the slow query log, but that has a bit of overhead and MySQL 5.0 requires a restart).
The offending query quickly became clear:
1 SELECT * FROM users WHERE facebook_uid in (<long list of integers>) and installed = 1; 2 # Installed means they have added our application on Facebook
This query is executed to find your list of Facebook friends who are also playing our game. The way MySQL executes this query is by looking up each value in the list of integers in the index on facebook_uid, which for many people is several hundred to thousands of integers, and checking the installed field on those that had a row. The biggest problem with this query is that a lot of time was wasted looking up people who were not in the database or are not installed; we couldn't avoid querying for the people who are playing the game since we needed at least a little bit of information about them to show to the user.
We needed a way to efficiently reduce this huge list of users to remove the ones who would not be part of the result set. There are 10s of millions of installed users, so we can't simply keep a set in memory and have it be faster than MySQL's index lookups. I decided to use a bloom filter, which is one of the most memory efficient ways to check set membership. In this case, the set is defined as Facebook User IDs whose installed field is equal to '1' in our database. Using a bloom filter, we were able to significantly reduce the number of users we checked in the database; and because bloom filters can't have false negatives, we'd never incorrectly exclude an installed user from that query.
There are a couple bloom filter implementations for Ruby, but neither seemed to provide a service interface. So I decided to write one: http://github.com/arya/bloom_filter. This bloom filter has been in production for a couple weeks now. It's currently doing adds in an average of 2.92 milliseconds, and filters an average of 317 integers in average 12.7 milliseconds.
The bloom filter can be used either as a service, or in process. Check out the readme for details.
Comments
Leave a Comment