asemanfar - a blog about programming

 

Using a bloom filter to optimize a SQL query

February 08, 2010

Several weeks ago, we started hitting some bottle necks with our database. We have 3 MySQL machines: 1 master, 1 online slave and 1 offline slave. The online slave started lagging pretty badly, about 0.46 seconds per second. This post is about one of the things I did to alleviate the load on this machine and eliminate slave lag.

The slave database was only used for a couple types of queries: comments and user search. This meant that people would post comments, see them (write-through to memcache) and then see them disappear next time they came back (they didn't exist on the slave yet). In addition, our user search page would be out-dated and individuals' details would be incorrect. In order to find out which of these queries was a bigger contributor to slave lag, I watched the output of 'show full processlist' and looked for queries that took a long time (you can alternatively enable the slow query log, but that has a bit of overhead and MySQL 5.0 requires a restart).

The offending query quickly became clear:

   1  SELECT * FROM users WHERE facebook_uid in (<long list of integers>) and installed = 1;
   2  # Installed means they have added our application on Facebook

This query is executed to find your list of Facebook friends who are also playing our game. The way MySQL executes this query is by looking up each value in the list of integers in the index on facebook_uid, which for many people is several hundred to thousands of integers, and checking the installed field on those that had a row. The biggest problem with this query is that a lot of time was wasted looking up people who were not in the database or are not installed; we couldn't avoid querying for the people who are playing the game since we needed at least a little bit of information about them to show to the user.

We needed a way to efficiently reduce this huge list of users to remove the ones who would not be part of the result set. There are 10s of millions of installed users, so we can't simply keep a set in memory and have it be faster than MySQL's index lookups. I decided to use a bloom filter, which is one of the most memory efficient ways to check set membership. In this case, the set is defined as Facebook User IDs whose installed field is equal to '1' in our database. Using a bloom filter, we were able to significantly reduce the number of users we checked in the database; and because bloom filters can't have false negatives, we'd never incorrectly exclude an installed user from that query.

There are a couple bloom filter implementations for Ruby, but neither seemed to provide a service interface. So I decided to write one: http://github.com/arya/bloom_filter. This bloom filter has been in production for a couple weeks now. It's currently doing adds in an average of 2.92 milliseconds, and filters an average of 317 integers in average 12.7 milliseconds.

The bloom filter can be used either as a service, or in process. Check out the readme for details.

ActiveRecord bug with reload and new_record

January 27, 2010

In ActiveRecord, the new_record? method returns true when the object represents a record that is not persisted. This is implemented by a trivial flag set when you initialize a new record with Model.new. The reload method in ActiveRecord calls find and passes the primary key (usually id) for that model, replaces the attributes hash with the newly retrieved (and clears some internal caches). A bug arises when you try to call reload on a new record after you've set a value for id.

After you initialize a new record, set its id, and then call reload which subsequently loads all the attributes from the database (if it exists), the AR object is left in a weird state; when you save it tries to insert despite the fact that the record knows it came from the database and will raise an error.

   1  record_a = Model.new(:some_attribute => "some value")
   2  record_b = Model.new(:some_attribute => "some other value")
   3  record_a.id = record_b.id = 1
   4  record_a.save
   5  record_b.reload # this reloads the record with the attributes from record_a
   6  record_b.save # this tries to do an insert and throws ActiveRecord::RecordNotUnique

The fix is relatively simple: set @new_record to false in the reload method.

I should explain why I'm doing this in the first place and when this would happen to you. I ran into this in a highly concurrent environment where the primary key of a record is not an auto increment column. When two processes do a select to find a record that doesn't exist yet and try to create it, they will race to create that record in the DB; losing that race means you'll get ActiveRecord::RecordNotUnique. After this patch goes through, you'll be able to call reload and continue as you were. If you are writing an application where the primary key is not auto increment, you should be aware of this race case wherever you create records of that model and rescue ActiveRecord::RecordNotUnique.

RackProctitle

January 26, 2010

The CodeRack Contest recently came to a close and RackProctitle came in 3rd place.

RackProctitle allows you to see what request a given web server process is handling. By running "watch -n 0.1 ‘ps aux | grep “USER\|rack” ’ you can see all your rack processes, what request each is currently handling and for how long (or if they’re idle), what their last request was and how long it took, how many requests have been handled in the life of this process, and the requests in queue (if it's receiving more than 1 request at a time).

It ends up being really useful to diagnose problems. Example output [screenshot]:

   1  rack/14a0b0 [1/682]: handling 0.0s /users/6014086

You can set a prefix, which will replace “rack” in the line above, and you can set an APPLICATION_VERSION constant which will be after the prefix. You can use this to distinguish between different rack processes if you run multiple applications on the same machine. In the example above, APPLICATION_VERSION is set to the first six characters of the git commit hash.

It's a gem:

   1  sudo gem install rack_proctitle

In some file somewhere in your application (optional):

   1  APPLICATION_VERSION = File.read(File.join(RAILS_ROOT, “REVISION”))[0,6] # if you use Capistrano, which creates a REVISION file for you.

In your rackup file (prefix is optional):

   1  use RackProctitle, :prefix => “application_name”

We use watch to get a continuously updating status of our rack processes:

   1  watch -n 0.1 ‘ps aux | grep “USER\|application_name” | grep -v grep ; uptime; hostname’

The watch linux utility is one of those things that is hard to Google, so here's the link if it's not installed or not in your package manager: http://procps.sourceforge.net/

Concept and output formatting is based on Alexander Staubo’s mongrel_proctitle but has more features, is thread safe, and is Rack compatible. For more information on why mongrel_proctitle is not thread safe, read this post.

As always, feature requests and bug reports always welcome.

HashWithIndifferentAccess + Ruby's Assignment Behavior

January 17, 2010

I recently ran into a weird bug related to HashWithIndifferentAccess that led to my discovery of an interesting Ruby behavior.

The relevant code:

   1  require 'rubygems'; require 'activesupport'
   2  class House
   3    def initialize
   4      @rooms = {}.with_indifferent_access
   5    end
   6  
   7    def room_details_for(room_name)
   8      @rooms[room_name] ||= {}
   9    end
  10  end
  11  
  12  house = House.new
  13  kitchen = house.room_details_for(:kitchen)
  14  kitchen[:size] = "5x10"
  15  
  16  puts kitchen.inspect #=> {:size=>"5x10"}
  17  puts house.room_details_for(:kitchen).inspect #=> {}

After a little debugging and digging through HashWithIndifferentAccess's code, it turns out the problem is a consequence of the way Ruby implements the assignment operator. The return value of an assignment expression is the return value of the right hand side, regardless of what the return value of the assignment method that ends up being called:

   1  obj = Object.new
   2  def obj.foo=(value)
   3    print "foo is called with #{value} but is returning 2, puts receives: "
   4    2
   5  end
   6  
   7  puts (obj.foo = 1) #=> foo is called with 1 but is returning 2, puts receives: 1

That, in conjunction with the fact that HashWithIndifferentAccess converts all hashes set as values into HashWithIndifferentAccess, makes the above bug possible.

   1  hash = {}.with_indifferent_access
   2  hash[:key] = {'inner_key' => 2}
   3  puts hash[:key][:inner_key] #=> 2

Note that ActiveSupport creates a new HashWithIndifferentAccess even if it is already one:

   1  hash = {}.with_indifferent_access
   2  inner_hash = (hash[:key] = {}.with_indifferent_access)
   3  hash[:key][:inner_key] = 2
   4  puts inner_hash.inspect #=> {}
   5  puts hash[:key].inspect #=> {"inner_key"=>2}

There are a couple ways to get around this bug:

  1. Always discard the return value of a HWIA assignment statement unless the right hand side is not a Hash or an Array.
  2. Patch ActiveSupport so that calling with_indifferent_access on a HWIA is a no-op and always use HWIA in favor of Hash when it's a value of another HWIA.
  3. Patch ActiveSupport to not convert values assigned to it. This would break params in Rails unless it was explicitly recursively converted to HWIA.

Oh and by the way, using send doesn't have the same behavior, but is ugly:

   1  obj = Object.new
   2  def obj.foo=(value)
   3    print "foo is called with #{value} but is returning 2, puts receives: "
   4    2
   5  end
   6  
   7  puts (obj.send(:foo=, 1)) #=> foo is called with 1 but is returning 2, puts receives: 2

I hope this helps at least one person while Googling for this issue.