How to Process Large Data Sets with Ruby

The need for data migrations in mature systems is real. At times, requests for these migrations can appear at random. One minute, a system is behaving as specified, happily fulfilling requests, and then bam! All the user objects suddenly need an extremely crucial attribute. Well that seems relatively simple, right? All that is needed is a simple ruby script to iterate over all users and update every user with this essential piece of data.

To demonstrate such a problem, we can assume the following:

A Ruby on Rails application exists with a User class, each with a phone_number attribute.

class User < ActiveRecord::Base

  validates_presence_of :phone_number
end

The application is relatively popular, which results in 2,000,000 users:

User.count
#=> 2000000

Finally, the requested task:

Add a “+1” to the beginning of all user phone numbers (we are assuming that all phone numbers belong to users in the USA or Canada).

Disclaimer: Ruby is probably not the best tool for this kind of data migration but for argument’s sake we can assume it is the only one available.

Serial scripts at scale are slow

Without giving it too much thought, an approach to solve this problem might look something like:

User.find_each do |user|
  user.phone_number = "+1#{ user.phone_number }"

  user.save!
end

This approach will work. The find_each method will make sure the memory footprint of the script stays low (it will not load every user into memory at once) and the phone numbers will be updated.

However, this will be painfully slow. Even if the system is able to update 20 users per second, it will take approximately 27 hours to complete.

2,000,000 users / 20 users per second
= 100,000 seconds to process all users
100,000 seconds / 60 seconds
~= 1,666 minutes
1,666 minutes / 60 minutes
~= 27 hours

Resque to the Rescue

Resque is a very useful Ruby library for creating background jobs. Redis is utilized as the storage for these jobs and individual Resque workers pick one job off each queue at a time.

With a small amount of code reorganization, Resque enables processing multiple users at the same time. The basic idea behind this approach is divide and conquer.

0. Install Resque and Redis

1. Extract phone number updates into a new class

This new class will accept a single user and update all associated phone_numbers.

class PhoneUpdater
  def initialize(user)
    @user = user
  end

  def update_phone!
    @user.phone_number = "+1#{ @user.phone_number }"

    @user.save!
  end
end

2. Create a Resque Job to use this logic

A Resque Job class needs to conform to a simple API. A queue must be defined and the class must have a class method named perform. The parameter given to perform is the same that is passed to Resque when a job is enqueued.

class PhoneUpdaterJob
  @queue = :phone_updates

  def self.perform(user_id)
    user = User.find(user_id)
    updater = PhoneUpdater.new(user)
    updater.update_phone!
  end
end

3. Create script to enqueue ResqueJobs

Iterate over the user set and enqueue a job per user. The second argument is the id of the user to process.

User.find_each do |user|
  Resque.enqueue(
    PhoneUpdaterJob,
    user.id)
end

4. Run many parallel Resque workers

This is where the fun happens. By initializing many different Resque worker processes, they all will read from the phone_updates queue and process users in parallel. The QUEUE specified matches what was defined in PhoneUpdaterJob.

# Open a terminal in the root of your project...
QUEUE=phone_updates rake resque:work

# Open another terminal in the root of your project...
QUEUE=phone_updates rake resque:work

QUEUE=phone_updates rake resque:work
# Open another terminal in the root of your project...

# etc.

5. Try not to overload the system

Now with 2,000,000 queued Resque jobs, each Resque worker will drastically decrease the execution time of the overall task. However, system constraints should not be ignored. Database connection saturation, CPU usage and memory usage should be used in calculation of how many Resque workers to run at once.

Parallelism is awesome

Although the parallel solution is not as simple as its serial counterpart, the benefits are extremely apparent. The parallel solution is not without its weaknesses, but they do not make it invalid. It iterates over the entire user set twice, but parallelizing the slow phone_number update queries makes up for that inefficiency ten-fold.