How to Process Large Data Sets with Ruby
09 Aug 2015The need for data migrations in mature systems is real. At times, requests for these migrations can appear at random. One minute, a system is behaving as specified, happily fulfilling requests, and then bam! All the user objects suddenly need an extremely crucial attribute. Well that seems relatively simple, right? All that is needed is a simple ruby script to iterate over all users and update every user with this essential piece of data.
To demonstrate such a problem, we can assume the following:
A Ruby on Rails application exists with a User
class, each with a phone_number
attribute.
class User < ActiveRecord::Base
validates_presence_of :phone_number
end
The application is relatively popular, which results in 2,000,000 users
:
User.count
#=> 2000000
Finally, the requested task:
Add a “+1” to the beginning of all user phone numbers (we are assuming that all phone numbers belong to users in the USA or Canada).
Disclaimer: Ruby is probably not the best tool for this kind of data migration but for argument’s sake we can assume it is the only one available.
Serial scripts at scale are slow
Without giving it too much thought, an approach to solve this problem might look something like:
User.find_each do |user|
user.phone_number = "+1#{ user.phone_number }"
user.save!
end
This approach will work. The find_each method will make sure the memory footprint of the script stays low (it will not load every user into memory at once) and the phone numbers will be updated.
However, this will be painfully slow. Even if the system is able to update 20 users per second, it will take approximately 27 hours to complete.
2,000,000 users / 20 users per second
= 100,000 seconds to process all users
100,000 seconds / 60 seconds
~= 1,666 minutes
1,666 minutes / 60 minutes
~= 27 hours
Resque to the Rescue
Resque is a very useful Ruby library for creating background jobs. Redis is utilized as the storage for these jobs and individual Resque workers pick one job off each queue at a time.
With a small amount of code reorganization, Resque enables processing multiple users at the same time. The basic idea behind this approach is divide and conquer.
0. Install Resque and Redis
1. Extract phone number updates into a new class
This new class will accept a single user
and update all associated phone_numbers
.
class PhoneUpdater
def initialize(user)
@user = user
end
def update_phone!
@user.phone_number = "+1#{ @user.phone_number }"
@user.save!
end
end
2. Create a Resque Job to use this logic
A Resque Job class needs to conform to a simple API. A queue
must be defined and the class must have a class method named perform
. The parameter given to perform
is the same that is passed to Resque when a job is enqueued.
class PhoneUpdaterJob
@queue = :phone_updates
def self.perform(user_id)
user = User.find(user_id)
updater = PhoneUpdater.new(user)
updater.update_phone!
end
end
3. Create script to enqueue ResqueJobs
Iterate over the user set and enqueue a job per user. The second argument is the id
of the user to process.
User.find_each do |user|
Resque.enqueue(
PhoneUpdaterJob,
user.id)
end
4. Run many parallel Resque workers
This is where the fun happens. By initializing many different Resque worker processes, they all will read from the phone_updates
queue and process users in parallel. The QUEUE
specified matches what was defined in PhoneUpdaterJob
.
# Open a terminal in the root of your project...
QUEUE=phone_updates rake resque:work
# Open another terminal in the root of your project...
QUEUE=phone_updates rake resque:work
QUEUE=phone_updates rake resque:work
# Open another terminal in the root of your project...
# etc.
5. Try not to overload the system
Now with 2,000,000 queued Resque jobs, each Resque worker will drastically decrease the execution time of the overall task. However, system constraints should not be ignored. Database connection saturation, CPU usage and memory usage should be used in calculation of how many Resque workers to run at once.
Parallelism is awesome
Although the parallel solution is not as simple as its serial counterpart, the benefits are extremely apparent. The parallel solution is not without its weaknesses, but they do not make it invalid. It iterates over the entire user set twice, but parallelizing the slow phone_number
update queries makes up for that inefficiency ten-fold.