Thursday, March 12, 2015

Cassandra schema migrations made easy with Apache Spark

By far the most common question I get asked when talking about Cassandra is once you've denormalised based on your queries what happens if you were wrong or a new requirement comes in that requires a new type of query.

First I always check that it is a real requirement to be able to have this new functionality on old data. If that's not the case, and often it isn't, then you can just start double/triple writing into the new table.

However if you truly need to have the new functionality on old data then Spark can come to the rescue. The first step is to still double write. We can then backfill using Spark. The awesome thing is that nearly all writes in Cassandra are idempotent, so when we backfill we don't need to worry about inserting data that was already inserted via the new write process.

Let's see an example. Suppose you were storing customer events so you know what they are up to. At first you want to query by customer/time so you end up with following table:

Then the requirement comes in to be able to look for events by staff member. My reaction a couple of years ago would have been something like this:

However if you have Spark workers on each of your Cassandra nodes then this is not an issue.

Assuming you want to a new table keyed by staff_id and have modified your application to double write you do the back fill with Spark. Here's the new table:

Then open up a Spark-shell (or submit a job) with the Spark-Cassandra connector on the classpath and all you'll need is something like this:

How can a few lines do so much! If you're in a shell obviously you don't even need to create a SparkContext. What will happen here is the Spark workers will process the partitions on a Cassandra node that owns the data for the customer table (original table) and insert it back into Cassandra locally. Cassandra will then handle the replication to the correct nodes for the staff table.

This is the least network traffic you could hope to achieve. Any solution that you write your self with Java/Python/Shell will involve pulling the data back to your application and pushing it to a new node, which will then need to replicate it for the new table.

You won't want to do this at a peak time as this will HAMMER you Cassandra cluster as Spark is going to do this quickly. If you have a small DC for just running the Spark jobs and let it asynchronously replicate to your operational DC this is less of a concern.