Saturday, May 25, 2013

Time series based queries in Cassandra 1.2+ and CQL3


Cassandra is fantastic for storing large amounts of data, and as of 1.2/CQL3 working with time series data just got a lot easier. Here is a basic example that stores some kind of posts (e.g blog) that can be queried by username and a time period.

Everything is assuming you have Cassandra 1.2+ installed and are using CQL3 with the cqlsh python client. Here are the exact versions I used for the below example:

[cqlsh 3.0.2 | Cassandra 1.2.5 | CQL spec 3.0.0 | Thrift protocol 19.36.0]

Creating a keyspace:
create keyspace cassandraspike WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 1};

The syntax has changed for creating keyspaces in CQL for cassandra 1.2 so if this fails check you aren't running against a pre-1.2 version of cassandra.
use cassandraspike;

Then create a posts table:
create table posts(username varchar, time timeuuid, post_text varchar, primary key(username, time))

We've used a compound primary key that is the username and the time. The time is of type timeuuid which is a new data type in CQL3 that us a type 1 UUID that contains a timestamp so it both stores the time of the post and makes our rows unique.

If you're familier with column families then it might interest you to know that the first column in your primary key becomes the column family (CF) row key. Every subsequent part of the primary key becomes part of the CF column names in that CF row. This has two implications:
  • There will only be as many CF rows as there are variations of the first element in your primary key. This can be a problem if this element has a very low cardinality as you can end up with very wide CF rows.
  • CF columns are stored in order so the fact that each CF column is prefixed with the time means that your data is stored in the order it happened. Making it very efficient to be queried by time.
Hopefully you are sufficiently convinced that your data is going to be stored in a way that is efficient to query by time. So let's store some data in and query it by user name and time.

To insert we'll make use if the now() function which gives you a timeuuid for the current time:
insert into posts(username, time, post_text) values ('chbatey', now(), 'i am writing something about cassandra');

Selecting this back works but gives you a time which isn't too human read able:
select * from posts;

 username | time                                 | post_text
----------+--------------------------------------+----------------------------------------
  chbatey | 59ad61d0-c540-11e2-881e-b9e6057626c4 | i am writing something about cassandra


That's where the dateOf function comes in handy which converts a timeuuid into a date:
select username, dateOf(time), post_text from posts;

 username | dateOf(time)             | post_text
----------+--------------------------+----------------------------------------
  chbatey | 2013-05-25 14:38:14+0100 | i am writing something about cassandra

I've now inserted another post at a different time:
select username, dateOf(time), post_text from posts;


username | dateOf(time)             | post_text
----------+--------------------------+----------------------------------------
  chbatey | 2013-05-25 14:38:14+0100 | i am writing something about cassandra
  chbatey | 2013-05-25 14:40:35+0100 | i am writing something about cassandra


Now lets say you are only interested in the posts chbatey made at 14:38:
select username, dateOf(time), post_text from posts where time >= minTimeuuid('2013-05-25 14:38') and time < minTimeuuid('2013-05-25 14:39') and username = 'chbatey';

 username | dateOf(time)             | post_text
----------+--------------------------+----------------------------------------
  chbatey | 2013-05-25 14:38:14+0100 | i am writing something about cassandra

This query is more complicated. It makes use of another cql function: minTimeuuid. The minTimeuuid function gives a fake (as it isn't unique) timeuuid that is the smallest possible timeuuid for the given time. This is very hand when you want to do less than/greater than queries on timeuuid fields.

In the above query we are getting everything that is greater than the minimum timeuuid for the time 2013-05-25 14:38 but less than the minimum timeuuid of the next minute 2013-05-25 14:39.

Rather than that we could have used the very similar function maxTimeuuid function with 2013-05-24 14:39:59. However I find original one easier to understand as I read it as greater than or equal to 14:38:00 and less than 14:39:00.




Friday, May 24, 2013

Installing Cassandra on Mac OS X

I've recently posted some more tips on using Cassandra on Mac OSX: Cassandra on Mac

If you don't already have homebrew then install it from here.

Then it as simple as:

brew install cassandra

This doesn't install the python driver for the cqlsh command line tool. To do this install it first install python if you haven't got it already:

brew install python

This should have also installed pip - the python package manager - so you can then install the cql python module:

pip install cql

Now try and start cqlsh

You might get this:


Python CQL driver not installed, or not on PYTHONPATH.
You might try "easy_install cql".

One second didn't I just install the cql module?

This could be because the Python in your path is the Mac OS X version. Not the version you installed with home brew that has cql. I fixed this by adding /usr/local/bin to the start of my PATH as that is where the brew Python executable lives:

export PATH=/usr/local/bin/:$PATH

Unless you've started cassandra the next time you try cqlsh you'll get:


cqlsh
Connection error: Could not connect to localhost:9160

Now if you do a brew info on cassandra:

brew info cassandra


To have launchd start cassandra at login:
    ln -sfv /usr/local/opt/cassandra/*.plist ~/Library/LaunchAgents
Then to load cassandra now:
    launchctl load ~/Library/LaunchAgents/homebrew.mxcl.cassandra.plist

Unless you are going to use cassandra a lot I wouldn't set it to load on startup as it does use a reasonable amount of memory. Instead to just start it off:

launchctl load /usr/local/opt/cassandra/homebrew.mxcl.cassandra.plist 

Finally cqlsh should connect to cassandra:



cqlsh
Connected to Test Cluster at localhost:9160.
[cqlsh 3.0.2 | Cassandra 1.2.5 | CQL spec 3.0.0 | Thrift protocol 19.36.0]
Use HELP for help.
cqlsh> 

Or if you prefer the older cassandra-cli interface to cassandra:


cassandra-cli
Connected to: "Test Cluster" on 127.0.0.1/9160
Welcome to Cassandra CLI version 1.2.5

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown


All done.