Wednesday, October 2, 2013

Installing Cassandra 2.0 on Ubuntu

Update your apt source list with the following:

sudo vim /etc/apt/sources.list

#Add at the bottom
deb 20x main
deb-src 20x main

Run an apt-get update. 

sudo apt-get update

This will give you a warning about not being able to verify the signatures of the apache repos:

GPG error: unstable Release:
The following signatures couldn't be verified because the public key is not available:

Now do the following for that key:

gpg --keyserver --recv-keys 4BD736A82B5C1B00
gpg --export --armor 4BD736A82B5C1B00 | sudo apt-key add -

Also add this one:

gpg --keyserver --recv-keys 2B5C1B00
gpg --export --armor 2B5C1B00 | sudo apt-key add -

Now run apt-get update again.
sudo apt-get update

The error should be gone. Now check that all is working and UBuntu can see Cassandra 2.0:

apt-cache showpkg cassandra
Package: cassandra

Great! Now install it:

sudo apt-get install cassandra

Now start it:

sudo service cassandra start
xss = -ea -javaagent:/usr/share/cassandra/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms1001M -Xmx1001M -Xmn100M -XX:+HeapDumpOnOutOfMemoryError -Xss256k

Now you can check you can connect:
Connected to Test Cluster at localhost:9160.
[cqlsh 4.0.1 | Cassandra 2.0.1 | CQL spec 3.1.1 | Thrift protocol 19.37.0]
Use HELP for help.

Where is everything?

  • Logs: /var/log/cassandra
  • Config: /etc/cassandra/
  • Data: /var/lib/cassandra


Tuesday, October 1, 2013

Using Cassandra on Mac OSX

I posted some time ago about installing Cassandra on Mac OSX. Admittedly I generally use Linux when dealing with Cassandra but have recently been using it on Mac OSX again so here are some tips when working with Cassandra on ac OSX.

Install it with homebrew 

It's easy! The only reason for not using homebrew is if you want a specific version. I have an old blog post on installing it with homebrew here: install Cassandra on Mac OSX. If you want 1.2 rather than 2.0 read below first.

The default formula for Cassandra is now 2.0. If you aren't that cutting edge and want to stick to  Cassandra 1.2 then you need to do some tinkering. First off do a brew update & tap to the versions branch:

brew update
brew tap homebrew/versions

Now lets see what we get for cassandra:

brew search cassandra
cassandra      cassandra-0.6  cassandra12

Homebrew have kindly created three formulas you can work with: 0.6, 1.2 and the latest (currently 2.0). If you want 1.2 simply do:

brew install cassandra12 

Rather than brew install Cassandra. By default the brew installed Cassandra will use the same config/data locations for 1.2 and 2 so you can't (without work) use brew to manage multiple versions of Cassandra on your Mac - but if you want that you probably should use VMs instead.

Cassandra is installed: Where is everything?

All of this applies regardless of whether you're on Cassandra 1.2 or Cassandra 2.0. Package managers are great but sometimes they leave you baffled to where they put everything!

Where's my Cassandra yaml and other property files? /usr/local/etc/cassandra

Where's my logs? /usr/local/var/log/cassandra/
  • This can be updated by modifying /usr/local/etc/cassandra/
Where's the data/commit log etc (you may need to delete this when playing with different versions / partitioners) ? /usr/local/var/lib/cassandra/data

How do I stop and start Cassandra?

If you're used to unix services/init.d etc you'll want to know how to start/stop Cassandra without the kill command. On Mac this is launchd using the launchctl utility. Assuming you installed Cassandra using homebrew use the following commands:

launchctl start homebrew.mxcl.cassandra
launchctl stop homebrew.mxcl.cassandra

That's a lot of typing so I tend to alias these in my profile e.g

alias stop_cassandra="launchctl stop homebrew.mxcl.cassandra"
alias start_cassandra="launchctl start homebrew.mxcl.cassandra"

Cassandra: Datastax Java driver retry policy

The Datastax Java Driver for Cassandra exposes its strategy for retrying via the following interface:

There are three scenarios you can control retry policy for:
  1. Read time out: When a coordinator received the request and sent the read to replica(s) but the replica(s) did not respond in time
  2. Write timeout: As above but for writes
  3. Unavailable: When the coordinator is aware there aren't enough replica available without sending the read/write request on

What is the default behaviour?

The DefaultRetryPolicy retries with the following behaviour:
  1. Read timeout: When enough replica are available but the data did not come back within the configured read time out 
  2. Write timeout: Only if the initial phase of a batch write times out - see cassandra batch statement
  3. Unavailable timeout: Never

How do I configure the value for the read and write timeout?

This is configured in the cassandra.yaml on the Cassandra server.  The default is 10 seconds, you can change the following properties:
# How long the coordinator should wait for read operations to complete
read_request_timeout_in_ms: 10000
# How long the coordinator should wait for writes to complete
write_request_timeout_in_ms: 10000

What are the other policies?


The most complicated retry policy and comes with a big warning: your read/write may be re-tried at a lower consistency. So if you have business requirements to not report success if you don't meet a certain level of consistency then use this with cation.

What does it do?
  1. Read: If at least one replica responded then the read is retried at a lower consistency
  2. Write: Retries for unlogged batch queries when at least one replica responded (see and for all other types of writes the timeout is just ignored if at least one replica acknowledged the write (essentially ignoring the consistency request)
  3. Unavailable: If at least one replica is available then the query is re-tried with a lower consistency


No retrying! Any failure is re-thrown to the client.


This is just a decorator policy that you can wrap around any other policy that logs ignored (no retry) and any actual retries. The driver uses SLF4J and logs at INFO level.

How do I use a different policy?

Simply add it with creating your Cluster. The retry policies all have a singleton you can use e.g:


The Datastax driver is very open for extension as it exposes its strategies for retry, load balancing and reconnection. 

The retry policy is very easy to work with as all the current implementations are stateless. I'll follow this post up with how to implement your own retry policy.

Wednesday, September 25, 2013

Talk: Introduction to Cassandra, CQL, and the Datstax Java Driver

Presenter: Johnny Miller
Who is he? Datastax Solutions Architect
Where? Skills matter exchange in London

I went to an "introductory" talk even though I have a lot of experience with Cassandra for a few reasons:
  • Meet other people in London that are using Cassandra
  • To discover what I don't know about Cassandra
Here are my notes that in roughly the same order as the talk.

What's Cassandra? The headlines

  • Been around for ~5 years - originally developed by Facebook for their inbox search
  • Distributed key store - column orientated data model
  • Tuneable consistency - per request decide how consistent you want the response to be
  • Datacenter aware with asynchronous replication
  • Designed for use as a cluster - not much value in a single node Cassandra deployment

Gossip - how nodes in a cluster learn about other nodes

  • P2P protocol for how nodes discover location and state of other nodes
  • New nodes are given seed nodes for bootstrapping - but these aren't single points of failure as they aren't used again

Data distribution and replication

  • Replication factor: How many nodes each piece of data is stored on
  • Each node is given a range of primary keys to look after

Partitioners - How to decide which node gets what data

  • Row keys are hashed to decide node then a replication strategy defines how to pick the other replicas

Replicas - how to select where else the data lives

  • All replicas are equally important. No difference between the node the key hashed to and the other replicas that were selected
  • Two ways to pick the other replicas:
    • Simple: Only single DC. Specify just a replication factor. Hashes the key and then walks the cluster and picks the replicas. Not very clever - all replicas could end up on the same rack
    • Network: Configure with a RF per DC. Walk the ring for each DC until it reaches a node in another rack

Snitch - how to define a data centre and a rack

  • Informs Cassandra about node topology, designates DC and Rack for every node
  • Example: Rack inferring snitch designates DC and Rack based on the IP of the node 
  • Example: Property file snitch where every node has the DC and Rack of every other node
  • Example: GossipingPropertyFileSnitch: Every node knows its own DC and Rack and tells other nodes via Gossip
  • Dynamic snitching: monitors performance of reads, this snitch wraps the other snitches to respond to network latency

Client requests

  • Connect to any client in the node - becomes the coordinator. This node knows which nodes to talk to for the request
  • Multi DC - picks a coordinator in the other data centre  to replicate data there or to get data for a read


  • Quorum = (Replication Factor/2) + 1 i.e. more than half
  • E.g R = 3, Q = 2, tolerate 1 replica going down to continue reading and writing at Quorum
  • Per request consistency - can decide certain writes are more important and require higher consistency than others
  • Example consistency levels: ANY, ONE, TWO, THREE, QUORUM, EACH_QUORUM, LOCAL_QUORUM
  • SERIAL: New in cassandra 2.0

Write requests - what happens?

  • The coordinator (node the client connects to) forwards the write to all the replicas in the local DC and designates a coordinator in the other DCs to do the same there
  • The coordinator may be a replica but does not need to be
  • For a single node writes first go to commit log (disk), then writes to meltable (memory)
  • When does the write succeed? Depends on consistency e.g a write consistency of ONE means that the data needs to be in the commit log and memtable of at least one replica

Hinted handoff - how Cassandra deals with nodes being down on write

  • Coordinator node keeps hints if one of the replicas down
  • When the node comes back up the hints are then sent to the node so it can catch up
  • Hints are kept for a finite amount of time - default is three hours

Read requests  - what happens?

  • Coordinator contacts a number of nodes depending on the consistency - once enough have responded the read can be successful 
  • Will send requests to node responding the fastest
  • If not consistent - compare timestamps + do a read repair
  • Possible other background read repair

What was missing?

Overall it was a great talk however here is some possible improvements:
  • A glossary/overview at the start? Perhaps a mapping from relational terminology to Cassandra terminology. For example the term keyspace was used a number of times before describing what it is
  • Overview of consistency when talking about eventual consistency - however this did come later? A few scenarios for when read/writes at different consistency levels would fail/succeed would have been very helpful 
  • Compaction required for an intro talk? I thought talking about compaction was a bit too much for an introductory talk as you need to understand memtables and sstables before it makes sense
  • The downsides of Cassandra: for example some forms of schema migration/change is a nightmare when you are using CQL3 + have data you need to migrate

Sunday, September 22, 2013

Scala, MongoDB and Casbah: Dealing with Arrays

Get hold of a collection object using something like this:

scala> val collection = MongoClient()("test")("messages")
collection: com.mongodb.casbah.MongoCollection = messages

Where test is the database and messages is the name of the collection.

Inserting arrays is nice and easy, just build up your MongoDBObject with Lists inside:

scala> collection.insert(MongoDBObject("message" -> "Hello World") ++ ("countries" -> List("England","France","Spain")))
res18: com.mongodb.casbah.Imports.WriteResult = { "serverUsed" : "/" , "n" : 0 , "connectionId" : 234 , "err" :  null  , "ok" : 1.0}

Use your favourite one liner to print all the objects in the collection:

scala> collection.foreach(println(_))
{ "_id" : { "$oid" : "523f145e30041dae32fd04da"} , "message" : "Hello World" , "countries" : [ "England" , "France" , "Spain"]}

Now lets say you want a list of objects, simply create a list of MongoDBObjects:

scala> collection.insert(MongoDBObject("message" -> "A list of objects?") ++ ("objects" -> List(MongoDBObject("name" -> "England"),MongoDBObject("name" -> "France"))))
res20: com.mongodb.casbah.Imports.WriteResult = { "serverUsed" : "/" , "n" : 0 , "connectionId" : 234 , "err" :  null  , "ok" : 1.0}

scala> collection.foreach(println(_))
{ "_id" : { "$oid" : "523f145e30041dae32fd04da"} , "message" : "Hello World" , "countries" : [ "England" , "France" , "Spain"]}
{ "_id" : { "$oid" : "523f14b530041dae32fd04db"} , "message" : "A list of objects?" , "objects" : [ { "name" : "England"} , { "name" : "France"}]}

Now reading them back out of Mongo and processing the array items individually. First we can get a hold of an object that contains an array:

scala> val anObjectThatContainsAnArrayOfObjects = collection.findOne().get
anObjectThatContainsAnArrayOfObjects: collection.T = { "_id" : { "$oid" : "523f145e30041dae32fd04da"} , "message" : "Hello World" , "countries" : [ "England" , "France" , "Spain"]}

The extra get is on the end as we used the findOne method this time and it returns an Option. Then we can get just the array field:

val mongoListOfObjects = anObjectThatContainsAnArrayOfObjects.getAs[MongoDBList]("countries").get
mongoListOfObjects: Option[com.mongodb.casbah.Imports.MongoDBList] = Some([ "England" , "France" , "Spain"])

Now we have a handle on a MongoDBList which represents our array in Mongo. The MongoDBList is Iterable so we can loop through and print it out:

scala> mongoListOfObjects.foreach( country => println(country) )

Or map it to a sequence of Strings:

scala> val listOfCountries =
listOfCountries: scala.collection.mutable.Seq[String] = ArrayBuffer(England, France, Spain)

scala> listOfCountries

res24: scala.collection.mutable.Seq[String] = ArrayBuffer(England, France, Spain)

Friday, September 20, 2013

Scala and MongoDB: Getting started with Casbah

The officially supported Scala driver for Mongo is Casbah. Cashbah is a thin wrapper around the Java MongoDB driver that gives it a Scala like feel. As long as you ignore all the MongoDBObjects then it feels much more like being in the Mongo shell or working in Python that working with Java/Mongo.

All the examples are copied from a Scala REPL launched from an SBT project with Casbah added as a dependency.

So lets get started by importing the Casbah package:

scala> import com.mongodb.casbah.Imports._ 
import com.mongodb.casbah.Imports._

Now lets create a connection to a locally running Mongo and use the "test" database:

scala> val mongoClient = MongoClient()
mongoClient: com.mongodb.casbah.MongoClient = com.mongodb.casbah.MongoClient@2acf0276
scala> val database = mongoClient("test")
database: com.mongodb.casbah.MongoDB = test

And now lets get a reference to the messages collections:

scala> val collection = database("messages")
collection: com.mongodb.casbah.MongoCollection = messages

As you can see Casbah makes heavy use of the apply method to give relatively nice boiler plate connection code. To print all the rows for a collection you can use the find method which returns an iterator (there is none at the moment):

scala> collection.find().foreach(row => println(row) )

Now lets insert some data the using the insert method and then find and print it:

scala> collection.insert(MongoDBObject("message" -> "Hello world"))
res2: com.mongodb.casbah.Imports.WriteResult = { "serverUsed" : "/" , "n" : 0 , "connectionId" : 225 , "err" :  null  , "ok" : 1.0}

scala> collection.find().foreach(row => println(row) )
{ "_id" : { "$oid" : "523aa69a30048ee48f49c333"} , "message" : "Hello world"}

And adding another document:

scala> collection.insert(MongoDBObject("message" -> "Hello London"))
res4: com.mongodb.casbah.Imports.WriteResult = { "serverUsed" : "/" , "n" : 0 , "connectionId" : 225 , "err" :  null  , "ok" : 1.0}

scala> collection.find().foreach(row => println(row) )
{ "_id" : { "$oid" : "523aa69a30048ee48f49c333"} , "message" : "Hello world"}
{ "_id" : { "$oid" : "523aa6bf30048ee48f49c334"} , "message" : "Hello London"}

The familiar findone method is there. Rather than an Iterable object returned from find, findOne returns an Option so you can use a basic pattern match to handle the document being there or not:

scala> val singleResult = collection.findOne()
singleResult: Option[collection.T] = Some({ "_id" : { "$oid" : "523aa69a30048ee48f49c333"} , "message" : "Hello world"})

scala> singleResult match {
     |   case None => println("No messages found")
     |   case Some(message) => println(message)
     | }
{ "_id" : { "$oid" : "523aa69a30048ee48f49c333"} , "message" : "Hello world"}

Now lets query using the ID of an object we've inserted (querying by any other field is the same):

scala> val query = MongoDBObject("_id" -> helloWorld.get("_id"))
id: com.mongodb.casbah.commons.Imports.DBObject = { "_id" : { "$oid" : "523aa69a30048ee48f49c333"}}

scala> collection.findOne(query)
res12: Option[collection.T] = Some({ "_id" : { "$oid" : "523aa69a30048ee48f49c333"} , "message" : "Hello world"})

We can also update the document in the database and then get it again to prove it has changed:

scala> collection.update(query, MongoDBObject("message" -> "Hello Planet"))
res13: com.mongodb.WriteResult = { "serverUsed" : "/" , "updatedExisting" : true , "n" : 1 , "connectionId" : 225 , "err" :  null  , "ok" : 1.0}

scala> collection.findOne(query)
res14: Option[collection.T] = Some({ "_id" : { "$oid" : "523aa69a30048ee48f49c333"} , "message" : "Hello Planet"})

The remove method works in the same way, just pass in a MongoDBObject for the selection criterion.

Not look Scalary enough for you? You can also insert using the += method:

scala> collection += MongoDBObject("message"->"Hello England")
res15: com.mongodb.WriteResult = { "serverUsed" : "/" , "n" : 0 , "connectionId" : 225 , "err" :  null  , "ok" : 1.0}

scala> collection.find().foreach(row => println(row))
{ "_id" : { "$oid" : "523aa69a30048ee48f49c333"} , "message" : "Hello Planet"}
{ "_id" : { "$oid" : "523aa6bf30048ee48f49c334"} , "message" : "Hello London"}
{ "_id" : { "$oid" : "523c911230048ee48f49c335"} , "message" : "Hello England"}

How do you build more complex document in Scala? Simply use the MongoDBObject ++ method, for example we can create an object with multiple fields, insert it, then view it by printing all the documents in the collection:

scala> val moreThanOneField = MongoDBObject("message" -> "I'm coming") ++ ("time" -> "today") ++ ("Name" -> "Chris")
moreThanOneField: com.mongodb.casbah.commons.Imports.DBObject = { "message" : "I'm coming" , "time" : "today" , "Name" : "Chris"}

scala> collection.insert(moreThanOneField)
res6: com.mongodb.casbah.Imports.WriteResult = { "serverUsed" : "/" , "n" : 0 , "connectionId" : 234 , "err" :  null  , "ok" : 1.0}

scala> collection.find().foreach(println(_) )
{ "_id" : { "$oid" : "523aa69a30048ee48f49c333"} , "message" : "Hello Planet"}
{ "_id" : { "$oid" : "523aa6bf30048ee48f49c334"} , "message" : "Hello London"}
{ "_id" : { "$oid" : "523c911230048ee48f49c335"} , "message" : "Hello England"}
{ "_id" : { "$oid" : "523c96b530041dae32fd04d6"} , "message" : "I'm coming" , "time" : "today" , "Name" : "Chris"}

Saturday, September 14, 2013

Git + Tig: Viewing Git history from command line

Git is great. The output from git log is not!

One of my favourite tools to work with git is tig. Among other things tig is a great repository/history viewer for git. If you're on Mac OSX and have homebrew setup then all you need to do to install tig is:
brew install tig

Then go to a git repo on your computer and run tig and you get great output like:

You can see individual commits and where each remote branch is up to. In the above screenshot I am on the master branch and I can see that my remote branch heroku is four commits behind as is origin.

You can also see the diff for any commit:

Here I've selected an individual commit and tig shows me all the changes.

Tig also has a very nice status view. So if you like the above tree view try tig status.

Scala: What logging library to use?

Having investigated a few options I have decided on SLF4J + Logback + Grizzled. The project I am currently on uses Scalatra - this matches their solution.

Three libraries?? That sounds overkill but it is in fact very simple.

SLF4J + Logback are common place in Java projects. SLF4J is logging facade - basically a set of interfaces to program to where as Logback is an implementation you put on your classpath at runtime. There are other implementations of SLF4J you can use such as Log4J. Grizzled is a Scala wrapper around SLF4J to give it Scala like usage.

Having worked on many Java projects that use SLF4J + Logback I'm used to seeing lines at the top of files that look like this:
private static final Logger LOGGER = LoggerFactory.getLogger(SomeClass.class)
Fortunately Grizzled-SLF4J + traits help here.

Mixing in the Grizzled Logging trait allows you to write logging code like this:

This will produce logs like this:
09:57:11.169 [main] INFO com.batey.examples.SomeClass - Some information
09:57:11.172 [main] ERROR com.batey.examples.SomeClass - Something terrible
The grizzled trait uses your class name as the logger name. Job done with less boilerplate than Java!

Everything you need s on Maven central so can be added to your pom or sbt dependencies:
Grizzled: Grizzled
Logback: Logback

Sunday, September 1, 2013

Scala: SBT OutOfMemoryError: PermGen space

When starting out with Scala/SBT I very quickly ran into perm gen issues followed by SBT crashing:

java.lang.OutOfMemoryError: PermGen space

Especially when running the console from within interactive mode.

To fix/brush this under the carpet add the following to your profile (.bashrc / .bash_profile) and source it again (run . ~/.bashrc)

export SBT_OPTS=-XX:MaxPermSize=256m

Saturday, August 31, 2013

Book Review: Driving Technical Change: Why People on Your Team Don't Act on Good Ideas, and How to Convince Them They Should

Author: Terrence Ryan

Before reading a technical book I tend to look at what the author is about. Terrence works for Adobe, has an active github account where it looks like he works primarily with front end technologies: CSS, JavaScript.


I categorise books as follows:
  1. Reference - useful for a particular technology when actively using it.
  2. Tutorial/teach - useful for learning about a technology, easy to read even if not currently using it. For example beginners guides for technologies. 
  3. General technical books - describing methodologies, practices and general feel good technical books. Books like Clean Code by Uncle Bob fall into this category.
Driving technical change definitely falls into category number three. I read it while commuting and got through it in a couple of hours. It is split into the following sections:
  1. Describing the type of people that typically oppose change. These are: 
    1. The unaware: don't want the change because they are ignorant of the new technology/process. These people are easy to get on your side if you help them.
    2. The burned: don't want change because they have previously been burned by the technology/process in question.
    3. The cynic: prefer to look clever by rejecting change. These people tend to have a superficial knowledge that if you are well prepared you can get them on your side.
    4. The irrational: people who don't want change for an irrational reason e.g they have a personal problem with you. Best to avoid trying to convince these people.
  2. Techniques for driving technical change. This section gives various techniques and indicates which type of person they are effective against. I found this section of the book eventually become quite repetitive. 
  3. Finally strategies, how to use the techniques and in which order to actual drive the technical change you want.
Overall I think this book is worth a read as it is very quick to get through and can make you think about whether your colleagues fit into any of the stereotypes. Or perhaps more importantly do you yourself fall into one of them when other people are trying to drive change?

After reading the book it made me realise that if I think a team needs to adopt a new technology or process then it is up to me to put in the work to prove why. Too often people in the work place are trying to introduce new technologies without being able to answer the simple question: What problem is it solving? Or how is it going to improve the software we're producing?

Sunday, August 18, 2013

Groovy MockFor: mocking multiple invocations of the same method with different arguments

Coming at Groovy from a Java background it wasn't long before I wanted to see what mocking was like in Groovy. I'm fond of Mockito's flavour of mocking where I setup my mocks to behave appropriately, poke the class under test, then verify the correct invocations have been made with the class under test's dependencies. Saying that I have nothing against the setup expectations and then fail at the end if they haven't been met.

So Groovy MockFor puzzled me for a while. Lets look at an example of testing a very simple Service class that is required to process entries in a map and store each of them:

And the Datastore looks like this:

Now lets say we want to test that when the Service is initialised that we "open" the Datastore. This can be achieved with MockFor quite easily:

Lets take see what is going on here.
  • On line 9 we create a new MockFor object to mock the DataStore.
  • On line 10 we demand that the open method is called and we return true. 
  • On line 11 we get a proxy instance that we can pass into the Service class. 
  • On line 14 we create an instance of the Service class passing in our mocked Datastore.
  • Finally on line 17 we verify that all our demands have been met. 
It is worth pointing out that the MockFor isn't strongly typed, we could have mocked any method even if it doesn't exist. If we run this test we'll get the following failure:

junit.framework.AssertionFailedError: verify[0]: expected 1..1 call(s) to 'open' but was called 0 time(s)

Brilliant. Now lets move onto a move complicated example. Lets say we want to test the processEntry method takes each of the entries and stores them in the Datastore. This is when it became apparent to me that what is happening on line 10 is a closure that is executed when the mocked method is called. It just happened to return true as that was the last statement in the closure and Groovy doesn't always require the return statement. My first attempt to test the above scenario led me to:

Okay so the test fails now:

junit.framework.AssertionFailedError: verify[1]: expected 1..1 call(s) to 'storeField' but was called 0 time(s)

However we can make it pass without writing any meaningful code:

But changing line 6 to the following:

Means we'll get the following failure:

Assertion failed: 
assert v == "someValue"
       | |
       | false
       this is a made up value

And then we can actually write some sensible code to make it pass as now we're actually asserting that the correct values have been passed in.

Now finally onto the problem from the title. How do we mock multiple invocations to the same method with different arguments? We may want to do this when we verify all the values in the map passed to processEntry are passed to the Datastore rather than just the first one. This is where if you are coming from a Java/Mockito background you need to think differently. The solution I used in a similar situation looked like this:

Here we've written more complicated code in the closure that will be executed when the storeField method is called. Depending on what the key is that has been passed in a different assertion is executed. Of course you need to add some more code to the closure if we wanted to verify that no other values have been passed in.

This same style can be used to return different values depending on input parameters e.g.

I've also included the equivalent Java Mockito code for comparison.

Friday, August 16, 2013

Gradle, Groovy and Java

I've been moving my projects from Maven to Gradle and it has given me my first exposure to Groovy as it's Gradle's language of choice. Gradle is the perfect build tool for Java projects, Groovy projects and polyglot projects containing any combination of JVM languages. To get going with Groovy with Gradle you need to add the Groovy plugin to a Gradle script - this extends the Java plugin:

apply plugin: 'groovy'

Then the Gradle build will look for both Groovy and Java when you build your project:

Christophers-MacBook-Pro:testing chbatey$ gradle build
:compileJava UP-TO-DATE
:compileGroovy UP-TO-DATE
:processResources UP-TO-DATE
:classes UP-TO-DATE
:assemble UP-TO-DATE
:compileTestJava UP-TO-DATE
:compileTestGroovy UP-TO-DATE
:processTestResources UP-TO-DATE
:testClasses UP-TO-DATE
:test UP-TO-DATE
:check UP-TO-DATE
:build UP-TO-DATE


IntelliJ's support for Gradle has improved vastly as of late but if you experience problems revert to the Gradle idea plugin instead - I tend to use it if the Gralde project is more than a simple Groovy/Java project. Importing a project with just the above in your Gradle script and having created source folders for Java and Groovy IntelliJ should recognise everything:

Now lets create some Groovy and Java! To allow us to interchange between Java and Groovy and be able to add some unit tests, add the following to the build.gradle:

repositories {

dependencies {
    compile 'org.codehaus.groovy:groovy:2.1.6'
    testCompile 'junit:junit:4.+'

Now lets see a Groovy class testing a Java class:

On the left we can see a Java class in the main source tree and on the left a Groovy class testing it. This is a nice way to learn some Groovy and write tests with less boilerplate. You can say goodbye to types and semicolons. It is an easy transition for a Java developer as all Java is valid Groovy so you can write Java and Groovyify it as much or as little as you like. There is nothing stopping you from adding production Groovy code and testing it with Java as well.

If you want to stay on the command line then just run gradle test from the command line. If you want the source it's on GitHub:

Thursday, August 15, 2013

I don't fix bugs any more

One part of being a software engineer that I used to enjoy was the satisfaction of cracking a hard bug. Trawling through logs, using profilers and heap dumps and theorising on different ways different threads could have interleved etc. 

This week it suddenly dawned on me that I don't have to do this any more!

I've spent the last 6+ months working with some great people and we follow a number of great XP practices:
  • All our acceptance tests are automated,  many of them before development has started
  • All code is done TDD
  • All production code is paired on
  • We swap pairs every day, so unless a feature is completed in under a day most of our team will see the code
In addition we also have a great definition of done; for a particular feature it means:
  • Development complete (implicitly reviewed by the fact it was paired on)
  • All edge cases functionally tested and signed off
  • Deployed to our UAT environment
We always go the extra mile for functional testing. If a feature states we should store something persistently we don't only check that it has been but test all the edge cases around the datastore being down, being slow, traffic being dropped by a firewall etc. If is is a multi node cluster datastore (e.g Cassandra) we'll test that we still function when nodes go down etc. And when I say test I don't mean a quick manual test but automated tests running against every commit.

This is the first team I've worked in where I haven't had to fight for software to be developed like this. Regardless of how hard it is to create automated tests for a particular scenario - we'll do it.

So if you like fixing bugs or being woken up in the middle of the night with production problems - don't develop software like this!

I can go on holiday whenever I like: how pairing with regular swaps make my life easier

I'm extremely keen and enthusiastic. When I'm working, if there is someone who knows more about something than I do then I am compelled to study/play with it until I am up to speed. If there is an area of a system I don't understand I'll study it over lunch - I just like knowing things!

The problem with people who have this characteristic is that they tend to become "the expert" or at least the go-to person for different things and this can cause some problems (or at least I see them as problems):

  • these people get distracted a lot with other people them asking them questions
  • they become a bottleneck because their input is needed to make decisions
  • they eventually become "architects" or "tech leads" and don't get to develop any more - this makes them sad :(
In my current project we are fairly strict about pair swapping: every day after our standup we swap people around. One person always stays on a feature so there is continuity and so everyone typically spends two days on a particular task. It's enough time to get stuck in but not bogged down.

After doing this for a few months I've noticed the following:

  1. We have no code ownership at all, if a single pair stays on a feature or one person is always part of the pair that works in a particular area this tends still to happen. Not so for us and I think that is fantastic
  2. Due to number one, we refactor and delete code without anyone getting offended
  3. People don't get disheartened with boring tasks or features as their time working on them is reasonably short
  4. You constantly get fresh views on how we are doing things. Even if you've been on the feature before having four days off means that you still generally have fresh ideas
  5. We have no fake dependencies between features - if there are free developers then the highest priority item will be worked on. No waiting for someone to finish a different feature. 
After a while I noticed that we would not swap if something was "nearly done". Why? I guess we thought it would be disruptive and the task would take longer. I began not to like this. I brought it up in a retro and we went back to strict swapping. Now I know what you are thinking - processes shouldn't be rigid, relax man!


When a software engineer says something is nearly done alarm bells should go off! A lot of the time it can still take hours or days so a swap is still a good idea. Or let's give software developers some credit - let's say the feature is actually nearly done. I think that is the perfect time to get fresh eyes on the code. Make sure the entire team is happy with how it's been implemented before we call it done! A new set of eyes may see some new edge cases etc too.

So what am I saying? Daily pair swaps means I'm confident the entire team knows our code base and feature set and can answer questions on both - no more "I don't know - let's check with Ted when he gets back from holiday". This means I can go on holiday whenever I like, and I like holidays so this is great! 

Thursday, August 8, 2013

Cassandra vs MongoDB: Replication

What is replication?

Most databases offer some way to recover data in the event of hardware failure. In Cassandra and MongoDB this is achieved via their respective replication strategies where the same data is stored on multiple hosts. In addition replication usually goes hand-in-hand with sharding so I've mentioned some details on sharding. To the impatient people, just read the Q&As :)

Cassandra replication

Cassandra was never designed to be run as a single node - the only time this is likely is in a development environment. This becomes clear as soon as you create a keyspace as you must provide a replication factor. For example:

           WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3}

The keyspace 'People' will replicate every row of every column family to three nodes in the cluster. A few key points about Cassandra replication:
  • You can create a keyspace with a replication factor greater than your number of nodes and add them later. For instance I ran the above create keyspace command on a single node Cassandra cluster running on my laptop.
  • You don't need all the replicas available to do an insert. This is based on the consistency of the write (a separate post will come later comparing Cassandra and MongoDB consistency).
  • You get sharding for free if your number of nodes is greater than your replication factor.
Cassandra is designed to be very fault tolerant - when replicating data the aim is to survive things like a node failure, a rack failure and even a datacentre failure. For this reason anything but the simplest Cassandra setup will use a replication strategy that is rack and datacentre aware. Cassandra gets this information from a snitch. A snitch is responsible for identifying a node's rack and datacentre. Examples of the types of snitch are: PropertyFileSnitch where you define each node's rack and datacentre in a file; EC2Snitch where the rack and datacentre of each node is inferred from the AWS region and availability zone and a RackInferringSnitch where the rack and datacenter of each node is inferred from its IP address. Cassandra uses this information to avoid placing replicas in the same rack and keeping a set number of replicas in each datacenter. Once you have setup your snitch then all of this just happens.
An important feature of Cassandra is that all replicas are equal. There is no master for a particular piece of data and if a node goes down there is no period where a slave replica needs to become the master replica. This makes single node failure (load permitting) nearly transparent to the client (nearly because if there is a query in progress on the node during the failure then the query will fail). Most clients can be configured to retry transparently.
What's an ideal replication factor? That depends on the number of node failures you want to be able to handle and continue working at the consistency you want to be able to write at. If you have 3 replicas and you want to always write to a majority (QUORUM in Cassandra) then you can continue to write with 1 node down. If you want to handle 2 nodes down then you need a replication factor of 5.

MongoDB replication

Fault tolerance and replication are not as apparent in MongoDB from an application developer's point of view. Replication strategy is usually tightly coupled with sharding but this doesn't have to be the case - you can shard without replication and replicate without sharding. 

How does this work? Multiple MongoD processes can be added to a replica set. One of these MongoD processes will be automatically elected as the master. Every read and write must then go to the master and the writes are asynchronously replicated to the rest of the nodes in the replica set. If the master node goes down then a new master is automatically elected by the remaining nodes. This means that replication does not result in horizontal scaling as by default everything goes through the master.
Sharding in Mongo is achieved separately by separating collections across multiple shards. Shards are either individual MongoD instances or replica sets. Clients then send queries to query routers (MongoS processes) that route client requests to the correct shard. The metadata for the cluster (i.e. which shard contains what data) is kept on another process called a Config Server.  
Ensuring that replicas are unlikely to fail together e.g on the same rack, is down to the cluster setup. The nodes in a replica set must be manually put on different racks etc.

Q&A - Question, Cassandra and Mongo

Q. Are any additional processes required?
C. No - All nodes are equal.
M. No - However if you want sharding a separate MongoS (query router) process and three Config servers are required. Additional nodes that don't store the data may also be required to vote on which node should become the new master in the event of a failure.

Q. Is there a master for each piece of data?
C. No - all replicas are equal. All can process inserts and reads.
M. Yes - a particular MongoD instance is the master in a replica set. The rest are asynchronously replicated to.

Q. How does a client know which replica to send reads and writes to?
C. It doesn't. Writes and reads can be sent to any node and they'll be routed to the correct node. There are however token aware clients that can work this out based on hashing of they row key to aid performance.
M. If sharding is also enabled a separate process runs called MongoS that routes queries to the correct replica set. When there is no sharding then it will discover which replica in the replica set is the master.

Q. Does the application programmer care about how the data is sharded?
C. The row key is hashed so the programmer needs to ensure that the key has reasonably high cardinality. When using CQL3 the row key is the first part of the primary key so put low cardinality fields later in a compound primary key.
M. Yes - a field in the document is the designated shard key. This field is therefore mandatory and split into ranges. Shard keys such as a dates should be avoided as eventually all keys will be in the last range and it will cause the database to require re-balancing.

Monday, July 29, 2013

Cassandra vs MongoDB: The basics

Cassandra and MongoDB are two of the more popular NoSQL databases. I've been using Cassandra extensively over the past 6 months and I've recently started using MongoDB. Here is a brief description of the two, I'll follow up this post with a deeper comparison of the more advanced features.


Cassandra was originally created by Facebook and is written in Java, however it is now a Apache project. Traditionally Cassandra can be thought of as a column orientated database or a row orientated database depending on how you use columns. Each row is uniquely identified by a row key, like a primary key in a relational database. Unlike a relational database each row can have a different set of columns and it is common to use both the column name and the column value to store data. Rows are contained bya  column family which can be thought of as a table.

Client's use the thrift transport protocol and queries look like:

set Person['chbatey']['fname'] = 'Chris Batey';

Where Person is the column family, chbatey is the row key, fname is the column name and "Chris Batey" is the column value. Column names are dynamic so a client can store any key/value pairs. In this sense Cassandra is quite schemaless.

Then came Cassandra 1.* and CQL 3. Cassandra Query Language (CQL) is a SQL like language for Cassandra. Suddenly Cassandra, from a client's perspective, become much more like a relational database. Queries now look like this:

insert into Person(fname) values ('chbatey')

Using CQL3 there are no more dynamic column names and you create tables rather than column families (however the map type basically gives the same functionality). It's all still column families under the covers, CQL3 is just a very nice abstraction (a simplification). 

Cassandra appears to be moving away from a thrift protocol and moving to a proprietary protocol referred to as a native protocol. 

Overall Cassandra is quite a "rough around the edges" database to use (less so with CQL3) from a client perspective. It's real power comes from its horizontal scalability and tuneable eventual consistency. More on this in a future post.


MongoDB is a document database written in C++. Document databases are very intuitive as you simply store and retrieve documents! No crazy data model to learn, for MongoDB you simply store and retrieve JSON (BSON) objects.

Storing looks like this:{_id: 'chbatey', fname:'Chris Batey'})

Retrieving looks like this:

db.people.find({_id: 'chbatey'})


MongoDB has a very rich JSON based querying language and a fantastic aggregation framework. From a client's perspective MongoDB is a vastly more featured database with support for ad-hoc querying (Cassandra you must index everything you want to search by). 


This post was a very brief description of Cassandra and MongoDB. In future posts I will compare:
  • Fault tolerance - replication
  • Read and write consistency
  • Clients
  • Hadoop support
Particularly for Cassandra it is very important how your data centre and Cassandra cluster are laid out as to which read and write consistency levels you need to get the desired behaviour. 

Wednesday, July 24, 2013

21st July 2013: What's happening?

I am always looking to improve as a Software Engineer. To keep track of what I'm working on I've broken it down to the following categories:
  • Languages: My day job is primarily Java so I like to use other languages for everything else.
  • Frameworks: Usually tightly coupled to a language but becoming less so - especially for JVM based languages.
  • Databases: The world is changing. No longer can you get away with relational database / SQL knowledge
  • Craftsmanship: How do I go about producing better, more maintainable software as well as helping those around me to do the same.
  • General knowledge: Keeping up with technological world takes some doing. I try to read a few articles a day and listen to podcasts.
I won't work on each category every week. Here's what I've been doing the last week:


At work I'm a complete Java head. Over the past few years I've primarily developed standalone multi-threaded server applications for financial companies. More recently I've been developing cloud based applications so been doing a lot more Java development where it is deployed to a container e.g tomcat.

For this reason, when not at work, I am completely avoiding Java. This week I've been learning to test Java applications using Groovy (ok ok so I didn't leave Java complete behind!) and been learning to unit test the logic in Gradle scripts using GroovyTest.

In addition to Groovy I've been working on Python this week. If you live in London you might be aware you can get Transport for London to send you your travel statements in CSV format. I've been writing a Python application to parse these and work out how much money I spend commuting to work. Blog posts coming about this but initial version on github:


Having worked with Cassandra a lot over the last six months I'm now exploring MongoDB. Leaving the relational world for the NoSQL world has been a great learning experience this year. I'll put up a comparison for Cassandra vs MongoDB soon. Cassandra is such a low level, developer must understand everything database so MongoDB is quite refreshing!


I've started doing katas again the last few weeks. I've started with sorting algorithms. I'm doing this quite quickly and in Python to further solidify my knowledge of the language. Here's merge sort: quicksort coming!

General Knowledge

Started going through the backlog of programming throwdown last few weeks: Not bad listening for the train, though I wish they spoke about games less!

Thursday, July 11, 2013

Mergesort in Python

My train was cancelled today and as I am trying to cement my knowledge of python  I decided to do mergesort in python from memory. I find when adding new languages to my toolkit It is easy to forget how to setup a quick project with unit tests so I find it useful to do regular little projects from scratch Here's the code:

 And hear's the unit tests (of course the unit tests came first!):

I really like python and its unit testing framework. So simple to get going and for doing TDD.

Tuesday, July 9, 2013

Uncle Bob: Automated acceptance tests

Yesterday I went to see Uncle Bob at the Skills Matter Exchange in London. Having read and enjoyed Clean Code and Clean Coder it was great to see Uncle Bob in the flesh.

The talk was on automated acceptance tests. Such a simple topic - we all automate our acceptance tests don't we?

A few points I took away from the talk:

  • Can we get our stakeholders to write our acceptance tests? If not is it at least business analysts or QAs? If it is developers you're in trouble!
  • Developers are brilliant at rationalising about concepts such as "is it done?". Don't trust a developer to tell you something is done!
  • Acceptance tests should be automated at the latest half way through an iteration if your QAs are going to have time to do exploratory testing
  • The QAs should be the smartest people in your team. They should be specifying the system with the business analyst not verifying it after it has been developed
  • Your functional tests are just another component of the system. If that part of brittle it means your project is badly designed! Go back to the drawing board.

A final point that stuck with me is that acceptance tests don't need to be black box tests. The language they are written in should be high level (it was your stake holder who wrote it right??). But the implementation could interact with a version of your system that has the database or HTTP layer mocked out. Think of it this way:
  • How many times do you need to test the login functionality of your application? Once!
  • How many times will you test it if all your tests go through the GUI/web front end? Hundreds!
Hearing Uncle Bob speak reminds me that even when I am working on a project I think is being developed in a fantastic way, with fantastic testing - I can still try and make it better.

Saturday, June 1, 2013

Mocking Iterable objects generically with Mockito

I often find my self having to mock iterable objects. At times I've created a test stub that overrides the iterator method and stubs the iterator, other times I have used mockito to mock the iterator. 

Most recently it was the Cassandra Datastax Java Driver's ResultSet object (which doesn't implement an interface and has a private constructor so you can't extend it and override or create an alternative implementation) which motivated me to create a generic method. 

So basically I want this code to work but with aSetOfStrings is a mock. You probably won't be mocking collections but it'll give you the idea.
         Set<String> aSetOfStrings = // passed in some how
       for (String s : aSetOfStrings) {

And the method for creating the mock iterable object needs to be generic. E.g:
       public static <T> void mockIterable(Iterable<T> iterable, T... values)

Where the first parameter is the mock iterable object and the var arg is the objects that it should return. 

The foreach loop internally uses two methods on the iterator returned by the set's iterator() method: hasNext() and next().

So to get it to work three methods need to be mocked:
  1. Mock the iterable's (the set) iterator() method to return a mock iterator
  2. Mock the iterator's hasNext() method to return true N times followed by false where N is the number of values that you want the iterator to return.
  3. Mock the iterator's next() method to return the N elements in order.
Using mockito number one is easy:
        Iterator<T> mockIterator = mock(Iterator.class);

Number two and three are slightly more complicated. Mockito lets you pass in a vararg for what to return. The slight complication is that the signature is:
        thenReturn(T, T...)

This is to enforce that at least one element is passed in. This means that for the hasNext() we need to pass in true N times followed by a false but the first true needs to be passed in separately rather then in the vararg. The same applies for next() - we can't simply use the vararg passed into our mockIterable(..) method we need to build a new array with N-1 elements in. This can be done as follows:

  1. If no values are passed in all we need to do is mock hasNext() to false.
  2. If a single value is passed in we don't need to build an array to pass into thenReturn.
  3. Finally, for more than one value we need to build the correct boolean array and values array to pass into thenReturn. For example:
MockIterator.mockIterable(mockIterable, "one", "two", "three");

We'd need the following mocking calls:
        when(mockIterable.hasNext()).thenReturn(true, [true, true, false]) 
        when("one", ["two", "three"])

And here is the code to do it:
And the full code along with unit tests and a gradle script to build + pull in the dependencies is here.

Saturday, May 25, 2013

Time series based queries in Cassandra 1.2+ and CQL3

Cassandra is fantastic for storing large amounts of data, and as of 1.2/CQL3 working with time series data just got a lot easier. Here is a basic example that stores some kind of posts (e.g blog) that can be queried by username and a time period.

Everything is assuming you have Cassandra 1.2+ installed and are using CQL3 with the cqlsh python client. Here are the exact versions I used for the below example:

[cqlsh 3.0.2 | Cassandra 1.2.5 | CQL spec 3.0.0 | Thrift protocol 19.36.0]

Creating a keyspace:
create keyspace cassandraspike WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 1};

The syntax has changed for creating keyspaces in CQL for cassandra 1.2 so if this fails check you aren't running against a pre-1.2 version of cassandra.
use cassandraspike;

Then create a posts table:
create table posts(username varchar, time timeuuid, post_text varchar, primary key(username, time))

We've used a compound primary key that is the username and the time. The time is of type timeuuid which is a new data type in CQL3 that us a type 1 UUID that contains a timestamp so it both stores the time of the post and makes our rows unique.

If you're familier with column families then it might interest you to know that the first column in your primary key becomes the column family (CF) row key. Every subsequent part of the primary key becomes part of the CF column names in that CF row. This has two implications:
  • There will only be as many CF rows as there are variations of the first element in your primary key. This can be a problem if this element has a very low cardinality as you can end up with very wide CF rows.
  • CF columns are stored in order so the fact that each CF column is prefixed with the time means that your data is stored in the order it happened. Making it very efficient to be queried by time.
Hopefully you are sufficiently convinced that your data is going to be stored in a way that is efficient to query by time. So let's store some data in and query it by user name and time.

To insert we'll make use if the now() function which gives you a timeuuid for the current time:
insert into posts(username, time, post_text) values ('chbatey', now(), 'i am writing something about cassandra');

Selecting this back works but gives you a time which isn't too human read able:
select * from posts;

 username | time                                 | post_text
  chbatey | 59ad61d0-c540-11e2-881e-b9e6057626c4 | i am writing something about cassandra

That's where the dateOf function comes in handy which converts a timeuuid into a date:
select username, dateOf(time), post_text from posts;

 username | dateOf(time)             | post_text
  chbatey | 2013-05-25 14:38:14+0100 | i am writing something about cassandra

I've now inserted another post at a different time:
select username, dateOf(time), post_text from posts;

username | dateOf(time)             | post_text
  chbatey | 2013-05-25 14:38:14+0100 | i am writing something about cassandra
  chbatey | 2013-05-25 14:40:35+0100 | i am writing something about cassandra

Now lets say you are only interested in the posts chbatey made at 14:38:
select username, dateOf(time), post_text from posts where time >= minTimeuuid('2013-05-25 14:38') and time < minTimeuuid('2013-05-25 14:39') and username = 'chbatey';

 username | dateOf(time)             | post_text
  chbatey | 2013-05-25 14:38:14+0100 | i am writing something about cassandra

This query is more complicated. It makes use of another cql function: minTimeuuid. The minTimeuuid function gives a fake (as it isn't unique) timeuuid that is the smallest possible timeuuid for the given time. This is very hand when you want to do less than/greater than queries on timeuuid fields.

In the above query we are getting everything that is greater than the minimum timeuuid for the time 2013-05-25 14:38 but less than the minimum timeuuid of the next minute 2013-05-25 14:39.

Rather than that we could have used the very similar function maxTimeuuid function with 2013-05-24 14:39:59. However I find original one easier to understand as I read it as greater than or equal to 14:38:00 and less than 14:39:00.

Friday, May 24, 2013

Installing Cassandra on Mac OS X

I've recently posted some more tips on using Cassandra on Mac OSX: Cassandra on Mac

If you don't already have homebrew then install it from here.

Then it as simple as:

brew install cassandra

This doesn't install the python driver for the cqlsh command line tool. To do this install it first install python if you haven't got it already:

brew install python

This should have also installed pip - the python package manager - so you can then install the cql python module:

pip install cql

Now try and start cqlsh

You might get this:

Python CQL driver not installed, or not on PYTHONPATH.
You might try "easy_install cql".

One second didn't I just install the cql module?

This could be because the Python in your path is the Mac OS X version. Not the version you installed with home brew that has cql. I fixed this by adding /usr/local/bin to the start of my PATH as that is where the brew Python executable lives:

export PATH=/usr/local/bin/:$PATH

Unless you've started cassandra the next time you try cqlsh you'll get:

Connection error: Could not connect to localhost:9160

Now if you do a brew info on cassandra:

brew info cassandra

To have launchd start cassandra at login:
    ln -sfv /usr/local/opt/cassandra/*.plist ~/Library/LaunchAgents
Then to load cassandra now:
    launchctl load ~/Library/LaunchAgents/homebrew.mxcl.cassandra.plist

Unless you are going to use cassandra a lot I wouldn't set it to load on startup as it does use a reasonable amount of memory. Instead to just start it off:

launchctl load /usr/local/opt/cassandra/homebrew.mxcl.cassandra.plist 

Finally cqlsh should connect to cassandra:

Connected to Test Cluster at localhost:9160.
[cqlsh 3.0.2 | Cassandra 1.2.5 | CQL spec 3.0.0 | Thrift protocol 19.36.0]
Use HELP for help.

Or if you prefer the older cassandra-cli interface to cassandra:

Connected to: "Test Cluster" on
Welcome to Cassandra CLI version 1.2.5

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.


All done.