Christopher Batey's Blog has moved to www.batey.info

Thursday, August 15, 2013

I can go on holiday whenever I like: how pairing with regular swaps make my life easier

I'm extremely keen and enthusiastic. When I'm working, if there is someone who knows more about something than I do then I am compelled to study/play with it until I am up to speed. If there is an area of a system I don't understand I'll study it over lunch - I just like knowing things!

The problem with people who have this characteristic is that they tend to become "the expert" or at least the go-to person for different things and this can cause some problems (or at least I see them as problems):

these people get distracted a lot with other people them asking them questions
they become a bottleneck because their input is needed to make decisions
they eventually become "architects" or "tech leads" and don't get to develop any more - this makes them sad :(

In my current project we are fairly strict about pair swapping: every day after our standup we swap people around. One person always stays on a feature so there is continuity and so everyone typically spends two days on a particular task. It's enough time to get stuck in but not bogged down.

After doing this for a few months I've noticed the following:

We have no code ownership at all, if a single pair stays on a feature or one person is always part of the pair that works in a particular area this tends still to happen. Not so for us and I think that is fantastic
Due to number one, we refactor and delete code without anyone getting offended
People don't get disheartened with boring tasks or features as their time working on them is reasonably short
You constantly get fresh views on how we are doing things. Even if you've been on the feature before having four days off means that you still generally have fresh ideas
We have no fake dependencies between features - if there are free developers then the highest priority item will be worked on. No waiting for someone to finish a different feature.

After a while I noticed that we would not swap if something was "nearly done". Why? I guess we thought it would be disruptive and the task would take longer. I began not to like this. I brought it up in a retro and we went back to strict swapping. Now I know what you are thinking - processes shouldn't be rigid, relax man!

However...

When a software engineer says something is nearly done alarm bells should go off! A lot of the time it can still take hours or days so a swap is still a good idea. Or let's give software developers some credit - let's say the feature is actually nearly done. I think that is the perfect time to get fresh eyes on the code. Make sure the entire team is happy with how it's been implemented before we call it done! A new set of eyes may see some new edge cases etc too.

So what am I saying? Daily pair swaps means I'm confident the entire team knows our code base and feature set and can answer questions on both - no more "I don't know - let's check with Ted when he gets back from holiday". This means I can go on holiday whenever I like, and I like holidays so this is great!

Thursday, August 8, 2013

Cassandra vs MongoDB: Replication

What is replication?

Most databases offer some way to recover data in the event of hardware failure. In Cassandra and MongoDB this is achieved via their respective replication strategies where the same data is stored on multiple hosts. In addition replication usually goes hand-in-hand with sharding so I've mentioned some details on sharding. To the impatient people, just read the Q&As :)

Cassandra replication

Cassandra was never designed to be run as a single node - the only time this is likely is in a development environment. This becomes clear as soon as you create a keyspace as you must provide a replication factor. For example:

CREATE KEYSPACE People
           WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3}

The keyspace 'People' will replicate every row of every column family to three nodes in the cluster. A few key points about Cassandra replication:

You can create a keyspace with a replication factor greater than your number of nodes and add them later. For instance I ran the above create keyspace command on a single node Cassandra cluster running on my laptop.
You don't need all the replicas available to do an insert. This is based on the consistency of the write (a separate post will come later comparing Cassandra and MongoDB consistency).
You get sharding for free if your number of nodes is greater than your replication factor.

Cassandra is designed to be very fault tolerant - when replicating data the aim is to survive things like a node failure, a rack failure and even a datacentre failure. For this reason anything but the simplest Cassandra setup will use a replication strategy that is rack and datacentre aware. Cassandra gets this information from a snitch. A snitch is responsible for identifying a node's rack and datacentre. Examples of the types of snitch are: PropertyFileSnitch where you define each node's rack and datacentre in a file; EC2Snitch where the rack and datacentre of each node is inferred from the AWS region and availability zone and a RackInferringSnitch where the rack and datacenter of each node is inferred from its IP address. Cassandra uses this information to avoid placing replicas in the same rack and keeping a set number of replicas in each datacenter. Once you have setup your snitch then all of this just happens.

An important feature of Cassandra is that all replicas are equal. There is no master for a particular piece of data and if a node goes down there is no period where a slave replica needs to become the master replica. This makes single node failure (load permitting) nearly transparent to the client (nearly because if there is a query in progress on the node during the failure then the query will fail). Most clients can be configured to retry transparently.

What's an ideal replication factor? That depends on the number of node failures you want to be able to handle and continue working at the consistency you want to be able to write at. If you have 3 replicas and you want to always write to a majority (QUORUM in Cassandra) then you can continue to write with 1 node down. If you want to handle 2 nodes down then you need a replication factor of 5.

MongoDB replication

Fault tolerance and replication are not as apparent in MongoDB from an application developer's point of view. Replication strategy is usually tightly coupled with sharding but this doesn't have to be the case - you can shard without replication and replicate without sharding.

How does this work? Multiple MongoD processes can be added to a replica set. One of these MongoD processes will be automatically elected as the master. Every read and write must then go to the master and the writes are asynchronously replicated to the rest of the nodes in the replica set. If the master node goes down then a new master is automatically elected by the remaining nodes. This means that replication does not result in horizontal scaling as by default everything goes through the master.

Sharding in Mongo is achieved separately by separating collections across multiple shards. Shards are either individual MongoD instances or replica sets. Clients then send queries to query routers (MongoS processes) that route client requests to the correct shard. The metadata for the cluster (i.e. which shard contains what data) is kept on another process called a Config Server.

Ensuring that replicas are unlikely to fail together e.g on the same rack, is down to the cluster setup. The nodes in a replica set must be manually put on different racks etc.

Q&A - Question, Cassandra and Mongo

Q. Are any additional processes required?

C. No - All nodes are equal.

M. No - However if you want sharding a separate MongoS (query router) process and three Config servers are required. Additional nodes that don't store the data may also be required to vote on which node should become the new master in the event of a failure.

Q. Is there a master for each piece of data?

C. No - all replicas are equal. All can process inserts and reads.

M. Yes - a particular MongoD instance is the master in a replica set. The rest are asynchronously replicated to.

Q. How does a client know which replica to send reads and writes to?

C. It doesn't. Writes and reads can be sent to any node and they'll be routed to the correct node. There are however token aware clients that can work this out based on hashing of they row key to aid performance.

M. If sharding is also enabled a separate process runs called MongoS that routes queries to the correct replica set. When there is no sharding then it will discover which replica in the replica set is the master.

Q. Does the application programmer care about how the data is sharded?

C. The row key is hashed so the programmer needs to ensure that the key has reasonably high cardinality. When using CQL3 the row key is the first part of the primary key so put low cardinality fields later in a compound primary key.

M. Yes - a field in the document is the designated shard key. This field is therefore mandatory and split into ranges. Shard keys such as a dates should be avoided as eventually all keys will be in the last range and it will cause the database to require re-balancing.

Monday, July 29, 2013

Cassandra vs MongoDB: The basics

Cassandra and MongoDB are two of the more popular NoSQL databases. I've been using Cassandra extensively over the past 6 months and I've recently started using MongoDB. Here is a brief description of the two, I'll follow up this post with a deeper comparison of the more advanced features.

Cassandra

Cassandra was originally created by Facebook and is written in Java, however it is now a Apache project. Traditionally Cassandra can be thought of as a column orientated database or a row orientated database depending on how you use columns. Each row is uniquely identified by a row key, like a primary key in a relational database. Unlike a relational database each row can have a different set of columns and it is common to use both the column name and the column value to store data. Rows are contained bya column family which can be thought of as a table.

Client's use the thrift transport protocol and queries look like:

set Person['chbatey']['fname'] = 'Chris Batey';

Where Person is the column family, chbatey is the row key, fname is the column name and "Chris Batey" is the column value. Column names are dynamic so a client can store any key/value pairs. In this sense Cassandra is quite schemaless.

Then came Cassandra 1.* and CQL 3. Cassandra Query Language (CQL) is a SQL like language for Cassandra. Suddenly Cassandra, from a client's perspective, become much more like a relational database. Queries now look like this:

insert into Person(fname) values ('chbatey')

Using CQL3 there are no more dynamic column names and you create tables rather than column families (however the map type basically gives the same functionality). It's all still column families under the covers, CQL3 is just a very nice abstraction (a simplification).

Cassandra appears to be moving away from a thrift protocol and moving to a proprietary protocol referred to as a native protocol.

Overall Cassandra is quite a "rough around the edges" database to use (less so with CQL3) from a client perspective. It's real power comes from its horizontal scalability and tuneable eventual consistency. More on this in a future post.

MongoDB

MongoDB is a document database written in C++. Document databases are very intuitive as you simply store and retrieve documents! No crazy data model to learn, for MongoDB you simply store and retrieve JSON (BSON) objects.

Storing looks like this:

db.people.save({_id: 'chbatey', fname:'Chris Batey'})

Retrieving looks like this:

db.people.find({_id: 'chbatey'})

Simple!

MongoDB has a very rich JSON based querying language and a fantastic aggregation framework. From a client's perspective MongoDB is a vastly more featured database with support for ad-hoc querying (Cassandra you must index everything you want to search by).

Conclusion

This post was a very brief description of Cassandra and MongoDB. In future posts I will compare:

Fault tolerance - replication
Read and write consistency
Clients
Hadoop support

Particularly for Cassandra it is very important how your data centre and Cassandra cluster are laid out as to which read and write consistency levels you need to get the desired behaviour.

Wednesday, July 24, 2013

21st July 2013: What's happening?

I am always looking to improve as a Software Engineer. To keep track of what I'm working on I've broken it down to the following categories:

Languages: My day job is primarily Java so I like to use other languages for everything else.
Frameworks: Usually tightly coupled to a language but becoming less so - especially for JVM based languages.
Databases: The world is changing. No longer can you get away with relational database / SQL knowledge
Craftsmanship: How do I go about producing better, more maintainable software as well as helping those around me to do the same.
General knowledge: Keeping up with technological world takes some doing. I try to read a few articles a day and listen to podcasts.

I won't work on each category every week. Here's what I've been doing the last week:

Languages

At work I'm a complete Java head. Over the past few years I've primarily developed standalone multi-threaded server applications for financial companies. More recently I've been developing cloud based applications so been doing a lot more Java development where it is deployed to a container e.g tomcat.

For this reason, when not at work, I am completely avoiding Java. This week I've been learning to test Java applications using Groovy (ok ok so I didn't leave Java complete behind!) and been learning to unit test the logic in Gradle scripts using GroovyTest.

In addition to Groovy I've been working on Python this week. If you live in London you might be aware you can get Transport for London to send you your travel statements in CSV format. I've been writing a Python application to parse these and work out how much money I spend commuting to work. Blog posts coming about this but initial version on github: https://github.com/chbatey/oystergrep

Databases

Having worked with Cassandra a lot over the last six months I'm now exploring MongoDB. Leaving the relational world for the NoSQL world has been a great learning experience this year. I'll put up a comparison for Cassandra vs MongoDB soon. Cassandra is such a low level, developer must understand everything database so MongoDB is quite refreshing!

Craftsmanship

I've started doing katas again the last few weeks. I've started with sorting algorithms. I'm doing this quite quickly and in Python to further solidify my knowledge of the language. Here's merge sort: https://github.com/chbatey/kata_mergesort quicksort coming!

General Knowledge

Started going through the backlog of programming throwdown last few weeks: http://programmingthrowdown.blogspot.co.uk/. Not bad listening for the train, though I wish they spoke about games less!

Thursday, July 11, 2013

Mergesort in Python

My train was cancelled today and as I am trying to cement my knowledge of python I decided to do mergesort in python from memory. I find when adding new languages to my toolkit It is easy to forget how to setup a quick project with unit tests so I find it useful to do regular little projects from scratch Here's the code:

And hear's the unit tests (of course the unit tests came first!):

I really like python and its unit testing framework. So simple to get going and for doing TDD.

Tuesday, July 9, 2013

Uncle Bob: Automated acceptance tests

Yesterday I went to see Uncle Bob at the Skills Matter Exchange in London. Having read and enjoyed Clean Code and Clean Coder it was great to see Uncle Bob in the flesh.

The talk was on automated acceptance tests. Such a simple topic - we all automate our acceptance tests don't we?

A few points I took away from the talk:

Can we get our stakeholders to write our acceptance tests? If not is it at least business analysts or QAs? If it is developers you're in trouble!
Developers are brilliant at rationalising about concepts such as "is it done?". Don't trust a developer to tell you something is done!
Acceptance tests should be automated at the latest half way through an iteration if your QAs are going to have time to do exploratory testing
The QAs should be the smartest people in your team. They should be specifying the system with the business analyst not verifying it after it has been developed
Your functional tests are just another component of the system. If that part of brittle it means your project is badly designed! Go back to the drawing board.

A final point that stuck with me is that acceptance tests don't need to be black box tests. The language they are written in should be high level (it was your stake holder who wrote it right??). But the implementation could interact with a version of your system that has the database or HTTP layer mocked out. Think of it this way:

How many times do you need to test the login functionality of your application? Once!
How many times will you test it if all your tests go through the GUI/web front end? Hundreds!

Hearing Uncle Bob speak reminds me that even when I am working on a project I think is being developed in a fantastic way, with fantastic testing - I can still try and make it better.

Saturday, June 1, 2013

Mocking Iterable objects generically with Mockito

I often find my self having to mock iterable objects. At times I've created a test stub that overrides the iterator method and stubs the iterator, other times I have used mockito to mock the iterator.

Most recently it was the Cassandra Datastax Java Driver's ResultSet object (which doesn't implement an interface and has a private constructor so you can't extend it and override or create an alternative implementation) which motivated me to create a generic method.

So basically I want this code to work but with aSetOfStrings is a mock. You probably won't be mocking collections but it'll give you the idea.

Set<String> aSetOfStrings = // passed in some how

for (String s : aSetOfStrings) {

results.add(s);

}

And the method for creating the mock iterable object needs to be generic. E.g:

public static <T> void mockIterable(Iterable<T> iterable, T... values)

Where the first parameter is the mock iterable object and the var arg is the objects that it should return.

The foreach loop internally uses two methods on the iterator returned by the set's iterator() method: hasNext() and next().

So to get it to work three methods need to be mocked:

Mock the iterable's (the set) iterator() method to return a mock iterator
Mock the iterator's hasNext() method to return true N times followed by false where N is the number of values that you want the iterator to return.
Mock the iterator's next() method to return the N elements in order.

Using mockito number one is easy:

Iterator<T> mockIterator = mock(Iterator.class);

when(iterable.iterator()).thenReturn(mockIterator);

Number two and three are slightly more complicated. Mockito lets you pass in a vararg for what to return. The slight complication is that the signature is:

thenReturn(T, T...)

This is to enforce that at least one element is passed in. This means that for the hasNext() we need to pass in true N times followed by a false but the first true needs to be passed in separately rather then in the vararg. The same applies for next() - we can't simply use the vararg passed into our mockIterable(..) method we need to build a new array with N-1 elements in. This can be done as follows:

If no values are passed in all we need to do is mock hasNext() to false.
If a single value is passed in we don't need to build an array to pass into thenReturn.
Finally, for more than one value we need to build the correct boolean array and values array to pass into thenReturn. For example:

MockIterator.mockIterable(mockIterable, "one", "two", "three");

We'd need the following mocking calls:

when(mockIterable.hasNext()).thenReturn(true, [true, true, false])

and

when(mockIterable.next()).thenReturn("one", ["two", "three"])

And here is the code to do it:

And the full code along with unit tests and a gradle script to build + pull in the dependencies is here.