Christopher Batey's Blog has moved to www.batey.info

Monday, December 1, 2014

Talking about Cassandra at the LJC open conference 2014

The LJC conference is a yearly event for Java/JVM developers in London to get together and see some (hopefully) great talks :)

IBM kindly provide the an awesome venue on South Bank at no charge.

I had the fortune of being scheduled to talk first which means I could get my talk done and then enjoy the rest of the day. I chose to speak about Cassandra for Java devs which went down really well and I had people coming to me all day asking about Cassandra.

Here are the slides:

LJC Conference 2014 Cassandra for Java Developers from Christopher Batey

Overall it was an awesome day and I look forward to next year :)

Friday, November 7, 2014

Talking at @skillsmatter for the LJC about fault tolerant microservices

Had a great time giving a talk on building fault tolerant micro services at skills matter this week. It was great to share some of the good work my team and I have been doing at BSkyB. Here are the slides:

Fault tolerant microservices - LJC Skills Matter 4thNov2014 from chbatey

Skills matter were kind enough to record the event, here is the video: https://skillsmatter.com/skillscasts/5810-building-fault-tolerant-microservices

I referenced some great tools such as wiremock and saboteur and coincidentally the author was in the audience, it is a small tech world we live in!

Tuesday, October 21, 2014

Building fault tolerant services: Do you handle timeouts correctly?

One of the simplest sounding requirements that you might get as a software engineer is that your service should either succeed or timeout within N seconds.

However as we move toward more distributed services a.k.a microservices, this is harder than it sounds.

Even if all your service did was call out to another HTTP service over TCP, do some logic, then return a response you have to deal with:

Thread creation duration
Socket connection timeout to the dependency
Socket read timeout to the dependency
Resource acquisition e.g how long a it takes for your request thread to get hold of a resource from a pool

If you stop there you might think you have all your bases covered:

Looks good, you are taking into account the time taken to get a resource from a resource pool for the third party, any connection timeouts in case you need to re-connect and then finally you set a socket read timeout on the request.

This covers timing out in most cases, but what happens if the dependency is feeding you data very slowly?

Here a socket read time out won't help you as the underlying socket library you're using is receiving some data per read time out period. For large payloads this scenario can leave your application appearing to hang.

So how to you solve this? To be sure that you as an application will time out than you can't rely on network level timeouts. A common pattern is to have a worker queue and thread pool for each dependency, that way you can timeout on the request in process. A fantastic library for this is Netflix's Hystrix.

How do you do automated testing for this? If you're like me and love to test everything then this is a tough one. However the combination of running your dependency (or a mock like wiremock) on a separate VM that is provisioned for the test, and then using linux command like iptables and tc, then you can automate the tests for slow network. Saboteur is a small python library that does this for you, and offers a HTTP API for slowing network, dropping packets etc.

Isn't this slow? On a Jenkins server, the provision of the VM takes ~1 minute with vagrant after the base box is downloaded. For development I always have it running.

The whole stack of an application under test, wiremock, vagrant and saboteur will be the topic of a follow up post that will contain a full working example.

This article showed how complicated this is for a single dependency, what about if you call out to many dependencies? I tend to use a library like Hystrix to wrap calls to dependencies, but in-house code to wrap the whole request and timeout. This allows one dependency to be slow while the others are fast, which is more flexible than taking your SLA and dividing it between your dependencies.

Saturday, August 23, 2014

Using Hystrix with Dropwizard

I've previously blogged about Hystrix/Tenacity and Breakerbox. The full code and outline of the example is located here.

Subsequently I've been using Dropwizard and Hystrix without Tenacity/Breakerbox and found it far simpler. I don't see a great deal of value in adding Tenacity and Breakerbox as Hystrix uses Netflix's configuration library Archaius which already comes with dynamic configuration via files, databases and Zookeeper.

So lets see what is involved in integrating Hystrix and Dropwizard.

The example is the same from this article. Briefly, it is a single service that calls out to three other services:

A user service
A pin check service
A device service

To allow the application to reliably handle dependency failures we are going to call out to each of the three services using Hystrix commands. Here is an example of calling out using the Apache Http Client to a pin check service:

To execute this command you simply instantiate an instance and call execute(), Hystrix handles creating a work queue and thread pool. Each command that is executed with the same group will use the same work queue and thread pool. You tell Hystrix the group by passing it to super() when extending a Hystrix command. To configure Hystrix in a Dropwizard like way we can add a map to our Dropwizard YAML:

This will translate to a Map in your Dropwizard configuration class:

The advantage of using a simple map rather than a class with the property names matching Hystrix property names is this allows you to be completely decoupled from Hystrix and its property naming conventions. It also allows users to copy property names directly from Hystrix documentation into the YAML.

To enable Hystrix to pick these properties up it requires a single line in your Dropwizard application class. This simplicity is due to the fact that Hystrix uses Archaius for property management.

Now you can add as any of Hystrix's many properties to your YAML. Then later extend the Configuration you install to include a dynamic configuration source such as ZooKeeper.

I hope this shows just how simple it is to use Hystrix with Dropwizard without bothering with Tenacity. A full working example is on github.

Thursday, August 21, 2014

Stubbed Cassandra at Skills Matter

Yesterday I gave a talk on how to test Cassandra applications using Stubbed Cassandra at the Skills Matter in London for the Cassandra London meetup group.

The talk was well attended with some where between 50 to 100 people attending.

The slides are on Slide share:

And the talk is on the skills matter website.

Thanks to Cassandra London and Skills Matter for having me!

Thursday, August 7, 2014

RabbitMQ and Highly Available Queues

RabbiqMQ is a AMQP broker with an interesting set of HA abilities. Do a little research and your head will start spinning working out the differences between making messages persistent, or queues durable, or was it durable messages and HA queues with transactions? Hopefully the following is all the information you need in one place.

Before evaluating them you need to define your requirements.

Do you want queues to survive broker failures?
Do you want unconsumed messages to survive a broker failure?
What matters more, publisher speed, or the above? Or do you want a nice compromise?

RabbitMQ allows you to:

Make a cluster of Rabbits where clients can communicate with any node in the cluster
Make a queue durable, meaning the queue definition itself will survive broker failure
Make a message persistent, meaning that it will get stored to disk, which you do by setting a message's delivery_mode
Make a queue HA, meaning its contents will be replicated across brokers, either a specified list, all of them or a number of them
Even an HA queue has a single master that handles all operations on that queue even if the client is connected to a different node in the cluster, the master sends information to the replicas, these are called slaves

Okay so you have a durable queue that is HA and you're using persistent messages (you really want it all!). How do you work with the queue correctly?

Producing to an HA queue

You have three options for publishing to a HA queue:

Accept the defaults, the publish will return with no guarantees in the result of broker failure
Publisher confirms
Transactions

The defaults: You went to all that effort of making a durable HA queue and send a persistent message and then you just fire and forget? Sounds crazy, but its not. You might have done the above to make sure you don't lose a lot of messages, but you don't want the performance impact of waiting for any form of acknowledgment. You're essentially accepting a few failures when you lose a rabbit that is the master for any of your queues.

Transactions: To use RabbitMQ transactions you do a txSelect on your channel. Then when you publish a message you call txCommit which won't return until your message has been accepted by all of the master and all of the queues slaves. If you message is persistent then that means it is on the disk of them all, you're safe! What's not to like? The speed! Every persistent message that is published in a transaction results in an fsync to disk. You need a compromise you say?

Publisher confirms: So you don't want to lose your messages and you want to speed things up. Then you can enable publish confirms on your channel. RabbitMQ will then send you a confirmation when the message has made it to disk on all the rabbits but it won't do it right away, it will flush things to disk in batches. You can either block periodically or set up a listener to get notified. Then you can put logic in your publisher to do retries etc. You might even write logic to limit the number of published messages that haven't been confirmed. But wait, isn't queueing meant to be easy?

Consuming from a HA queue

Okay, so you have your message on the queue - how do you consume it? This is simpler:

Auto-ack: As soon as a message is delivered RabbitMQ discards it
Ack: Your consumer has to manually ack each message

If your consumer crashes and disconnects from Rabbit then the message will be re-queued. However if you have a bug and you just don't ack it, then Rabbit will keep a hold of it until you disconnect, then it will be re-queued. I bet that leads to some interesting bugs!

So what could go wrong?

This sounds peachy, you don't care about performance so you have a durable HA queue with persistent messages and are using transactions for producing and acks when consuming, you guaranteed exactly once delivery right? Well, no. Imagine your consumer crashes having consumed the message but just before sending the ack? Rabbit will re-send the message to another consumer.

HA queueing is hard!

Conclusion

There is no magic bullet, you really need to understand the software you use for HA queueing. It is complicated and I didn't even cover topics like network partitions. Rabbit is a great piece of software and its automatic failover is really great but every notch you add on (transactions etc) will degrade your performance significantly.

Monday, August 4, 2014

Getting started with Hystrix and Tenacity to build fault tolerant applications

Applications are becoming increasingly distributed. Micro service architecture is the new rage. This means that each application you develop has more and more "integration points".

Any time you make a call to another service or database, or use any third party library that is a black box to you, it can be thought of as an integration point.

Netflix's architecture gives a great example of how to deal with integration points. They have a popular open-source library called Hystrix which allows you to isolate integration points by executing all calls in its own worker queue and thread pool.

Yammer have integrated Hystrix with Dropwizard, enabling enhancement of applications to publish metrics and accept configuration updates.

Here is an example application that calls out to three HTTP services and collects the results together into a single response.

Rather then calling into a HTTP library on the thread provided by Jetty this application uses Yammer's Hystrix wrapper, Tenacity.

Let's look at one of the integration points:

Here we extend the TenacityCommand class and call the dependency in the run() method. Typically all this code would be in another class with the TenacityCommand just being a wrapper, but this is a self-contained example. Let's explain what it is doing:

Making an HTTP call using the Apache HTTP client
If it fails throw a Runtime exception

By instantiating this TenacityCommand and calling execute(), your code is automagically executed on its very own thread pool, and requests are queued on its very own work queue. What benefits do you get?

You get a guaranteed timeout, so no more relying on library's read timeout that never seems to work in production
You get a circuit breaker that opens if a configured % of calls fail, meaning you can fail fast and throttle calls to failing dependencies
You get endpoints that show the configuration and whether a circuit breaker is open

If the call to run() fails, times out or the circuit breaker is open Tenacity will call the optional getFallback() method, so you can provide a fallback strategy rather than failing completely.

Another hidden benefit is how easy it is to move to a more asynchronous style of programming. Let's look at the resource class that pulls together the three dependencies:

Let's ignore the fact we are calling out to other HTTP services from a resource layer. The above code shows how to use Tenacity synchronously. Apart from the advantages you gain regarding failures, all the calls are still happening one by one as we call execute() which blocks so we don't call the second dependency until the first one has finished.

However, this doesn't have to be the case. Now you've snuck Tenacity into your code base you can change the code to something like this:

And without your colleagues realising you've made all of your calls to your dependencies execute asynchronously and (possibly) at the same time, then you block to bring them all together at the end.

We've barely scratched the surface of Hystrix and Tenacity but hopefully you can already see the benefits. All the code for this example a long with instructions on how to use wiremock to mock the dependencies is here.