Tuesday, October 21, 2014

Building fault tolerant services: Do you handle timeouts correctly?

One of the simplest sounding requirements that you might get as a software engineer is that your service should either succeed or timeout within N seconds.

However as we move toward more distributed services a.k.a microservices, this is harder than it sounds.

Even if all your service did was call out to another HTTP service over TCP, do some logic, then return a response you have to deal with:
  • Thread creation duration
  • Socket connection timeout to the dependency
  • Socket read timeout to the dependency
  • Resource acquisition e.g how long a it takes for your request thread to get hold of a resource from a pool
If you stop there you might think you have all your bases covered:


Looks good, you are taking into account the time taken to get a resource from a resource pool for the third party, any connection timeouts in case you need to re-connect and then finally you set a socket read timeout on the request.

This covers timing out in most cases, but what happens if the dependency is feeding you data very slowly? 

Here a socket read time out won't help you as the underlying socket library you're using is receiving some data per read time out period. For large payloads this scenario can leave your application appearing to hang.

So how to you solve this? To be sure that you as an application will time out than you can't rely on network level timeouts. A common pattern is to have a worker queue and thread pool for each dependency, that way you can timeout on the request in process. A fantastic library for this is Netflix's Hystrix.

How do you do automated testing for this? If you're like me and love to test everything then this is a tough one. However the combination of running your dependency (or a mock like wiremock) on a separate VM that is provisioned for the test, and then using linux command like iptables and tc, then you can automate the tests for slow network. Saboteur is a small python library that does this for you, and offers a HTTP API for slowing network, dropping packets etc.

Isn't this slow? On a Jenkins server, the provision of the VM takes ~1 minute with vagrant after the base box is downloaded. For development I always have it running. 

The whole stack of an application under test, wiremock, vagrant and saboteur will be the topic of a follow up post that will contain a full working example.

This article showed how complicated this is for a single dependency, what about if you call out to many dependencies? I tend to use a library like Hystrix to wrap calls to dependencies, but in-house code to wrap the whole request and timeout. This allows one dependency to be slow while the others are fast, which is more flexible than taking your SLA and dividing it between your dependencies.