Thursday, May 28, 2015

Cassandra Aggregates - min, max, avg, group by

This blog has moved to batey.info and won't be updated here.

Disclaimer: all this was against 2.2-beta so the syntax may have changed.

Cassandra 2.2 introduced user defined functions and user defined aggregates. We can now do things like min, max and average on the server rather than having to bring all the data back to your application. Max and min are built in but we'll see how you could have implemented them your self.

Max/Min


Here's an example table for us to try and implement max/min against.


User defined aggregates work by calling your user defined function on every row returned from your query, they differ from a function because the first value to the function is state that is passed between rows, much like a fold.

Creating an aggregate is a two or three step process:
  1. Create a function that takes in state (any Cassandra type including collections) as the first parameter and any number of additional parameters
  2. (Optionally) Create a final function that is called after the state function has been called on every row
  3. Refer to these in an aggregate
For max we don't need a final function but we will for average later.


Here we're using Java for the language (you can also use JavaScript) and just using Math.max. For our aggregate definition we start with (INITCOND) null (so it will return null for an empty table) and then set the state to be the max of the current state and the value passed in. We can our new aggregate like:


GroupBy


So there's no group by keyword in Cassandra but you can get similar behaviour with a custom user defined aggregate. Imagine you had a table that kept track of everything your customers did e.g



We can write a UDA to get a count of a particular column:


And we keep track of the counts in a map. Example use for counting both the event_type and the origin of the event:


More often than not when you use group by in other databases you are totalling another field. For example imagine we were keeping track of customer purchases and wanted a total amount each customer has spent:



We can create a generate aggregate for that called group_and_total:


And an example usage:


As you can see Haddad spends way too much.

Average


The Cassandra docs have an example of how to use a user defined aggregate to calculate aggregates: http://cassandra.apache.org/doc/cql3/CQL-2.2.html#udas

Small print


If you've ever heard me rant about distributed databases you've probably heard me talk about scalable queries, ones that work on a 3 node cluster as well as a 1000 node cluster. User defined functions and aggregates are executed on the coordinator. So if you don't include a partition key in your query all the results are brought back to the coordinator for your function to be executed, if you do a full table scan for your UDF/A don't expect it to be fast if your table is huge.

This functionality is in beta for a very good reason, it is user defined code running in your database! Proper sandboxing e.g with a SecurityManager will be added before this goes mainstream.




Sunday, May 10, 2015

Building well tested applications with SpringBoot / Gradle / Cucumber / Gatling

I am a huge test first advocate. Since seeing Martin Thompson speak I am now trying to include performance testing with the same approach. I am going to call this approach Performance, Acceptance and Unit Test Driven Development, or PAUTDD :)

Tools/frameworks/library come and go so I'll start with the theory then show how I set this up using Junit and Mockio for unit testing, Cucumber for acceptance tests and Gatling for performance tests. Well I won't show JUnit and Mockito because that is boring!

So here's how I develop a feature:
  1. Write a high level end to end acceptance test. There will be times where I'll want acceptance tests not to be end to end, like if there was an embedded rules engine.
  2. Write a basic performance test.
  3. TDD with unit testing framework until the above two pass.
  4. Consider scenario based performance test.
I hate to work without test first at the acceptance level, even for a small feature (half a day dev?) I find them invaluable for keeping me on track and letting me know when functionally I am done. Especially if I end up doing a bit too much Unit testing/Mocking (bad Chris!) as when head down in code it is easy to forget the big picture: what functionality are you developing for a user. 

Next is a newer part of my development process. Here I want a performance test for the new feature, this may not be applicable but it usually is. Which ever framework I am using here I want to be able to run it locally and as part of my continuous integration for trend analysis. However I want more for my effort than that, I really want to be able to use the same code for running against various environments running different hardware.

I hate performance tests that aren't in source control and versioned. Tools that use a GUI are no use to me, I constantly found while working at my last company that the "performance testers" would constantly change their scripts and this would change the trend way more than any application changes. I want to be able to track both.

So how to set all this up for a Spring Boot application using Cucumber and Gatling?


This post is on build setup, not the actual writing of the tests. My aim is to enable easy acceptance/performance testing.

Here's the layout of my latest project:


Main is the source, test is the unit tests and e2e is both the Cucumber and the Gatling tests. I could have had separate source sets for the Cucumber and Gatling tests but that would have confused IntelliJ's Gradle support too much (and they are nicely split by Cucumber being Java and Gatling being Scala).

Cucumber JVM


There are various articles on Cucumber-JVM, here's the steps I used to get this running nicely in the IDE and via Gradle.

First the new source set:

Nothing exciting here, we are using the same classapth as test, we could have had a separate one.

Next is dependencies, this is actually the Gatling and HTTP client (for hitting our application) as well.

We cucumber JUnit and Spring dependencies.

Next is the source code for the acceptance tests. The Features are in the resource folder and the source is in the Java folder. To allow running via Gradle you also create a JUnit test to run all the features. Intellij should work find without this by just running the feature files.



Here I have Features separated by type, here's an example Feature:


I like to keep the language at a high level so a non-techy can write these. The JUnit test RunEndToEndTests looks like this:


This is what Gradle will pick up when we run this from the command line. You could separate this out into multiple tests if you wanted.

For running inside IntelliJ you might need to edit the run configuration to include a different Glue as by default it will be the same as the package your Feature file is in, for this project this wouldn't pick up the GlobalSteps as it is outside of the security/users folder. This is what my configuration looks like, I set this as the default:


Now our features will run if you want to see what the implementation of the Steps look like, checkout the whole project from Github.

Gatling


This is my first project using Gatling, I wanted my scenarios in code that I could have in version control. Previously I've used JMeter. Where as you can checkin the XML it really isn't nice to look at in diff tools. I've also been forced *weep* to use more GUI based tools like SOASTA and HP Load runner. One thing I haven't looked at is Gatling's support for running many agents. For me to continue using Gatling beyond developer trend analysis this needs to be well supported.

We already have the dependencies, see the dependencies section above, and we're going to use the same source set. The only difference is we're going to be writing these tests in Scala.

My first requirement was not to have to have Gatling installed manually on developer and CI boxes. Here's how to do this in Gradle:

Where BasicSimulation is the fully qualified name of my Gatling load test. All we do here is define a JavaExec task with the Gatling main class, tell Gatling where our source is and which simulation to run.

To tie all this together so it runs every time we fun Gradle check we add the following at the bottom of our Gradle build file:


This will produce reports that can be published + consumed by the Jenkins Gatling plugin.

To run the same load test from Intellij we do exactly the same in a run configuration:



A basic Gatling tests looks like this:


This test will run a very simple test where a single virtual user hits the /api/auction URL 100 times with a pause of 10 milliseconds. The top couple of lines start the Spring boot application and register a shut down hook to stop it.

We'll then end up with a report that looks like this:


This is a pretty terrible load test as it runs for 2 seconds and has a single user. But the point of this post is to setup everything so when adding new functionality it is trivial to add new performance and acceptance tests.

That's it, happy testing! If you want to see the whole project is on Github. It us under active development so you'll know how I got on with Gatling based if it is still there and there are lots of tests that use it!

Monday, May 4, 2015

Strata workshop: Getting started with Cassandra

Downloading and installing Cassandra:

Linux/Mac:
curl -L http://downloads.datastax.com/community/dsc-cassandra-2.1.4-bin.tar.gz | tar xz

(or use home brew)

Then run:

./bin/cassandra

To start Cqlsh (may need to install Python)

./bin/cqlsh

Windows:
http://planetcassandra.org/cassandra/ or grab a USB key from me.

Workshop code (we may not get to this):

Cql docs:

Cassandra docs:

Java Driver Docs:

Data modelling exercises:

First create the keysapce:


CREATE KEYSPACE killrauction WITH replication = {'class': 'SimpleStrategy' , 'replication_factor': 1 };

1) Get into CQLSH and create a table for users
- username
- firstname
- lastname
- emails
- password
- salt for password

2) Auction item table (no bids)
- name
- identifier?
- owner
- expiration

3) The bids

Data:
 - item identifier
 - bid time
 - bid user
 - bid amount

Considerations
 - Avoid sorting in the application
 - Two bids the same price?
 - Really fast sequential access
 - Current winner?