Thursday, May 28, 2015

Cassandra Aggregates - min, max, avg, group by

This blog has moved to batey.info and won't be updated here.

Disclaimer: all this was against 2.2-beta so the syntax may have changed.

Cassandra 2.2 introduced user defined functions and user defined aggregates. We can now do things like min, max and average on the server rather than having to bring all the data back to your application. Max and min are built in but we'll see how you could have implemented them your self.

Max/Min


Here's an example table for us to try and implement max/min against.


User defined aggregates work by calling your user defined function on every row returned from your query, they differ from a function because the first value to the function is state that is passed between rows, much like a fold.

Creating an aggregate is a two or three step process:
  1. Create a function that takes in state (any Cassandra type including collections) as the first parameter and any number of additional parameters
  2. (Optionally) Create a final function that is called after the state function has been called on every row
  3. Refer to these in an aggregate
For max we don't need a final function but we will for average later.


Here we're using Java for the language (you can also use JavaScript) and just using Math.max. For our aggregate definition we start with (INITCOND) null (so it will return null for an empty table) and then set the state to be the max of the current state and the value passed in. We can our new aggregate like:


GroupBy


So there's no group by keyword in Cassandra but you can get similar behaviour with a custom user defined aggregate. Imagine you had a table that kept track of everything your customers did e.g



We can write a UDA to get a count of a particular column:


And we keep track of the counts in a map. Example use for counting both the event_type and the origin of the event:


More often than not when you use group by in other databases you are totalling another field. For example imagine we were keeping track of customer purchases and wanted a total amount each customer has spent:



We can create a generate aggregate for that called group_and_total:


And an example usage:


As you can see Haddad spends way too much.

Average


The Cassandra docs have an example of how to use a user defined aggregate to calculate aggregates: http://cassandra.apache.org/doc/cql3/CQL-2.2.html#udas

Small print


If you've ever heard me rant about distributed databases you've probably heard me talk about scalable queries, ones that work on a 3 node cluster as well as a 1000 node cluster. User defined functions and aggregates are executed on the coordinator. So if you don't include a partition key in your query all the results are brought back to the coordinator for your function to be executed, if you do a full table scan for your UDF/A don't expect it to be fast if your table is huge.

This functionality is in beta for a very good reason, it is user defined code running in your database! Proper sandboxing e.g with a SecurityManager will be added before this goes mainstream.




10 comments:

vignesh m said...
This comment has been removed by a blog administrator.
karmadip dodiya said...

Great job,
Can you know when approximate Cassandra 2.2 production version will release?

Christopher Batey said...

2.2 is on release candidate 2 so it will be released soon. As to when it is "production ready" that is normally after a few point releases. A general rule of thumb is to wait for DataStax to roll it into DataStax Enterprise :)

keypoint said...

Hi Christopher,

2.2 is now released and I read the documentation, but still no "group by" function. Is it still needed to write a separate function like "state_group_and_count" to achieve group by in SQL? I'm new and not quite sure. Thanks a lot.

Arbi Akhina said...

I'm having a OperationTimedOut: errors={}, last_host=127.0.0.1 when I run a similar aggregate query from CQLSH http://stackoverflow.com/questions/32567571/failed-to-run-custom-aggregation-operationtimedout-errors-last-host-127-0-0

Martin said...

Hi Christoper,

Great examples. Unfortunately the grouping example does not seem to work anymore. In Cassandra 3 beta 2 the incoming state map is a java.util.collections$UnmodifiableMap and it fails with a java.lang.UnsupportedOperationException when using the function.

Anuj said...

Hi Christoper,

I am new to spark/cassandra. I'm using cassandra 2.2 & spark 1.5.1 for internal POC work.
I am successfully able to execute aggregate functions through cqlsh.

I wanted to know is there any way to call the same function using datastax spark cassandra java API ? or I can create/register user defined function in spark itself and perform the same operations on cassandra rows ?

I have already spent couple of days without any success.

Thanks
Anuj Jain

Muhammad Rafif Murazza said...

Hi Christopher,

I just want to ask you one question, can we use this UDA to aggregates values from collections data type?
for examples, I have a field set .
Id | setfield |
1 | {a,b} |
2 | {a,c} |
3 | {a,b} |

is it possible to count the value so that I get the result:
{a : 3, b :2, c:1} ?

Thank you.
Raz

Pratik Shekhar said...

I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in TECHNOLOGY , kindly contact us http://www.maxmunus.com/contact
MaxMunus Offer World Class Virtual Instructor-led training on TECHNOLOGY. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ pieces of training in India, USA, UK, Australia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
For Demo Contact us.
Pratik Shekhar
MaxMunus
E-mail: pratik@maxmunus.com
Ph:(0) +91 9066268701
http://www.maxmunus.com/

Dattatray Bhosale said...

I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Apache Cassandra, kindly contact us http://www.maxmunus.com/contact
MaxMunus Offer World Class Virtual Instructor led training on TECHNOLOGY. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
For Demo Contact us.
Sangita Mohanty
MaxMunus
E-mail: sangita@maxmunus.com
Skype id: training_maxmunus
Ph:(0) 9738075708 / 080 - 41103383
http://www.maxmunus.com/