tag:blogger.com,1999:blog-4161315644722406995.post3851754232712772354..comments2024-03-28T02:13:16.582-07:00Comments on Christopher Batey's Blog has moved to www.batey.info: Cassandra anti-pattern: Misuse of unlogged batcheschbateyhttp://www.blogger.com/profile/13384294386607277964noreply@blogger.comBlogger15125tag:blogger.com,1999:blog-4161315644722406995.post-37472012924867431822023-02-10T03:46:27.506-08:002023-02-10T03:46:27.506-08:00This is a great post, however, I was wondering if ...This is a great post, however, I was wondering if you could elaborate a little on this. I would appreciate it if you could add a little more detail. Thank you! I want to share an article about <a href="https://tealfeed.com/color-blind-test-color-vision-tested-z7zvv" rel="nofollow">color vision test</a>. Color blindness or color vision deficiency is a genetic disorder carried on by a faulty X chromosome.<br />Larryhttps://www.blogger.com/profile/08276766408342014062noreply@blogger.comtag:blogger.com,1999:blog-4161315644722406995.post-17291500712962874492016-01-18T05:08:53.166-08:002016-01-18T05:08:53.166-08:00Yep, your comment ends the discussion ;). I missed...Yep, your comment ends the discussion ;). I missed this section somehowAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-4161315644722406995.post-36609285222851976652016-01-14T02:18:54.434-08:002016-01-14T02:18:54.434-08:00@Pawel So it seems that I've failed to explain...@Pawel So it seems that I've failed to explain this well enough :-) The <a href="https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/#single-partition" rel="nofollow">Single Partition Batches</a> section says (in "When should you use single partition batches?")<br /><br />> <i>Single partition batches may also be used to increase the throughput compared to multiple un-batched statements. Of course you must benchmark your workload with your own setup/infrastructure to verify this assumption. If you don’t want to do this you shouldn’t use single partition batches if you don’t need atomicity/isolation.</i><br /><br />In my experience most often single partition batches indeed increase throughput, but it may depend on your setup/infracture/workload/data.<br /><br />Does this answer your question?Martin Grotzkehttps://www.blogger.com/profile/06892885925057928810noreply@blogger.comtag:blogger.com,1999:blog-4161315644722406995.post-49350699336680703172016-01-14T00:26:21.283-08:002016-01-14T00:26:21.283-08:00@Martin, well, my understanding of your blog post ...@Martin, well, my understanding of your blog post is that it is better to execute multiple async when inserting many rows by different partition key as they will potentially go to different boxes in cluster.<br /><br />but what I am asking about, is it still better to do multiple async inserts when writing with same partition key (first part of compound primary key) but different clustering key (following parts of primary key). my guts feelings are that running single batch should be better as only one box will coordinate inserts to target box, there will be less data transferred. Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-4161315644722406995.post-74883609072836616012016-01-13T14:46:50.886-08:002016-01-13T14:46:50.886-08:00@pinkpanter, @pawel.kaminski: I think this blog po...@pinkpanter, @pawel.kaminski: I think this blog post explained it already, but perhaps this logged vs unlogged is sometimes a bit confusing. I tried to put it differently and focus more on single vs multi partition batches, hopefully it's helpful: <a href="https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/" rel="nofollow">Cassandra - to BATCH or not to BATCH</a>Martin Grotzkehttps://www.blogger.com/profile/06892885925057928810noreply@blogger.comtag:blogger.com,1999:blog-4161315644722406995.post-28581427716438468362015-12-23T06:16:28.736-08:002015-12-23T06:16:28.736-08:00I was thinking about same thing as pinkpanther men...I was thinking about same thing as pinkpanther mentioned.<br /><br />when it comes to same partition key, batch should be better option as there is less connections open at once. lets say I want to create 10000 insert for same client id (with different values) at once I guess it should work better.<br /><br />waiting for any comment on that ! <br /><br /><br />PS.<br />@Stefan G updates are based on primary key only. cassandra can efficiently find exact node in cluster to store/update row. Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-4161315644722406995.post-41279966083224745892015-12-18T17:31:16.264-08:002015-12-18T17:31:16.264-08:00Hope this is not too off-topic, but if someone can...Hope this is not too off-topic, but if someone can help a Cassandra newby, it would be appreciated.<br /><br />Suppose there is the following table in Cassandra 3.0.<br /><br />CREATE TABLE mykeyspace.users (<br /> user_id int PRIMARY KEY,<br /> fname text,<br /> lname text<br />) WITH bloom_filter_fp_chance = 0.01<br /> AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}<br /> AND comment = ''<br /> AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCom<br />': '32', 'min_threshold': '4'}<br /> AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandr<br /> AND crc_check_chance = 1.0<br /> AND dclocal_read_repair_chance = 0.1<br /> AND default_time_to_live = 0<br /> AND gc_grace_seconds = 864000<br /> AND max_index_interval = 2048<br /> AND memtable_flush_period_in_ms = 0<br /> AND min_index_interval = 128<br /> AND read_repair_chance = 0.0<br /> AND speculative_retry = '99PERCENTILE';<br />CREATE INDEX lname_index ON mykeyspace.users (lname);<br /><br />Now suppose that there is the following data in the USERS table.<br /><br />cqlsh:mykeyspace> select * from users;<br /><br /> user_id | fname | lname<br />---------+---------+--------<br /> 1 | Stefan | Jones<br /> 2 | Trevor | Richmond<br /> 3 | Allison | Richmond<br /><br />Now there is a secondary index on the lname column.<br /><br />Is it possible to write a single-shot statement that will find all the rows where lname = 'Richmond' and change that last name to 'Smith'? <br /><br />I have tried to write such a statement using a Java client using the Datastax driver, but there is always a complaint that a partition key is not being used. I don't understand why this is a problem. Doesn't Cassandra have local secondary indices on each node. If so, why can't such a statement simply be distributed to the nodes by some coordinator, and then each node can locally perform its portion of the statement using its local index for the secondary index on the users(lname) column. Also note that this statement does not refer to the primary key of the users table. Instead, execution of this statement, at least as how I had hoped it would work, would simply be done by delegating the work to the individuals of the cluster. <br /><br />What am I missing? Is it possible to efficiently write a statement that will perform an update based on a secondary index without referring to a primary key or partition key? If not, why not? Is it possible that some future version of Cassandra could execute such a statement as I thought, but simply doesn't support such statements at the present time? Or am I missing some fundamental idea of the data architecture?Stefan Ghttps://www.blogger.com/profile/17030772321457053395noreply@blogger.comtag:blogger.com,1999:blog-4161315644722406995.post-5057443029981646332015-12-18T17:31:15.745-08:002015-12-18T17:31:15.745-08:00Hope this is not too off-topic, but if someone can...Hope this is not too off-topic, but if someone can help a Cassandra newby, it would be appreciated.<br /><br />Suppose there is the following table in Cassandra 3.0.<br /><br />CREATE TABLE mykeyspace.users (<br /> user_id int PRIMARY KEY,<br /> fname text,<br /> lname text<br />) WITH bloom_filter_fp_chance = 0.01<br /> AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}<br /> AND comment = ''<br /> AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCom<br />': '32', 'min_threshold': '4'}<br /> AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandr<br /> AND crc_check_chance = 1.0<br /> AND dclocal_read_repair_chance = 0.1<br /> AND default_time_to_live = 0<br /> AND gc_grace_seconds = 864000<br /> AND max_index_interval = 2048<br /> AND memtable_flush_period_in_ms = 0<br /> AND min_index_interval = 128<br /> AND read_repair_chance = 0.0<br /> AND speculative_retry = '99PERCENTILE';<br />CREATE INDEX lname_index ON mykeyspace.users (lname);<br /><br />Now suppose that there is the following data in the USERS table.<br /><br />cqlsh:mykeyspace> select * from users;<br /><br /> user_id | fname | lname<br />---------+---------+--------<br /> 1 | Stefan | Jones<br /> 2 | Trevor | Richmond<br /> 3 | Allison | Richmond<br /><br />Now there is a secondary index on the lname column.<br /><br />Is it possible to write a single-shot statement that will find all the rows where lname = 'Richmond' and change that last name to 'Smith'? <br /><br />I have tried to write such a statement using a Java client using the Datastax driver, but there is always a complaint that a partition key is not being used. I don't understand why this is a problem. Doesn't Cassandra have local secondary indices on each node. If so, why can't such a statement simply be distributed to the nodes by some coordinator, and then each node can locally perform its portion of the statement using its local index for the secondary index on the users(lname) column. Also note that this statement does not refer to the primary key of the users table. Instead, execution of this statement, at least as how I had hoped it would work, would simply be done by delegating the work to the individuals of the cluster. <br /><br />What am I missing? Is it possible to efficiently write a statement that will perform an update based on a secondary index without referring to a primary key or partition key? If not, why not? Is it possible that some future version of Cassandra could execute such a statement as I thought, but simply doesn't support such statements at the present time? Or am I missing some fundamental idea of the data architecture?Stefan Ghttps://www.blogger.com/profile/17030772321457053395noreply@blogger.comtag:blogger.com,1999:blog-4161315644722406995.post-84062699544668942382015-04-22T00:08:01.741-07:002015-04-22T00:08:01.741-07:00Hi,
So, is it okay to use unlogged statements for...Hi,<br /><br />So, is it okay to use unlogged statements for inserting into two column families with same partition key?<br /><br />Like<br /><br />INSERT INTO CF1(uuid,...) values(UUID_X, ...)<br />INSERT INTO CF2(uuid,...)<br />values(UUID_X, ...)<br /><br /><br />So does the same argument work when two CFs being used but with same partition key (value UUID_X=UUID_X) using unlogged batch?<br /><br />And also, are UNLOGGED BATCHES comparable to LOGGED BATCHES (atomicity and all) when it comes to using the same partition key value in statements?<br /><br />Thankspinkpantherhttps://www.blogger.com/profile/17627953395443193451noreply@blogger.comtag:blogger.com,1999:blog-4161315644722406995.post-70464856234802579852015-03-20T01:43:06.644-07:002015-03-20T01:43:06.644-07:00I was suggesting not to use Cassandra batches at a...I was suggesting not to use Cassandra batches at all, but use the executeAsync functionality in the DataStax drivers which will give you back a Future. Here's a good article on the different ways of pulling together all the Futures: http://www.datastax.com/dev/blog/java-driver-async-queries<br /><br />You'd still want to check every Future, otherwise you won't know i any failed. Unless you had a retry policy in place that logged / alerted sufficiently. chbateyhttps://www.blogger.com/profile/13384294386607277964noreply@blogger.comtag:blogger.com,1999:blog-4161315644722406995.post-30547713287613350772015-03-19T21:11:27.596-07:002015-03-19T21:11:27.596-07:00Hi Chris,
This article is very informative, thank...Hi Chris,<br /> This article is very informative, thanks. Can you explain a bit more on your answer to one of the questions above "so do lots of async inserts before awaiting, then once you have executed all the async inserts then await for them all to complete."<br /> <br />are you referring here to execute asynchronous requests individually in batches (say issue 100 writes, wait for all of them to be succesful, then issue next 100 and so on)? Is there a way to know how much we should wait here to get a confirmation on write? Adding a listener for every write would be an overkill for heavy write workload, right?<br /><br />Thanks<br />SrivatsanAnonymoushttps://www.blogger.com/profile/09881354088465023422noreply@blogger.comtag:blogger.com,1999:blog-4161315644722406995.post-88106988845124668752015-03-03T14:31:21.099-08:002015-03-03T14:31:21.099-08:00Hi Edward, there are two config options that will ...Hi Edward, there are two config options that will help you:<br /><br />spark.cassandra.output.batch.size.rows<br />spark.cassandra.output.batch.size.byteschbateyhttps://www.blogger.com/profile/13384294386607277964noreply@blogger.comtag:blogger.com,1999:blog-4161315644722406995.post-59260109785213597362015-02-13T20:28:53.797-08:002015-02-13T20:28:53.797-08:00Hey Chris,
If I am using datastax's spark-cas...Hey Chris,<br /><br />If I am using datastax's spark-cassandra-connector, how would I control the way the connector is writing to cassandra ring? The sc contains 50,000 columns to be inserted into a single row key. Spark currently complains that the batch is too large.<br /><br />Failed to execute: com.datastax.driver.core.BatchStatement@f5b9a66<br />com.datastax.driver.core.exceptions.InvalidQueryException: Batch too large<br /><br />What is the best approach to write large number of columns to Cassandra from Spark?Anonymoushttps://www.blogger.com/profile/06780803934986464595noreply@blogger.comtag:blogger.com,1999:blog-4161315644722406995.post-86453958763921627912015-02-10T01:36:25.195-08:002015-02-10T01:36:25.195-08:00Hi Oleksii - Thanks!
1) By default the DataStax d...Hi Oleksii - Thanks!<br /><br />1) By default the DataStax drivers now use a Token aware policy. So the driver hashes your partition key and sends the request to a replica for that partition.<br /><br />2) The advantage is that you can execute the inserts in parallel, so do lots of async inserts before awaiting, then once you have executed all the async inserts then await for them all to complete. So the total time will be your slowest query out of say 10 inserts, rather than the sum of the latencies.chbateyhttps://www.blogger.com/profile/13384294386607277964noreply@blogger.comtag:blogger.com,1999:blog-4161315644722406995.post-1282845646080343562015-02-10T00:49:50.944-08:002015-02-10T00:49:50.944-08:00This is excellent. I've done this very mistake...This is excellent. I've done this very mistake. Now need to go back and unfix my 'improvement'. <br /><br />Christopher, couple of questions:<br /><br />1. How does the driver know where to send the requests if I use async inserts with different partitions keys and vnodes with automatic key ranges distribution?<br /><br />2. How is sync different from async for multiple inserts? Is there any difference between sync call to insert rows and 'await async' call to insert (Await is a C# sort of blocking statement for asyns operations)<br /><br />Thanks, Oleksii.oleksii.mdrhttps://www.blogger.com/profile/05478316672867984839noreply@blogger.com