Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

Gangadhara M.B avatar image
Gangadhara M.B asked ·

How do I enable logging of INSERT failures to Cassandra logs?

[FOLLOW UP QUESTION TO #5145]

Using batch job we are loading huge chunk of data from oracle databases to Cassandra , when application users are complaining about batch job(data insert failures) failures like below i am not seeing any failed insert statement which leading in to error state either in system.log or debug.log ,to get log of failed Insert statement in either system.log or debug.log what type of logging needs to be enabled ?

Cassandra Instance type :- EC2 Instance R5.2Xlarge( 8CPU 64 GB RAM) , 01TB EBS SSD volume . Total there are 09 cassandra nodes , it's an new cluster set up .Initial data load testing is failing , data loading into Cassandra from Oracle is through Java program

Tried to increase the below param values as below also didn't help

batch_size_warn_threshold_in_kb: 100 , batch_size_fail_threshold_in_kb :1000
write_request_timeout_in_ms: 30000 , counter_write_request_timeout_in_ms: 30000
commitlog_sync: periodic , commitlog_sync_period_in_ms: 10000
concurrent_writes: 64 , concurrent_counter_writes: 64
commitlog_sync: periodic , commitlog_sync_period_in_ms: 10000

Below is the default lagging level

[cassandra@uatrrcass05 commitlog]$ nodetool getlogginglevels
Logger Name                                        Log Level
ROOT                                                    INFO
com.thinkaurelius.thrift                               ERROR
org.apache.cassandra                                   DEBUG


11-06-2020 13:02:30.0437|ERROR|Error inserting: [/10.164.47.125:9042] Timed out waiting for server response
11-06-2020 13:54:41.0136|ERROR|Error inserting: [/10.164.47.125:9042] Timed out waiting for server response
11-06-2020 14:13:35.0536|ERROR|Error inserting: [/10.164.47.129:9042] Timed out waiting for server response
11-06-2020 14:16:09.0136|ERROR|Error inserting: [/10.164.47.123:9042] Timed out waiting for server response
11-06-2020 14:16:09.0236|ERROR|Error inserting: [/10.164.47.124:9042] Timed out waiting for server response
11-06-2020 14:30:16.0236|ERROR|Error inserting: [/10.164.47.122:9042] Timed out waiting for server response
11-06-2020 14:30:18.0136|ERROR|Error inserting: [/10.164.47.121:9042] Timed out waiting for server response
11-06-2020 14:35:56.0036|ERROR|Error inserting: [/10.164.47.128:9042] Timed out waiting for server response
11-06-2020 14:35:56.0936|ERROR|Error inserting: [/10.164.47.127:9042] Timed out waiting for server response
11-06-2020 14:40:15.0835|ERROR|Error inserting: [/10.164.47.126:9042] Timed out waiting for server response
11-06-2020 14:40:15.0835|ERROR|Too many INSERT errors ... Stopping
11-06-2020 14:40:15.0835|ERROR|There was an error.  Please check the log file for more information
11-06-2020 14:40:17.0735|ERROR|Error inserting: [/10.164.47.126:9042] Timed out waiting for server response
11-06-2020 14:40:17.0736|ERROR|Too many INSERT errors ... Stopping
11-06-2020 14:40:17.0736|ERROR|Error inserting: [/10.164.47.126:9042] Timed out waiting for server response
11-06-2020 14:40:17.0736|ERROR|Too many INSERT errors ... Stopping
11-06-2020 14:40:17.0836|ERROR|Error inserting: [/10.164.47.126:9042] Timed out waiting for server response
11-06-2020 14:40:17.0836|ERROR|Too many INSERT errors ... Stopping
cassandralogging
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered ·

INSERT failures

In Cassandra, writes requests from an application usually fail with the following reasons:

  • WriteTimeoutException - the write request timed out at the coordinator while it was waiting for N replicas to respond (N number is based on the write consistency level) within write_request_timeout_in_ms
  • WriteFailureException - some or all the replicas contacted by the coordinator replied with an error (not a timeout error) to indicate that the write failed, e.g. commitlog disk is full, disk IO queue is exhausted, etc.
  • DriverTimeoutException - the driver request timed out because it didn't get a response from the coordinator within the client-side timeout window (configured in your app)
  • AllNodesFailedException - the query failed on all the coordinators tried by the driver (for whatever reason)

All these exceptions are logged client-side meaning the driver reports these exceptions in your application's log files. You then need to correlate the errors on the logs of corresponding node to determine what caused the error but the query itself.

For example, if the driver reported in the application log a write failure on node 10.1.2.3 at 13:24, you need to review the log entries on that node around 13:24 to determine why the writes failed. It will usually be a GC pause and/or a dropped mutation message (usually indicates that the commitlog disk is overloaded and cannot keep up with the IO load).

Other items to consider

We don't recommend increasing the batch size thresholds this high:

batch_size_warn_threshold_in_kb: 100
batch_size_fail_threshold_in_kb :1000

Batches are not an optimisation -- they don't make your queries run faster. It can be counter-intuitive because large batch sizes can overload your cluster and make the queries run slower.

We also don't recommend increasing the write timeouts:

write_request_timeout_in_ms: 30000
counter_write_request_timeout_in_ms: 30000

The timeouts are in place to prevent long-running writes from overloading the nodes. If the writes are timing out because the disks cannot keep up with the IO, increasing the timeout threshold only overloads the nodes further since the requests stay queued up for a long time instead of getting rejected by Cassandra.

It looks like you got this statement wrong too:

Cassandra Instance type :- EC2 Instance M5.Xlarge( 8CPU 64 GB RAM)

m5.xlarge instances only come with 4 vCPUs + 16GB of RAM. There isn't an m5 instance that comes with 8 vCPUs + 64GB of RAM. Is it possible your cluster is running with r5.2xlarge EC2 instances?

As a side note, you previously indicated in question #5145 that your cluster is running with Apache Cassandra 3.11.6 but the nodetool getlogginglevels output you posted indicates your cluster is running with a version of DSE:

[cassandra@uatrrcass05 commitlog]$ nodetool getlogginglevels
Logger Name                                        Log Level
ROOT                                                    INFO
com.thinkaurelius.thrift                               ERROR
org.apache.cassandra                                   DEBUG

I only wanted to point this out because the version of Cassandra or DSE your cluster is running is relevant to answering your questions. The recommendation we give depends on the specific version you're running. Cheers!

4 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Sorry there was a typo error , actual EC2 Instance type is R5.2xlarge.

Below is the Apache cassandra tar ball downloaded from https://cassandra.apache.org/download/ and untar into user defined directory , i hope it's apache cassandra not DSE cassandra

We don't have space problem , we have total 01TB allocated

All data file , commitlog, savedcache under the same mount point

Below is the CPU and Disk I/O statistics for the 01 hr period on one of Cassandra node when application users complained about write operation failed with timeout errors .

We saw MUTATION dropped at 2:40 PM on this node uatrrcass08

0 Likes 0 · ·
1591935554393.png (41.2 KiB)
1591935671828.png (67.4 KiB)
1591935757478.png (42.9 KiB)
1591935891038.png (143.6 KiB)
1591936176113.png (78.8 KiB)

I'm not sure if there was a question there. But your post just confirms that the nodes can't keep up with the writes -- the local node dropped 2K mutations and replicas dropped 12K. You need to throttle back your writes so they don't overload your cluster. Cheers!

P.S. A friendly note that you shouldn't post long comments as "answers". There's a reason why comments have a 1,000-character limit. :)

0 Likes 0 · ·

Yes earlier post was question ,


0 Likes 0 · ·

But what was your question? It isn't clear to me based on what you posted. Cheers!

0 Likes 0 · ·