mahmoud.hamdy.ali avatar image
mahmoud.hamdy.ali asked Erick Ramirez commented

How do I resolve repeated ReadTimeoutException?


I have a 3 nodes cluster, with a very low traffic, however the ReadTimeoutException is keep repeating even after hours

ERROR [ReadRepairStage:759] 2021-12-02 11:18:11,057 - Exception in thread Thread[ReadRepairStage:759,5,main]
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 1 responses.
 at org.apache.cassandra.service.DataResolver$RepairMergeListener.close( ~[apache-cassandra-3.11.11.jar:3.11.11]
 at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close( ~[apache-cassandra-3.11.11.jar:3.11.11]
 at org.apache.cassandra.db.transform.BaseIterator.close( ~[apache-cassandra-3.11.11.jar:3.11.11]
 at org.apache.cassandra.service.DataResolver.compareResponses( ~[apache-cassandra-3.11.11.jar:3.11.11]
 at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow( ~[apache-cassandra-3.11.11.jar:3.11.11]
 at ~[apache-cassandra-3.11.11.jar:3.11.11]
 at java.util.concurrent.ThreadPoolExecutor.runWorker( ~[na:1.8.0_262]
 at java.util.concurrent.ThreadPoolExecutor$ ~[na:1.8.0_262]
 at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0( ~[apache-cassandra-3.11.11.jar:3.11.11]
 at ~[na:1.8.0_262]

I increased the read_request_timeout_in_ms and write_request_timeout_in_ms and the issue still happening.

and when this excpetions are getting increased the cluster, my application stop working properly, and when I restart Cassandra the old data are gone.

How to investigate further to solve the issue


10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez commented

The error indicates that the timeouts are occurring when a coordinator node has triggered a read-repair. Specifically for the error you posted, the coordinator is expecting a response from multiple replicas but only received a response from one.

The underlying issue is here is that your nodes are likely overloaded at some point leading to dropped mutations. When the application performs a read request, the replicas are so out-of-sync that the coordinator needs to trigger a read-repair. You cannot resolve this issue by increasing the timeouts -- you're just delaying the inevitable.

You need to review the capacity of your cluster and ensure that you don't overload the nodes. Otherwise, the only solution is to add more nodes. Cheers!

2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

mahmoud.hamdy.ali avatar image mahmoud.hamdy.ali commented ·

Hello Erick,

Thanks so much for your reply

Do you think reducing the replication factor could reduce the load on the cluster?

or it could be due to lack of resources on the machine itself?

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ mahmoud.hamdy.ali commented ·
No, it won't. The solution is in the last paragraph of my answer. Cheers!
0 Likes 0 ·