Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

liuhl6 avatar image
liuhl6 asked ·

Getting ReadTimeoutException and AssertionError: attempted to serialize a 'remote' batch

I am a new in DSE. I have a DSE cluster with 4 nodes (single-token architecture). One of them always has a problem every few dozen days. I have no idea with those errors. Those errors like:

----------------------------------------------------------------

我是个DSE新手。我有个DSE集群,里面有四个节点,是单令牌环的。 其中一个节点总是每隔几十天出一次问题。我不知道是什么错误。错误如下:

ERROR [CoreThread-25] 2020-09-23 00:54:35,534  EpollTPCEventLoopGroup.java:935 - Task exception encountered: 
java.lang.AssertionError: attempted to serialize a 'remote' batch
at org.apache.cassandra.batchlog.Batch$BatchSerializer.serializedSize(Batch.java:120)
at org.apache.cassandra.batchlog.Batch$BatchSerializer.serializedSize(Batch.java:108)
at org.apache.cassandra.net.Request.payloadSerializedSize(Request.java:293)
at org.apache.cassandra.metrics.DroppedMessageMetrics.onMessageDropped(DroppedMessageMetrics.java:123)
at org.apache.cassandra.net.DroppedMessages.onDroppedMessage(DroppedMessages.java:106)
at org.apache.cassandra.net.MessagingService.incrementDroppedMessages(MessagingService.java:985)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:54)
at org.apache.cassandra.concurrent.TPCRunnable.run(TPCRunnable.java:68)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.process(EpollTPCEventLoopGroup.java:920)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.processTasks(EpollTPCEventLoopGroup.java:895)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.processTasks(EpollTPCEventLoopGroup.java:876)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.processEvents(EpollTPCEventLoopGroup.java:777)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.run(EpollTPCEventLoopGroup.java:441)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)

and

ERROR [PO-thread-2] 2020-09-23 01:14:11,906  InsightsRuntimeConfigManager.java:164 - Error encountered while checking for insights configuration changes
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.service.ReadCallback.generateFlowOnTimeout(ReadCallback.java:226)
at org.apache.cassandra.service.ReadCallback.onTimeout(ReadCallback.java:267)
at org.apache.cassandra.net.MessagingService.lambda$null$5(MessagingService.java:247)
at org.apache.cassandra.concurrent.TPCRunnable.run(TPCRunnable.java:68)
at org.apache.cassandra.concurrent.StagedScheduler.execute(StagedScheduler.java:88)
at org.apache.cassandra.concurrent.StagedScheduler$1.execute(StagedScheduler.java:163)
at org.apache.cassandra.net.MessagingService.lambda$deliverLocally$6(MessagingService.java:244)
at org.apache.cassandra.net.MessagingService.lambda$deliverLocally$7(MessagingService.java:254)
at org.apache.cassandra.net.MessagingService.lambda$null$11(MessagingService.java:406)
at org.apache.cassandra.concurrent.TPCRunnable.run(TPCRunnable.java:68)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.process(EpollTPCEventLoopGroup.java:920)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.processTasks(EpollTPCEventLoopGroup.java:895)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.processTasks(EpollTPCEventLoopGroup.java:876)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.processEvents(EpollTPCEventLoopGroup.java:777)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.run(EpollTPCEventLoopGroup.java:441)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)

请问这是什么错误,我该如何解决?

------------------------------------------------------

What are those errors? How can I solve it?

performance
1 comment
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Which version of DSE are you running?

0 Likes 0 · ·
Erick Ramirez avatar image
Erick Ramirez answered ·

The assertion in Batch$BatchSerializer.serializedSize() was inadvertently generating this error:

java.lang.AssertionError: attempted to serialize a 'remote' batch

Remote batch mutations received from remote replicas are stored in the batch log table. They only contain encoded mutations and can't be serialised by design.

When recording the dropped mutation (i.e. the failed batch log mutation), the DroppedMessageMetrics class also needs to record the payload size but it needs to serialise the batch mutation to do that which it can't (again by design). This is something that will be fixed in an upcoming release so the error is more meaningful.

To reiterate Bettina's answer, the real issue here is that your nodes are dropping mutations. It is a sign that your cluster is overloaded -- the commitlog disk cannot keep up with the writes. The ReadTimeoutException is just another symptom that indicates the node can't keep up with requests.

This isn't really related to the rack configuration so I've posted your follow up question in #8548 and will respond to it separately. Cheers!

5 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thank you, @Erick Ramirez.

In fact, the problem I encountered is that whenever this node is overloaded, my database cluster becomes inaccessible.I access DSE through a third-party system called Thingworx. It reported to me that the database was not accessible, but I did not know the details of the error. There are four nodes in my cluster. The other three nodes are good. In this case, can the cluster be inaccessible? The Thingworx's setting like this.

Although the overload is not caused by the rack setting, the overload of the same node every time is caused by the rack setting?

If the overload is caused by the huge amount of writing, will the node automatically return to normal when the amount of writing drops? Will the cluster automatically return to normal?

0 Likes 0 · ·
1601195083132.png (8.2 KiB)
1601195106413.png (20.6 KiB)

Continue with previous reply. ^_^

Because every time an overload problem occurs, the cluster will be inaccessible, so I will restart the overloaded node immediately, so I don't know whether the node will automatically return to normal when the load drops. However, whenever I restart this node, although the load does not drop, the overload problem of this node no longer occurs. Is it because the overload problem has to accumulate to a certain level before it can happen?

If I directly shut down the overloaded node, not restart. I found that the cluster was immediately accessible, too.

0 Likes 0 · ·

The reason your cluster is inaccessible when the node goes down is there are only 2 racks and your queries require a consistency of TWO. You need fix your rack configuration as I stated in #8548 to get around this problem. Cheers!

0 Likes 0 · ·
liuhl6 avatar image liuhl6 Erick Ramirez ♦♦ ·

Thank you @Erick Ramirez. It's very helpful !

0 Likes 0 · ·
Show more comments
bettina.swynnerton avatar image
bettina.swynnerton answered ·

Hello @liuhl6,

you didn't say what DSE version you are working with, it looks like DSE 6.7 or DSE 6.0.

The first error type is a known defect on how this node is reporting dropped messages. (For your reference, the Jira for this known defect is "DB-3905 Dropped Mutation Resulting in 'java.lang.AssertionError: attempted to serialize a 'remote' batch'").

Regardless of this reporting error, the key issue here is that your node is reporting dropped messages, which could be a sign of overloading.

The second type of error that you mention is also pointing at overloading. Here an internal request resulted in a timeout.

org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

If this only happens every few days, check what load is running in your cluster when these errors occur.

To fully understand the state of your cluster at the time when these errors occur, a lot more information and context would be necessary, and it will be difficult to investigate this in this Q&A format. If you have a support subscription for DSE, it would be better to open a support ticket, where we can better help you understand what contributes to the overloaded nodes in your cluster.

1 comment Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Hello @bettina.swynnerton,

Thank you, Bettina! My DSE version is 6.7.5.

[Follow up question posted in #8548]

0 Likes 0 · ·
1601002395395.png (31.3 KiB)