question

liuhl6 avatar image
liuhl6 asked liuhl6 commented

Getting ReadTimeoutException and AssertionError: attempted to serialize a 'remote' batch

I am a new in DSE. I have a DSE cluster with 4 nodes (single-token architecture). One of them always has a problem every few dozen days. I have no idea with those errors. Those errors like:

----------------------------------------------------------------

我是个DSE新手。我有个DSE集群,里面有四个节点,是单令牌环的。 其中一个节点总是每隔几十天出一次问题。我不知道是什么错误。错误如下:

ERROR [CoreThread-25] 2020-09-23 00:54:35,534  EpollTPCEventLoopGroup.java:935 - Task exception encountered: 
java.lang.AssertionError: attempted to serialize a 'remote' batch
at org.apache.cassandra.batchlog.Batch$BatchSerializer.serializedSize(Batch.java:120)
at org.apache.cassandra.batchlog.Batch$BatchSerializer.serializedSize(Batch.java:108)
at org.apache.cassandra.net.Request.payloadSerializedSize(Request.java:293)
at org.apache.cassandra.metrics.DroppedMessageMetrics.onMessageDropped(DroppedMessageMetrics.java:123)
at org.apache.cassandra.net.DroppedMessages.onDroppedMessage(DroppedMessages.java:106)
at org.apache.cassandra.net.MessagingService.incrementDroppedMessages(MessagingService.java:985)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:54)
at org.apache.cassandra.concurrent.TPCRunnable.run(TPCRunnable.java:68)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.process(EpollTPCEventLoopGroup.java:920)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.processTasks(EpollTPCEventLoopGroup.java:895)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.processTasks(EpollTPCEventLoopGroup.java:876)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.processEvents(EpollTPCEventLoopGroup.java:777)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.run(EpollTPCEventLoopGroup.java:441)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)

and

ERROR [PO-thread-2] 2020-09-23 01:14:11,906  InsightsRuntimeConfigManager.java:164 - Error encountered while checking for insights configuration changes
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.service.ReadCallback.generateFlowOnTimeout(ReadCallback.java:226)
at org.apache.cassandra.service.ReadCallback.onTimeout(ReadCallback.java:267)
at org.apache.cassandra.net.MessagingService.lambda$null$5(MessagingService.java:247)
at org.apache.cassandra.concurrent.TPCRunnable.run(TPCRunnable.java:68)
at org.apache.cassandra.concurrent.StagedScheduler.execute(StagedScheduler.java:88)
at org.apache.cassandra.concurrent.StagedScheduler$1.execute(StagedScheduler.java:163)
at org.apache.cassandra.net.MessagingService.lambda$deliverLocally$6(MessagingService.java:244)
at org.apache.cassandra.net.MessagingService.lambda$deliverLocally$7(MessagingService.java:254)
at org.apache.cassandra.net.MessagingService.lambda$null$11(MessagingService.java:406)
at org.apache.cassandra.concurrent.TPCRunnable.run(TPCRunnable.java:68)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.process(EpollTPCEventLoopGroup.java:920)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.processTasks(EpollTPCEventLoopGroup.java:895)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.processTasks(EpollTPCEventLoopGroup.java:876)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.processEvents(EpollTPCEventLoopGroup.java:777)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.run(EpollTPCEventLoopGroup.java:441)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)

请问这是什么错误,我该如何解决?

------------------------------------------------------

What are those errors? How can I solve it?

performance
1 comment
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image Erick Ramirez ♦♦ commented ·

Which version of DSE are you running?

0 Likes 0 ·
bettina.swynnerton avatar image
bettina.swynnerton answered Erick Ramirez edited

Hello @liuhl6,

you didn't say what DSE version you are working with, it looks like DSE 6.7 or DSE 6.0.

The first error type is a known defect on how this node is reporting dropped messages. (For your reference, the Jira for this known defect is "DB-3905 Dropped Mutation Resulting in 'java.lang.AssertionError: attempted to serialize a 'remote' batch'").

Regardless of this reporting error, the key issue here is that your node is reporting dropped messages, which could be a sign of overloading.

The second type of error that you mention is also pointing at overloading. Here an internal request resulted in a timeout.

org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

If this only happens every few days, check what load is running in your cluster when these errors occur.

To fully understand the state of your cluster at the time when these errors occur, a lot more information and context would be necessary, and it will be difficult to investigate this in this Q&A format. If you have a support subscription for DSE, it would be better to open a support ticket, where we can better help you understand what contributes to the overloaded nodes in your cluster.

1 comment Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

liuhl6 avatar image liuhl6 commented ·

Hello @bettina.swynnerton,

Thank you, Bettina! My DSE version is 6.7.5.

[Follow up question posted in #8548]

0 Likes 0 ·
1601002395395.png (31.3 KiB)
Erick Ramirez avatar image
Erick Ramirez answered liuhl6 commented

The assertion in Batch$BatchSerializer.serializedSize() was inadvertently generating this error:

java.lang.AssertionError: attempted to serialize a 'remote' batch

Remote batch mutations received from remote replicas are stored in the batch log table. They only contain encoded mutations and can't be serialised by design.

When recording the dropped mutation (i.e. the failed batch log mutation), the DroppedMessageMetrics class also needs to record the payload size but it needs to serialise the batch mutation to do that which it can't (again by design). This is something that will be fixed in an upcoming release so the error is more meaningful.

To reiterate Bettina's answer, the real issue here is that your nodes are dropping mutations. It is a sign that your cluster is overloaded -- the commitlog disk cannot keep up with the writes. The ReadTimeoutException is just another symptom that indicates the node can't keep up with requests.

This isn't really related to the rack configuration so I've posted your follow up question in #8548 and will respond to it separately. Cheers!

6 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

liuhl6 avatar image liuhl6 commented ·

Thank you, @Erick Ramirez.

In fact, the problem I encountered is that whenever this node is overloaded, my database cluster becomes inaccessible.I access DSE through a third-party system called Thingworx. It reported to me that the database was not accessible, but I did not know the details of the error. There are four nodes in my cluster. The other three nodes are good. In this case, can the cluster be inaccessible? The Thingworx's setting like this.

Although the overload is not caused by the rack setting, the overload of the same node every time is caused by the rack setting?

If the overload is caused by the huge amount of writing, will the node automatically return to normal when the amount of writing drops? Will the cluster automatically return to normal?

0 Likes 0 ·
1601195083132.png (8.2 KiB)
1601195106413.png (20.6 KiB)
liuhl6 avatar image liuhl6 liuhl6 commented ·

Continue with previous reply. ^_^

Because every time an overload problem occurs, the cluster will be inaccessible, so I will restart the overloaded node immediately, so I don't know whether the node will automatically return to normal when the load drops. However, whenever I restart this node, although the load does not drop, the overload problem of this node no longer occurs. Is it because the overload problem has to accumulate to a certain level before it can happen?

If I directly shut down the overloaded node, not restart. I found that the cluster was immediately accessible, too.

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ liuhl6 commented ·

The reason your cluster is inaccessible when the node goes down is there are only 2 racks and your queries require a consistency of TWO. You need fix your rack configuration as I stated in #8548 to get around this problem. Cheers!

0 Likes 0 ·
liuhl6 avatar image liuhl6 Erick Ramirez ♦♦ commented ·

Thank you @Erick Ramirez. It's very helpful !

0 Likes 0 ·
Show more comments
liuhl6 avatar image liuhl6 commented ·

Hi @Erick Ramirez

Would you please check the log for me to see if the fault is really caused by the overload problem? I reinstalled the node that always failed and put all the nodes under rack1, but the failure reappeared more than a month later.My current cluster architecture is as follows:

1606724278254.png

10.46.137.1 is the node that often fails. In fact, I don't need a search node, but the platform I use (Thingworx) needs a search node, otherwise the configuration cannot be completed. Therefore, I configured Multi-Instance on one of the physical machines (total 3 physical machines), and the search node is one of the instances. This decision is really strange, I don't know if it is the cause of the failure. But my keyspace does not use the search node, its replication = {'class':'NetworkTopologyStrategy', 'NingYan1':3} .

After the node was reinstalled, I started the node at 2020-10-17 01:09:21, and then it was running normally until 2020-11-29 07:19:02 when it failed.

0 Likes 0 ·
1606724278254.png (35.4 KiB)
logs.zip (586.5 KiB)
debug.zip (998.0 KiB)