Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

mpatodia avatar image
mpatodia asked ·

Error while rebuilding a node, stream failed

I'm trying to add a new datacenter in Azure. All the configurations are done following this document.

While running the nodetool rebuild command, I'm getting the below exception after around 30-40 minutes. I noticed that it's happening while replicating very large files (approx. 200-300 GB).

ERROR [STREAM-OUT-/<source_ip>:7000] 2021-05-07 15:39:24,969 StreamSession.java:609 - [Stream #e66ad4d0-af44-11eb-b016-0fdd3f794cc3] Streaming error occurred on session with peer <source_ip>
java.io.IOException: Connection timed out
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_292]
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_292]
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_292]
        at sun.nio.ch.IOUtil.write(IOUtil.java:51) ~[na:1.8.0_292]
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470) ~[na:1.8.0_292]
        at org.apache.cassandra.io.util.BufferedDataOutputStreamPlus.doFlush(BufferedDataOutputStreamPlus.java:323) ~[apache-cassandra-3.11.10.jar:3.11.10]
        at org.apache.cassandra.io.util.BufferedDataOutputStreamPlus.flush(BufferedDataOutputStreamPlus.java:331) ~[apache-cassandra-3.11.10.jar:3.11.10]
        at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.sendMessage(ConnectionHandler.java:409) [apache-cassandra-3.11.10.jar:3.11.10]
        at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.run(ConnectionHandler.java:380) [apache-cassandra-3.11.10.jar:3.11.10]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_292]

ERROR [STREAM-IN-/<source_ip>:7000] 2021-05-07 15:39:30,929 StreamSession.java:609 - [Stream #e66ad4d0-af44-11eb-b016-0fdd3f794cc3] Streaming error occurred on session with peer <source_ip>
java.lang.RuntimeException: Stream receive task e66ad4d0-af44-11eb-b016-0fdd3f794cc3 of cf 57722510-658b-11eb-9058-07778e41ebc3 already finished.
        at org.apache.cassandra.streaming.StreamReceiveTask.createLifecycleNewTracker(StreamReceiveTask.java:145) ~[apache-cassandra-3.11.10.jar:3.11.10]
        at org.apache.cassandra.streaming.StreamReader.createWriter(StreamReader.java:155) ~[apache-cassandra-3.11.10.jar:3.11.10]
        at org.apache.cassandra.streaming.compress.CompressedStreamReader.read(CompressedStreamReader.java:92) ~[apache-cassandra-3.11.10.jar:3.11.10]
        at org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:54) ~[apache-cassandra-3.11.10.jar:3.11.10]
        at org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:43) ~[apache-cassandra-3.11.10.jar:3.11.10]
        at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:61) ~[apache-cassandra-3.11.10.jar:3.11.10]
        at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:311) ~[apache-cassandra-3.11.10.jar:3.11.10]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_292]
ERROR [STREAM-OUT-/<source_ip>:7000] 2021-05-07 15:39:30,930 StreamSession.java:609 - [Stream #e66ad4d0-af44-11eb-b016-0fdd3f794cc3] Streaming error occurred on session with peer <source_ip>
java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_292]
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_292]
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_292]
        at sun.nio.ch.IOUtil.write(IOUtil.java:51) ~[na:1.8.0_292]
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470) ~[na:1.8.0_292]
        at org.apache.cassandra.io.util.BufferedDataOutputStreamPlus.doFlush(BufferedDataOutputStreamPlus.java:323) ~[apache-cassandra-3.11.10.jar:3.11.10]
        at org.apache.cassandra.io.util.BufferedDataOutputStreamPlus.flush(BufferedDataOutputStreamPlus.java:331) ~[apache-cassandra-3.11.10.jar:3.11.10]
        at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.sendMessage(ConnectionHandler.java:409) [apache-cassandra-3.11.10.jar:3.11.10]
        at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.run(ConnectionHandler.java:388) [apache-cassandra-3.11.10.jar:3.11.10]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_292]

I also tried to modify below values:

net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9

And in cassandra.yaml, I updated following values:

streaming_keep_alive_period_in_secs: 10000
streaming_socket_timeout_in_ms: 86400000

But still I get the exception and not able to stream all the data.

Does anybody have any idea how this can be fixed? Thanks in advance.

rebuild
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered ·

In my experience, the most common cause of the streaming session timeout is when there are lots of secondary indexes and/or materialised views getting built and takes too long to complete.

You were almost right in picking the configuration to tune but note that streaming_socket_timeout_in_ms was removed in Cassandra 3.10 (CASSANDRA-11839) and replaced by streaming_keep_alive_period_in_secs (CASSANDRA-11841). You will need to set this on ALL existing nodes + ALL the nodes in the new DC. Note that you will need to perform a rolling restart of all nodes for this change to take effect.

The Linux keepalive values you've set are too high. What they mean is that it will detect dead TCP connections after 2 hours (7200s) + 675 secs (9 probes every 75s).

Instead, you need to configure TCP keepalive timeout to 60 seconds with 3 probes, 10 seconds gap between each:

$ sudo sysctl -w net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_probes=3 net.ipv4.tcp_keepalive_intvl

This command sets keepalive to detect dead TCP connections after 90 seconds (60 + 10 + 10 + 10). Additional traffic is negligible so it is safe to use these settings on all nodes. Cheers!

2 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thank you @Erick Ramirez. The issue got resolved.

The Linux timeout configuration changes that I did was somehow not taking effect because some configuration in Azure was overriding it.


Actually, the streaming_keep_alive_period_in_secs in cassandra.yaml defaults to 300 seconds (5 minutes) and the TCP idle timeout in Azure machine was configured as 4 minutes. So for large SSTable files, when the time taken was longer, the Azure Netowork interface was breaking the connection.

Perhaps, I should've reduced the keep_alive setting in Cassandra instead of increasing it. Reducing the streaming_keep_alive_period_in_secs to 2 minutes resolved the issue. I reset all other changes that I had done earlier. Now, the keep-alive signal is being sent every 2 minutes to keep the session active.

0 Likes 0 ·

Good to hear you got it resolved. Cheers!

0 Likes 0 ·