Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

kajarvine_115939 avatar image
kajarvine_115939 asked Erick Ramirez commented

Issues with joining node, stream failed

We have a problem joining new (old-dropped) node into Cassandra (Apache) Cluster.

The version is 3.11.2 - vnodes 128

It seems that during the bootstrap the streams start to flow ok.

At the end / after some time "Stream Failed".

One ERROR we have gotten is like

ERROR [Native-Transport-Requests-1] 2021-01-25 14:54:21,621 Message.java:629 - Unexpected exception during request; channel = [id: 0xf5c8dc6c, L:/10.51.105.67:9142 - R:/172.17.0.10:55246]
java.lang.NullPointerException: null

..67 is the node in question, the 172 is docker address of a sending node"

We do suspect network failures. But how to prove "the NW is not valid"?

OR what kind of traces to run?

bootstrap
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez commented

The common cause for stream failures is when the connection between the source and receiving node gets interrupted either because of (a) a network connectivity issue, or (b) failure in reading the SSTable on the source node to stream the data.

The error you message you posted has no relation to the streaming failure at all. It relates to a failed client connection between your application (running on remote server 172.17.0.10) and the coordinator node (local server 10.51.105.67).

You will need to review the logs on both the source and destination nodes to determine why the stream failed. For the record, I'm not asking you to send me the logs. I'm just giving you pointers on how to identify the cause of the issue. Cheers!

5 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Hello Erick, again :)
We did do a lot after Your excact answer. Tried to find out TCP layers etc.
AND did a lot of logging.
So I will attach the log between two instances here also

cass_stream_fails_log.txt

One is of course the bootstrapping one, and one another streaming one.

We have done tests with AWS instances with similiar docker image and schema descriptions, and they do bootstrap just fine.
That makes this even more disturbing; "WHY. It seems that all stops after all streaming has been completed and it is time to upgrade the schema version."


cass_stream_fails_log.txt

0 Likes 0 ·

[part 1/2]

And one more; of the schema when joining / streaming:

We have inspected the logs in the environment where the joining works. Here are the few lines that show the node gets the correct schema version:

DEBUG [main] 2021-01-27 09:35:45,648 MigrationManager.java:614 - Gossiping my 3.11 schema version (empty)
DEBUG [RequestResponseStage-1] 2021-01-27 09:35:47,311 MigrationManager.java:116 - Immediately submitting migration task for /10.150.255.167, schema versions: local/real=(empty), local/compatible=(empty), remote=0ba43fce-642d-3ec5-a8e4-b837de32458e
DEBUG [InternalResponseStage:1] 2021-01-27 09:35:56,475 MigrationManager.java:614 - Gossiping my 3.11 schema version 0ba43fce-642d-3ec5-a8e4-b837de32458e
DEBUG [main] 2021-01-27 09:35:56,717 StorageService.java:847 - got schema: 0ba43fce-642d-3ec5-a8e4-b837de32458e
0 Likes 0 ·

It seems like you are sidetracked. The schema version is not a problem and is not related at all to the stream failure.

You need to trace the stream in the logs based on the streaming ID and figure out why the streaming failed. Cheers!

0 Likes 0 ·

A followup on what kajarvine_115939 posted. The streaming works now, but the schema version is different on the joining node. As seen in the log snippet attached in kajarvine_115939's post. The other nodes can be changed to that schema version with resetlocalschema command. Is there any risk involved doing so? And why would the joining node decide to create a new schema version?

0 Likes 0 ·

[part 2/2]

INFO [main] 2021-01-27 09:35:56,717 StorageService.java:1449 - JOINING: waiting for schema information to complete
INFO [main] 2021-01-27 09:35:57,718 StorageService.java:1449 - JOINING: schema complete, ready to bootstrap
DEBUG [pool-1-thread-1] 2021-01-27 09:36:10,769 StorageProxy.java:2368 - Schemas are in agreement.

Any ideas where to look? Why in the other environment the joining node gets the wrong schema version? The logs don’t really say anything about that.

0 Likes 0 ·