Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

yashwanth.kondeti@verizon.com avatar image
yashwanth.kondeti@verizon.com asked ·

Cassandra crashing intermittently without logging any error messages

Hi,

We have a 3-node cluster (10.179.127.67/68/69) running on AWS EC2 (RHEL). Past few days, we are noticing that 2nd and 3rd nodes are crashing quickly after starting up without even posting any error messages in the system.log file. Can you please help in understanding what might be the cause? Also what can we do to make Cassandra log all the error messages clearly so that it is easy to identify the root cause?

In the below snippet, I have started up the node at around 2021-05-24 20:24. After few seconds, the node crashes again. So I had to attempt restart again at 2021-05-24 20:34:50. There is no errors logged explaining why the node crashed. I checked the logs from other 2 nodes, but they did not log any additional messages either. How do I force Cassandra to log more details?


INFO  [GossipStage:1] 2021-05-24 20:24:24,367 Gossiper.java:1125 - Node /10.179.127.67 has restarted, now UP
INFO  [GossipStage:1] 2021-05-24 20:24:24,372 StorageService.java:2386 - Node /10.179.127.67 state jump to NORMAL
INFO  [GossipStage:1] 2021-05-24 20:24:24,378 TokenMetadata.java:497 - Updating topology for /10.179.127.67
INFO  [GossipStage:1] 2021-05-24 20:24:24,378 TokenMetadata.java:497 - Updating topology for /10.179.127.67
INFO  [GossipStage:1] 2021-05-24 20:24:24,379 Gossiper.java:1125 - Node /10.179.127.69 has restarted, now UP
INFO  [GossipStage:1] 2021-05-24 20:24:24,386 StorageService.java:2386 - Node /10.179.127.69 state jump to NORMAL
INFO  [GossipStage:1] 2021-05-24 20:24:24,391 TokenMetadata.java:497 - Updating topology for /10.179.127.69
INFO  [GossipStage:1] 2021-05-24 20:24:24,391 TokenMetadata.java:497 - Updating topology for /10.179.127.69
INFO  [HANDSHAKE-/10.179.127.69] 2021-05-24 20:24:24,450 OutboundTcpConnection.java:561 - Handshaking version with /10.179.127.69
INFO  [GossipStage:1] 2021-05-24 20:24:24,505 Gossiper.java:1089 - InetAddress /10.179.127.67 is now UP
INFO  [GossipStage:1] 2021-05-24 20:24:24,570 Gossiper.java:1089 - InetAddress /10.179.127.69 is now UP
INFO  [HANDSHAKE-/10.179.127.67] 2021-05-24 20:24:25,036 OutboundTcpConnection.java:561 - Handshaking version with /10.179.127.67
WARN  [GossipTasks:1] 2021-05-24 20:24:25,192 FailureDetector.java:278 - Not marking nodes down due to local pause of 28740715793 > 5000000000
INFO  [main] 2021-05-24 20:24:32,354 Gossiper.java:1811 - No gossip backlog; proceeding
INFO  [main] 2021-05-24 20:24:33,020 NativeTransportService.java:68 - Netty using native Epoll event loop
INFO  [main] 2021-05-24 20:24:33,115 Server.java:148 - Enabling encrypted CQL connections between client and server
INFO  [main] 2021-05-24 20:24:33,154 Server.java:158 - Using Netty Version: [netty-buffer=netty-buffer-4.0.44.Final.452812a, netty-codec=netty-codec-4.0.44.Final.452812a, netty-codec-haproxy=netty-codec-haproxy-4.0.44.Final.452812a, netty-codec-http=netty-codec-http-4.0.44.Final.452812a, netty-codec-socks=netty-codec-socks-4.0.44.Final.452812a, netty-common=netty-common-4.0.44.Final.452812a, netty-handler=netty-handler-4.0.44.Final.452812a, netty-tcnative=netty-tcnative-1.1.33.Fork26.142ecbb, netty-transport=netty-transport-4.0.44.Final.452812a, netty-transport-native-epoll=netty-transport-native-epoll-4.0.44.Final.452812a, netty-transport-rxtx=netty-transport-rxtx-4.0.44.Final.452812a, netty-transport-sctp=netty-transport-sctp-4.0.44.Final.452812a, netty-transport-udt=netty-transport-udt-4.0.44.Final.452812a]
INFO  [main] 2021-05-24 20:24:33,155 Server.java:159 - Starting listening for CQL clients on /10.179.127.68:9142 (encrypted)...
INFO  [main] 2021-05-24 20:24:33,213 CassandraDaemon.java:556 - Not starting RPC server as requested. Use JMX (StorageService->startRPCServer()) or nodetool (enablethrift) to start it
INFO  [Native-Transport-Requests-1] 2021-05-24 20:24:33,710 AuthCache.java:177 - (Re)initializing PermissionsCache (validity period/update interval/max entries) (2000/2000/1000)
INFO  [main] 2021-05-24 20:34:50,372 YamlConfigurationLoader.java:89 - Configuration location: file:/data/cassandra/conf/cassandra.yaml

Thanks,

Yashwanth.

cassandra
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered ·

With the very little information you provided, it would be impossible for us to do any meaningful analysis. My suspicion is that the Linux oom-killer is terminating the Cassandra process because the server is running out of memory.

A common cause is when disk_access_mode is not set correctly so all SSTables are mapped to memory until the server runs out of resources. I've explained why it happens in this post.

It would be difficult to assist you in this Q&A forum so my recommendation is that you log a ticket with DataStax Support if you have a valid subscription so one of our engineers can work with you directly. Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.