PLANNED MAINTENANCE

Hello, DataStax Community!

We want to make you aware of a few operational updates which will be carried out on the site. We are working hard to streamline the login process to integrate with other DataStax resources. As such, you will soon be prompted to update your password. Please note that your username will remain the same.

As we work to improve your user experience, please be aware that login to the DataStax Community will be unavailable for a few hours on:

  • Wednesday, July 15 16:00 PDT | 19:00 EDT | 20:00 BRT
  • Thursday, July 16 00:00 BST | 01:00 CEST | 04:30 IST | 07:00 CST | 09:00 AEST

For more info, check out the FAQ page. Thank you for being a valued member of our community.


question

Sergio avatar image
Sergio asked ·

The cassandra node does not work even after restart throwing TTransportException: java.net.SocketException

Problem:

WARN [Thread-83069] 2019-10-11 16:13:23,713 CustomTThreadPoolServer.java:125 - Transport error occurred during acceptance of message.
org.apache.thrift.transport.TTransportException: java.net.SocketException: Socket closed
at org.apache.cassandra.thrift.TCustomServerSocket.acceptImpl(TCustomServerSocket.java:109) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.thrift.TCustomServerSocket.acceptImpl(TCustomServerSocket.java:36) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:60) ~[libthrift-0.9.2.jar:0.9.2]
at org.apache.cassandra.thrift.CustomTThreadPoolServer.serve(CustomTThreadPoolServer.java:113) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.thrift.ThriftServer$ThriftServerThread.run(ThriftServer.java:134) [apache-cassandra-3.11.4.jar:3.11.4]

The CPU Load goes to 50 and it becomes unresponsive.

Node configuration:
OS: Linux 4.16.13-1.el7.elrepo.x86_64 #1 SMP Wed May 30 14:31:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux

This is a working node that does not have the recommended settings but it is working and it is one of the first node in the cluster
cat /proc/23935/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 122422 122422 processes
Max open files 65536 65536 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 122422 122422 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us


I tried to bootstrap a new node that joins the existing cluster.
The disk space used is around 400GB SSD over 885GB available

At my first attempt, the node failed and got restarted over and over by systemctl that does not
honor the limits configuration specified and thrown

Caused by: java.nio.file.FileSystemException: /mnt/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/md-52-big-Index.db: Too many open files
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) ~[na:1.8.0_161]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[na:1.8.0_161]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[na:1.8.0_161]
at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177) ~[na:1.8.0_161]
at java.nio.channels.FileChannel.open(FileChannel.java:287) ~[na:1.8.0_161]
at java.nio.channels.FileChannel.open(FileChannel.java:335) ~[na:1.8.0_161]
at org.apache.cassandra.io.util.SequentialWriter.openChannel(SequentialWriter.java:104) ~[apache-cassandra-3.11.4.jar:3.11.4]
... 20 common frames omitted
^C

I fixed the above by stopping cassandra, cleaning commitlog, saved_caches, hints and data directory and restarting it and getting the PID and run the 2 commands below
sudo prlimit -n1048576 -p <JAVA_PID>
sudo prlimit -u32768 -p <CASSANDRA_PID>
because at the beginning the node didn't even joint the cluster. it was reported by UJ.

After fixing the max open file problem, The node from UpJoining passed to the status UpNormal
The node joined the cluster but after a while, it started to throw

WARN [Thread-83069] 2019-10-11 16:13:23,713 CustomTThreadPoolServer.java:125 - Transport error occurred during acceptance of message.
org.apache.thrift.transport.TTransportException: java.net.SocketException: Socket closed
at org.apache.cassandra.thrift.TCustomServerSocket.acceptImpl(TCustomServerSocket.java:109) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.thrift.TCustomServerSocket.acceptImpl(TCustomServerSocket.java:36) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:60) ~[libthrift-0.9.2.jar:0.9.2]
at org.apache.cassandra.thrift.CustomTThreadPoolServer.serve(CustomTThreadPoolServer.java:113) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.thrift.ThriftServer$ThriftServerThread.run(ThriftServer.java:134) [apache-cassandra-3.11.4.jar:3.11.4]


I compared cassandra.yaml, limits.conf but it looks like that it does not help. I don't know how the current nodes are working since they don't have the recommended cassandra limits.

Any suggestions on the possible culprit?

Please let me know

Thanks

joinexception
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered ·

@Sergio the symptoms you described and this kind of warning message indicate that the node is likely overloaded:

WARN [Thread-83069] 2019-10-11 16:13:23,713 CustomTThreadPoolServer.java:125 - Transport error occurred during acceptance of message.
org.apache.thrift.transport.TTransportException: java.net.SocketException: Socket closed

You mentioned that CPU load went up to 50. Check if the node is IO-bound and if so, it will explain why the CPU load is high. I would check that the hardware on the problematic node is identical to the other nodes and that there are no issues with the disk (check the syslogs for clues). Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.