Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

Sergio avatar image
Sergio asked Erick Ramirez edited

3.11.4 Node the load starts to increase after few minutes to 40 on 4 CPU machine. It is due to compaction. What can I do to help?

Hello guys!

I performed a thread dump https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMTAvMTcvLS1kdW1wLnR4dC0tMC0zMC00MA==& while try to join the node with

-Dcassandra.join_ring=false


because the node spiked in load and latency was affecting the clients.

With or without that flag the node is high in latency and I see the load sky rocketing when the number of TCP established connections increases

Analyzing the /var/log/messages I am able to read

Oct 17 00:23:39 prod-personalization-live-data-cassandra-08 cassandra: INFO [Service Thread] 2019-10-17 00:23:39,030 GCInspector.java:284 - G1 Young Generation GC in 255ms. G1 Eden Space: 361758720 -> 0; G1 Old Gen: 1855455944 -> 1781007048; G1 Survivor Space: 39845888 -> 32505856;

Oct 17 00:23:40 prod-personalization-live-data-cassandra-08 cassandra: INFO [ScheduledTasks:1] 2019-10-17 00:23:40,352 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)

Oct 17 00:23:03 prod-personalization-live-data-cassandra-08 kernel: TCP: request_sock_TCP: Possible SYN flooding on port 9042. Sending cookies. Check SNMP counters.

I don't see anything on debug.log that looks to be relevant

The machine is on aws with 4 cpu with 32GB Ram and 1 TB SSD i3.xlarge


TOP

top - 19:58:32 up 2 days, 19:59, 3 users, load average: 42.77, 39.50, 38.07

Tasks: 137 total, 1 running, 71 sleeping, 0 stopped, 0 zombie

%Cpu(s): 87.3 us, 6.3 sy, 4.8 ni, 0.0 id, 0.0 wa, 0.0 hi, 1.6 si, 0.0 st

KiB Mem : 31391772 total, 256604 free, 10187460 used, 20947708 buff/cache

KiB Swap: 0 total, 0 free, 0 used. 20017184 avail Mem


FREE

total used free shared buff/cache available

Mem: 31391772 10186776 239876 549400 20965120 20018132

Swap: 0 0 0


DEBUG.LOG

INFO [ReadStage-30] 2019-10-17 19:34:20,370 NoSpamLogger.java:91 - Maximum memory usage reached (536870912), cannot allocate chunk of 1048576

INFO [ScheduledTasks:1] 2019-10-17 19:35:19,004 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)

INFO [ScheduledTasks:1] 2019-10-17 19:40:19,223 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)

INFO [ScheduledTasks:1] 2019-10-17 19:45:19,614 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)

INFO [ReadStage-277] 2019-10-17 19:49:20,438 NoSpamLogger.java:91 - Maximum memory usage reached (536870912), cannot allocate chunk of 1048576

INFO [ScheduledTasks:1] 2019-10-17 19:50:19,834 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)


bash-4.2$ java -jar sjk.jar ttop -p 26309

Monitoring threads ...


2019-10-17T20:01:14.870+0000 Process summary

process cpu=325.34%

application cpu=321.13% (user=291.77% sys=29.36%)

other: cpu=4.21%

thread count: 476

heap allocation rate 772mb/s

[000101] user=34.76% sys= 0.45% alloc= 12mb/s - CompactionExecutor:1

[000125] user=14.90% sys= 0.73% alloc= 49mb/s - Native-Transport-Requests-1

[000269] user=12.12% sys= 0.30% alloc= 38mb/s - Native-Transport-Requests-3

[000267] user=10.92% sys= 0.34% alloc= 33mb/s - ReadStage-4

[000584] user= 9.14% sys= 0.33% alloc= 30mb/s - Native-Transport-Requests-14

[000126] user= 8.54% sys= 0.16% alloc= 25mb/s - Native-Transport-Requests-2

[000271] user= 8.34% sys= 0.26% alloc= 26mb/s - Native-Transport-Requests-6

[000525] user= 7.15% sys= 0.51% alloc= 22mb/s - Native-Transport-Requests-12

[000582] user= 7.35% sys= 0.25% alloc= 22mb/s - Native-Transport-Requests-16

[000304] user= 7.15% sys= 0.41% alloc= 22mb/s - Native-Transport-Requests-7

[000524] user= 6.55% sys= 0.30% alloc= 20mb/s - Native-Transport-Requests-13

[000268] user= 5.36% sys= 0.35% alloc= 13mb/s - Native-Transport-Requests-5

[002461] user= 3.97% sys= 1.49% alloc= 2655kb/s - RMI TCP Connection(309)-10.1.19.198

[000545] user= 4.97% sys= 0.20% alloc= 14mb/s - ReadStage-10

[000581] user= 4.97% sys= 0.19% alloc= 15mb/s - ReadStage-18

[000531] user= 4.97% sys= 0.14% alloc= 16mb/s - Native-Transport-Requests-26

[000572] user= 4.97% sys= 0.13% alloc= 16mb/s - Native-Transport-Requests-37

[000321] user= 4.57% sys= 0.37% alloc= 12mb/s - Native-Transport-Requests-8

[000573] user= 4.57% sys= 0.19% alloc= 14mb/s - Native-Transport-Requests-35

[000583] user= 4.57% sys= 0.15% alloc= 13mb/s - ReadStage-15


DSTAT


^Cbash-4.2$ dstat

You did not select any stats, using -cdngy by default.

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--

usr sys idl wai hiq siq| read writ| recv send| in out | int csw

28 1 70 0 0 0| 380k 5912k| 0 0 | 0 0 |3245 2629

90 7 1 0 0 2|1128k 917k| 803k 1010k| 0 0 | 21k 25k

92 6 0 0 0 1| 120k 0 | 836k 1048k| 0 0 | 20k 24k

90 5 4 0 0 1| 952k 0 | 488k 539k| 0 0 | 16k 15k

96 3 0 0 0 1| 128k 0 | 252k 248k| 0 0 | 12k 12k

89 8 1 0 0 2| 256k 2064k| 807k 1088k| 0 0 | 22k 30k

92 6 0 0 0 1|1164k 29k| 707k 905k| 0 0 | 17k 21k



Do you advise to increase the Heap?

cassand+ 26309 365 50.6 69878708 15914212 ? Ssl 17:24 574:49 java -Xloggc:/var/log/cassandra/gc.log -Dcassandra.max_queued_native_transport_requests=4096 -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -XX:+AlwaysPreTouch -XX:-UseBiasedLocking -XX:+UseTLAB -XX:+ResizeTLAB -XX:+UseNUMA -XX:+PerfDisableSharedMem -Djava.net.preferIPv4Stack=true -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:+UseG1GC -XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=45 -XX:G1HeapRegionSize=0 -XX:-ParallelRefProcEnabled -Xms7664M -Xmx7664M -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.password.file=/etc/cassandra/conf/jmxremote.password -Dcom.sun.management.jmxremote.access.file=/etc/cassandra/conf/jmxremote.access -Djava.library.path=/usr/share/cassandra/lib/sigar-bin -Djava.rmi.server.hostname=10.1.19.198 -XX:+CMSClassUnloadingEnabled -javaagent:/usr/share/cassandra/lib/jmx_prometheus_javaagent-0.3.1.jar=10100:/etc/cassandra/default.conf/jmx-export.yml -Dcassandra.join_ring=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid -Dcassandra-foreground=yes -cp /etc/cassandra/conf:/usr/share/cassandra/lib/airline-0.6.jar:/usr/share/cassandra/lib/antlr-runtime-3.5.2.jar:/usr/share/cassandra/lib/asm-5.0.4.jar:/usr/share/cassandra/lib/caffeine-2.2.6.jar:/usr/share/cassandra/lib/cassandra-driver-core-3.0.1-shaded.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.9.jar:/usr/share/cassandra/lib/commons-lang3-3.1.jar:/usr/share/cassandra/lib/commons-math3-3.2.jar:/usr/share/cassandra/lib/compress-lzf-0.8.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.4.jar:/usr/share/cassandra/lib/concurrent-trees-2.4.0.jar:/usr/share/cassandra/lib/disruptor-3.0.1.jar:/usr/share/cassandra/lib/ecj-4.4.2.jar:/usr/share/cassandra/lib/guava-18.0.jar:/usr/share/cassandra/lib/HdrHistogram-2.1.9.jar:/usr/share/cassandra/lib/high-scale-lib-1.0.6.jar:/usr/share/cassandra/lib/hppc-0.5.4.jar:/usr/share/cassandra/lib/jackson-core-asl-1.9.13.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.9.13.jar:/usr/share/cassandra/lib/jamm-0.3.0.jar:/usr/share/cassandra/lib/javax.inject.jar:/usr/share/cassandra/lib/jbcrypt-0.3m.jar:/usr/share/cassandra/lib/jcl-over-slf4j-1.7.7.jar:/usr/share/cassandra/lib/jctools-core-1.2.1.jar:/usr/share/cassandra/lib/jflex-1.6.0.jar:/usr/share/cassandra/lib/jmx_prometheus_javaagent-0.3.1.jar:/usr/share/cassandra/lib/jna-4.2.2.jar:/usr/share/cassandra/lib/joda-time-2.4.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/jstackjunit-0.0.1.jar:/usr/share/cassandra/lib/libthrift-0.9.2.jar:/usr/share/cassandra/lib/log4j-over-slf4j-1.7.7.jar:/usr/share/cassandra/lib/logback-classic-1.1.3.jar:/usr/share/cassandra/lib/logback-core-1.1.3.jar:/usr/share/cassandra/lib/lz4-1.3.0.jar:/usr/share/cassandra/lib/metrics-core-3.1.5.jar:/usr/share/cassandra/lib/metrics-jvm-3.1.5.jar:/usr/share/cassandra/lib/metrics-logback-3.1.5.jar:/usr/share/cassandra/lib/netty-all-4.0.44.Final.jar:/usr/share/cassandra/lib/ohc-core-0.4.4.jar:/usr/share/cassandra/lib/ohc-core-j8-0.4.4.jar:/usr/share/cassandra/lib/reporter-config3-3.0.3.jar:/usr/share/cassandra/lib/reporter-config-base-3.0.3.jar:/usr/share/cassandra/lib/sigar-1.6.4.jar:/usr/share/cassandra/lib/slf4j-api-1.7.7.jar:/usr/share/cassandra/lib/snakeyaml-1.11.jar:/usr/share/cassandra/lib/snappy-java-1.1.1.7.jar:/usr/share/cassandra/lib/snowball-stemmer-1.3.0.581.1.jar:/usr/share/cassandra/lib/ST4-4.0.8.jar:/usr/share/cassandra/lib/stream-2.5.2.jar:/usr/share/cassandra/lib/thrift-server-0.3.7.jar:/usr/share/cassandra/apache-cassandra-3.11.4.jar:/usr/share/cassandra/apache-cassandra-thrift-3.11.4.jar:/usr/share/cassandra/stress.jar: org.apache.cassandra.service.CassandraDaemon


OS Linux 4.16.13-1.el7.elrepo.x86_64 #1 SMP Wed May 30 14:31:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux


nodetool compactionstats

pending tasks: 584


nodetool netstats

Mode: STARTING

Not sending any streams.

performance
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Sergio avatar image
Sergio answered

For all the readers I solved that problem by changing the policy at the Cassandra driver level for all the clients. https://docs.datastax.com/en/drivers/java/2.2/com/datastax/driver/core/policies/LatencyAwarePolicy.html

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered Sergio commented

@Sergio this sounds like a follow up to the same symptom you reported in this post. My response to that was the symptoms appear to be due to node not being able to cope with the traffic, particularly the disk being overloaded with IO.

You need to check the disk based on the suggestions I made. But for all intents and purposes, the ttop output shows the node is serving a lot of requests. You should also check that the clients/driver/application is configured correctly in case all requests are getting routed to this node instead of being load balanced to the rest of the cluster. Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.