question

denis.koblov avatar image
denis.koblov asked denis.koblov commented

Node hanging, huge heap usage

Hey, guys. I need your help))

Sometimes one of the cassandra nodes (3.11.2 version) starts behaving strangely. In particular:

System CPU usage:

снимок-экрана-2021-03-21-в-013303.png

IO wait:

снимок-экрана-2021-03-21-в-013431.png

During this period, the outgoing traffic of the node fell down:

снимок-экрана-2021-03-21-в-013719.png

And there were a lot of records like 'Timeout while read-repairing after receiving all 6 data and digest' in the logs.

If we look at the usage of mapped memory, we see the following:

снимок-экрана-2021-03-21-в-013558.png

The spikes before the anomaly correlates with the compaction activity:

снимок-экрана-2021-03-21-в-013831.png

In addition, the heap usage has reached almost 100% (32 GB).

снимок-экрана-2021-03-21-в-115204.png

As a result, GC worked a lot:

снимок-экрана-2021-03-21-в-134006.png

In my opinion, the GC work is the cause of node unavailability, but why was the heap and mapped memory used a lot (unfortunately, I wasn't able to get a heap dump)?

From 15: 44:40 to 15:44:50 (when the CPU usage fell down), there are only these records in the logs:

15:44:42.499Z host DEBUG [COMMIT-LOG-ALLOCATOR] o.a.c.d.c.AbstractCommitLogSegmentManager - [] No segments in reserve; creating a fresh one
15:44:43.357Z host DEBUG [cluster2-nio-worker-0] com.datastax.driver.core.Connection - [] Connection[/<local_ip>:9042-1, inFlight=0, closed=false] was inactive for 30 seconds, sending heartbeat
15:44:43.358Z host DEBUG [cluster2-nio-worker-0] com.datastax.driver.core.Connection - [] Connection[/<local_ip>:9042-1, inFlight=0, closed=false] heartbeat query succeeded
OCATOR] o.a.c.d.c.AbstractCommitLogSegmentManager - [] No segments in reserve; creating a fresh one
15:44:45.242Z host DEBUG [CompactionExecutor:9] o.a.c.db.compaction.CompactionTask - [] Compacted (11239242-88c9-11eb-9287-ed8c5d87e5c3) 2 sstables to [/data/data/my_table-b4108710c1cc11e6bbea6739e6451df9/mc-3647-big,] to level=0.  6.954GiB to 5.983GiB (~86% of original) in 413,012ms.  Read Throughput = 17.241MiB/s, Write Throughput = 14.833MiB/s, Row Throughput = ~121,513/s.  14,009,430 total partitions merged to 11,084,640.  Partition merge counts were {1:8205824, 2:2901803, }
15:44:45.250Z host DEBUG [CompactionExecutor:9] o.a.c.db.compaction.CompactionTask - [] Compacting (07519311-88ca-11eb-9287-ed8c5d87e5c3) [/data/data/my_table-b4108710c1cc11e6bbea6739e6451df9/mc-3650-big-Data.db:level=0, /data/data/my_table-b4108710c1cc11e6bbea6739e6451df9/mc-3651-big-Data.db:level=0, /data/data/my_table-b4108710c1cc11e6bbea6739e6451df9/mc-3649-big-Data.db:level=0, /data/data/my_table-b4108710c1cc11e6bbea6739e6451df9/mc-3652-big-Data.db:level=0, ]
15:44:46.183Z host INFO [Service Thread] o.a.cassandra.service.GCInspector - [] G1 Young Generation GC in 222ms.  G1 Eden Space: 8606711808 -> 0; G1 Old Gen: 15180859000 -> 15167382432; G1 Survivor Space: 218103808 -> 536870912; 
15:44:47.841Z host DEBUG [COMMIT-LOG-ALLOCATOR] o.a.c.d.c.AbstractCommitLogSegmentManager - [] No segments in reserve; creating a fresh one

The compaction task 07519311-88ca-11eb-9287-ed8c5d87e5c3 was completed at 15:46:36 After that the next compaction task was started and there is no record about finishing this task (it wasn't completed by the time of the manual node restart at 16:10)

...
15:46:35.957Z host DEBUG [Native-Transport-Requests-288] o.a.c.service.ResponseResolver - [] Timeout while read-repairing after receiving all 6 data and digest responses
15:46:35.974Z host DEBUG [Native-Transport-Requests-432] o.a.c.service.ResponseResolver - [] Timeout while read-repairing after receiving all 6 data and digest responses
15:46:36.249Z host DEBUG [CompactionExecutor:9] o.a.c.db.compaction.CompactionTask - [] Compacted (07519311-88ca-11eb-9287-ed8c5d87e5c3) 4 sstables to [/data/data/my_table-b4108710c1cc11e6bbea6739e6451df9/mc-3653-big,] to level=0.  1.204GiB to 1.135GiB (~94% of original) in 76,400ms.  Read Throughput = 16.132MiB/s, Write Throughput = 15.208MiB/s, Row Throughput = ~108,350/s.  3,995,772 total partitions merged to 3,379,549.  Partition merge counts were {1:2948311, 2:300597, 3:76297, 4:54344, }
15:46:36.298Z host DEBUG [ReadRepairStage:741] o.a.c.service.ResponseResolver - [] Timeout while read-repairing after receiving all 1 data and digest responses
15:46:36.300Z host DEBUG [CompactionExecutor:9] o.a.c.db.compaction.CompactionTask - [] Compacting (497de1d0-88ca-11eb-9287-ed8c5d87e5c3) [/data/data/my_table-b4108710c1cc11e6bbea6739e6451df9/mc-3653-big-Data.db:level=0, /data/data/my_table-b4108710c1cc11e6bbea6739e6451df9/mc-3647-big-Data.db:level=0, ]
...

My compaction strategy is LeveledCompactionStrategy.

I am not sure if the root cause of the issue is compaction, because in general it works fast enough without affecting the node works. But then what could be the cause for this behavior?

Thank you in advance for your help.

performance
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered denis.koblov commented

The symptoms you described are typically caused by nodes being overloaded with app traffic.

In my experience, high CPU usage is almost always caused by high GC churn, high IO waits or both.

In relation to the high mapped memory, data files are cached with mmap() to make reads faster. Typically, you only want to map the index files so as not to run out of memory on the nodes (see Why nodes run out of memory after upgrading Cassandra). But the important thing to note is that the files are cached when they are read -- the fact that the mapped memory usage is high just indicates that a lot of data files are being read by Cassandra to satisfy read requests.

If the issue you reported is only occurring on one node then there is insufficient information from what you have provided so far to indicate the root cause so start looking into your application monitoring to see whether there has been a spike (or significant increase) in traffic that would account for the high load. Cheers!

10 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

denis.koblov avatar image denis.koblov commented ·
whether there has been a spike (or significant increase) in traffic that would account for the high load.

No, there were no any spikes in the traffic:

снимок-экрана-2021-03-22-в-115023.png

In addition, such cases happens in a milder form during the low traffic period.

What I mean by "in a milder form":

  • the request latency to the cassandra node increases;
  • the other nodes lose connections with suspected node;
  • the number of the hints increases.

It is happening for 5 minutes and after that return to the normal state.


For example, mapped memory usage in such case:

снимок-экрана-2021-03-22-в-164217.png

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ denis.koblov commented ·

If it's just happening on one node, I don't have enough information to know what is causing it. Sorry.

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ denis.koblov commented ·

I've been thinking about this. I think the IO wait is an important clue here. If it's just happening on one node, is it possible that there's a problem with the disk?

0 Likes 0 ·
denis.koblov avatar image denis.koblov Erick Ramirez ♦♦ commented ·

No, sorry to confuse you. This happens on one node at a time. At other times, this is also observed on other nodes, too.

We have an idea: We noticed that JVM mapped memory growth correlates with internal. MemtableReclaimMemory. At the time of the incident, there are no completed tasks in this pool, the number of active ones is always = 1 and the number of Pending ones is growing.

There is the following code in the Cassandra-all.jar:

private void reclaim(final Memtable memtable){
  
   final OpOrder.Barrier readBarrier = readOrdering.newBarrier();
   readBarrier.issue();
   postFlushTask.addListener(new WrappedRunnable(){
   public void runMayThrow(){
            readBarrier.await();
            memtable.setDiscarded();
        }
    }, reclaimExecutor);
}

We think, that readBarrier.await() is not released and is waiting for something. As a result, memtables are nott flushed and we observed an increase in JVM mapped memory (not system mapped) and then an increase in heap usage.

0 Likes 0 ·
Show more comments