question

michael.guissine_30999 avatar image
michael.guissine_30999 asked Erick Ramirez edited

Nodes going down with AssertionError after upgrading to DSE 6.7.3

After upgrading DSE from 5.1.14 to 6.7.3 we are seeing nodes are going down with lots of errors like the one below in the logs, any thoughts?

ERROR [CoreThread-0] 2019-06-25 10:15:30,519  VerbHandlers.java:77 - Unexpected error during execution of request READS.SINGLE_READ (99097935): /10.16.6.6 -> /10.16.6.4
java.lang.AssertionError: Expected valid buffer or boundary crossed
    at org.apache.cassandra.utils.flow.Flow$ReduceSubscriber.onError(Flow.java:1221)
    at org.apache.cassandra.utils.flow.FlowTransformBase.onError(FlowTransformBase.java:38)
    at org.apache.cassandra.utils.flow.FlatMap.onError(FlatMap.java:133)
    at org.apache.cassandra.utils.flow.FlowTransformBase.onError(FlowTransformBase.java:38)
    at org.apache.cassandra.utils.flow.FlowTransformBase.onError(FlowTransformBase.java:38)
    at org.apache.cassandra.utils.flow.FlowTransformBase.onError(FlowTransformBase.java:38)
    at org.apache.cassandra.utils.flow.FlowTransformBase.onError(FlowTransformBase.java:38)
    at org.apache.cassandra.utils.flow.FlowTransformBase.onError(FlowTransformBase.java:38)
    at org.apache.cassandra.utils.flow.FlowTransformBase.onError(FlowTransformBase.java:38)
    at org.apache.cassandra.utils.flow.FlatMap$FlatMapChild.onError(FlatMap.java:185)
    at org.apache.cassandra.utils.flow.FlowTransformBase.onError(FlowTransformBase.java:38)
    at org.apache.cassandra.utils.flow.FlatMap$FlatMapChild.onError(FlatMap.java:185)
    at org.apache.cassandra.io.sstable.format.AsyncPartitionReader$PartitionReader.onError(AsyncPartitionReader.java:365)

[UPDATE] Thank you @Erick Ramirez. That was (I believe) the full stack trace however we also seeing the errors you mentioned (`o.a.c.utils.memory.buffers.TemporaryBufferPool `)

ERROR [CompactionExecutor:1557] 2019-06-27 10:02:45,022  CassandraDaemon.java:126 - Exception in thread Thread[CompactionExecutor:1557,5,main]
java.lang.AssertionError: Slab should have been unreferenced and all buffers returned before recycling
    at org.apache.cassandra.utils.memory.buffers.MemorySlabWithBumpPtr.recycle(MemorySlabWithBumpPtr.java:174)
    at org.apache.cassandra.utils.memory.buffers.TemporaryBufferPool.newSlab(TemporaryBufferPool.java:321)
    at org.apache.cassandra.utils.memory.buffers.TemporaryBufferPool.switchSharedSlab(TemporaryBufferPool.java:232)
    at org.apache.cassandra.utils.memory.buffers.TemporaryBufferPool.allocateFromShared(TemporaryBufferPool.java:197)
    at org.apache.cassandra.utils.memory.buffers.TemporaryBufferPool.allocate(TemporaryBufferPool.java:128)
    at org.apache.cassandra.io.util.ChunkReader.readScattered(ChunkReader.java:103)
    at org.apache.cassandra.cache.ChunkCacheImpl$MultiBufferChunk.asyncLoad(ChunkCacheImpl.java:215)
    at org.apache.cassandra.cache.ChunkCacheImpl.asyncLoad(ChunkCacheImpl.java:357)
    at org.apache.cassandra.cache.ChunkCacheImpl.asyncLoad(ChunkCacheImpl.java:56)
    at com.github.benmanes.caffeine.cache.LocalAsyncLoadingCache.lambda$get$2(LocalAsyncLoadingCache.java:129)
    at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:2039)
    at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
    at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2037)
    at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2020)
    at com.github.benmanes.caffeine.cache.LocalAsyncLoadingCache.get(LocalAsyncLoadingCache.java:128)

as well as `OutOfMemory` errors (on different cluster)

ERROR [CoreThread-0] 2019-06-27 14:06:15,140  VerbHandlers.java:77 - Unexpected error during execution of request READS.SINGLE_READ (330386123): /10.17.6.5 -> /10.17.6.5 java.lang.OutOfMemoryError: Direct buffer memory    at org.apache.cassandra.utils.flow.Flow$ReduceSubscriber.onError(Flow.java:1221)    at org.apache.cassandra.utils.flow.FlowTransformBase.onError(FlowTransformBase.java:38)    at org.apache.cassandra.utils.flow.FlatMap.onError(FlatMap.java:133)    at org.apache.cassandra.utils.flow.FlowTransformBase.onError(FlowTransformBase.java:38)    at org.apache.cassandra.utils.flow.FlowTransformBase.onError(FlowTransformBase.java:38)    at org.apache.cassandra.utils.flow.FlowTransformBase.onError(FlowTransformBase.java:38)    at org.apache.cassandra.utils.flow.FlowTransformBase.onError(FlowTransformBase.java:38)    at org.apache.cassandra.utils.flow.FlowTransformBase.onError(FlowTransformBase.java:38)    at org.apache.cassandra.utils.flow.FlowTransformBase.onError(FlowTransformBase.java:38)    at org.apache.cassandra.utils.flow.FlatMap.onError(FlatMap.java:133)    at org.apache.cassandra.utils.flow.FlatMap$FlatMapChild.onError(FlatMap.java:185)    at org.apache.cassandra.utils.flow.FlatMap$FlatMapChild.onError(FlatMap.java:185)    at org.apache.cassandra.io.sstable.format.AsyncPartitionReader$PartitionReader.onError(AsyncPartitionReader.java:365)

dse
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez edited

@michael.guissine_30999 the stack trace is incomplete but if it were, I'm pretty sure it would include o.a.c.utils.memory.buffers.TemporaryBufferPool. If it does, it's confirmation that it's for a known issue in DSE 6.7.3 (ticket ID DB-3172). We aim to get the fix included in the next release of DSE (no ETA yet).

You can workaround the issue by temporarily downgrading the binaries to DSE 6.7.2 -- this won't have any impact on the data. I've written about the issue in detail in this KB article -- Compaction fails with CorruptSSTableException, AssertionError recycling a memory buffer. Cheers!

[UPDATE] The solution is to upgrade to DSE 6.7.4 (or newer) where DB-3172/DB-3174 were fixed.

8 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Beck avatar image Beck commented ·

I get this issue too.

Maybe it fixed. (6.7.4)


Reference: the DataStax Enterprise 6.7 release notes:

Resolved issues:
AssertionError in temporary buffer pool causes CorruptSSTableException. (DB-3172, DB-3174)
1 Like 1 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ Beck commented ·

You're correct @Beck. DB-3172 which I wrote about in the article above was fixed in DSE 6.7.4. When I wrote the article, 6.7.4 was not released yet. Cheers!

1 Like 1 ·
michael.guissine_30999 avatar image michael.guissine_30999 commented ·

[Update reposted in original question]

1 Like 1 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ commented ·

@michael.guissine_30999 DSE 6.7.4 is out now so you should be able to do a simple binary upgrade. Cheers!

1 Like 1 ·
michael.guissine_30999 avatar image michael.guissine_30999 Erick Ramirez ♦♦ commented ·

thank you @Erick Ramirez, we managed to resolved the issue by downgrading to 6.7.2 along with disabling Asynchronous IO -Ddse.io.aio.enabled=false and limiting file size cache file_cache_size_in_mb: 1024 . Next, we will try upgrading to 6.7.4

1 Like 1 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ michael.guissine_30999 commented ·

For future reference, disabling AIO is not recommended since it will significantly affect the performance of your cluster. There are very limited edge cases where it is suggested by a DataStax expert after exhaustive investigation. Cheers!

0 Likes 0 ·
Show more comments