Mohamed avatar image
Mohamed asked Mohamed edited

DSBulk unloading 1TB of data from Kubernetes DSE cluster fails


I am using DSBulk to unload data into CSV from a DSE cluster installed under Kubernetes, My cluster consists of 9 Kubernetes Pods each with 120 GB Ram.

I have monitored the resources while unloading the data and observed that the more the data is fetched in CSV the more the ram is getting utilised and pods are restarting due to lack of memory.

If one Pod is down at a time the DSBulk unload won't fail, but if 2 Pods are down unload will fail with the exception :

Cassandra timeout during read query at consistency LOCAL_ONE (1 responses were required but

only 0 replica responded).

Is there a way to avoid this exceeding of memory happening or is there a way to increase the timeout duration.

The command I am using is :

dsbulk unload -maxErrors -1 -h ‘[“ < My Host > ”]’ -port 9042 -u < My user name > -p < Password > -k < Key Space > -t < My Table > -url < My Table > --dsbulk.executor.continuousPaging.enabled false 1000 --dsbulk.engine.maxConcurrentQueries 128 --driver.advanced.retry-policy.max-retries 100000
1 comment
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Mohamed avatar image Mohamed commented ·

After a lot of Trial and Error, we found out the problem was with Kubernetes Cassandra pods using the main server's memory size as Max Direct Memory Size, rather than using the pods max assigned Ram.

The pods were assigned 120 GB of Ram, but Cassandra on each pod was assigning 185 GB Ram to file_cache_size, which made the unloading process fails as Kubernetes was rebooting each Pod that utilises Ram more than 120 GB.

The reason is that Max Direct Memory Size is calculated as:

Max direct memory = ((system memory - JVM heap size))/2

And each pod was using 325 GB as Max Direct Memory Size and each pods file_cache_size sets automatically to be half of Max Direct Memory Size value, So whenever a pod requests for memory more than 120 GB Kubernetes will restart it.

The solution to it was to set Max Direct Memory Size as an env variable in Kubernetes cluster's yaml file with a default value or to override it by setting the file_cache_size value on each pod's Cassandra yaml's file.

0 Likes 0 ·

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered Mohamed commented

The high memory usage is expected because DSBulk is issuing lots of concurrent read requests and it is overloading your cluster.

Add this option to your DSBulk command to reduce the maximum number of concurrent queries that DSBulk can execute in parallel:

    --engine.maxConcurrentQueries 5

These options further limit the concurrent requests per second:

    --executor.maxPerSecond 5
    --executor.maxInFlight 5

For more information, see the DSBulk Engine and DSBulk Executor options pages. Cheers!

8 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Mohamed avatar image Mohamed commented ·

Yes Erick I have tried using less queries, but still for some reason the memory usage is just getting consumed more and more and pods will restart.

I have also tried using 1 concurrent query but yet, What I think is happening is that DSBulk is loading the data in memory writing it into disk and loading more data into memory rather than first freeing the previous data.

Is there a way to free old data which has already been written to disk and please correct me if I am wrong with the process .

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ Mohamed commented ·
Your analysis isn't quite right. I'd suggest you first try all the options I pointed out. Cheers!
1 Like 1 ·
Mohamed avatar image Mohamed Erick Ramirez ♦♦ commented ·
Hi Erick I tried the options you suggested, It is reading 5 rows per second that will take very long time to unload approximately 3 billion data that I have.
0 Likes 0 ·
Show more comments