Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

anson avatar image
anson asked ·

How can I speed up sstableloader?

Hi,

I have a 5 node cluster and currently it has 100k records. I took a backup snapshot of these nodes and want to restore the data to a singe node cluster using sstableloader. But it is taking some time to restore. I am using python subprocess command to call sstableloader command.

I need a way to make restore faster. Like by using python multiprocessing. Idea is to parallely restore from 5 nodes to the target cluster using sstableloader

Or is there some other way to speed up?

restoresstableloader
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

nikhilsk99_141387 avatar image
nikhilsk99_141387 answered ·

If you have just 100k records, its not a huge volume of data, why not use dsbulk to export and pipe the output to another dsbulk for import.

dsbulk unload -k ks -t fills -u cassandra -p cassandra -h <Cluster_Source_IP> | dsbulk load -k ks -t fills_by_medication -u cassandra -p cassandra -h <Cluster_Dest_IP>

3 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

i can only use sstableloader. No provision as of now for me to use any other ways other than sstableloader

Is there any chance to parrallelly restore from these nodes?


0 Likes 0 ·

What i was thinking is whether we can give each node's data to each core making it a multiprocess restoration. Lets say if we are trying to restore from a 3 node, is it possible to give restoration of each node to a seperate process core so to make the overall restoration faster?


0 Likes 0 ·

Thanks for being part of the community. We really appreciate your contributions so please keep it up!

On the subject of DSBulk, it doesn't make sense to export the data to CSV when the SSTables already exist. This will add unnecessary steps/overhead to the process.

DSBulk is really intended for importing data from other sources such as relational databases. Cheers!

0 Likes 0 ·
Erick Ramirez avatar image
Erick Ramirez answered ·

If there's only one node in the destination cluster, you can simplify the procedure with the nodetool import command.

For each source node, copy the SSTables to a directory on the destination node then run:

$ nodetool import ks_name table_name -d /path/to/sstables

You will need to do this for each of the tables from each of the source nodes. This is a much quicker way of cloning the application data to a single-node cluster.

WARNING: Make sure you only restore application keyspaces. Do NOT restore system tables to another cluster. Cheers!

[UPDATE] Given the updated requirements, you can maximise the throughput by running multiple instances of sstableloader preferably on different servers so the instances are not competing for the same resources particularly for disk IO bandwidth.

2 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Hi,

The restore can be from any higher node to any lower node, like it can be from 4 node to 2 node or 5 node to 3, etc. (Generalized way)

And i can only use sstableloader. No provision as of now for me to use any other ways other than sstableloader

Is there any chance to parrallelly restore from these nodes?



0 Likes 0 ·

This is a completely different requirement to what you originally posted so it is technically a different question altogether.

I've updated my answer to reflect the different scenario you just posted. Cheers!

0 Likes 0 ·