question

anson avatar image
anson asked nikhilsk99_141387 commented

How can I speed up sstableloader?

Hi,

I have a 5 node cluster and currently it has 100k records. I took a backup snapshot of these nodes and want to restore the data to a singe node cluster using sstableloader. But it is taking some time to restore. I am using python subprocess command to call sstableloader command.

I need a way to make restore faster. Like by using python multiprocessing. Idea is to parallely restore from 5 nodes to the target cluster using sstableloader

Or is there some other way to speed up?

restoresstableloader
4 comments
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

nikhilsk99_141387 avatar image nikhilsk99_141387 commented ·

If you have just 100k records, its not a huge volume of data, why not use dsbulk to export and pipe the output to another dsbulk for import.

dsbulk unload -k ks -t fills -u cassandra -p cassandra -h <Cluster_Source_IP> | dsbulk load -k ks -t fills_by_medication -u cassandra -p cassandra -h <Cluster_Dest_IP>

0 Likes 0 ·
anson avatar image anson nikhilsk99_141387 commented ·

i can only use sstableloader. No provision as of now for me to use any other ways other than sstableloader

Is there any chance to parrallelly restore from these nodes?


0 Likes 0 ·
anson avatar image anson anson commented ·

What i was thinking is whether we can give each node's data to each core making it a multiprocess restoration. Lets say if we are trying to restore from a 3 node, is it possible to give restoration of each node to a seperate process core so to make the overall restoration faster?


0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ nikhilsk99_141387 commented ·

Thanks for being part of the community. We really appreciate your contributions so please keep it up!

On the subject of DSBulk, it doesn't make sense to export the data to CSV when the SSTables already exist. This will add unnecessary steps/overhead to the process.

DSBulk is really intended for importing data from other sources such as relational databases. Cheers!

0 Likes 0 ·

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez commented

If there's only one node in the destination cluster, you can simplify the procedure with the nodetool import command.

For each source node, copy the SSTables to a directory on the destination node then run:

$ nodetool import ks_name table_name -d /path/to/sstables

You will need to do this for each of the tables from each of the source nodes. This is a much quicker way of cloning the application data to a single-node cluster.

WARNING: Make sure you only restore application keyspaces. Do NOT restore system tables to another cluster. Cheers!

[UPDATE] Given the updated requirements, you can maximise the throughput by running multiple instances of sstableloader preferably on different servers so the instances are not competing for the same resources particularly for disk IO bandwidth.

2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

anson avatar image anson commented ·

Hi,

The restore can be from any higher node to any lower node, like it can be from 4 node to 2 node or 5 node to 3, etc. (Generalized way)

And i can only use sstableloader. No provision as of now for me to use any other ways other than sstableloader

Is there any chance to parrallelly restore from these nodes?



0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ anson commented ·

This is a completely different requirement to what you originally posted so it is technically a different question altogether.

I've updated my answer to reflect the different scenario you just posted. Cheers!

0 Likes 0 ·