question

Tri avatar image
Tri asked azim_91_184236 commented

On which node(s) should sstableloader be executed?

The documentation about sstableloader is a bit unclear about WHERE this utility should be executed

  • On the source node (who has the data) OR the target node (who receives the data)?

  • In both cases, does this imply ALL nodes in a cluster?

Q1. For example. The source cluster has 10 nodes (5 Keypaces, each keyspace has 20 tables). The target cluster has 15 nodes. On which nodes should we run sstableloader?

Q2. Is it possible to use nodetool rebuild to achieve the same result than sstableloader?

sstableloadercloning
1 comment
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Tri avatar image Tri commented ·

In this post: https://community.datastax.com/questions/4477/how-to-migrate-a-subset-of-tables30-tables-out-of.html

There are two answers whcih are quite contradictory. One suggested to run sstableloader on source nodes, the other answer suggested to run sstableloader on target nodes.

0 Likes 0 ·

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered azim_91_184236 commented

You can run the sstableloader utility on any nodes you want but there are several factors you need to take into consideration:

  1. If cloning production data, we do not recommend running sstableloader on production nodes as this will affect the performance of the cluster. It will use up CPU, network and IO bandwidth and will compete with production traffic for resources.
  2. If you decide to run sstableloader on nodes in the target cluster, be aware that it will compete for the same CPU and IO resources. This isn't necessarily a problem if throughput is not a concern.
  3. Where possible, copy the snapshots to servers which have sufficient CPU, network and IO bandwidth. The servers where sstableloader gets run does not need to have Cassandra running on them but the servers do need to (a) have C* installed, and (b) configured the same as the target cluster.
  4. Run sstableloader on more than 1 server to increase throughput -- more loader instances means faster time to load data. Make sure to divide the files among loading servers so you are not loading the same files multiple times.

Finally, the rebuild command isn't relevant here since it is designed for adding nodes in a new DC. Cheers!

5 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Tri avatar image Tri commented ·

1) we do not recommend running sstableloader on production nodes

Can you please suggest better alternatives?

2) If you decide to run sstableloader on nodes in the target cluster, be aware that it will compete for the same CPU and IO resources.

Does this imply running sstableloader on the nodes in the source cluster would be "less bad" ?

3) Where possible, copy the snapshots to servers which have sufficient CPU, network and IO bandwidth. The servers where sstableloader gets run does not need to have Cassandra running on them

Where does this machine come from? Is it a powerful machine placed on the same network than the target cluster. Just for the purpose of running sstableloader? Which we can shutdown when the task is done?

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ Tri commented ·
  1. The alternatives are in the rest of my answer. :)
  2. I don't think you read my answer in its entirety. If you read it again, you'll see that it's a guide to help you make a decision based on the pros and cons.
  3. Yes.
0 Likes 0 ·
Tri avatar image Tri Erick Ramirez ♦♦ commented ·

Oh cool I get it now. By alternative I thought there could be a non-sstableloader way of achieving the same goal.

0 Likes 0 ·
Show more comments
azim_91_184236 avatar image azim_91_184236 commented ·

@Erick Ramirez, need some clarification on the #3, that is, running sstableloader on different servers from source or target cluster. Could you please clarify what you mean by "configured the same as the target cluster"? are you referring to identical configurations in the cassandra.yaml file?

When we use the --conf-path option to point to a cassandra.yaml, is that config file supposed to be from the target cluster (receiving the data)?

0 Likes 0 ·