How to migrate a subset of tables(30 tables out of 110 ) from a single key space having TB of data from one cluser to another cluster ?
Source cluster and target have different configuration(RF) and number of nodes ?
How to migrate a subset of tables(30 tables out of 110 ) from a single key space having TB of data from one cluser to another cluster ?
Source cluster and target have different configuration(RF) and number of nodes ?
I've modified your original question and broke them up into 2 parts:
On the source cluster, take a snapshot of the relevant keyspaces using the nodetool snapshot
command. For example:
$ nodetool snapshot <keyspace_name>
Here is an example where I take a snapshot of the community
keyspace.
STEP B1 - Create a snapshot:
$ nodetool snapshot community Requested creating snapshot(s) for [community] with snapshot name [1591083719993] and options {skipFlush=false} Snapshot directory: 1591083719993
The directory 1591083719993
name is a Unix timestamp for when the snapshot was created and is equivalent to June 2, 2020 7:41am GMT. There is one table called users
in my example keyspace and the snapshot is located in the following directory structure:
data/ community/ users-6140f420a4a411ea9212efde68e7dd4b/ snapshots/ 1591083719993/ manifest.json mc-1-big-CompressionInfo.db mc-1-big-Data.db mc-1-big-Digest.crc32 mc-1-big-Filter.db mc-1-big-Index.db mc-1-big-Statistics.db mc-1-big-Summary.db mc-1-big-TOC.txt schema.cql
For more info, see Taking a snapshot.
Taking a snapshot needs to be carried out on all nodes in the cluster. It is preferable if you create them in parallel to make it simpler for you to identify the snapshot folders.
To achieve this, I recommend using tools you already have in your environment. If you are already using orchestration tools like Ansible, create the snapshots in parallel by running the command on all nodes simultaneously. Similarly, you can also script the restore operation so you can execute it in parallel using Ansible.
If you are not using orchestration tools, consider using Cluster SSH (cssh
) or Parallel SSH (pssh
) so you can run commands simultaneously on all nodes in your cluster.
PREPARATION - Create the keyspace and table schema on the destination cluster. If necessary, use the schema.cql
file in the snapshots folder as a guide.
Once the keyspace and table schema has been created, follow the procedure below to restore the tables.
STEP 1 - Copy the snapshot to a temporary location so that the SSTable files are located in a directory with keyspace_name/table_name
. For example:
$ cp -p data/community/users-6140f420a4a411ea9212efde68e7dd4b/snapshots/1591083719993/* /path/to/community/users/.
STEP 2 - Load the data files to the destination cluster with the utility as follows:
$ sstableloader -d dest_node_ip1, dest_node_ip2 /path/to/community/users/
STEP 3 - Repeat steps 1 & 2 on the next node in the source cluster until the snapshots on ALL nodes have been loaded to the destination cluster.
Repeat the steps above on each table that you want to clone to the destination cluster.
For more info, see Cassandra bulk loader. Cheers!
Hi @Erick Ramirez,
I am a novice in Cassandra and I have question-based on the below scenarios:-
For case 1 or case 2, you can use sstableloader.
How it works: it takes representation of sstable on each tables and use that to load data in another cluster.
Prerequisites: the cassandra.yaml file setup will be have basically the same information than the cluster you are dumping data into.
You can find information about this tool here:
https://cassandra.apache.org/doc/latest/tools/sstable/sstableloader.html
how to do your migration ?
1) Make snapshot of each table on the source cluster
2) move all snapshots on each table from the source to to the target cluster
3) Don't forget to create the same structure on table on the target cluster
3) use sstableloader to restore them on the target cluster
Thanks @dmngaya for your swift response. Further I have few queries -
Just assume I have 6 node cluster in source with RF=3 and all my table (33 tables) data spread in all 6 nodes (Range like : 0-10,11-20,21-30,31-40,41-50,51-60) . I will take snapshot using nodetool and It will create snapshot in each table directories (33*3=99) for all 33 tables.There will a structure like this -
/node1/data/keyspace/table1/snp*,/node1/data/keyspace/table2/snp*,... /node2/data/keyspace/table1/snp*,/node2/data/keyspace/table2/snp*,... ... /node6/data/keyspace/table2/snp*,/node5/data/keyspace/table1/snp*,...
how I will come to know In which target node and in which table directory I have to copy snapshot for each table ? Do I need to copy all (99) snapshot to each table directory on target cluster?
Kindly Explain !
7 People are following this question.
DataStax Enterprise is powered by the best distribution of Apache Cassandra ™
© 2023 DataStax, Titan, and TitanDB are registered trademarks of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.
Privacy Policy Terms of Use