question avatar image

Why is the bulk-loaded data size much larger compared to the source cluster?

Hi all,

I have a requirement to migrate a keyspace (employees keyspace with only 1 table in it) in production from one existing source cluster to another existing target cluster. Here are the details:

Source Cluster -- Number of nodes 6 (1 DC, Replication Factor - 3) -- has a total of 7 keyspaces (including employees keyspace)

Target Cluster -- Number of nodes 6 (1 DC, Replication Factor - 3) -- has a total of 4 keyspaces (not including employees keyspace)

Since the source and Target clusters are already hosting other keyspaces, we are not allowed to do any changes to cassandra.yaml configuration file. We are only allowed to copy/migrate the entire employees keyspace (has only 1 table) from source to target without modifications to the existing cluster configuration. Also, they are not identical clusters with varying token range assignments, so I have followed the following article to migrate the data for the 1 table.

1. Created snapshot on Node1 of source and copied the snapshot files to the /path/to/<keyspace>/<table> directory structure of Target node1

2. Ran the sstableloader on node1 of target cluster.

3. copied the snapshots from node2 of source node and copied the contents to the node1 of target.

4. Similarly, I have run sstableloader for the data coming from all source nodes and executed it on only the node1 of target.

The problem is that on source, the table size on each node is approx. in the range of 33gb -35gb but on the target the size of the table has increased to the range of 60gb -85gb on each node.

1. Why has the table size increased considerably on all the nodes of the target?

2. Do I need to run nodetool repair or compaction to bring it to the actual 33-35gb range as it is on source?

3. How do I validate that all the data is copied properly and there are no discrepancies between source and target tables?

Appreciate your inputs..

10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered commented

It isn't readily obvious to me what could be the reason for the size discrepancy based on the limited information you provided.

The obvious things I would check for are compaction strategies used and compression settings as a start but it would just be a guess. Without comparing and analysing the diagnostic data from both source and destination clusters, it would be impossible to know whether what you perceive as a size discrepancy is even relevant.

My suggestion is that you log a ticket with DataStax Support so one of our engineers can request the relevant diagnostic data to be able to assist you. Cheers!

1 comment Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total. avatar image commented ·

Hi Erik @Erick Ramirez,

I will check on the cluster settings and the differences in compression/compaction strategies.

Can you help also me with the question below:

Once I am done with sstableloader utility, how do I validate that all the source data is successfully loaded to target cluster? What is the sure way of validation after migrating a keyspace?

I tried to do a count of all records on the table in source and target clusters, but it timed-out because of the huge table size. Are there any other options to do the validation?

0 Likes 0 ·