Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

vkrot avatar image
vkrot asked Erick Ramirez answered

How can I convert DSE 6.8 SSTables to CSV?

Hi all,

Is there a supported way to convert snapshot sstables to csv? sstables are taken from DSE 6.8 snapshot.

We have a huge table - 40TB of data on a 14 nodes cluster and need to export it to csv atomically. That should be a point in time export.

Using dsbulk is not an option:

- takes ages to export 40TB of data

- export is not consistent. we need a point in time export - we need the data exactly as it was when the export started.

We'd like to use something like https://github.com/jberragan/spark-cassandra-bulkreader for reading snapshot files, but it doesn't work with DSE 6.8 'bti' SSTables.

Any suggestions how to accomplish this?

dse
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

steve.lacerda avatar image
steve.lacerda answered Erick Ramirez edited

An option for you would be something like Spark to read the data from Cassandra and write it to a CSV:

dataframeObj.write.csv("path")

Spark doesn't provide atomicity, but once you read in the rows from Cassandra into the dataframe, you can count on that dataframe, and then match that count with the rows in the CSV.

Alternatively, there are 3rd party solutions like Datalake that provide that atomicity.

2 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

@steve.lacerda , that doesn't work. data frame is lazy, any computation on a dataframe reads chunks of data from cassandra iteratively.

I did use cassandra to export the table to csv and output data had gaps - something was added/removed to the table during that export was running. It took 5 days using 4 executor nodes to export 40TB table.

Datalake can be used only after you put data into it, and you need an point in time export to be done first.

I'd prefer to use spark to read from sstables, but DataStax writes sstables that cannot be read by existing spark libraries. I'd be happy to modify standard cassandra table exporter, but looks like datastax sstables format is not compatible with open source cassandra codebase. Or am I wrong?

0 Likes 0 ·
Which dse version are you using? If it's the 6.x line, then there is no comparable in oss. However, if you're using 5.x then oss sstable formats should be the same.
0 Likes 0 ·
Erick Ramirez avatar image
Erick Ramirez answered

There isn't an out-of-the-box solution for this use case. You wouldn't be able to use any open-source tools because the SSTables formats for DSE clusters are proprietary.

You can workaround it by using the sstabledowngrade tool in DSE 6.8 to convert the SSTables to a format that is compatible with open-source Cassandra.

If you have any follow up questions, please log a ticket with DataStax Support so one of our engineers can assist you directly. Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.