Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

bharadhwajt avatar image
bharadhwajt asked

spark sql to export table data

Hi,

Currently we have a table as below

CREATE TABLE test.progress_pp (

user_id text,

program_id text,

asset_id text,

completed_at timestamp,

furthest_position float,

furthest_position_asset_id text,

furthest_position_updated_at timestamp,

is_ota boolean,

migrated_at timestamp,

percentage_watched float,

solr_query text,

updated_at timestamp,

PRIMARY KEY (user_id, program_id)

) WITH CLUSTERING ORDER BY (program_id ASC)


with TTL of 6 years.



We are looking to cleanup the OLD data based on updated_at column by exporting only last 6 months of data into another table and ask app folks to point to this table .


I am trying to run spark sql to fetch data based On non primary key as below

spark-sql> select * from test.progress_pp where updated_at > '2021-01-01 00:00:00+0000';


May I know if there is a way to export to CSV using above spark SQL and import data to another table ?


I started one out of 3 node as a spark node and running the above spark sql as the data set is less in dev.


Is there any considerations to take hen working with production. I believe we have around billion rows in prod and running spark-sql from any of the node impacts cluster performance? please advise

spark
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

0 Answers