Currently we have a table as below
CREATE TABLE test.progress_pp ( user_id text, program_id text, asset_id text, completed_at timestamp, furthest_position float, furthest_position_asset_id text, furthest_position_updated_at timestamp, is_ota boolean, migrated_at timestamp, percentage_watched float, solr_query text, updated_at timestamp, PRIMARY KEY (user_id, program_id) ) WITH CLUSTERING ORDER BY (program_id ASC)
with TTL of 6 years.
We are looking to cleanup the OLD data based on updated_at column by exporting only last 6 months of data into another table and ask app folks to point to this table .
I am trying to run spark sql to fetch data based On non primary key as below
spark-sql> select * from test.progress_pp where updated_at > '2021-01-01 00:00:00+0000';
May I know if there is a way to export to CSV using above spark SQL and import data to another table ?
I started one out of 3 node as a spark node and running the above spark sql as the data set is less in dev. Is there any considerations to take hen working with production. I believe we have around billion rows in prod and running spark-sql from any of the node impacts cluster performance? please advise