question

bharadhwajt avatar image
bharadhwajt asked Erick Ramirez answered

How do I export data from one table and import to another with Spark SQL?

Hi,

Currently we have a table as below

CREATE TABLE test.progress_pp (
    user_id text,
    program_id text,
    asset_id text,
    completed_at timestamp,
    furthest_position float,
    furthest_position_asset_id text,
    furthest_position_updated_at timestamp,
    is_ota boolean,
    migrated_at timestamp,
    percentage_watched float,
    solr_query text,
    updated_at timestamp,
    PRIMARY KEY (user_id, program_id)
) WITH CLUSTERING ORDER BY (program_id ASC)

with TTL of 6 years.

We are looking to cleanup the OLD data based on updated_at column by exporting only last 6 months of data into another table and ask app folks to point to this table .

I am trying to run spark sql to fetch data based On non primary key as below

spark-sql> select * from test.progress_pp where updated_at > '2021-01-01 00:00:00+0000';

May I know if there is a way to export to CSV using above spark SQL and import data to another table ?

I started one out of 3 node as a spark node and running the above spark sql as the data set is less in dev. Is there any considerations to take hen working with production. I believe we have around billion rows in prod and running spark-sql from any of the node impacts cluster performance? please advise

spark
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered

You should be able to extract from one Cassandra table and save it to another table without needing to export to CSV. Let me show you how with a simple example.

Table schema

Here is my source table:

CREATE TABLE community.vaccinated (
    name text PRIMARY KEY,
    vaccinated timestamp
)

The destination table is pretty much identical but just with a different name:

CREATE TABLE community.justdosed (
    name text PRIMARY KEY,
    vaccinated timestamp
)

Sample data

cqlsh:community> SELECT * FROM vaccinated;

 name   | vaccinated
--------+---------------------------------
    bob | 2021-05-16 00:00:00.000000+0000
 charli | 2021-07-28 00:00:00.000000+0000
  alice | 2021-03-04 00:00:00.000000+0000

Spark SQL export

STEP 1 - Query the source table:

scala> val results = spark.sql("SELECT * from community.vaccinated WHERE vaccinated > '2021-04-01 +0000'")
results: org.apache.spark.sql.DataFrame = [name: string, vaccinated: timestamp]

STEP 2 - Show the results:

scala> results.show()
+------+-------------------+                                                    
|  name|         vaccinated|
+------+-------------------+
|   bob|2021-05-16 10:00:00|
|charli|2021-07-28 10:00:00|
+------+-------------------+

STEP 3 - Save the results to another table:

scala> results.write.cassandraFormat("justdosed","community").save()

If I go and check the data in the destination table, I can see that it's been populated as expected:

cqlsh:community> SELECT * FROM justdosed ;

 name   | vaccinated
--------+---------------------------------
    bob | 2021-05-16 00:00:00.000000+0000
 charli | 2021-07-28 00:00:00.000000+0000
Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.