JoshPerryman avatar image
JoshPerryman asked Erick Ramirez edited

Best way to delete 1000s of partitions with Spark / DSE Analytics

I have a table with a partition key: run_date, entity_type, rank_offset. Each run_date, entity_type combination will have 30,000 or more record, and I use rank_offset to group them in 10s.

Occasionally, though not often, I will need to purge a given run_date, entity_type combination (e.g. delete all of the people records from 2018-01-23), which involves ~3,000 partitions. What's the best way to do this in my Spark Scala program?

I was thinking something like:

spark.sparkContext.cassandraTable(KS, T)
  .where(CQL_WHERE, runDate, entityType)
  .deleteFromCassandra(KS, T)

but I run afoul the problem that I've not included all 3 parts of my partition key.

I've looked at foreachParition but it isn't obvious how I'd use that with deleteFromCassandra.

Any guidance is appreciated

10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Russell Spitzer avatar image
Russell Spitzer answered JoshPerryman commented

Yeah one of the issues with Partition Keys is that without the full PK you can't do a hash lookup. So you can't do any kind of Partial PK delete using the "deleteFrom".

Unless i'm misunderstanding this, the way that you usually would do this is something like

spark.sparkContext.cassandraTable(KS, T) //Full table scan
 .select("run_date", "entity_type", "rank_offset") // Prune only our PK columns
 .filter( ) // Do a Spark Side filtering here (No C* Pushdown)
 .deleteFromCassandra(KS, T)

Leave a comment and I can try to give you other solutions if I'm understanding it wrong

1 comment Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

JoshPerryman avatar image JoshPerryman commented ·

Thanks Russ!

Does it help at all that our rank_offset values will range from 0 to (count / 10)?

That's what I was struggling with, some way to do loop logic like: for(int i, i < partitionCount, i++) and combine it with the other 2 known parts of the partition key. But that may not be much different than your example.

0 Likes 0 ·