I have a table with a partition key: run_date, entity_type, rank_offset. Each run_date, entity_type combination will have 30,000 or more record, and I use rank_offset to group them in 10s.
Occasionally, though not often, I will need to purge a given run_date, entity_type combination (e.g. delete all of the people records from 2018-01-23), which involves ~3,000 partitions. What's the best way to do this in my Spark Scala program?
I was thinking something like:
spark.sparkContext.cassandraTable(KS, T) .where(CQL_WHERE, runDate, entityType) .deleteFromCassandra(KS, T)
but I run afoul the problem that I've not included all 3 parts of my partition key.
I've looked at foreachParition but it isn't obvious how I'd use that with deleteFromCassandra.
Any guidance is appreciated