DataStax Academy FAQ

DataStax Academy migrated to a new learning management system (LMS) in July 2020. We are also moving to a new Cassandra Certification process so there are changes to exam bookings, voucher system and issuing of certificates.

Check out the Academy FAQ pages for answers to your questions:


question

blagomira.blagoeva_187694 avatar image
blagomira.blagoeva_187694 asked ·

Is deleting multiple partitions with IN() operator better than deleting partitions one by one?

Performance of delete of multiple rows by multiple PK that are Partition Keys with "where in" clause vs deleting records one by one?

Which one will have a better performance delete with "where in" clause for multiple Primary Keys that are Partition Keys or deleting them one by one?

performance
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered ·

In almost all cases, using the IN() operator on partition keys is not recommended. Allow me to explain.

Single-partition queries

Consider these queries:

DELETE FROM table_name WHERE pk = 'natalie'
DELETE FROM table_name WHERE pk = 'dylan'
DELETE FROM table_name WHERE pk = 'alex'

For each of these, a coordinator will be picked by the driver to coordinate the request, i.e. send the mutations to replicas.

For replication factor of 3 in a single-DC cluster, each coordinator will contact the 3 replicas. Put simply, ONE request PER coordinator to contact THREE replicas.

Multi-partition queries

Consider this query:

DELETE FROM table_name WHERE pk IN ('natalie', 'dylan', 'alex')

Since there is only one query, only one coordinator will get picked. However given there are 3 partitions involved, the coordinator needs to fire three separate requests under-the-hood.

This means that there is ONLY ONE coordinator to coordinate THREE requests involving NINE replicas.

In a small cluster of 3 nodes, this is probably OK. But in larger clusters with multiple DCs, coordinating the multi-partition request to the replicas in all DCs can get out of hand pretty quickly.

The bottom line is that multi-partition queries with the IN() operator does not scale beyond 2-3 partitions. It is discouraged and not recommended.

Usage

So what is the IN() operator good for? Only use IN() to filter on clustering columns, that is to filter the rows in ONE partition.

To specifically respond to your question, delete one partition at a time so that the requests are distributed across the nodes in the cluster.

For more info, see Using IN to filter queries. Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.