question

zhouzhiyong123 avatar image
zhouzhiyong123 asked Erick Ramirez commented

Get All Partition Keys On A Physical Machine

Hi Friends,

I am new to Cassandra. This my first post.

Here is a scenarios:

Suppose we have 100 Cassandra machines storing the raw data need to be process.

There are 10 worker machines which are using to process the data. Each worker machines are responsible to handle around 10% of the load.

Task Table is

Task ID----guid, primarykey and partitionId

TaskComplete --- true/false

Task Content

Is there any way for the worker to query the uncompleted tasks it suppose to deal with ?

data modelinganti-pattern
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez commented

This sounds like you want to use Cassandra as a queue of items to process. This is a classic anti-pattern in Cassandra for several reasons. See Cassandra anti-patterns: Queues and queue-like datasets.

Specifically for the use case you described, it implies that you want to read all the partitions to get a list of items to process. This is a full table scan and has significant performance impacts because it won't scale as your data set and cluster grows. Imagine a single coordinator requesting all the records from tens/hundreds of nodes -- the query is more likely to hit the request timeout and never get a result.

There are limited ways to solve your problem. Check out Ryan Svihla's blog post Understanding Deletes for suggestions on how to model your data. Cheers!

2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

zhouzhiyong123 avatar image zhouzhiyong123 commented ·

hi Erick,

Thank you very much for your reply.

Yes, it is a kind of of full table scan. But I only want each worker scan part of the table, and all workers will scan the whole table. It is a kind of map-reduce pattern. From scale wise, when the whole cassandra cluster increased 1000 nodes, there will be 100 worker, each worker's load are still same.

The key question here is what is the best way for a worker to scan a rang of partitions ?



0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ zhouzhiyong123 commented ·

If you're using the Spark connector, it's smart enough to handle it. Cheers!

0 Likes 0 ·