HDC avatar image
HDC asked Erick Ramirez answered

cassandra local_quorum 查询不一致

上次版本写错了,正确的版本:cassandra 版本:2.1.15

节点数:dc1: 80, dc2:80



WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 3};

我们使用cassandra遇到一个问题,用 local_quorum查询时不一致。我们只会在dc1读写。

我们写入也是用的 local_quorum,然后查询也是用的 local_quorum。


select count(*) from table where partitionKey=?



这个期间,我们做过大规模的扩容。而且确保每台机器都做了clean up。

而且我们还发现,在不同机器使用getEndpoint <keyspace> <table> <key> 的结果是不一致的。

最终,我们发现 getEndpoint 的结果在dc1有4个机器。

然后我们对应的4台机器上执行getSstable 只有3台机器显示了结果。

与此同时,我们遇到另外一个partitionKey有类似问题,但是这个 partitionKey 只做过一次查询,因为我们在另外的地方记录这个 partitionKey 的总条数,可以确认这个 partitionKey 总数不对。当我们逐一重启dc1的每一台机器之后,这个问题解决了。 partitionKey 的总条数 和 我们记录的结果一致了,而且多次做相同的查询,结果不再变化。


10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered

Read repairs definitely play a large part in this:


But the underlying issue is that the data is inconsistent between replicas. The problem is the nodes dropping mutations when the cluster gets overloaded so replicas become inconsistent. If you regularly run repairs and make sure your cluster doesn't get overloaded, you wouldn't run into this issue.

On the second part of your question, nodetool getendpoints simply hashes the partition key to get its token value then return the list of nodes which own the token -- regardless of whether the partition key exists in the cluster.

When you are adding nodes to a cluster, a new node is taking ownership of a token range(s). While it is bootstrapping, it is not accepting read requests but the nodes which own the data (replicas) will increase temporarily.

For example, if node A owns tokens 0-100 and node E is taking ownership of token 50, both nodes A and E are replicas of token 50 while node E is bootstrapping. When node E successfully joins the cluster (bootstrapping has completed), node A will no longer own token 50 and the number of replicas will go back to normal.

On the third part of your question, counting using COUNT() is a bad idea because you will get inconsistent results while data is mutating. It is an unreliable way of tracking data as I've explained in Why COUNT() is bad in Cassandra. Cheers!

10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.