Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

HDC avatar image
HDC asked Erick Ramirez answered

cassandra local_quorum 查询不一致

上次版本写错了,正确的版本:cassandra 版本:2.1.15

节点数:dc1: 80, dc2:80

问题:

我们的副本策略如下:

WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 3};

我们使用cassandra遇到一个问题,用 local_quorum查询时不一致。我们只会在dc1读写。

我们写入也是用的 local_quorum,然后查询也是用的 local_quorum。

但是出现一个现象,使用以下语句:

select count(*) from table where partitionKey=?

查询的结果最初是不一致的,最终一致。

假设第一是10000,第二次是9998,第三次是9997,最终可能保持在10001。(也许是触发读修复,导致最终稳定了)

这个期间,我们做过大规模的扩容。而且确保每台机器都做了clean up。

而且我们还发现,在不同机器使用getEndpoint <keyspace> <table> <key> 的结果是不一致的。

最终,我们发现 getEndpoint 的结果在dc1有4个机器。

然后我们对应的4台机器上执行getSstable 只有3台机器显示了结果。

与此同时,我们遇到另外一个partitionKey有类似问题,但是这个 partitionKey 只做过一次查询,因为我们在另外的地方记录这个 partitionKey 的总条数,可以确认这个 partitionKey 总数不对。当我们逐一重启dc1的每一台机器之后,这个问题解决了。 partitionKey 的总条数 和 我们记录的结果一致了,而且多次做相同的查询,结果不再变化。

所以,我怀疑gossip同步节点的信息过于缓慢,导致可能选择节点查询时候导致最终的结果不一致。

consistency
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered

Read repairs definitely play a large part in this:

假设第一是10000,第二次是9998,第三次是9997,最终可能保持在10001。(也许是触发读修复,导致最终稳定了)

But the underlying issue is that the data is inconsistent between replicas. The problem is the nodes dropping mutations when the cluster gets overloaded so replicas become inconsistent. If you regularly run repairs and make sure your cluster doesn't get overloaded, you wouldn't run into this issue.

On the second part of your question, nodetool getendpoints simply hashes the partition key to get its token value then return the list of nodes which own the token -- regardless of whether the partition key exists in the cluster.

When you are adding nodes to a cluster, a new node is taking ownership of a token range(s). While it is bootstrapping, it is not accepting read requests but the nodes which own the data (replicas) will increase temporarily.

For example, if node A owns tokens 0-100 and node E is taking ownership of token 50, both nodes A and E are replicas of token 50 while node E is bootstrapping. When node E successfully joins the cluster (bootstrapping has completed), node A will no longer own token 50 and the number of replicas will go back to normal.

On the third part of your question, counting using COUNT() is a bad idea because you will get inconsistent results while data is mutating. It is an unreliable way of tracking data as I've explained in Why COUNT() is bad in Cassandra. Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.