Tri avatar image
Tri asked Erick Ramirez commented

How does a node know which nodes hold the data?

Can you please help to clarify the Replication mechanism described in

User connects to node1 for write data belonging to partition P. But partition P is stored in nodeA, and replicated to nodeB and nodeC.

Q1. How does the contact node1 knows that nodeA holds partition P? It does knows where nodeA is located thanks to snitch info. Because token range info is not part of the gossipinfo. But how does node1 knows about the token ranges of nodeA?

Q2. Let's assume nodeA is offline at that moment. Would node1 knows about this offline status after it had forwarded the write request to nodeA? Then node1 would make a new attempt to forward the write request to either nodeB or nodeC. In this case, how does node1 knows that nodeB or nodeC hold the replicas of nodeA?

Q3. Let's assume nodeA is working OK. It receives the write request from node1 and handled the write successfully. Then nodeA must fulfill the relication fctor 3 by asking nodeB and nodeC to store a copy. Based on the explanation of the video course, this is b/c B and C are neighbors of A. But how does A knows that B and C are its neighbors?

I am wondering if there are some deterministic rules that allows to a node to compute (node, neighbor) = f(partition) based on some cluster topology params.

Thanks for any clarification.

10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez commented

C* Partitions

When a node joins the cluster for the first time, it announces the token (tokens in the case of virtual nodes) it owns so the rest of the nodes know where it belongs in the ring. Nodes gossip this token metadata to each other and it gets persisted in each node's system.peers table.

The nodes determine where the partition lives using a combination of the partitioner and snitch -- the snitch provides information about the cluster topology (e.g. which DC nodes belong to) and the partitioner calculates the token value using a hash function on the partition key.


Since you're already familiar with the Replication unit in the DS201 course at DataStax Academy, I'll use the ring example in it to expand on the ideas above.

For the purposes of your questions:

  • node 1 is the coordinator node which is blue green in the diagram
  • node A is purple in the diagram (token 63)
  • node B is red in the diagram (token 75)
  • node C is blue in the diagram (token 88)

The coordinator determines the primary owner of the partition through the partitioner's hashing function. In the example above, the partition key has a token value of 59. Since the purple node owns tokens in the range 51 to 63 (indicated by the purple "inner-ring"), the coordinator knows that it is the primary owner for token 59 (we'll call it "primary replica").

The replicas are the immediate neighbours of the "primary replica". A replication factor of 3 in the ring (data centre) means that Cassandra stores 3 copies of the partition so the primary owner (purple) plus its 2 neighbours (red and blue) hold the 3 copies.

The coordinator knows the neighbouring nodes since it has knowledge of the cluster topology from the gossip information it has as I mentioned in the section above.

Your questions

Let me address your specific questions here.

  1. As explained above, nodes gossip about each other including token metadata so they know about the cluster topology. This is how node 1 (coordinator) knows that node A is the primary owner of partition P.
  2. Nodes know about each other's status based on the information they've shared with each other via gossip. To clarify your understanding, the coordinator sends mutations (writes) to all replicas at the same time, not just to node A. The coordinator knows that nodes B & C are replicas based on what I've described in the top section above.
  3. You've misunderstood how it works. As in question 2, the coordinator sends the mutation to all replicas.

Further information

You can get more information about how gossip and failure detection works in the document Internode communications. If you're interested in the source code, you can check out the implementation of (link to C* 3.11.6 code) and (link to C* 3.11.6 code).

You can get additional information on hashing and replication in Data distribution and replication. There's also additional info on partitions and partitioning in Partitioners. Cheers!

replicas.png (447.1 KiB)
2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Tri avatar image Tri commented ·

Thanks very much @Erick Ramirez for the detailed answer. The `system.peers`table clarifies a lot.

Is it correct to say the replica neighbors nodes are always the next +N (N = RF -1) nodes from the primary replica nodes?

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ Tri commented ·

That's correct. Cheers!

0 Likes 0 ·