Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

pranali.khanna101994_189965 avatar image
pranali.khanna101994_189965 asked ·

How does Cassandra handle tombstones in hinted handoff?

I was going through how the data is deleted in cassandra from below link :


https://docs.datastax.com/en/ddac/doc/datastax_enterprise/dbInternals/dbIntAboutDeletes.html


what will happen in the case when I send a delete request with Consistency level of 1 and RF=3

so the row will be marked for deletion for one of the replica and send the response and thereby send further update request to the replica nodes . What if 2 replicas are down. This will column marked as a tombstone will be stored as a hint . According to the below mentioned Para:


"When an unresponsive node recovers, Cassandra uses hinted handoffs to replay the database mutations that the node missed while it was down. Cassandra does not replay a mutation for a tombstone during its grace period. If the node does not recover until after the grace period ends, the deletion might be missed."


so according to this tombstone data is not replayed via HINTED HANDOFF? why is that so.


As per my understanding if node comes up before the tombestone's grace period ends it should be propagted/updated on both the nodes.


and if it does not comes up, whch means now the tombstone has already been deleted from the cluster but will it be stored as HINT or ?



cassandratombstoneshinted handoff
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

saravanan.chinnachamy_185977 avatar image
saravanan.chinnachamy_185977 answered ·

@pranali.khanna101994_189965 As Erick mentioned above, the document is a bit confusing.

Tombstones are markers placed on delete mutations (There are other cases like TTL). If a node goes down, then the hints are collected in the coordinator node for a period defined by the parameter max_hint_window_in_ms (3 hrs by default). After sometime, if the downed node come back online and joins the cluster, the hints are replayed back from the coordinator node to the node that was down.


Please see an example below to better understand the concepts.

I have created a table in a 3 node cluster with RF=3 and added 2 records.

cqlsh:killervideo> create table emp_by_id(emp_id int PRIMARY KEY,emp_name text,emp_city text);
cqlsh:killervideo> insert into emp_by_id(emp_id,emp_name,emp_city) values (1001,'kevin','london');
cqlsh:killervideo> insert into emp_by_id(emp_id,emp_name,emp_city) values (1002,'miller','new york');


Since it is a 3 node cluster and RF=3, we can expect to see all the records in every node. We can inspect the data in node=3 using the command sstabledump.

$ sstabledump ac-4-bti-Data.db
  {
    "partition" : {
      "key" : [ "1001" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 37,
        "liveness_info" : { "tstamp" : "2020-06-15T19:32:43.300870Z" },
        "cells" : [
          { "name" : "emp_city", "value" : "london" },
          { "name" : "emp_name", "value" : "kevin" }
        ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "1002" ],
      "position" : 38
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 81,
        "liveness_info" : { "tstamp" : "2020-06-15T19:32:48.425054Z" },
        "cells" : [
          { "name" : "emp_city", "value" : "new york" },
          { "name" : "emp_name", "value" : "miller" }
        ]
      }
    ]
  }


Bring node3 down.

Status=Up/Down State=Normal/Leaving/Joining/Moving/Stopped
--  Address     Load       Tokens       Owns (effective)  Host ID                               Rack
DS  10.XXX.X.3  233.42 KiB  8            100.0%            4bd6a53c-92da-4297-ae84-c82076da2c69  r1
UN  10.XXX.X.2  210.17 KiB  8            100.0%            4daa4c2f-e0bf-4b01-a43a-f5e0b469b0cf  r1
UN  10.XXX.X.1  164.02 KiB  8            100.0%            079fe7b1-bcb9-4f51-837f-42fd477402e1  r1


Insert one new record and delete an existing record. Inspect the coordinator node and hints are generated.

-bash-4.2$ ls -ltr
-rw-r--r--. 1 cassandra cassandra 13334 Jun 15 19:47 4bd6a53c-92da-4297-ae84-c82076da2c69-1592250443860-1.hints


Bring node3 back up again.

Datacenter: davitapoc
=====================
Status=Up/Down State=Normal/Leaving/Joining/Moving/Stopped
--  Address     Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.XXX.X.3  275.4 KiB  8            100.0%            4bd6a53c-92da-4297-ae84-c82076da2c69  r1
UN  10.XXX.X.2  216.13 KiB  8            100.0%            4daa4c2f-e0bf-4b01-a43a-f5e0b469b0cf  r1
UN  10.XXX.X.1  170.18 KiB  8            100.0%            079fe7b1-bcb9-4f51-837f-42fd477402e1  r1


Now inspect the data in node=3.

$ sstabledump ac-5-bti-Data.db
[
  {
    "partition" : {
      "key" : [ "1001" ],
      "position" : 0,
      "deletion_info" : { "marked_deleted" : "2020-06-15T19:47:29.001499Z", "local_delete_time" : "2020-06-15T19:47:29Z" }
    },
    "rows" : [ ]
  },
  {
    "partition" : {
      "key" : [ "1003" ],
      "position" : 19
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 58,
        "liveness_info" : { "tstamp" : "2020-06-15T19:47:20.552343Z" },
        "cells" : [
          { "name" : "emp_city", "value" : "chicago" },
          { "name" : "emp_name", "value" : "robert" }
        ]
      }
    ]
  }

You can see that the mutations are replayed from hints to the downed node. Also the hints will be deleted once the data is completely replayed.


-bash-4.2$ pwd
/dse_data/hints
-bash-4.2$ ls
-bash-4.2$  


You can also inspect the tombstone through nodetool tablestats command.

$ nodetool tablestats -H killervideo.emp_by_id;
Total number of tables: 50
----------------
Keyspace : killervideo
Read Count: 4
Read Latency: 1.31225 ms
Write Count: 2
Write Latency: 0.154 ms
Pending Flushes: 0
Table: emp_by_id
SSTable count: 2
Space used (live): 9.74 KiB
Space used (total): 9.74 KiB
Space used by snapshots (total): 14.67 KiB
Off heap memory used (total): 32 bytes
SSTable Compression Ratio: 1.0141843971631206
Number of partitions (estimate): 3
Memtable cell count: 0
Memtable data size: 0 bytes
Memtable off heap memory used: 0 bytes
Memtable switch count: 3
Local read count: 4
Local read latency: NaN ms
Local write count: 2
Local write latency: NaN ms
Pending flushes: 0
Percent repaired: 0.0
Bytes repaired: 0.000KiB
Bytes unrepaired: 0.138KiB
Bytes pending repair: 0.000KiB
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 32 bytes
Bloom filter off heap memory used: 16 bytes
Index summary off heap memory used: 0 bytes
Compression metadata off heap memory used: 16 bytes
Compacted partition minimum bytes: 18
Compacted partition maximum bytes: 50
Compacted partition mean bytes: 38
Average live cells per slice (last five minutes): 1.0
Maximum live cells per slice (last five minutes): 1
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Dropped Mutations: 0 bytes
Failed Replication Count: null


Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered ·

I believe all hinted mutations get replayed to the respective replica. I am looking through the code to try to understand why the document says otherwise and I think that paragraph misquoted something else about hints and hinted handoffs.

If a node does not come up after gc_grace_seconds (for example if a node is down for longer than the default GC grace of 10 days), you should not bring the node back online. Instead, you should wipe its data, commitlog and saved_caches directories then bootstrap it back into the cluster using the "replace" method -- replace the node with itself using its own IP address with the JVM flag -Dcassandra.replace_address_first_boot=<node_ip_address>. Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.