DataStax Academy FAQ

DataStax Academy migrated to a new learning management system (LMS) in July 2020. We are also moving to a new Cassandra Certification process so there are changes to exam bookings, voucher system and issuing of certificates.

Check out the Academy FAQ pages for answers to your questions:


question

Tri avatar image
Tri asked ·

How does Cassandra guarantee data integrity?

This might be paranoid but how can we be sure that Cassandra writes exactly what it is given? Let take a simplistic example. A 1 node cluster. And table with only 2 columns

CREATE TABLE users (userId UUID, username TEXT, PRIMARY KEY (userId));

INSERT INTO users (userId, username) VALUES (UUID(), 'Issac Newton');

What is the underlying mechanism that guarantees that Cassandra did write the given value 'Issac Newton' and not something else? Like ' 'Issac N' ?

In a general case, Cassandra may receive a data which could be different than the original data for example the row partially truncated/mutated while in-flight. Or even when Cassandra did receive the correct row data, maybe some random things still happen so Cassandra writes B instead of A.

cassandra
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered ·

There are no guarantees in life :) but the likelihood of the scenario you described is pretty unlikely. Let me explain.

The CQL native protocol is frame-based and looks like this:

Header validation

The 9-byte header is composed of:

  • protocol version, e.g. v4 (1 byte)
  • flags: compression, tracing, custom payload, warning (1 byte)
  • stream ID (2 bytes)
  • operation code, e.g. STARTUP, AUTHENTICATE, QUERY, PREPARE, AUTH_CHALLENGE (1 byte)
  • length of the body of the frame up to a maximum of 256MB (4 bytes)

The body of the message is variable in size but it's length is defined in the header so if the body's length is off for whatever reason, it gets picked up immediately.

Message validation

Within the body itself for messages of operation code QUERY, the contents look like:

<query_string><query_parameters>

Going in deeper where the message is a request, the format of the query parameters section contains the values for bound variables:

  <consistency><flag><value_1><value_2>...<value_n>[<timestamp>]

Since the query string enumerates the bound variables, a value missing would invalidate the message.

Serialisation format validation

Furthermore, the serialisation formats of the CQL data types define how the drivers encode the values. Values are represented as bytes and the format includes an integer prefix that denotes the length of the value. It's another layer of validation built into the protocol.

Replicas

On the write path, the mutation will get sent to multiple replicas (based on keyspace replication).

For a replication factor of 3, the chances of the value getting corrupted/truncated by the time it is written on 3 nodes is nearly impossible unless all 3 replicas had a hardware failure because of Cassandra's nothing-shared architecture. Data on disk is immutable so the only way it would get corrupted is through a bad disk.

For more details, see the CQL Binary Protocol v4 Specification document. Cheers!


2 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thanks very much @Erick Ramirez. This answers perfectly the question. Thanks for the efforts making a concise and readable answer while explaining a complex under-the-hood details.

0 Likes 0 · ·

Nice pickup! I spent over an hour trying to find the right words, labouring over which details to include and which details to leave out.

I didn't want readers to be overwhelmed with the technical implementation of the protocol but I also didn't want to over-simplify it.

I found it really difficult finding the sweet spot so thanks for acknowledging it. Cheers!

0 Likes 0 · ·