Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

ari.erev_185363 avatar image
ari.erev_185363 asked ·

Multi Datacenter Cassandra Setup for Active/Active application

I would like to create a Cassandra setup for a multi data-center system.

Each data center should operate "locally", with the following requirements:

1/ In case of a complete site failure, traffic will be directed to the other (healthy) site and the application nodes will access Cassandra nodes on that site (read/write).

2/ In case of a network partition (network disconnection), each site should continue to work using its own Cassandra nodes. Later - after network connectivity is restored, the expectation is for data in both sites to be sync-ed automatically.

The application is operated without any data-center affinity, i.e., requests for a specific key can arrive at any of the sites virtually "in the same time" (to be more precise, data update to site B can happen before an update for site A has synced with site B). (Sometimes this is referred to as "Active/Active").

The application state data is stored in a blob (per each key/record). There is no way to "merge" two "generations" of this data.

Our understanding:

a/ If we use consistency-level EACH_QUORUM – it seems that both requirements (1) and (2) are not fulfilled, as if a site fails, or is disconnected, any query will fail as the EACH_QUORUM will not be satisfied.
Using EACH_QUORUM has also a performance implications penalty, as each (at least) write operation will have to cross a long geographical distance. This performance penalty cannot be mitigated for this application.

b/ If we use LOCAL_QUORUM – requirement (1) will be satisfied.
However, in case of a network disconnection between sites, and assuming same keys are updated on both sites during the disconnection period, we believe that there will be DATA LOSS when the network connection is restored and the Cassandra attempts to synchronize the data.
Per our understanding – the ‘state data’ with the most recent timestamp will “win”, and updates which have been done to the blob data with the less-recent timestamp will be lost ("Last Write Win" paradigm).

c/ Similar to (b), and even when network connection between sites is functioning - the same data loss phenomena may happen also when two updates to the same key happen at the "same time" (i.e., before update from other sites have been exchanged - using LOCAL_QUORUM)

Questions:

Is this understanding correct?

If so – is there a proposed way to set the cluster such that both requirements ( (1) and (2) ) are fulfilled? How?

Thanks very much,
Ari

consistencylast write winsmulti data-centermulti site
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Cedrick Lunven avatar image
Cedrick Lunven answered ·

Hi,

Thank you for your message.


Evaluating the requirements

Req1. Complete site failure[..]: Right This is what you can expect from a multi-DC cassandra Cluster. The failover is handled at the driver level, switching from one datacenter to another is transparent for the application.

Req2. network partition (network disconnection). Same this is the behaviour of Cassandra Clusters. : with no master, if you loose connection between the 2 DC service is still available

The application is operated without any data-center affinity. This is not 100% accurate, in the driver configuration you have to specify a `localDatacenterName` property which should be set as the closest datacenter to reduce latency. If this one is not available, the failover is operated as discuss above.

This ring a bell in my mind. Req1 refer to Availability (A), Req2 refer to Network Partition (P). So as stated in the CAP theorem you won't be able to ensure consistency (C) with Req1 and Req2 both true. But let's have a look to possibilities


Discussing your statements:

(a) This is absolutely correct. EACH_QUORUM means consistency, as such, you loose Availability = request failed if DC is not available.

(b and c) LOCAL-QUORUM seems indeed the way to go. There is indeed a risk of inconsistency if you read with LOCAL_QUORUM on a DC that does not have the latest data and replication between DC did not occured yet. I would not called that a DATA LOSS but reading a replica with not latest values.


So what can we do ?

- First, make sure to sync clocks within different nodes, as per Cassandra 3.x those are the coordinator nodes that will define the `writetime`. (last write win)

- What about having different CL for read and write. Write with LOCAL_QUORUM and read with EACH_QUORUM. If read failed (one DC is down) fallback, at application level, with a second query with only LOCAL_QUORUM ?

- What about deduplicating the records. You said you cannot merged 2 values, you also spoke about Data Loss if values is overriden, this is not what you expect, you may want to keep history. To do that you can add a timeuuid in the primary KEY of your table. As such, now, when doing the request you may now have multiple records and you pick the last at application level but keep tracks of the 2.

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Cedrick Lunven avatar image
Cedrick Lunven answered ·

More details happened out of the tool:


Ari : The application is aware to the datacenter. What I meant is that updates/transactions arrive at the application with no “affinity. So, update to key X may arrive at DC-1, and “in parallel” other updates to key X may arrive at DC-2. (so, the probability of parallel updates to the same key is substantial, even when the DC are connected).

Understood. As a consequence, very same row is inserted in the 2 DC in the same table. (potentially with different values). 

If we want to avoid EACH_QUORUM (hit performance) or DC are not connected we may have 2 different values in the DC yes.


Ari : Also, as I understand it, it does not solve my problem... after a long disconnection "Last Write Wins" will use the last data from one DC, possibly ignoring updates with older timestamps from the other DC. (this is what I referred to as "data loss"...).
For example:

  • Last update is at t1
  • Network is disconnect
  • Site B gets an update (for key x) at t2 (t2>t1)
  • Site B gets another update (for key x) at t3 (t3>t2)
  • Site A gets an update (for key x) at t4 (t4>t3) (*totally unaware, and not based on the updates done at site B at t2 and t3)
  • Network is re-connected.
  • My understanding is that “eventually”, after DCs are sync-ed - the data that will be in-effect in both sites is the update done at t4. Which means that the update done at t2 and t3 are “lost”.


This is correct. You may lost in between updates at t2/t3 when DC are in sync again


Ari : Regarding your suggestion to De-duplicate the records and having “history” of the data: I don’t see how this helps. Of course, it is possible to keep the history. However, as I indicated at my first query, the application needs the latest data, and it can’t construct this latest data by “merging” two “generations” of its state data. (at least, not in the case of our applications, where the updates are not “addative”).
Am I missing something here? Do you still think that (even we can’t construct the “Latest version” by looking at two (or more) “History” records) your suggestion is valid?


Let's try

DROP TABLE IF EXISTS sample_table;
CREATE TABLE IF NOT EXISTS sample_table (
  key  text,
  time timeuuid,
  val  text,
  PRIMARY KEY((key), time)
) WITH CLUSTERING ORDER BY (time DESC);

Sites are disconnected and you insert values:
INSERT INTO sample_table (key, time, val) VALUES ('X', now(), 'val1');  (site A)
INSERT INTO sample_table (key, time, val) VALUES ('X', now(), 'val2');  (site B)   
INSERT INTO sample_table (key, time, val) VALUES ('X', now(), 'val3');  (site B)
INSERT INTO sample_table (key, time, val) VALUES ('X', now(), 'val4');  (site A


Then : Site A

key time val
X 29d40a01-548f-11ea-97e9-47a0572459ac val4
X 29d4f461-548f-11ea-97e9-47a0572459ac val1


Then : Site B

key time val
X 29d4cd51-548f-11ea-97e9-47a0572459ac val3
X 29d45821-548f-11ea-97e9-47a0572459ac val2


And when DC sync with got:

key time val
X 29d40a01-548f-11ea-97e9-47a0572459ac val4
X 29d4cd51-548f-11ea-97e9-47a0572459ac val3
X 29d45821-548f-11ea-97e9-47a0572459ac val2
X 29d4f461-548f-11ea-97e9-47a0572459ac val1


But now to have always the (as best as you can) latest value simply do :

SELECT * from sample_table where key='X' LIMIT 1;


And you do not loose the data.

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.