DhavalBhatt avatar image
DhavalBhatt asked Erick Ramirez edited

Is it a good idea to point 2 Spark contexts to the same Cassandra instance?

We have a DSE environment in production which is divided into two data centers. Among them, one datacenter is handling SOLR search responsibility and the other is Spark Analytics. Both SOLR and Spark running fine except sometimes causing issues due to data storage that we have.

To reduce the storage load we are planning to load some analytics data in an altogether separate DSE environment. for that, I am planning to run a Spark service in a new DSE environment which will fetch existing data from current production and perform some operation on data and store it back in the new DSE environment.

So now the question is, Is it a good idea to point two spark context to one Cassandra table? I am aware that it would be a completely separate spark context so it should be good. but still, this is something which impacts our production environment so it's always good to take an opinion considering that we have more than 5tb of data and I don't want to take risk

EDIT: One thing which I miss to mention

I will install stand-alone Spark from the repository and configure it separately. Will not use spark service which packed along with DSE environment

10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

smadhavan avatar image
smadhavan answered DhavalBhatt commented

@ddbRocks_150730, this is a quite broad topic/question. I'll try to make my best bet here. Also, please feel free to update the version of DSE & cluster topology in the original question for additional clarity.

Workload isolation is great (DC1 for Search workload & DC2 for Analytics workload) on this cluster.

Assumption that I am making is your applications are connecting to DC1 to perform it's OLTP & connecting to DC2 for Analytics workload. 5TB of data is assumed to be the overall data of the cluster and not the per-node size.

If the sole problem is to reduce the storage load (i.e. storage disk size), there are couple ways to address this situation (unless I am not understanding the root cause/problem statement clear),

  • Increasing the disk size per-node to handle the storage size supported by the version of DSE (or)
  • Perform horizontal scale out of the cluster by adding additional nodes which would distribute the data and reduce the per-node storage size (or)
  • Separate out the type of data stored in this cluster and store the analytics data into a separate cluster (as you had mentioned above)

Having said that, pointing the spark context on the same table from your BYOS configuration shouldn't be a problem as long as the current cluster can handle that load (plus the existing regular traffic/load). Based on the assumption made above, if your transactional workload is happening at the DC1, you could point this on the DC2 to avoid any impacts to your existing production transactional loads.

Thorough testing in a production-like simulated environment is recommended prior to performing this on the actual production cluster.

2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

DhavalBhatt avatar image DhavalBhatt commented ·

thank you @smadhavan.

Yes, 5TB data is across all nodes. coming to suggestion

  • Increasing disk size per node: we have sufficient space on each node so we are good in terms of hard disk space. But the main issue that we have as of now one table which model can be optimized that we are planning to perform this in the near future. that particular table has Datetieredcompactionstrategy and has gc_grace_period 10 days and we have an analytics job which is deleting row on periodically basic. That single table occupied 200 GB space out of 690gb on one node.
  • Adding a new node: we already started communicating this with our IT team they will soon provide us an update on this.

0 Likes 0 ·
DhavalBhatt avatar image DhavalBhatt commented ·

The table which is causing an issue is capturing clients' activity data so naturally have heavy write load on this table. which further performs analytics on rows. also once data stored in the database we do not perform any update on those rows. that data will mark deleted once processed. so the current structure creating a tombstone which remains in the system for 10 days. that affect our read operation. So what I suggest to my management that we can have separate datacenter which holds this type of data and perform analytics task.

0 Likes 0 ·
Erick Ramirez avatar image
Erick Ramirez answered

Another Spark connection is just another client to a Cassandra cluster so that in itself isn't a problem since you can have multiple clients connecting to your cluster at any point in time.

The real concern you should have is whether your cluster can support the load. Analytics queries are heavier compared to ordinary OLTP transactions. It can have a significant impact on the performance of your normal application [OLTP] traffic so we recommend that you isolate the Analytics traffic in a separate DC so your application users are not affected.

As a side note, you mentioned that you delete rows after they've been processed and you run into tombstone issues. It sounds like you're using that table as a queue and there are better ways of modelling your table to avoid the tombstone issue. If you post a new question with the problem you're facing, we'd be happy to provide some ideas. Cheers!

10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.