Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

wassim_yaich avatar image
wassim_yaich asked Erick Ramirez answered

Stream data from Cassandra and read it in pyspark

Hi Folks,

I am working in a project data science . I am obligate to read data from cassandra in spark in real time . Anyone have solution for this issues ?

change data capturepyspark
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

steve.lacerda avatar image
steve.lacerda answered steve.lacerda commented

I'm not sure if you're using OSS or DSE, but in DSE you can use spark streaming:

https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/spark/sparkStreamingIntro.html

All streaming relies upon some underlying messaging system, like Kafka or Pulsar. So, you'll need to implement that first in order to receive messages which then get processed and sent to Cassandra.

Here's a doc providing an example of a sink between Kafka and Cassandra:

https://hevodata.com/learn/kafka-and-cassandra/


2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Hi Mr steve ,

I am using cassandra open source , I read data from cassandra in static mode , my objective is to read data from cassandra in real time (readstream) . For the connection I use (com.datastax.spark:spark-cassandra-connector_2.12:3.1.0) .

This pictures may help you to understand my issue.

Screenshot from 2022-04-25 11-55-59.pngScreenshot from 2022-04-25 11-52-16.png


0 Likes 0 ·

OSS Cassandra does not support streaming as a source. You can use it as a sink as I suggested previously, so using Kafka or a messaging system you can then sink cassandra to it:

https://stackoverflow.com/questions/64302327/error-data-source-org-apache-spark-sql-cassandra-does-not-support-streamed-rea

0 Likes 0 ·
Erick Ramirez avatar image
Erick Ramirez answered

The Spark Streaming in the spark-cassandra-connector provides a mechanism for consuming data from sources like Akka and Kafka and store the data in Cassandra. It only works where Cassandra is the destination (sink), not the source.

DataStax has a change data capture product called CDC for Cassandra which works with open-source Apache Cassandra, DataStax Enterprise and Astra DB.

With DataStax CDC, CDC agents installed on the same nodes as Cassandra capture changes (mutations) from the commitlog, deduplicates them, then streams the data to Apache Pulsar. Your Spark app can then subscribe to the relevant Pulsar topic and process the stream.

For more info, see the blog post Shatter Data Silos with DataStax Change Data Capture for Apache Cassandra. Cheers!

Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.