Hi Folks,
I am working in a project data science . I am obligate to read data from cassandra in spark in real time . Anyone have solution for this issues ?
Bringing together the Apache Cassandra experts from the community and DataStax.
Want to learn? Have a question? Want to share your expertise? You are in the right place!
Not sure where to begin? Getting Started
Hi Folks,
I am working in a project data science . I am obligate to read data from cassandra in spark in real time . Anyone have solution for this issues ?
I'm not sure if you're using OSS or DSE, but in DSE you can use spark streaming:
https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/spark/sparkStreamingIntro.html
All streaming relies upon some underlying messaging system, like Kafka or Pulsar. So, you'll need to implement that first in order to receive messages which then get processed and sent to Cassandra.
Here's a doc providing an example of a sink between Kafka and Cassandra:
https://hevodata.com/learn/kafka-and-cassandra/
Hi Mr steve ,
I am using cassandra open source , I read data from cassandra in static mode , my objective is to read data from cassandra in real time (readstream) . For the connection I use (com.datastax.spark:spark-cassandra-connector_2.12:3.1.0) .
This pictures may help you to understand my issue.
Screenshot from 2022-04-25 11-55-59.pngScreenshot from 2022-04-25 11-52-16.png
OSS Cassandra does not support streaming as a source. You can use it as a sink as I suggested previously, so using Kafka or a messaging system you can then sink cassandra to it:
The Spark Streaming in the spark-cassandra-connector provides a mechanism for consuming data from sources like Akka and Kafka and store the data in Cassandra. It only works where Cassandra is the destination (sink), not the source.
DataStax has a change data capture product called CDC for Cassandra which works with open-source Apache Cassandra, DataStax Enterprise and Astra DB.
With DataStax CDC, CDC agents installed on the same nodes as Cassandra capture changes (mutations) from the commitlog, deduplicates them, then streams the data to Apache Pulsar. Your Spark app can then subscribe to the relevant Pulsar topic and process the stream.
For more info, see the blog post Shatter Data Silos with DataStax Change Data Capture for Apache Cassandra. Cheers!
9 People are following this question.
How can I control the frequency with which the commit logs are flushed?
Is there sample code available that shows how to read the CDC logs?
Is it possible to produce a Kafka message that includes the before and after CDC values?
java.io.IOException: when trying to load a table from Cassandra using spark
DataStax Enterprise is powered by the best distribution of Apache Cassandra ™
© 2022 DataStax, Titan, and TitanDB are registered trademarks of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.
Privacy Policy Terms of Use