Hi,
I found 2 ways to fetch data from cassandra, one is
spark.read... - this one , if applied select(column, column) anyway really selects only on client side
and sc.cassandaTable(...) - this one selects on cassandra side, but does not provide a good working option to convert RDD to DataFrame
Thus, I do something pretty complex to get a dataFrame:
val data = spark.sparkContext.cassandraTable(keyspace, table).select("event_log_multiplier", "resolution","ip_country","user_hash","man_vs_machine_collection") .filter(row=> (row.getInt("event_log_multiplier") <= toPk && row.getInt("event_log_multiplier") >= fromPk )) val sqlContext = spark.sqlContext import sqlContext.implicits._ val selectedData = data.keyBy(row => ( row.getStringOption("resolution"), row.getStringOption("ip_country"), row.getStringOption("user_hash"), row.getStringOption("man_vs_machine_collection"))).map(x => x._1).toDF("resolution","ip_country","user_hash","man_vs_machine_collection").na.fill("-1000", colNames)
That means, I need list all columns that I want to use in the data frame like that row.getStringOption("resolution"), and if I have 100+ columns my code will be a nightmare.
Is there a simple way to convert to Data Frame from RDD[CassandraRaw]?
Thanks!