question

t.ariunbat_189192 avatar image
t.ariunbat_189192 asked Erick Ramirez answered

How do I load a table in PySpark?

I have an access to Cassandra via Zeppelin user. There are some tables in "default" keyspace. How can I call them to DataFrame in PySpark? I have no problem calling them from spark.sql using keyspace.tableName. But when I call using PySpark, it cannot find the keyspace named default.

Here is the code I use:

df = spark.read.format("org.apache.spark.sql.cassandra").options(keyspace="default", table = "sometable").load()

Here is the error:

Py4JJavaError: An error occurred while calling o439.load.
: java.io.IOException: Couldn't find table sometable or keyspace default - Found similar keyspaces and table
dse_perf.key_cache
at com.datastax.spark.connector.cql.Schema$.tableFromCassandra(Schema.scala:358)
sparkconnectorpyspark
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered

@t.ariunbat_189192 When you run SHOW TABLES in Spark SQL, the default you see is not one of the keyspaces in the Cassandra database. It is the default Hive database called default.

The SHOW TABLES command lists the tables in the default Hive database:

spark-sql> SHOW TABLES;
default false 

If you don't specify the Cassandra keyspace, It will "default" to listing the default Hive database. It is the equivalent of running:

spark-sql> SHOW TABLES FROM default;
default         false

In my test cluster, I have a keyspace community which has a table called users. Here's how it looks from Spark SQL:

spark-sql> SHOW TABLES FROM community;
community       users false

Your PySpark syntax for loading a keyspace and table is correct but you are getting the exception because neither the keyspace default nor the table sometable exists in Cassandra.

Please check the Cassandra schema for details of the keyspace and table(s) so you can access them correctly in PySpark.

For details on loading a DataFrame in PySpark, see PySpark with Data Frames. Cheers!

Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

smadhavan avatar image
smadhavan answered t.ariunbat_189192 commented

@t.ariunbat_189192, the error indicates that there is no keyspace called default. Also, it appears that it’s a reserved keyword and one cannot create a keyspace with that name. Could you show the describe output of that keyspace from cqlsh?

1 comment Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

t.ariunbat_189192 avatar image t.ariunbat_189192 commented ·

I have no access to cqlsh. I was given only access to Zeppelin user. I think it is a keyspace that is specified in connector for each session. "default" keyspace shown when I query show databases in spark.sql

0 Likes 0 ·