mishra.anurag643_153409 avatar image
mishra.anurag643_153409 asked Erick Ramirez edited

Why is Spark ignoring column with null value when reading data from Cassandra?

I am reading cassandra table data from spark job , and I am observing for a column when column value is null , then that column is ignored in spark .

given below is example:

cassandra data:

{last_a_date: '2020-11-01 23:26:24.372000+0000', id: null, username: 'anurag'}

in spark:

{last_a_date: '2020-11-01 23:26:24.372000+0000', username: 'anurag'}

here id column is ignored .

I am assuming this is happening due to cell value is deleted and cassandra has marked this as tombstone . I have below query on this:

1. if table is created with 4 columns and 3 columns are only populated and fourth column is kept as null , would spark read this column value and populated data-frame with this column and not ignoring column as its value is null .

2. What can I do to ensure when column value is deleted , spark read the column instead ignoring column ?

10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez edited

Rows in Cassandra are sparse. What this means is that only columns which have a value are stored in rows -- Cassandra does not store columns on disk if they are not set.

In relational databases, storage is allocated for every column in a row regardless of whether there is a value in the field. This makes all rows (records) in the table a fixed size even when the fields are empty.

To illustrate what I mean, consider a table which stores users' contact details: name, mobile number, address. Alice's mobile number is 01-789-456-123 and her address is 101 Main Street. Bob's address is 98 7th Avenue but he doesn't have a mobile. On disk, the records are stored as:

Name Mobile Number Address
Alice 01-789-456-123 101 Main Street
Name Mobile Number Address
Bob <empty> 98 7th Avenue

Notice how the widths of the fields are a fixed size. This wastes disks since space is occupied even the table is empty.

In Cassandra, this is how the rows are laid out on disk:

| name: Alice | mobile_number: 01-789-456-123 | address: 101 Main Street |
| name: Bob | address: 98 7th Avenue |

There is no mobile_number column for Bob so it doesn't use up disk space. Notice too that the widths of the columns are variable -- the column size is not a fixed data type size.

When you query with the Spark connector or one of the drivers, they will only return the columns which have a value because that's all they can read off the disk. This is the reason you don't see an empty column in the results -- it doesn't exist.

It's important to note that if you're querying with cqlsh, it formats the output so it looks like the column has a null value but in fact doesn't. It's just the way cqlsh represents the rows but the columns don't actually exist on disk. Cheers!

10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.