question

lavaraja.padala_150810 avatar image
lavaraja.padala_150810 asked Erick Ramirez answered

How do we unset the values in columns in a table with huge volume of records?

We have a table as below.

CREATE TABLE keyspace.table (
    empid text,
    year int,
    inserted_date date,
    joining_date timestamp,
    grade text,
    PRIMARY KEY ((empid, year), inserted_date, joining_date, grade)
) WITH CLUSTERING ORDER BY (inserted_date DESC, joining_date DESC, flag grade) ;

There has been some wrong data entry for few columns and now we want to update those columns values.

Instead of <<null>> column value they have inserted as 'null' string value. This table has billions of rows and We want to unset/insert null value for these columns for entire table.

What is the best way to do this ?.It is difficult to find the primary key for all rows.

Current data:

col1 col2

null null

we want it to be corrected as below or unset these columns as they are null.

col1 col2

<<null> <<null>>

cql
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

steve.lacerda avatar image
steve.lacerda answered

Hello, are you trying to unset the "null" string value? My first question is why? The <<null>> object actually creates tombstones, so that's not something that I would recommend. The "null" string does not create tombstones, so I would actually be going the other way and setting everything to "null".

Anyhow, to find those null values without the primary key, I would probably use spark to do a full table scan and then remove those null values. You could also unload the table using dsbulk, modify the data, then reload it but that will take significantly longer.

Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered

Setting a column's value to null is not recommended because it generates tombstones (as Steve explained).

Cassandra uses sparse storage meaning if a column doesn't have a value, that column is not stored on disk unlike traditional RDBMS which allocate a fixed size for each field/cell in the table regardless of whether the cell has data in it.

The best option is to write a Spark app that would iterate over the partitions and delete the columns you don't require. Cheers!

Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.