question

vahan avatar image
vahan asked alexandre.dutra commented

Why does the DSBulk count return different results?

I run "dsbulk count -k <keyspace name> -t <table name> -h <host name> --log.verbosity 0".

Each time it return different results. It can be bigger or less that previous result.

Do you know what is cause?

nodetool repair has been run before.

$ nodetool repair <keyspace name> <table name>

cqlsh> DESCRIBE TABLE <keyspace name> .<table name>

CREATE TABLE <keyspace name> .<table name> (
id uuid PRIMARY KEY,
....
....

....
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';

~$ dsbulk --version

DataStax Bulk Loader v1.8.0

dsbulkcount
1 comment
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

alexandre.dutra avatar image alexandre.dutra ♦ commented ·

Have you tried with consistency ALL? E.g. dsbulk count -k ks -t t1 -h h1 -verbosity 0 -cl ALL

0 Likes 0 ·

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered

There are two causes which immediately come to mind: (1) replicas are inconsistent, or (2) data is being updated.

Repair

You already mentioned that you run a repair on the table in question. But running the repair on just one node will only repair the token range(s) which exist on that node -- it will not repair all ranges in the entire cluster.

For the repair to be effective and efficient, you need to run repairs on all nodes in the cluster:

$ nodetool repair -r ks_name table_name

Mutating data

The results of the count will be different across multiple DSBulk runs unless the data is static.

If you are performing a count while the data is mutating, you will not get the same result between two different count runs. I've explained this in another post, Why count is bad in Cassandra. Cheers!

Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.