Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

Naraen avatar image
Naraen asked ·

Why do the DSBulk logs show less time taken for the operation to complete than the total time it took for the command to run?

We used dsbulk to perform unload operation, we could see the time taken was actually more than it shows in its original logs.

Attaching the details here,

[aaa@bbb bin]$ date; ./dsbulk unload -k *** -t *** -h *** -u *** -p *** > /aaa/bbb/xxx.csv;date;
Wed Mar 10 06:08:40 CST 2021
Username and password provided but auth provider not specified, inferring PlainTextAuthProvider
Operation directory: /aaa/bbb/dsbulk-1.7.0/bin/logs/UNLOAD_20210310-120841-278359
total | failed | rows/s | p50ms | p99ms | p999ms
38,260 | 0 | 14,304 | 8.20 | 78.64 | 78.64
Operation UNLOAD_20210310-120841-278359 completed successfully in 2 seconds.
Wed Mar 10 06:08:47 CST 2021

Here we could see that the original time taken is 7 secs but it shows only 2 secs in its logs. Is this because of time taken to establish session? Or something else?

So if session connectivity is the reason, say we have 70 tables, and if there is a delay of 7 secs for each table will the latency be 490 secs?

Also can we use multi threading/parallel processing here? to improve performance? We need to export multiple tables in a given time using a single python script.

Please let us know your suggestions/recommendations here. Thanks

dsbulk
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

alexandre.dutra avatar image
alexandre.dutra answered ·

The time elapsed printed at the end of the operation is the time spent doing the actual work, after the session and the thread pools are initialized, but before these elements are shut down. So it's normal that you see a difference.

Also can we use multi threading/parallel processing here? to improve performance?

DSBulk is already massively multi-threaded. But feel free to provide improvements, the code is open-source :-)

If you have several tables to export, you can try running multiple instances of DSBulk in parallel, but ultimately you could bring you cluster down if you overwhelm it with too many requests.

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.