Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

jean.andre_185656 avatar image
jean.andre_185656 asked ·

How do we run DSBulk on Google Cloud?

What does it takes to run dsBulk tool on Google cloud ? or do we need to package the dsbulk jar files with the application and to deploy the app. What is the best way to load a table in existing cassandra keyspace on the cloud ?

dsbulk
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered ·

There isn't much to running the DataStax Bulk Loader tool on any machine. The only prerequisite is that you have Java installed.

Installation

Installing DSBulk can be as simple as two steps:

  1. Download the tarball from the DataStax Downloads site.
  2. Unpack the tarball.

For details, see Installing DataStax Bulk Loader for Apache Cassandra.

Execution

To run DSBulk, simply run the dsbulk executable from the bin/ directory.

Loading data

To load data from a CSV input file, the basic form of the command is:

$ dsbulk load \
    -h 'node_ip_1, node_ip_2, ... node_ip_n'
    -url source.csv \
    -k ks_name -t table_name \
    -u db_user -p db_password \
    -header true

For details, see the Getting Started guide. Let me point out that you will also be interested in the Loading data examples page.

Let us know if you have any questions or need assistance. Cheers!

1 comment Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Hello Erick,


I understand that and it is what I did on my machine for testing. But we are building an application to extract millions of records, read them and loading them into Cassandra. This application will run on the Cloud as extracted data are in production.

I found a jar for dsBulk in maven repo. Then I suppose that dsBulk can be incorporated in our application, and it will be good to be able to pass to dsBulk a data stream rather than to write an external file on disk (risky) and to launch dsBulk inside a process...

We calculated the loading time of the data via DsBulk and it will take 17h at minima. But it needs also to extract the data and to transform the data. By passing directly the transformed data to dsBulk, without having to write a file on disk, we spare the time of the writing data on disk and we do not need to wait the file to be entirely written on disk to start the load of the data. Consequently the overall process time will be also reduced.

0 Likes 0 · ·