What does it takes to run dsBulk tool on Google cloud ? or do we need to package the dsbulk jar files with the application and to deploy the app. What is the best way to load a table in existing cassandra keyspace on the cloud ?
Bringing together the Apache Cassandra experts from the community and DataStax.
Want to learn? Have a question? Want to share your expertise? You are in the right place!
Not sure where to begin? Getting Started
What does it takes to run dsBulk tool on Google cloud ? or do we need to package the dsbulk jar files with the application and to deploy the app. What is the best way to load a table in existing cassandra keyspace on the cloud ?
There isn't much to running the DataStax Bulk Loader tool on any machine. The only prerequisite is that you have Java installed.
Installing DSBulk can be as simple as two steps:
For details, see Installing DataStax Bulk Loader for Apache Cassandra.
To run DSBulk, simply run the dsbulk
executable from the bin/
directory.
To load data from a CSV input file, the basic form of the command is:
$ dsbulk load \ -h 'node_ip_1, node_ip_2, ... node_ip_n' -url source.csv \ -k ks_name -t table_name \ -u db_user -p db_password \ -header true
For details, see the Getting Started guide. Let me point out that you will also be interested in the Loading data examples page.
Let us know if you have any questions or need assistance. Cheers!
Hello Erick,
I understand that and it is what I did on my machine for testing. But we are building an application to extract millions of records, read them and loading them into Cassandra. This application will run on the Cloud as extracted data are in production.
I found a jar for dsBulk in maven repo. Then I suppose that dsBulk can be incorporated in our application, and it will be good to be able to pass to dsBulk a data stream rather than to write an external file on disk (risky) and to launch dsBulk inside a process...
We calculated the loading time of the data via DsBulk and it will take 17h at minima. But it needs also to extract the data and to transform the data. By passing directly the transformed data to dsBulk, without having to write a file on disk, we spare the time of the writing data on disk and we do not need to wait the file to be entirely written on disk to start the load of the data. Consequently the overall process time will be also reduced.
5 People are following this question.
DataStax Enterprise is powered by the best distribution of Apache Cassandra ™
© 2022 DataStax, Titan, and TitanDB are registered trademarks of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.
Privacy Policy Terms of Use