DataStax Academy FAQ

DataStax Academy migrated to a new learning management system (LMS) in July 2020. We are also moving to a new Cassandra Certification process so there are changes to exam bookings, voucher system and issuing of certificates.

Check out the Academy FAQ pages for answers to your questions:


question

igor.rmarinho_185445 avatar image
igor.rmarinho_185445 asked ·

Exporting with COPY TO command returns "Exported 22 ranges out of 25 total ranges, some records might be missing"

Hello gain! :)

I got the error below after using COPY TO, what would the best pratice to avoid it? Usualy it happens in large tables that are distributed in 3 nodes.

Thanks

<stdin>:1:Exported 22 ranges out of 25 total ranges, some records might be missing
Processed: 594027 rows; Rate:     136 rows/s; Avg. rate:     419 rows/s
cassandracqlshcopy
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Richard Hamlin avatar image
Richard Hamlin answered ·

Without knowing the version of C* you're running, the exact command you ran, or the table size you are copying from, here is some basic info:

For C* versions: 2.1.13 / 2.2.5 / 3.0.3 / 3.2 several improvements have been made that provide several new command options that may be helpful given your specific scenario. These include:


MAXATTEMPTS - an integer indicating the maximum number of attempts to import a batch or to export a range in case of a server error. Note that for some permanent errors, such as parse errors, no multiple attempts will be performed. It defaults to 5.


NUMPROCESSES - an integer indicating the number of worker processes. It defaults to the number of cores minus one, capped at 16.


PAGESIZE - an integer indicating the page size for fetching results. It defaults to 1,000. The bigger the page size, the longer the page timeout should be. You may want to change this parameter if you have very large or very small partitions.


PAGETIMEOUT - an integer indicating the timeout in seconds for fetching each page. It defaults to 10 seconds. Increase it for large page sizes or large partitions. If you notice timeouts, you should consider increasing this value. In case of server timeouts, there is an exponential backoff policy that kicks in automatically, so you may notice delays but this is to prevent overloading the server even further. The driver can also generate timeouts, in this case there is a small chance data may be missing or duplicated since the driver doesn't know if the server will later on drop the request or return a result. Increasing this value is very helpful to prevent driver generated timeouts.


MAXREQUESTS - an integer indicating the maximum number of in-flight requests each worker process can work on. The total level of parallelism when exporting data is given by the number of worker processes multiplied by this value. It defaults to 6. Each request will export the data of an entire token range.


You can also specify the token-size to limit the amount of data that is being exported:


BEGINTOKEN - a string representing the minimum token to consider when exporting data. Records with smaller tokens will not be exported. It defaults to empty, which indicates no minimum token.


ENDTOKEN - a string representing the maximum token to consider when exporting data. Records with bigger tokens will not be exported. It defaults to empty, which indicates no maximum token.



For more information, please see this document:


https://www.datastax.com/blog/2016/01/new-options-and-better-performance-cqlsh-copy


1 comment Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thanks Richard! I thought was necessary to change a parameter in a Cassandra config file!

PAGETIMEOUT Will work for my case, I’m only using header=true

The size of the tables are in between 1 and 41.000.000 rows

1 Like 1 · ·
Erick Ramirez avatar image
Erick Ramirez answered ·

@igor.rmarinho_185445 The COPY command in cqlsh was a developer utility meant for working with very small tables only as a troubleshooting tool. By "small" I mean working on tables whose size is in the order of hundreds or thousands of records.

Richard's suggestions are fantastic and there were some improvements made several years ago but it is still doesn't perform well on large amounts of data. I expect you'll still run into issues even with the new tuning parameters so our general recommendation is to use the DataStax Bulk Loader tool (DSBulk).

DSBulk was designed to solve the limitations of the COPY commands. Compared to cqlsh, it can run four times as fast and can handle large data sets. The tool is available for Linux, Mac OS & Windows. Back in December, we made it freely available for everyone to use on DSE and open-source Cassandra.

For more information, check out the links listed below. Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.