Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

Yeikel.ValdesSantana_186477 avatar image
Yeikel.ValdesSantana_186477 asked ·

How can I copy local directories to DSEFS with DSE 6.0?

FOLLOW UP TO QUESTION #3539

I have many folders with many files(parquet files to process using Spark) , and following that approach , my only option is to

1. Create each folder from my local fs inside DSEFS

2. Copy each folder/file from the local fs to DSEFS using put

That works , but it is very slow. If the number of files is too large , it can take a lot of time.

dsefs
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered ·

@Yeikel.ValdesSantana_186477 In DSE 6.0, you can run Hadoop filesystem commands which allows copying directories and their contents recursively using dse hadoop fs.

Consider the following nested directory structure on a Linux filesystem:

dir1/
  subdir110/
    file11726.txt
    file12717.csv
  subdir16/
    file1059.json
    file13246.pdf
dir2/
  subdir210/
    file21767.csv 
    file23799.json
  subdir26/
    file2221.pdf
    file25137.txt

Recursive directory copy

To copy both dir1 and dir2 (including the subdirectories and their contents) to the root (/) directory of DSEFS, run:

$ dse hadoop fs -copyFromLocal dir* /

Check the contents of DSEFS:

$ dse hadoop fs -ls -R
drwxrwxrwx   - none none          0 2020-04-22 17:50 dir1
drwxrwxrwx   - none none          0 2020-04-22 17:50 dir1/subdir110
-rw-r--r--   3 none none          0 2020-04-22 17:50 dir1/subdir110/file11726.txt
-rw-r--r--   3 none none          0 2020-04-22 17:50 dir1/subdir110/file12717.csv
drwxrwxrwx   - none none          0 2020-04-22 17:50 dir1/subdir16
-rw-r--r--   3 none none          0 2020-04-22 17:50 dir1/subdir16/file1059.json
-rw-r--r--   3 none none          0 2020-04-22 17:50 dir1/subdir16/file13246.pdf
drwxrwxrwx   - none none          0 2020-04-22 17:51 dir2
drwxrwxrwx   - none none          0 2020-04-22 17:51 dir2/subdir210
-rw-r--r--   3 none none          0 2020-04-22 17:51 dir2/subdir210/file21767.csv
-rw-r--r--   3 none none          0 2020-04-22 17:51 dir2/subdir210/file23799.json
-rw-r--r--   3 none none          0 2020-04-22 17:51 dir2/subdir26/file2221.pdf
-rw-r--r--   3 none none          0 2020-04-22 17:51 dir2/subdir26/file25137.txt

Parallel copies

Take advantage of the distributed architecture of DSEFS and run multiple clients to increase throughput to the cluster if you have a large number of files/directories to copy.

The file/directory copy occurs sequentially and iterates over each directory, subdirectory and contents one by one so it can take some time when you have hundreds or thousands of files/directories to work on. It behaves just like any filesystem -- whether it be on your laptop's, an NFS server, or S3 bucket.

Using the example above, you could spawn two separate DSEFS clients for each directory to run in parallel:

$ dse hadoop fs -copyFromLocal dir1 /
$ dse hadoop fs -copyFromLocal dir2 /

It would ideally run quicker if you could divide the files onto multiple servers so the local disk doesn't become a bottleneck.

If you had say 100+ directories, each with dozens of nested subdirectories and files in each of them, you can further extrapolate the approach above by running parallel copies on 5-10 top directories at a time until all the files are copied to DSEFS. You will just need to monitor the performance of your cluster to make sure that you don't overload by running too many sessions in parallel.

Note that the Hadoop FS interface was deprecated in DSE 6.0 and was completely removed in DSE 6.7 (DSP-16063, DSP-16594). For more info, see the newer cp -R command in DSE 6.7. Cheers!

1 comment Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

I tried that and it seems to be working better, thanks for the insight.

Just a comment , when this error happens do you retry internally , or what does it mean?

I verified a few of the files and it seems that it was able to copy them successfully , but I can't verify manually all of them.

Thanks!

0 Likes 0 · ·
1587567911977.png (19.4 KiB)