DataStax Academy FAQ

DataStax Academy migrated to a new learning management system (LMS) in July 2020. We are also moving to a new Cassandra Certification process so there are changes to exam bookings, voucher system and issuing of certificates.

Check out the Academy FAQ pages for answers to your questions:


lziegler avatar image
lziegler asked ·

Spark output to NAS returns mkdirs failure on _temporary directories

My development team experiences mkdirs authorization failures when writing out csv or parquet from spark to a NAS drive.

Caused by: Mkdirs failed to create file:/mycompany/testcase/tmp/quick/pushdatazzz/_temporary/0/_temporary/attempt_20191118173538_0003_m_000000_9 (exists=false, cwd=file:/apps/cassandra/data/data2/spark/rdd/app-20191118173446-0234/0)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(
    at org.apache.hadoop.fs.ChecksumFileSystem.create(
    at org.apache.hadoop.fs.FileSystem.create(
    at org.apache.hadoop.fs.FileSystem.create(
    at org.apache.hadoop.fs.FileSystem.create(
    at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream    (CodecStreams.scala:81)
    at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92)
    at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.<init>(CSVFileFormat.scala:135)
    at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:77)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:303)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:312)

[fakeuser@fakenode quick]$ ls -al
total 36
drwxrwxrwt 8 fakeuser fakegroup 4096 Nov 18 17:35 .
drwxrwxrwx 4 500 500 8192 Nov 18 15:56 ..
drwxr-xr-x 2 fakeuser fakegroup 4096 Nov 18 17:35 pushdatazzz

No subdirectories are created whatsoever.

We would expect spark to create temporary directories from the nodes where the query executed and coalesce them into one file. If we did not coalesce, we would expect multiple files in this directory.

I read miscellaneous narratives searching for answers that mentioned

--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 ... but this was of no help.

Has anyone else faced this dilemna? Are we expected to write to dsefs first and then get the file to local?

10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Russell Spitzer avatar image
Russell Spitzer answered ·

The default operation of this is to create a target/_temporary directory. This is how hadoop implemented file-writers which are used by Spark. After they are finished being written to the temporary directory they are moved into the actual target.

Although I don't have the full stack trace for your error, I'm guessing this is an Executor exception. If this is the case it may be that the Executor Process has a different user than your job submitter, and that user doesn't have permissions to write to the target directory. By default in a DSE Analytics cluster, the executors are launched by the DSE user so check those permissions.

10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.