Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

lziegler avatar image
lziegler asked ·

Spark output to NAS returns mkdirs failure on _temporary directories

My development team experiences mkdirs authorization failures when writing out csv or parquet from spark to a NAS drive.

catCols.coalesce(1).write.csv("file:///mycompany/testcase/tmp/quick/pushdatazzz")
Caused by: java.io.IOException: Mkdirs failed to create file:/mycompany/testcase/tmp/quick/pushdatazzz/_temporary/0/_temporary/attempt_20191118173538_0003_m_000000_9 (exists=false, cwd=file:/apps/cassandra/data/data2/spark/rdd/app-20191118173446-0234/0)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:450)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:435)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:909)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:890)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:787)
    at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream    (CodecStreams.scala:81)
    at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92)
    at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.<init>(CSVFileFormat.scala:135)
    at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:77)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:303)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:312)


[fakeuser@fakenode quick]$ ls -al
total 36
drwxrwxrwt 8 fakeuser fakegroup 4096 Nov 18 17:35 .
drwxrwxrwx 4 500 500 8192 Nov 18 15:56 ..
drwxr-xr-x 2 fakeuser fakegroup 4096 Nov 18 17:35 pushdatazzz

No subdirectories are created whatsoever.

We would expect spark to create temporary directories from the nodes where the query executed and coalesce them into one file. If we did not coalesce, we would expect multiple files in this directory.

I read miscellaneous narratives searching for answers that mentioned

--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 ... but this was of no help.

Has anyone else faced this dilemna? Are we expected to write to dsefs first and then get the file to local?

spark
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Russell Spitzer avatar image
Russell Spitzer answered ·

The default operation of this is to create a target/_temporary directory. This is how hadoop implemented file-writers which are used by Spark. After they are finished being written to the temporary directory they are moved into the actual target.

Although I don't have the full stack trace for your error, I'm guessing this is an Executor exception. If this is the case it may be that the Executor Process has a different user than your job submitter, and that user doesn't have permissions to write to the target directory. By default in a DSE Analytics cluster, the executors are launched by the DSE user so check those permissions.

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.