question

lziegler avatar image
lziegler asked Russell Spitzer answered

How to Go About Detecting Actual Failures vs Recoverable Errors in Spark on a DSE Analytics Cluster

Our developers experienced two diametrically-opposed issues with spark application submissions on our DSE Analytic clusters (DSE 6.0.8 and Spark 2.2.2.8).

In one case, a spark application experienced errors storing blocks into DSEFS on a specific node. The developer interpreted these errors as a major issue with the spark job and interrupted the process flow.

In all actuality, the spark job/app actually ran successfully as DSE/Spark cleverly re-drove the blocks to another node ... amazing resilience. When examining the application/system/debug log for a specific spark app, how would a developer know if the errors are actually functional failures ... where spark could not self-heal.

In the second case, a developer executed a spark job with spark SQL that did not specify the right column/field name. It did not crash the spark job, and simply ran indefinitely, accumulating failed tasks in the stage(s). Why did it not simply fail immediately with the invalid column/field name? It's almost like it was TOO resilient.

What coding techniques / conf settings / best practices are we missing?

spark
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Russell Spitzer avatar image
Russell Spitzer answered

In the first case, any time the Spark Job completes that means in completed successfully. There are many types of errors that are only temporary in a spark job. Executors, for example, can shut down completely because of a variety of reasons and this will show up in the logs as dire error messages but because of retry policies these are ok. Unless the Driver itself shuts down prematurely then the app has completed according to specifications.

The second case sounds like a bug, wrong column names in Spark Sql should cause an analysis level exception. If we could have more details on that I'd be glad to take a look at that. It is possible that it wasn't actually related to column naming, but the Executor startup command was broken. This can lead to an infinite amount of executors (but not tasks). Normally Spark has a hard limit on task retries so it is unlikely that a task could be subjected to an indefinite amount of retries.

Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.