Our developers experienced two diametrically-opposed issues with spark application submissions on our DSE Analytic clusters (DSE 6.0.8 and Spark 2.2.2.8).
In one case, a spark application experienced errors storing blocks into DSEFS on a specific node. The developer interpreted these errors as a major issue with the spark job and interrupted the process flow.
In all actuality, the spark job/app actually ran successfully as DSE/Spark cleverly re-drove the blocks to another node ... amazing resilience. When examining the application/system/debug log for a specific spark app, how would a developer know if the errors are actually functional failures ... where spark could not self-heal.
In the second case, a developer executed a spark job with spark SQL that did not specify the right column/field name. It did not crash the spark job, and simply ran indefinitely, accumulating failed tasks in the stage(s). Why did it not simply fail immediately with the invalid column/field name? It's almost like it was TOO resilient.
What coding techniques / conf settings / best practices are we missing?