under dse 6.0.5, we execute operational intelligence spark jobs in batch mode which write parquet files to dsefs. the team discovered during testing that they experienced inconsistent data loss on the output. The remedy involved reducing the # of cores per executor to 5, which they indicated was a common prescription from hadoop.
Is this a recognized issue with the DSE flavor of hadoop, where it is documented and/or corrected after 6.0.5? We will go to 6.0.8 in January, followed shortly thereafter by 6.7.4. The 5 cores per executor severely cut into the level of parallelism that the spark job submission can achieve.
Looking at the event timeline for large stages, it is clear that the job would benefit from additional cores, but not at the risk of data loss due to a limitation with hadoop.
--total-executor-cores 36 --executor-cores 5 --num-executors 11
The team hardcodes the conf parms above to keep the #cores to 5, based upon the hadoop 'limitation'.
As far as 11 executors and 36 total cores, spark (dynamic allocation=false) never actually uses 11 executors because 36/total and 5 per executor only allow 7 executors which use a total of 35 cores. Is it just a matter of increasing the --num-executors and --total-executor-cores (5 * --num-executors)? How is the rest of the community dealing with the hadoop issue when writing to dsefs?
found this blog post from 2013 - https://www.datastax.com/blog/2013/10/tuning-dse-hadoop-mapreduce
is there a more updated artifact that provides hadoop tuning info on a dse analytics cluster?
Dug deeper and found narratives directing users to increase the hadoop
dfs.block.size
parameter from the default 64MB. In our Spark UI environment page, I see spark.hadoop.fs.s3n.multipart.uploads.block.size set at 64MB ... is this the parameter we need to bump up to 256MB to address the hadoop limitations? We want to increase the parallelism + cores, but do not want to risk data loss.