SPARK CONFIGURATION OPTIMIZATION

CLUSTER SIZE

Additional params

The following parameters help to fine tune the overall optimized configuration. We recommend to leave them as defaults.
We recommend setting this value to 2. It can be higher for a large cluster.

Calculated cluster resources

160
16
75 Leave 1 core per node for Hadoop/Yarn daemons.
144
3 Number of executors per node = (total number of cores per node - 1) / spark.executors.cores
10 Leave 1 GB for the Hadoop daemons.

Unused resources

5
1

spark-defaults.conf

140 Total number of cores on all executor nodes times parallelism per core or 2, whichever is larger
9 Node memory without the overhead memory.
14 Leaving 1 executor for ApplicationManager.
5 We recommend setting this to spark.executors.cores.
Assigning executors with a large number of virtual cores leads to a low number of executors and reduced parallelism. Assigning a low number of virtual cores leads to a high number of executors, causing a larger amount of I/O operations. We suggest that you have 5 cores for each executor to achieve optimal results in any sized cluster.
9 We recommend setting this to spark.executors.memory.
9 Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size is above this limit. Having a high limit may cause out-of-memory errors in driver (depends on spark.driver.memory and memory overhead of objects in JVM). Setting a proper limit can protect the driver from out-of-memory errors.
921 spark.driver.memory * 0.10, with minimum of 384
921 Amount of additional memory to be allocated per executor process in cluster mode, in MiB unless otherwise specified. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%). This option is currently supported on YARN and Kubernetes.
false Set spark.dynamicAllocation.enabled to true only if the numbers are properly determined for spark.dynamicAllocation.initialExecutors, minExecutors, maxExecutors. Otherwise we recommend to manually calculate the resources for the important jobs.
true Adaptive Query Execution set to false by default in Spark 3.0. It applies if the query is not a streaming query and contains at least one exchange (usually when there’s a join, aggregate or window operator) or one subquery. We recommend set it to true.
spark-submit
aws

Recommended configuration

Though the following parameters are not required but they can help in running the applications smoothly to avoid timeout and memory-related errors. We advise that you set these in the spark-defaults configuration file.
0.8 The lower this is, the more frequently spills and cached data eviction occur.
5
true When set to true, this property can save substantial space at the cost of some extra CPU time by compressing the RDDs.
true When set to true, this property compresses the map output to save space.
true When set to true, this property compresses the data spilled during shuffles.
org.apache.spark.serializer.KryoSerializer The default of Java serialization works with any Serializable Java object but is quite slow, so we recommend using org.apache.spark.serializer.KryoSerializer and configuring Kryo serialization when speed is necessary.
-XX:+UseG1GC -XX:+G1SummarizeConcMark You can use multiple garbage collectors to evict the old objects and place the new ones into the memory. However, the latest Garbage First Garbage Collector (G1GC) overcomes the latency and throughput limitations with the old garbage collectors.
-XX:+UseG1GC -XX:+G1SummarizeConcMark You can use multiple garbage collectors to evict the old objects and place the new ones into the memory. However, the latest Garbage First Garbage Collector (G1GC) overcomes the latency and throughput limitations with the old garbage collectors.