Apache Spark Configuration Optimization

SPARK CONFIGURATION OPTIMIZATION

CLUSTER SIZE

Number of nodes *

Number of cores per node *

Memory (RAM) per node (GB) *

Additional params

The following parameters help to fine tune the overall optimized configuration. We recommend to leave them as defaults.

Parallelism Per Core The level of parallelism per allocated core. This field is used to determine the spark.default.parallelism configuration.

We recommend setting this value to 2. It can be higher for a large cluster.

Memory overhead (%) The percentage of memory in each executor that will be reserved for spark.executor.memoryOverhead.

Calculated cluster resources

Total memory (GB) Total cluster memory.

160

Total overhead memory (GB) Total overhead memory in the cluster.

Total available cores Total cluster cores availible for node containers.

75 Leave 1 core per node for Hadoop/Yarn daemons.

Total available memory (GB) Total cluster memory availible for node containers.

144

Number of executors per node

3 Number of executors per node = (total number of cores per node - 1) / spark.executors.cores

Memory per executor (GB) This total memory per executor includes the executor memory and overhead (spark.executor.memoryOverhead).

10 Leave 1 GB for the Hadoop daemons.

Unused resources

Unused memory per node 5

Unused cores per node 1

spark-defaults.conf

spark.default.parallelism Default number of partitions in RDDs. We recommend that you estimate the size of each partition and adjust this number accordingly using coalesce or repartition.

140 Total number of cores on all executor nodes times parallelism per core or 2, whichever is larger

spark.executor.memory (GB) Amount of memory to use per executor process.

9 Node memory without the overhead memory.

spark.executor.instances Final number of executor instances.

14 Leaving 1 executor for ApplicationManager.

spark.driver.cores Amount of memory to use for the driver process.

5 We recommend setting this to spark.executors.cores.

spark.executor.cores The number of cores to use on each executor.

Assigning executors with a large number of virtual cores leads to a low number of executors and reduced parallelism. Assigning a low number of virtual cores leads to a high number of executors, causing a larger amount of I/O operations. We suggest that you have 5 cores for each executor to achieve optimal results in any sized cluster.

spark.driver.memory (GB) Amount of memory to use for the driver process.

9 We recommend setting this to spark.executors.memory.

spark.driver.maxResultSize (GB) Limit of total size of serialized results of all partitions for each Spark action (e.g. collect).

9 Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size is above this limit. Having a high limit may cause out-of-memory errors in driver (depends on spark.driver.memory and memory overhead of objects in JVM). Setting a proper limit can protect the driver from out-of-memory errors.

spark.driver.memoryOverhead (MB) Amount of non-heap memory to be allocated per driver process in cluster mode, in MiB unless otherwise specified. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the container size (typically 6-10%). This option is currently supported on YARN, Mesos and Kubernetes.

921 spark.driver.memory * 0.10, with minimum of 384

spark.executor.memoryOverhead (MB) Amount of additional memory to be allocated per executor process in cluster mode, in MiB unless otherwise specified. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%). This option is currently supported on YARN and Kubernetes.

921 Amount of additional memory to be allocated per executor process in cluster mode, in MiB unless otherwise specified. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%). This option is currently supported on YARN and Kubernetes.

spark.dynamicAllocation.enabled Spark on YARN can dynamically scale the number of executors used for a Spark application based on the workloads. This is the configuration responsible for it.

false Set spark.dynamicAllocation.enabled to true only if the numbers are properly determined for spark.dynamicAllocation.initialExecutors, minExecutors, maxExecutors. Otherwise we recommend to manually calculate the resources for the important jobs.

spark.sql.adaptive.enabled Adaptive Query Execution is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan.

true Adaptive Query Execution set to false by default in Spark 3.0. It applies if the query is not a streaming query and contains at least one exchange (usually when there’s a join, aggregate or window operator) or one subquery. We recommend set it to true.

spark-submit

aws

Recommended configuration

Though the following parameters are not required but they can help in running the applications smoothly to avoid timeout and memory-related errors. We advise that you set these in the spark-defaults configuration file.

spark.memory.fraction Fraction of JVM heap space used for Spark execution and storage.

0.8 The lower this is, the more frequently spills and cached data eviction occur.

spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures Maximum number executor failures allowed before YARN can fail the application.

spark.rdd.compress Whether to compress serialized RDD partitions

true When set to true, this property can save substantial space at the cost of some extra CPU time by compressing the RDDs.

spark.shuffle.compress Whether to compress the map output.

true When set to true, this property compresses the map output to save space.

spark.shuffle.spill.compress Whether to compress the data spilled during shuffles.

true When set to true, this property compresses the data spilled during shuffles.

spark.serializer Class to use for serializing objects that will be sent over the network or need to be cached in serialized form.

org.apache.spark.serializer.KryoSerializer The default of Java serialization works with any Serializable Java object but is quite slow, so we recommend using org.apache.spark.serializer.KryoSerializer and configuring Kryo serialization when speed is necessary.

spark.executor.extraJavaOptions A string of extra JVM options to pass to executors.

-XX:+UseG1GC -XX:+G1SummarizeConcMark You can use multiple garbage collectors to evict the old objects and place the new ones into the memory. However, the latest Garbage First Garbage Collector (G1GC) overcomes the latency and throughput limitations with the old garbage collectors.

spark.driver.extraJavaOptions A string of extra JVM options to pass to the driver.

Buy me a coffee

made by katekrivets