Performance Tuning
Goal: Improve Spark’s performance where feasible.
-
measure performance bottlenecks using new metrics, including block-time analysis
-
a live demo of a new performance analysis tool
-
CPU — not I/O (network) — is often a critical bottleneck
-
community dogma = network and disk I/O are major bottlenecks
-
a TPC-DS workload, of two sizes: a 20 machine cluster with 850GB of data, and a 60 machine cluster with 2.5TB of data.
-
network is almost irrelevant for performance of these workloads
-
network optimization could only reduce job completion time by, at most, 2%
-
10Gbps networking hardware is likely not necessary
-
-
serialized compressed data
From Making Sense of Spark Performance – Kay Ousterhout (UC Berkeley) at Spark Summit 2015:
-
reduceByKey
is better -
mind serialization time
-
impacts CPU – time to serialize and network – time to send the data over the wire
-
-
Tungsten – recent initiative from Databrics – aims at reducing CPU time
-
jobs become more bottlenecked by IO
-