关注 spark技术分享,
撸spark源码 玩spark最佳实践

Performance Tuning

Performance Tuning

Goal: Improve Spark’s performance where feasible.

  • measure performance bottlenecks using new metrics, including block-time analysis

  • a live demo of a new performance analysis tool

  • CPU — not I/O (network) — is often a critical bottleneck

  • community dogma = network and disk I/O are major bottlenecks

  • a TPC-DS workload, of two sizes: a 20 machine cluster with 850GB of data, and a 60 machine cluster with 2.5TB of data.

    • network is almost irrelevant for performance of these workloads

    • network optimization could only reduce job completion time by, at most, 2%

    • 10Gbps networking hardware is likely not necessary

  • serialized compressed data

  • reduceByKey is better

  • mind serialization time

    • impacts CPU – time to serialize and network – time to send the data over the wire

  • Tungsten – recent initiative from Databrics – aims at reducing CPU time

    • jobs become more bottlenecked by IO

赞(0) 打赏
未经允许不得转载:spark技术分享 » Performance Tuning
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏