Spark Memory and Optimizer

Spark Memory and Optimizer

Optimizer

  • Core of Spark SQL has Catalyst optimizer which leverages 2 important Scala features

    • Pattern matching
    • Quasi Notes (Easy to generate code at runtime from composable expressions)
  • Catalyst supports both rule-based and cost-based optimization.

  • Designed for 2 main pruposes :

    • Easily add new optimization techniques and features to Spark SQL
    • Enable external developers to extend the optimizer (e.g. adding data source specific rules, support for new data types, etc)

More details are here

Memory

For better memory management - Spark included Tungsten

It has 3 basic features

  • Memory management and binary processing : It removes the overhead of JVM and Garbage collection
  • Cache aware computaion : algorithms and data structures to exploit memory hierarchy
  • Code generation : using code generation to exploit modern compilers and CPUs

More details are here

Kyro-Serializer

Kryo is a significantly optimized serializer, and performs better than the standard java serializer. It helps in shuffles (wide transformatiions) where mostly serialization is utilized.

It can be setup as

conf.set( "spark.serializer", "org.apache.spark.serializer.KryoSerializer" )