Spark Memory and Optimizer

Optimizer

Core of Spark SQL has Catalyst optimizer which leverages 2 important Scala features
- Pattern matching
- Quasi Notes (Easy to generate code at runtime from composable expressions)
Catalyst supports both rule-based and cost-based optimization.
Designed for 2 main pruposes :
- Easily add new optimization techniques and features to Spark SQL
- Enable external developers to extend the optimizer (e.g. adding data source specific rules, support for new data types, etc)

More details are here

Memory

For better memory management - Spark included Tungsten

It has 3 basic features

Memory management and binary processing : It removes the overhead of JVM and Garbage collection
Cache aware computaion : algorithms and data structures to exploit memory hierarchy
Code generation : using code generation to exploit modern compilers and CPUs

More details are here

Kyro-Serializer

Kryo is a significantly optimized serializer, and performs better than the standard java serializer. It helps in shuffles (wide transformatiions) where mostly serialization is utilized.

It can be setup as

conf.set( "spark.serializer", "org.apache.spark.serializer.KryoSerializer" )