For Windows : Please follow this site
Run Spark application -> Driver program starts -> Main function starts ->
SparkContext gets initiated -> Driver program runs the operations inside the executors on worker nodes.
SparkContext uses Py4J to launch a JVM and creates a JavaSparkContext. By default, PySpark has SparkContext available as ‘sc’, so creating a new SparkContext won’t work.
Serialization is a mechanism of converting the state of an object into a byte stream. Deserialization is the reverse process where the byte stream is used to recreate the actual Java object in memory.
We need serialization because the hard disk or network infrastructure are hardware component and we cannot send java objects because it understands just bytes :)
Excellent training from Data Flair
2 types of Spark Stages
ShuffleMapStage : Intermediate stage in physical excution of DAG. In Adaptive Query Planning , it can be considered as final stage as well which can save output map files.
ResultStage : Final stage in Spark. It helps in computation of result of an action.
Lazy evaluation in Spark means execution of any task wont start start until an action is triggered. Spark has 2 operations
Transformation is lazy which means operation wont be performed until an action is triggered.
Major Advantage of Lazy execution
*Type safe
Spark provides Shared variables which are broadcast and accumulator variables.
Broadcast variables It allows users to keep a copy of variable (which can consist of large dataset) cached in each machine which can be utilized during task execution. It saves the communication cost and thus increases speed of application.
Accumulators They can be used to implement counters (as in MapReduce) or sums. Its only “added” to through an associative and commutative operation
Cache is an important factor in Spark application. Cache the dataframe whenever user feels the data is going to be used several times. It helps to improve the performance of application and also create checkpoints in application
Types of Storage level in Spark
Note - cache() in spark is lazily evaluated. Data will be cached when the 1st first action is called.