List: Databricks Optimization Articles | Curated by Oindrila Chakraborty

Nov 18, 2024

3 stories

1 save

Databricks Optimization Articles

Table of Contents: 1. What is Shuffle? 2. Why Data Needs to be Shuffled at All? 3. Significance of Shuffle 4. What is Shuffling Parameter? 5. How to Decide the Value of the Shuffling Parameter? 6. What are the Factors to Consider While Choosing the Value of the Shuffling Parameter? 7. How to Identify If Shuffle Has Occurred? 8. What is Shuffle Read and Shuffle Write? 9. How Shuffle Can be Reduced?

Oindrila Chakraborty

Working with Shuffle

What is Shuffle?

May 6, 2024

May 6, 2024

Table of Contents: 1. How the Size of Each Partition, of the Data to Process, is Estimated Initially in a Spark Application? 2. What is Skewness? 3. Does Data Skewness Occur Only When Partitioning a Dataset? 4. How to Identify Data Skewness in a Spark Application? 5. Data Skew Code 6. How to Solve the Data Skewness in a Spark Application? i) Using Skew Hint ii) Using Adaptive Query Execution (AQE)

Oindrila Chakraborty

Introduction to Skewness in Databricks

How the Size of Each Partition, of the Data to Process, is Estimated Initially in a Spark Application?

Nov 18, 2024

Nov 18, 2024

Table of Contents: 1. Recap of Parallel Processing Concept in Apache Spark 2. Recap of Spark Architecture 3. Recap of Internal Optimization Steps by Apache Spark 4. Levels of Optimization or Performance Tuning in Databricks 5. Five Most Common Performance Issues of an Application in Databricks i) Skew ii) Spill iii) Storage iv) Shuffle v) Serialization 6. How One Performance Issue Can Cause Another Performance Issue? i) Skew Can Induce Spill ii) Storage Issues Can Induce Excess Shuffle iii) Incorrectly Addressed Shuffle Can Lead to Skew

Oindrila Chakraborty

Introduction to Databricks Optimization and Performance Tuning

Recap of Parallel Processing Concept in Apache Spark

Nov 15, 2024

Introduction to Databricks Optimization and Performance Tuning

Nov 15, 2024