Execution Plan in Databricks

What is “Execution Plan”?

  • The “Execution Plan” in “Databricks” Allows the “Users” to “Understand” “How” the “Codes” will “Actually” get “Executed” Across a “Cluster” and is “Useful” for “Optimising” the “Queries”.
  • The “Execution Plan” “Translates” the “Operations” into “Optimized Logical Plan” and “Physical Plans” and “Shows What Operations” are Going to be “Executed” and “Sent” to the “Spark Executors”.
  • The “Execution Plan” is “Made” of “Logical Plans” and “Physical Plans”.

Different Sections of “Logical Plans”

The “Logical Plan” is “Broken Down” into “Three Sections” -

  1. Parsed Logical Plan (Unresolved Logical Plan): The “Parsed Logical Plan”, or, “Unresolved Logical Plan” is “Created” with the “Flow of ExecutionOnce the “Code” has been “Validated”, and, where “Apache Spark” is “Unable” to “Validate” a “Table” or “Column” Objects “Successfully”, it “Flags” those “Table” or “Column” Objects as “Unresolved”.
  2. Analyzed Logical Plan (Resolved Logical Plan): Using the “Metadata Catalog”, “Apache Spark” “Validates” and “Resolves” the “Unresolved Table or Column” Objects, which were “Identified” in the “Unresolved Logical Plan” before continuing.
  3. Optimized Logical Plan: Once everything is “Resolved”, the “Resolved Logical Plan” is “Sent” to the “Catalyst Optimizer”, which “Applies” the “Predicates” or “Rules”, and, “Creates” the “Various Optimization Strategies” to further “Optimize” the “Resolved Logical Plan”.
    Optimize Rules” can “Consist” of “Predicate or Projection Pushdown”, “Reordering Operations”, “Conversions” and “Simplifying Expressions”.

What is “Physical Plan”?

  • The “Physical Plan” is “How” the “Optimized Logical Plan” that was “Created”, is going be “Executed” on the “Cluster”.
  • The “Catalyst Optimizer” “Generates” the “Multiple Physical Plans” based on the “Various Optimization Strategies”.
  • Each of the “Various Optimization Strategies” is “Assessed” through a “Cost Model”, establishing the “Estimates” for “Execution Time” and “Resources Utilisation”.
  • Using the “Cost Model Estimates”, the “Catalyst Optimizer” determines the “Best Optimal Plan/Strategy” and “Selects” it as the “Selected Physical Plan”, which is “Executed” on the “Cluster”.

What is “Catalyst Optimizer”?

  • The “Catalyst Optimizer” is an “Extensible Query Optimizer”.
  • The “Catalyst Optimizer” is at the “Core” of the “Spark SQL’s” “power and speed.
  • The “Catalyst Optimizer” “Automatically Finds” the “Most Efficient Plan” for “Applying” the “Transformations” and “Actions”.

How “Catalyst Optimizer” “Optimizes” “Queries”?

The following “Stages” occur as the “Input Query” travels through the “Optimization Process” -

  1. User Input — Using “Declarative APIs” of “Spark SQL”, “DataFrame”, or “Dataset”, the users can specify the “Queries” to be processed.
  2. Unresolved Logical Plan — In the “Unresolved Logical Plan” Phase, “Apache Spark” accepts the “Query” that a “User Writes”, and, “Creates” an “Unresolved Logical Plan” with the “Flow of Execution”, which needs to be “Validated” against the “Metadata Catalog” in the “Next Step” to be “Sure” that there “Won’t” be any “Naming, or, “Data Type” Errors in “Tables”, “Columns”, “UDFs” etc.
    The “Metadata Catalog” is a “Metadata Repository” of “All” the “Table” and “DataFrames”.
  3. Analysis — In the “Analysis” Phase, “Apache Spark” “Validates” the “Unresolved Logical Plan” against the “Metadata Catalog”, and, then turns it into a “Resolved Logical Plan”.
  4. Logical Optimization — In the “Logical Optimization” Phase the “First Set” of “Optimizations” take place, i.e., the “Catalyst Optimizer” “Adjusts” the “Resolved Logical Plan” by “Applying” the “Predicates” or “Rules”, and, “Creating” the “Various Optimization Strategies” to make the “Resolved Logical Plan” as efficient as possible.
  5. Physical Planning — In the “Physical Planning” Phase, the “Catalyst Optimizer” “Generates” “One” or More “Physical Plans” of “Executing” a “Query”. For Example -
    A. Is the “Catalyst Optimizer” going to “PullAll” of the “Data”, i.e., “100%” of the “Data” across the “Network”, or,
    B. Is the “Catalyst Optimizer” going to “Use” a “Predicate Pushdown” and “Filter” the “Data” at its “Source”, like a “Parquet File”, or, “JDBC”. Thus, “Bringing Over” may be “Only 30%” of the “Data”.
    Each of the “Physical Plans” represents “What” the “Query Engine” will “Actually Do” after “All” of the “Optimizations” have been “Applied”.
  6. “Cost Model — Each of the “Optimizations” in Each of the “Physical Plans” provides a “Measurably Different Benefit”. This is the “Optimization’s Cost Model”.
    In this Phase, each “Physical Plan” is “Evaluated” according to its Own “Cost Model”. The “Best Performing Model” is “Selected”. This provides the “Selected Physical Plan”.
  7. Whole Stage Code Generation — In the “Whole Stage Code Generation” Phase, the “Catalyst Optimizer” “Converts” the “Selected Physical Plan” to “RDDs”, and, then “Generates” the “Bytecodes”, so that the “Selected Physical Plan” is “Executed” on the “Cluster”.

Limitation of “Catalyst Optimizer”

  • The “Catalyst Optimizer” in “Spark 2.x” “Applies” the “Optimizations” throughout “Logical Optimization Stage” and “Physical Planning Stage”. It “Generates” a “Selection” of “Physical Plans” and “Selects” the “Most Efficient One”.
  • These are “Rule-Based Optimizations”, and, while these generally “Improve” the “Query Performances”, these are all “Based” on “Estimates” and “Statistics” that are “Generated Before Runtime”. Hence, there “May” be “Unanticipated Problems”, or, “Tuning Opportunities” appearing as the “Query Runs”.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Oindrila Chakraborty

I have 10+ experience in IT industry. I love to learn about the data and work with data. I am happy to share my knowledge with all. Hope this will be of help.