How to “Optimize” the “Delta Tables” in Databricks
6 min readSep 15, 2023
How “Delta Tables” Become “Slow” to “Query”?
- “Delta Tables” can “Grow” “Very Large Over Time”, leading to “Slow Query Performance” and “Increased Storage Costs”.
How to “Speed Up” the “Query Performance” of the “Delta Tables”?
- Keeping the “Delta Tables” “Fast” and “Clean” is important for “Maintaining” the “Efficiency” of “Data Pipelines”.
- There are “Several Operations” and “Trade-Offs” that can “Positively Influence” the “Speed” of the “Delta Tables”.
A. “Keep” the “Delta Tables” “Clean” Using the “VACCUM” Command
- The “VACCUM” Command “Deletes” the “Data Files” that are “No Longer Required” by a “Delta Table”.
- The “VACCUM” Command “Deletes” the “Data Files” “Based On” a “Spark Configuration”, i.e., “delta.deletedFileRetentionDuration” that “Defaults” on “7 Days”.
- Every time a “VACCUM” Command is “Requested, “Two New Commits” are “Generated” in the “Transaction Log” containing “VACUUM START”, and, “VACUUM END” Operations, respectively.
- Example — If, there are “16 Data Files” for a “Delta Table”, and, “All” of the “Data Files” are “Older Than” the “Default Retention Interval”, but “Only 8” of the “Data Files” were “Referenced” in the “Delta Log”, then, in theory, if the “VACCUM” Command is “Run”, the “Other 8 Data Files”, which were “Not Referenced” in the “Delta Log” would be “Cleaned”.
Important Aspect of “VACCUM” Command
- The “Timestamps” that are “Internally Used” by the “VACUUM” Command are “Not” the “Ones” that are “Referenced” in the “add” Action of the “Transaction Log Files”, but rather the “modificationTime” of the “Files”.
- The “Reason” for that is to “Avoid Reading” a “Huge Number” of “Data Files” to “Find Which Data Files” should be “Selected” for “Deletion”.
Things to Remember When “Copying” the “Data Files” of a “Delta Table”
- It needs to be made sure to “Keep” the “modificationTime” “Intact” when “Copying”, or, “Migrating” the “Delta Tables”.
- If the “Developers” just “Copy” “All” the “Data Files” of a “Delta Table” “Without Keeping” the “modificationTime” “Intact”, the “modificationTime” would be the “Current Timestamp”, and “None” of the “Data Files” would be “Selected” for “Removal”, at least “Until 7 Days Have Passed”.
How to “Disable” the “Default Retention Interval”?
- It is possible to “Disable” the “Default Retention Interval” by “Setting” the “Spark Configuration”, i.e., “delta.retentionDurationCheck.enable” to “False”.
- “Disabling” the “Default Retention Interval” is something that should be “Used With Caution”, as it “Risks Corrupting” the “Delta Table” if any “Other Active Reader”, or, “Writer” Depends on the “Data” that is being “Deleted”.
When the “VACCUM” Command is Used?
- The “VACUUM” Jobs are a “Must” for “Storage Cost Reduction”. However, the “Developers” need to “Make Sure” to “Schedule” the “VACUUM” Jobs “Regularly”, as the “VACUUM” Jobs are “Not Scheduled” by “Default”.
- The “VACUUM” Jobs do “Not Affect” any “Running Jobs”.
- In addition to this, the “Developers” need to “Make Sure” to “Tweak” the “Data File Retention Value” for “As Long As” they would want to “Time Travel”, and, have the “modificationTime” “In Mind” when “Migrating” the “Delta Tables”.
B. “Keep” the “Delta Tables” “Clean” Using the “OPTIMIZE” Command
- The “OPTIMIZE” Command “Combines” the “Small Data Files” into “Larger Data Files”, while “Keeping” “All” the “Data” “Intact”, and, the “Delta Statistics Re-Calculated”.
- The “OPTIMIZE” Command can greatly “Improve” the “Query Performance”, especially if the “Data” is “Written” using a “Streaming Job”, where, “Depending” on the “Trigger Interval”, a “Lot” of “Small Data Files” can be “Generated”.
How to “Change” the “File Size” of the “Table” on Which the “OPTIMIZE” Command is “Run”?
- The “Target File Size” of the “Delta Table” on which the “OPTIMIZE” Command is “Run” can be “Changed” by “Tweaking” the “Spark Configuration”, i.e., “delta.targetFileSize”.
- “Setting” a “Value” to the “Spark Configuration”, i.e., “delta.targetFileSize” does “Not Guarantee” that “All” the “Data Files” will “End Up With” the “Specified Size”.
- The “OPTIMIZE” Command will make a “Best-Effort Attempt” to be “True” to the “Target File Size”, but, it “Heavily Depends” on the “Amount” of “Data Processed”, as well as the “Parallelism”.
spark.config.set("spark.databricks.delta.targetFileSize", 80m)
%sql
OPTIMIZE retailer_db.brnz_StudentWithCdcEnabled;
When the “OPTIMIZE” Command is Used?
- The “OPTIMIZE” Jobs should be “Scheduled Regularly” since the “Query Reads” can “Heavily Benefit” from having “Fewer Data Files” to “Read”.
- “Databricks” “Recommends Running” the “OPTIMIZE” Jobs Daily, but it really “Depends” on the “Frequency” of the “Updates”.
“Drawbacks” of the “OPTIMIZE” Command
- It needs to be “Kept” in “Mind” that the “OPTIMIZE” Command can “Take Some Time”, and, will “Increase” the “Data Processing Costs”.
- Although, with the “OPTIMIZE” Command, the “Scan Time” have been “Improved”, since “Fewer Files” are being “Read”, nevertheless, still the “Whole Dataset” is being “read”.
C. “Keep” the “Delta Tables” “Clean” Using the “Z-ORDER OPTIMIZE” Command
- The “Z-Ordering” is a “Technique” that is “Used” to “Co-Locate” the “Related Information” in the “Same Set” of “Data Files”.
- When the “Data Files” are “Written” to a “Delta Table”, the “min”, “max”, and the “count” “Statistics” are “Automatically Added” in a “stats” Field on the “add” Action.
- These “Statistics” are “Used” for “Data-Skipping” when “Querying” the “Delta Table”.
- “Data-Skipping” is an “Optimization” that “Aims” to “Optimize” the “Queries” Containing the “WHERE Clauses”.
- By default, the “Statistics” of the “First 32 Columns” of a “Dataset” are “Collected”. It can be “Changed” by “Tweaking” the “Spark Configuration”, i.e., “delta.dataSkippingNumIndexedCols” to the “Desired Number”.
- This needs to be “Kept” in “Mind” that “Tweaking” the “Spark Configuration”, i.e., “delta.dataSkippingNumIndexedCols” can “Affect” the “Write Performance”, especially for “Long String Columns”. It is “Advised” to “Move” the “Long String Columns” to the “End” of the “Schema”, and, “Set” the “Property” to a “Number” that is “Lower Than” its “Index”.
“Difference” Between “OPTIMIZE” and “OPTIMIZE Z-ORDER BY” Command
- In the “OPTIMIZE” Command, even though the “Statistics” are “Collected”, those can’t really be “Used”, and, “All” the “Files” are “Read”. That’s because there is “No Explicit Ordering”, and the “Data” are basically “Randomized” between “All” the “Files”.
- By “Adding” a “ZORDER BY Column Name” Command with the “OPTIMIZE” Command, this “Issue” can be easily “Solved”.
“Drawbacks” of the “OPTIMIZE ZORDER BY” Command
- Even though the “Z-Ordering” for “Data-Skipping” looks to be a game changer, it must be “Used Correctly” in order to be “Efficient”.
Below are some of the “Key Considerations” that “Must” be there when “Using” the “Z-Ordering” -
1. “Z-Ordering” is “Only Suited” for “Columns” with “High Cardinality”. If the “Column” has a “Low Cardinality”, there will be “No Benefit” from the “Data-Skipping”.
2. It is possible to “Specify” “Multiple Columns” on “ZORDER” Command, but, the “Effectiveness” of its “Data-Skipping” “Decreases” with “Each Extra Column”.
3. The “Developers” need to “Make Sure” to “Use Only” those “Columns” on “ZORDER” Command, for which “Statistics” are “Available”.
D. “Keep” the “Delta Tables” “Clean” Using the “Partitioning”
- While “Z-Ordering” “Groups” the “Data” with the “Similar Values” under the “Same Data File” the “Partitioning” “Groups” the “Data Files” under the “Same Folder”.
- “Contrary” to “Z-Ordering”, “Partitioning” “Works Best” with the “Low-Cardinality Columns”. If the “High-Cardinality Columns” are “Chosen”, then “Infinite Partitions” can be “Generated” with a “Lot” of “Small Data Files” which “Results” in “Performance Issues”.
- The “Grouping” of the “Data Files” under the “Same Folder” by the “Partitioning” is rather “Useful” for “Columns” that have “Low Cardinality”, and are “Very Often Used” in the “WHERE” Clauses in “Large Tables”.
%sql
REPLACE TABLE retailer_db.slvr_StudentWithCdcEnabled
USING DELTA
PARTITION BY (Subject)
AS
SELECT * FROM retailer_db.slvr_StudentWithCdcEnabled;
Use “ZORDER” and “Partitioning” Together
- The “OPTIMIZE” Command can be “Paired” with a “Partition Column”, if it is “Required” to “Optimize” a “Subset” of the “Data”.
%sql
-- Set the "Target File Size" of the "Delta Table" to "40MB"
SET spark.databricks.delta.targetFileSize = 40m
OPTIMIZE retailer_db.slvr_StudentWithCdcEnabled
WHERE Subject = 'Math'
ZORDER BY (Marks)