Introduction to “Delta Lake” in Databricks
“Disadvantages” of “Storing Data” in “Data Lake”
In a “Data Lake”, like — “Azure Data Lake Storage Gen 2” in “Azure”, where the “Underlying Data” of a “Table” is “Kept” in the “Form” of “Multiple Files” of “Any Format”, it becomes “Very Difficult” to “Perform” the following “Operations” on those “Files” -
- Merge Operation
- “Updating” a “Record”, which may be “Present” in “Any” of those “Files”
- Metadata Handling
- Version Controlling etc.
What is “Delta Lake”?
- “Delta Lake” is an “Optimized Storage Layer” on “Top” of the “Data Lake”.
- “Delta Lake” brings “Reliability” to the “Data Lake”, so that “Any Operation” that can be “Performed” on a “Traditional Database” can also be “Performed” on the “Tables”, for which the “Underlying Data” is “Stored” inside the “Delta Lake”.
In What “File Format” the “Data” is “Stored” in a “Delta Lake”?
In a “Data Lake”, the “Data” can be “Stored” in any “Format”, like — “CSV”, “JSON” etc.
- But, when any “Data” is “Tried” to be “Stored” in a “Delta Lake”, it is “Stored” in the “Form” of a “Delta Table”.
- Although, the “Delta Table” is “Not” actually a “Table”, but, it is called so because, a “Delta Table” has “All” the “Features” of a “Relational Database Table”.
- The “Underlying Data” of a “Delta Table” is “Stored” in the “Compressed Parquet File Format”, i.e., in “snappy. Parquet” File Format.
From “Databricks Runtime 8.0 Onwards”, any “Table” that will be “Created” in “Databricks”, will be a “Delta Table” by “Default”.
“Features” of “Delta Lake”
- 1. ACID Transaction — Although, a “Delta Table” is “Not” a “Relational Database Table”, still “Delta Lake” introduces the “Concept” of “ACID Transactions” on “Spark”.
- 2. Scalable Metadata Handling
- 3. Streaming and Batch Unification — A “Delta Table” in a “Delta Lake” can be a “Batch Table”, as well as, a “Streaming Source” and “Sink”.
- 4. Schema Enforcement — “Delta Lake” “Automatically Handles” the “Schema Variations” of the “Data” to be “Inserted” into the “Delta Tables” to “Prevent” the “Insertion” of “Bad Records”.
- 5. Time Travel — “Data Versioning” in the “Delta Lake” “Enables” the “Rollbacks”, and, “Full Historical Audit Trail”.
- 6. Upsert and Deletes — “Delta Lake” “Supports” the “Merge”, “Update”, and, “Delete” Operations to “Enable” the “Complex Use-Cases”, like -
- “Change Data Capture” (“CDC”)
- “Slowly Changing Dimensions” (“SCD”)
- “Streaming Upsert”, and, so on.
“Create” a “Delta Table”
df_ReadCustomerAddress = spark.read\
.option("header", "true")\
.option("sep", "|")\
.option("inferSchema", "true")\
.csv("dbfs:/FileStore/tables/retailer/data/customer_address.dat")
display(df_ReadCustomerAddress)
1. “Create” an “External Delta Table”
When an “External Delta Table” is “Created”, the “Location” of the “Underlying Data” for the “External Delta Table” is “Provided” by the “Users”.
A. “Create” an “External Delta Table” Without a “Name”
df_ReadCustomerAddress.write\
.format("delta")\
.mode("append")\
.save("dbfs:/FileStore/tables/training/delta-tables-folder/tbl_demo_customer_external_without_name")
deltaTablePath = "dbfs:/FileStore/tables/training/delta-tables-folder/tbl_demo_customer_external_without_name"
dfTblDemoCustomerExternal = spark.read\
.format("delta")\
.load(deltaTablePath)
display(dfTblDemoCustomerExternal)
B. “Create” an “External Delta Table” With a “Name”
df_ReadCustomerAddress.write\
.format("delta")\
.mode("append")\
.option("path", "dbfs:/FileStore/tables/training/delta-tables-folder/tbl_demo_customer_external_with_name")\
.saveAsTable("tbl_demo_customer_external")
dfTblDemoCustomerExternal = spark.table("tbl_demo_customer_external")
display(dfTblDemoCustomerExternal)
2. “Create” a “Managed Delta Table”
- When a “Managed Delta Table” is “Created”, the “Location” of the “Underlying Data” for the “Managed Delta Table” is “Determined” by “Databricks”.
- The “User” needs to “Only Provide” the “Name” of the “Managed Delta Table”.
df_ReadCustomerAddress.write\
.format("delta")\
.saveAsTable("tbl_demo_customer_managed")
dfTblDemoCustomerManaged = spark.table("tbl_demo_customer_managed")
display(dfTblDemoCustomerManaged)
To “Get” the “Location” of a “Managed Delta Table”, the “DESCRIBE DETAIL” command should be “Used” on the “Managed Delta Table”.
%sql
DESCRIBE DETAIL tbl_demo_customer_managed;
“Query” a “Delta Table”
- “Once” the “Content” of a “Delta Table” is “Read” into a “DataFrame”, any “Content” from the “Delta Table” can be “Displayed” using the “select” Operation on the “DataFrame”.
from pyspark.sql.functions import *
dfResult = dfTblDemoCustomerManaged.select("*")\
.filter(col("ca_city").startswith("P"))
display(dfResult)
“Drop” a “Delta Table”
1. “Drop” an “External Delta Table”
- When an “External Delta Table” is “Dropped” using the “Standard SQL Command”, i.e., “DROP TABLE ‘table-name’”, the “Users” need to “Manually Delete” the “Underlying Data”, which is “Stored as the “Compressed Parquet Files”, from the “Location” provided at the “Time” of “Creating” the “External Delta Table”.
%sql
DROP TABLE IF EXISTS tbl_demo_customer_external
deltaTablePath = "dbfs:/FileStore/tables/training/delta-tables-folder/tbl_demo_customer_external"
dbutils.fs.rm(deltaTablePath, recurse = True)
2. “Drop” a “Managed Delta Table”
- When a “Managed Delta Table” is “Dropped” using the “Standard SQL Command”, i.e., “DROP TABLE ‘table-name’”, the “Underlying Data” will be “Deleted” by the “Databricks”.
- The “Users” do “Not” need to “Manually Delete” anything.
%sql
DROP TABLE IF EXISTS tbl_demo_customer_managed