Introduction to “Delta Lake” in Databricks

Oindrila Chakraborty
5 min readAug 15, 2023

--

“Disadvantages” of “Storing Data” in “Data Lake”

In a “Data Lake”, like — “Azure Data Lake Storage Gen 2” in “Azure”, where the “Underlying Data” of a “Table” is “Kept” in the “Form” of “Multiple Files” of “Any Format”, it becomes “Very Difficult” to “Perform” the following “Operations” on those “Files” -

  • Merge Operation
  • “Updating” a “Record”, which may be “Present” in “Any” of those “Files
  • Metadata Handling
  • Version Controlling etc.

What is “Delta Lake”?

  • “Delta Lake” is an “Optimized Storage Layer” on “Top” of the “Data Lake”.
  • “Delta Lake” brings “Reliability” to the “Data Lake”, so that “Any Operation” that can be “Performed” on a “Traditional Database” can also be “Performed” on the “Tables”, for which the “Underlying Data” is “Stored” inside the “Delta Lake”.

In What “File Format” the “Data” is “Stored” in a “Delta Lake”?

In a “Data Lake”, the “Data” can be “Stored” in any “Format”, like — “CSV”, “JSON” etc.

  • But, when any “Data” is “Tried” to be “Stored” in a “Delta Lake”, it is “Stored” in the “Form” of a “Delta Table”.
  • Although, the “Delta Table” is “Not” actually a “Table”, but, it is called so because, a “Delta Table” has “All” the “Features” of a “Relational Database Table”.
  • The “Underlying Data” of a “Delta Table” is “Stored” in the “Compressed Parquet File Format”, i.e., in “snappy. ParquetFile Format.

From “Databricks Runtime 8.0 Onwards”, any “Table” that will be “Created” in “Databricks”, will be a “Delta Table” by “Default”.

“Features” of “Delta Lake”

  • 1. ACID Transaction — Although, a “Delta Table” is “Not” a “Relational Database Table”, still “Delta Lake” introduces the “Concept” of “ACID Transactions” on “Spark”.
  • 2. Scalable Metadata Handling
  • 3. Streaming and Batch Unification — A “Delta Table” in a “Delta Lake” can be a “Batch Table”, as well as, a “Streaming Source” and “Sink”.
  • 4. Schema Enforcement — “Delta Lake” “Automatically Handles” the “Schema Variations” of the “Data” to be “Inserted” into the “Delta Tables” to “Prevent” the “Insertion” of “Bad Records”.
  • 5. Time Travel — “Data Versioning” in the “Delta Lake” “Enables” the “Rollbacks”, and, “Full Historical Audit Trail”.
  • 6. Upsert and Deletes — “Delta Lake” “Supports” the “Merge”, “Update”, and, “DeleteOperations to “Enable” the “Complex Use-Cases”, like -
  • “Change Data Capture” (“CDC”)
  • “Slowly Changing Dimensions” (“SCD”)
  • “Streaming Upsert”, and, so on.

“Create” a “Delta Table”

df_ReadCustomerAddress = spark.read\
.option("header", "true")\
.option("sep", "|")\
.option("inferSchema", "true")\
.csv("dbfs:/FileStore/tables/retailer/data/customer_address.dat")
display(df_ReadCustomerAddress)
“Read” a “CSV File” and Create a “DataFrame”

1. “Create” an “External Delta Table”

When an “External Delta Table” is “Created”, the “Location” of the “Underlying Data” for the “External Delta Table” is “Provided” by the “Users”.

A. “Create” an “External Delta Table” Without a “Name”

df_ReadCustomerAddress.write\
.format("delta")\
.mode("append")\
.save("dbfs:/FileStore/tables/training/delta-tables-folder/tbl_demo_customer_external_without_name")
“Create” an “External Delta Table” from the “DataFrame” Without a “Name” for the “External Delta Table”
deltaTablePath = "dbfs:/FileStore/tables/training/delta-tables-folder/tbl_demo_customer_external_without_name"

dfTblDemoCustomerExternal = spark.read\
.format("delta")\
.load(deltaTablePath)

display(dfTblDemoCustomerExternal)
“Read” an “External Delta Table” that is “Created” Without a “Name”

B. “Create” an “External Delta Table” With a “Name”

df_ReadCustomerAddress.write\
.format("delta")\
.mode("append")\
.option("path", "dbfs:/FileStore/tables/training/delta-tables-folder/tbl_demo_customer_external_with_name")\
.saveAsTable("tbl_demo_customer_external")
“Create” an “External Delta Table” from the “DataFrame” With a “Name” for the “External Delta Table”
dfTblDemoCustomerExternal = spark.table("tbl_demo_customer_external")

display(dfTblDemoCustomerExternal)
“Read” an “External Delta Table” that is “Created” With a “Name”

2. “Create” a “Managed Delta Table”

  • When a “Managed Delta Table” is “Created”, the “Location” of the “Underlying Data” for the “Managed Delta Table” is “Determined” by “Databricks”.
  • The “User” needs to “Only Provide” the “Name” of the “Managed Delta Table”.
df_ReadCustomerAddress.write\
.format("delta")\
.saveAsTable("tbl_demo_customer_managed")
“Create” a “Managed Delta Table” from the “DataFrame”
dfTblDemoCustomerManaged = spark.table("tbl_demo_customer_managed")

display(dfTblDemoCustomerManaged)
“Read” the “Managed Delta Table”

To “Get” the “Location” of a “Managed Delta Table”, the “DESCRIBE DETAIL” command should be “Used” on the “Managed Delta Table”.

%sql
DESCRIBE DETAIL tbl_demo_customer_managed;
“Get” the “Location” of a “Managed Delta Table”

“Query” a “Delta Table”

  • “Once” the “Content” of a “Delta Table” is “Read” into a “DataFrame”, any “Content” from the “Delta Table” can be “Displayed” using the “selectOperation on the “DataFrame”.
from pyspark.sql.functions import *

dfResult = dfTblDemoCustomerManaged.select("*")\
.filter(col("ca_city").startswith("P"))

display(dfResult)
“Display” Only Rows from the “Delta Table”, Where the Value of the Column “ca_city” Starts With “P”

“Drop” a “Delta Table”

1. “Drop” an “External Delta Table”

  • When an “External Delta Table” is “Dropped” using the “Standard SQL Command”, i.e., “DROP TABLE ‘table-name’”, the “Users” need to “Manually Delete” the “Underlying Data”, which is “Stored as the “Compressed Parquet Files”, from the “Location” provided at the “Time” of “Creating” the “External Delta Table”.
%sql
DROP TABLE IF EXISTS tbl_demo_customer_external
“Drop” an “External Delta Table”
deltaTablePath = "dbfs:/FileStore/tables/training/delta-tables-folder/tbl_demo_customer_external"
dbutils.fs.rm(deltaTablePath, recurse = True)
“Remove” the “Underlying Data” of the “External Delta Table”

2. “Drop” a “Managed Delta Table”

  • When a “Managed Delta Table” is “Dropped” using the “Standard SQL Command”, i.e., “DROP TABLE ‘table-name’”, the “Underlying Data” will be “Deleted” by the “Databricks”.
  • The “Users” do “Not” need to “Manually Delete” anything.
%sql
DROP TABLE IF EXISTS tbl_demo_customer_managed
“Drop” a “Managed Delta Table”

--

--

Oindrila Chakraborty
Oindrila Chakraborty

Written by Oindrila Chakraborty

I have 12+ experience in IT industry. I love to learn about the data and work with data. I am happy to share my knowledge with all. Hope this will be of help.

Responses (1)