List: Databricks Articles | Curated by Oindrila Chakraborty

Nov 18, 2024
28 stories
3 saves
Databricks ArticlesAll the articles regarding Basics of Apache Spark, and, Azure Databricks will be here
In
Towards Dev
by
Oindrila Chakraborty
Introduction to the Streaming Table and Materialized View of Delta Live Table (DLT) in DatabricksCore Abstractions of Delta Live Table
Nov 7, 2024
Nov 7, 2024
In
Towards Dev
by
Oindrila Chakraborty
Introduction to Delta Live Table (DLT) in DatabricksHow Lakehouse Projects Work?
Nov 6, 2024
Nov 6, 2024
In
Towards Dev
by
Oindrila Chakraborty
Change Data Feed on Databricks in Action for Batch SystemsWhat is Change Data Capture?
Jun 26, 2024
4
Jun 26, 2024
4
In
Towards Dev
by
Oindrila Chakraborty
Schema Evolution in Delta Lake - DatabricksIn today’s modern big data world, often the clients send the source big data files to process, where the Structure of those source files…
May 26, 2024
2
May 26, 2024
2
In
Towards Dev
by
Oindrila Chakraborty
How To Dynamically Load Multiple Files in Databricks Based on Current Date or HourMany times, in a Batch System, the client wants to keep the already extracted files in the same path as the arriving extracted files to be…
May 25, 2024
May 25, 2024
In
Towards Dev
by
Oindrila Chakraborty
Data Security: Column Level Encryption in DatabricksData Security is one of the important concepts in any Data Engineering Platform. Most companies spend a huge amount of money in…
May 13, 2024
May 13, 2024
Oindrila Chakraborty
Cache vs. Persist in DatabricksWhat are “Cache” and “Persist”?
May 7, 2024
4
May 7, 2024
4
Oindrila Chakraborty
Developing a Simple “Streaming Application” Using “Spark” in DatabricksWhy “Streaming Application” is Required to be “Developed”?
Nov 1, 2023
Nov 1, 2023
Table of Contents:

1. What is “Slowly Changing Dimension”?
2. “Different Methods” of “Handling” the “Slowly Changing Dimension”
    i) SCD Type 1
    ii) SCD Type 2
    iii) SCD Type 3
3. What is “Merge”, or, “Upsert” Operation?
4. “Challenges” of Using “Merge” Operation in “Data Lake”
5. Did the Usage of “Delta Lake” Actually Solved the Problem of “Performing” the “Merge” Operation?
6. “Disadvantage” of Using “Spark SQL” to “Perform” the “Merge” Operation
7. Why It is Necessary for the “Developers” to Know the Usage of “Merge” Operation in “PySpark”?
8. “Create” the “Target Delta Table”
    i) Create the Database “training”
    ii) Create an “Empty Delta Table”
    iii) “Display” the “Content” of the “Delta Table” (Currently “Empty”)
9. “Perform” the “Merge” Operation Using “Spark SQL”
    A) First Load
        I) Step 1: “Create” the “Source” for the “First Load”
        II) Step 2: “Perform” the “Merge” Operation
    B) Second Load
        I) Step 1: “Create” the “Source” for the “Second Load”
        II) Step 2: “Perform” the “Merge” Operation
10. “Perform” the “Merge” Operation Using “PySpark”
    A) Third Load
        I) Step 1: “Create” a “Delta Table Instance” Using “DeltaTableBuilder API”
            i) What is a “Delta Table Instance”?
            ii) What is the “Usage” of the “Delta Table Instance”?
            iii) How to “Convert” the “Delta Table Instance” to “DataFrame”?
        II) Step 2: “Create” the “Source” for the “Third Load”
        III) Step 3: “Perform” the “Merge” Operation
    B) Fourth Load
        I) Step 1: “Create” a “Delta Table Instance” Using “DeltaTableBuilder API”
        II) Step 2: “Create” the “Source” for the “Fourth Load”
        III) Step 3: “Perform” the “Merge” Operation
Oindrila Chakraborty
Perform “SCD Type 1” Using “MERGE” Operation on Delta Table Using “SPARK SQL” and “PySpark” in…What is “Slowly Changing Dimension”?
Oct 7, 2023
5
Oct 7, 2023
5
Table of Contents:

1. What is “Join Selection Strategy”?
2. Why Learning About “Join Selection Strategies” is Important?
3. In Which Phase the “Join Selection Strategy” is “Selected”?
4. How “Many Types” of “Join Selection Strategies” are There in the “Apache Spark”?
5. How “Apache Spark” Decides Which “Join Selection Strategy” to “Choose”?
6. Dataset Used
7.
    A) Sort Merge Join
        I) Steps -
            i) Phase 1: “Shuffle”
            ii) Phase 2: “Sort”
            iii) Phase 3: “Merge”
        II) For What “Type” of “Join” the “Sort Merge Join” can be Used as the “Join Selection Strategy”?
        III) How to “Disable” the “Sort Merge Join” as the “Default Join Selection Strategy”?
    B) Shuffle Hash Join
        I) Steps -
            i) Phase 1: “Shuffle”
            ii) Phase 2: “Hash Table Creation”
            iii) Phase 3: “Hash Join”
        II) Why the “Shuffle Hash Join” is “Termed” as “Expensive Operation”?
        III) When the “Shuffle Hash Join” Can “Work” as the “Join Selection Strategy”?
        IV) For What “Type” of “Join” the “Shuffle Hash Join” can be Used as the “Join Selection Strategy”?
    C) Broadcast Hash Join
        I) “Default Size” of the “Broadcast Threshold Limit”
        II) Can the “Default Size” of the “Broadcast Threshold Limit” be “Changed” to “Any Other Value”?
        III) How to “Disable” the “Broadcast Threshold Limit”?
        IV) Why “Lesser Shuffling” Occurs in a “Broadcast Hash Join”?
        V) What Happens When Both of the “DataFrames” in a “Join” are “Larger” Than the “Broadcast Threshold Limit” and “Broadcast Hash Join” is Used?
        VI) How to “Explicitly Define” the “Broadcast Hash Join” as the “Join Selection Stratey” During the “Join” Operation?
Oindrila Chakraborty
Different Types of “Join Strategies” in “Apache Spark”What is “Join Selection Strategy”?
Oct 6, 2023
5
Oct 6, 2023
5
Table of Contents:

1. Why “Broadcast Variable” is Used?
2. What is a “Broadcast Variable”?
3. When the “Usage” of the “Broadcast Variable” is “Most Suitable”?
4. What “Type” of a “Table” can be “Sent” Using the “Broadcast Variable”?
5. Can a “Large Table” be “Sent” Using the “Broadcast Variable”?
6. Can the “Size” of the “Table” that is “Sent” Using the “Broadcast Variable” be “Larger” than the “Size” of the “Memory” of the “Driver Node”?
7. “Internal Spark Architecture” of a “Normal Parallel Processing”
8. “Downside” of a “Normal Parallel Processing”
9. “Internal Spark Architecture” of a “Parallel Processing” With “Broadcast Variable”
10. Does the “Data” Sent Through “Broadcast Variable” Gets “Cached” in the “Worker Nodes” of a “Cluster”?
11. How Using “Broadcast Variable” in “Joins” “Improves” the “Performance”?
Oindrila Chakraborty
Usage of “Broadcast Variable” in “Joins” in “Apache Spark”Why “Broadcast Variable” is Used?
Oct 6, 2023
2
Oct 6, 2023
2
Table of Contents:

1. What is the “Importance” of “Partition”?
2. Why “Partition Strategy” is Needed?
  A) “Choosing” the “Right Number” of “Partitions”
  B) “Choosing” the “Right Size” of “Partition”
3. How to “Choose” the “Right Number” of “Partitions”?
4. How the “Unevenly Distributed Partitions” Actually “Decrease” the “Performance” of a “Spark Application”?
5. The “Default Number of Partitions” of “Data” that is “Generated” in the “Spark Environment”
6. The “Default Number” and “Default Size” of “Partitions” of “Data” that is “Read” from the “External Storage”
7. Can the “Default Number of Partitions” of “Data” that is “Read” from the “External Storage” be Changed?
8. For Which Type of “Files” that is “Read” from the “External Storage”, the “Default Number of Partitions” of “Data” can “Not” be “Applicable”?
Oindrila Chakraborty
Introduction to “Partition” in “Apache Spark”What is the “Importance” of “Partition”?
Sep 15, 2023
1
Sep 15, 2023
1
Table of Contents:

1. How “Delta Tables” Become “Slow” to “Query”?
2. How to “Speed Up” the “Query Performance” of the “Delta Tables”?
3. “Keep” the “Delta Tables” “Clean” Using the “VACCUM” Command
    A) Important Aspect of “VACCUM” Command
    B) Things to Remember When “Copying” the “Data Files” of a “Delta Table”
    C) How to “Disable” the “Default Retention Interval”?
    D) When the “VACCUM” Command is Used?
4. “Keep” the “Delta Tables” “Clean” Using the “OPTIMIZE” Command
    A) How to “Change” the “File Size” of the “Table” on Which the “OPTIMIZE” Command is “Run”?
    B) When the “OPTIMIZE” Command is Used?
    C) “Drawbacks” of the “OPTIMIZE” Command
5. “Keep” the “Delta Tables” “Clean” Using the “Z-ORDER OPTIMIZE” Command
    A) “Difference” Between “OPTIMIZE” and “OPTIMIZE Z-ORDER BY” Command
    B) “Drawbacks” of the “OPTIMIZE ZORDER BY” Command
6. “Keep” the “Delta Tables” “Clean” Using the “Partitioning”
7. Use “ZORDER” and “Partitioning” Together
Oindrila Chakraborty
How to “Optimize” the “Delta Tables” in DatabricksHow “Delta Tables” Become “Slow” to “Query”?
Sep 15, 2023
1
Sep 15, 2023
1
Table of Contents:

1. What is “Unity Catalog”?
2. Why “Unity Catalog” is Used “Primarily”?
3. In “What Type” of “Databricks Workspace”, the “Unity Catalog” can be Used?
4. Why “Unity Catalog” is Considered as “Unified Catalog”?
5. What is the “Data Lineage” Feature in “Unity Catalog”?
6. What is the “Data Sharing” Feature in “Unity Catalog”?
7. What is a “Hive Metastore”?
8. Can “Hive Metastore” of “One Databricks Workspace” be “Shared” with “Another Databricks Workspace”?
9. “Architecture” of “Unity Catalog”
    A) Metastore
    B) User Management
10. Unity Catalog Object Model
11. Who Can “Set Up” the “Unity Catalog” in “Databricks”?
12. Why the “Access Connector for Azure Databricks” Service is “Required” to “Set Up” the “Unity Catalog”?
13. “Steps” to “Enable” the “Unity Catalog” in “Databricks”
14. Important Features of “Unity Catalog”
Oindrila Chakraborty
Introduction to “Unity Catalog” in DatabricksWhat is “Unity Catalog”?
Aug 14, 2023
3
Aug 14, 2023
3
Table of Contents:

1. “Disadvantages” of “Storing Data” in “Data Lake”
2. What is “Delta Lake”?
3. In What “File Format” the “Data” is “Stored” in a “Delta Lake”?
4. “Features” of “Delta Lake”
5. “Create” a “Delta Table”
    A) “Create” an “External Delta Table”
        i) “Create” an “External Delta Table” Without a “Name”
        ii) “Create” an “External Delta Table” With a “Name”
    B) “Create” a “Managed Delta Table”
6. “Query” a “Delta Table”
7. “Drop” a “Delta Table”
    A) “Drop” an “External Delta Table”
    B) “Drop” a “Managed Delta Table”
Oindrila Chakraborty
Introduction to “Delta Lake” in Databricks“Disadvantages” of “Storing Data” in “Data Lake”
Aug 15, 2023
1
Aug 15, 2023
1
Table of Contents:

1. What is “Incremental ETL”
2. Features of “Incremental ETL”
    A) Scalable 
    B) Idempotent
    C) Complicated
3. Features of “Incremental ETL” in Databricks
4. Benefits of “COPY INTO” Command in “Incremental ETL” in Databricks
    A) No Custom Bookkeeping
    B) Accessible
    C) Scalable
5. Usage of “COPY INTO” Command in “Incremental ETL” in Databricks
    A) Destination
    B) Source
    C) File Format
6. Benefits of “AutoLoader” in “Incremental ETL” in Databricks
    A) Easy to Use
7. “New File Detection Mode” of “AutoLoader” in “Incremental ETL” in Databricks
    A) Directory Listing Mode
    B) File Notification Mode
8. Data Ingestion Challenges Solved by “AutoLoader” in “Incremental ETL” in Databricks
9. How “AutoLoader” Helps in “Incremental ETL” in Databricks
    A) Identify Schema on Stream Initialization
    B) Auto-Detect Changes and Evolve Schema to Capture New Fields
    C) Add Type Hints for Enforcement When Schema is Known
    D) Rescue Data That Does Not Meet Expectation
10. Usage of “AutoLoader” in “Incremental ETL” in Databricks
    A) Streaming Loads with AutoLoader
    B) Batch Loads with AutoLoader
11. Check Point Location of “AutoLoader” in “Incremental ETL” in Databricks
Oindrila Chakraborty
Introduction to Incremental ETL in DatabricksWhat is “Incremental ETL”
Oct 25, 2022
1
Oct 25, 2022
1
Table of Contents:

1. What is “Execution Plan”?
2. Different Sections of “Logical Plans”
    A) Parsed Logical Plan (Unresolved Logical Plan)
    B) Analyzed Logical Plan (Resolved Logical Plan)
    C) Optimized Logical Plan
3. What is “Physical Plan”?
4. What is “Catalyst Optimizer”?
5. How “Catalyst Optimizer” “Optimizes” “Queries”?
    A) User Input
    B) Unresolved Logical Plan
    C) Analysis
    D) Logical Optimization
    E) Physical Planning
    F) Cost Model
    G) Whole Stage Code Generation
6. Limitation of “Catalyst Optimizer”
Oindrila Chakraborty
Execution Plan in DatabricksWhat is Execution Plan?
Nov 14, 2022
Nov 14, 2022
Table of Contents:

1. Limitation of External User-Defined Functions (UDF)
2. Benefits of Using SQL User-Defined Functions (SQL UDF)
3. Create SQL User-Defined Functions (SQL UDF) as Constants
4. Create SQL User-Defined Functions (SQL UDF) Encapsulating Expressions
5. Create Nested SQL User-Defined Functions (SQL UDF) Encapsulating Expressions
6. Create SQL User-Defined Functions (SQL UDF) Reading from Tables
7. Create SQL Table User-Defined Functions (SQL UDF)
8. Administering SQL User-Defined Functions (SQL UDF)
9. Future Extension to SQL User-Defined Functions (SQL UDF)
In
FAUN — Developer Community 🐾
by
Oindrila Chakraborty
Introduction to SQL User-Defined Function in DatabricksLimitation of External User-Defined Functions (UDF)
May 11, 2022
May 11, 2022
Table of Contents:

1. Handling Bad Records Using the following “Parsing Modes” -
    A) PERMISSIVE (Default)
    B) DROPMALFORMED
    C) FAILFAST
In
Dev Genius
by
Oindrila Chakraborty
Handling Bad Records Using “Parsing Modes” in DatabricksCorrectness of the Data
May 11, 2022
May 11, 2022
Table of Contents:

1. What is “badRecordsPath” Option
2. Limitation of “badRecordsPath” Option
3. Example of “badRecordsPath” Option for Bad File (Missing File)
4. Example of “badRecordsPath” Option for Bad Records
In
Dev Genius
by
Oindrila Chakraborty
Handling Bad Records and Files in Databricks Using “badRecordsPath”What is “badRecordsPath” Option
May 11, 2022
May 11, 2022