Sitemap

Introduction to Databricks Workflows

7 min readMay 20, 2025

--

What is a Databricks Workflow?

A Databricks Workflow is a Unified Orchestration area for the data, analytics, AI, GenAI etc. on the Lakehouse platform.

A Databricks Workflow has the capability to orchestrate various events using various ways of thinking with the benefits of Unity Catalog, and, Delta Lake withing three popular cloud systems, i.e., — AWS, Azure and Google Cloud Platform.

Databricks Workflow offers the following -

  • Simple authoring
  • Actionable insights
  • Proven reliability

A Databricks Workflow can be orchestrated for the full data for diverse workloads on any cloud, such as the following -

  • Streams
  • File-Event Triggers etc.

A Databricks Workflow can also be scheduled to run on a particular time.

Databricks Workflows are fully Managed, and, reliable, and, now also Serverless.

Building Blocks of Databricks Workflows

  • Job: A unit of orchestration in Databricks Workflows is called a Job.
    A Job is made up of one, or, more Tasks. These tasks can be a Databricks Notebook, or, a Python Script, or, a Python Wheel, or, a Delta Live Table Pipeline, or, can actually be another Job etc.
    One Databricks Workflow Job can be embedded inside another Databricks Workflow Job.
  • Control Flow
  • Trigger

Different Types of Computes to Run Jobs in Databricks Workflows

Following are the three different types of Computes on which a Job of a Databricks Workflow can run -

  • 1. Interactive Clusters: Interactive Clusters, or, All-Purpose Clusters can be shared by multiple Users.
    It is best for performing Ad-hoc analysis, data exploration, development, or, code debugging.
    Interactive Clusters should not be used in Production as these are not cost-efficient.
  • 2. Job Clusters: An instantiated Job Cluster can be associated with specific Tasks of a Job in a Databricks Workflow.
    Job Clusters are approximately 50% cheaper as these are terminated when the Job ends. Thus, reducing resource usage, and, costs.
    However, Job Clusters are subjected to Cloud Providers start-up time.
    It is possible to re-use the same Job Cluster across the Tasks in a Databricks Workflow for better price performance.
    It is best for performing Production grade workload, and, operational use-cases.
  • 3. Serverless Workflows: Serverless Workflow is a fully Managed Service that is operationally simpler and more reliable.
    Serverless Workflow provides faster Clusters, and, Auto-Scaling capabilities. Thus, the Users get a better experience for a lower cost.
    With out-of-the-box performance optimizations, Serverless Workflow provides an overall lower CTO.

When to Leverage Databricks Workflows?

  • The Databricks Workflows can be used in case of the following scenarios -
  • If the goal is to go for a simple, yet powerful ETL, or, DLT, i.e., Delta Live Table, or, ML orchestration.
  • If %run, i.e., Notebook Workflows is used.
  • If Apache Airflow is used in a project, and, the goal is to reduce infrastructure overhead, and, make orchestration easier.
  • To consolidate the toolset.
  • If the goal is to enable the Non-Engineers to orchestrate the code in a project.
  • If the goal is to achieve a Cloud-Provider Independent orchestration.

Workflow Tasks as Directed Acyclic Graph (DAG)

A Directed Acyclic Graph (DAG) is a conceptual representation of series of activities, including data processing flows. Following are the characteristics of a DAG -

  • Directed: Unambiguous direction for each Edge. Single directional flow from one Vertex to another.
  • Acyclic: Contains no cycle. impossible to loop back to previous Vertex.
  • Graph: Collection of Vertices connected by Edges.

It is possible to view the Tasks of a Databricks Workflow with a Directed Acyclic Graph (DAG) in the following way -

  • The Tasks are considered as Directed as these are unambiguous. A Task has an Edge that points to another Task in a specific direction.
  • The Tasks are considered as Acyclic as there is no cycle among the Tasks, which means that it is impossible to loop back to a previous Task once that is completed.
  • The Tasks of a Databricks Job can be represented through a Graph as a Collection of Vertices that are connected by Edge

Running Multiple Tasks of a Databricks Job as a DAG

  • It can be possible in a Databricks Job that a Task would be dependent on one, more previous Tasks.
  • Databricks Jobs has introduced the ability to perform Task Orchestration, which is the ability to run multiple Tasks as a Directed Acyclic Graph (DAG) using the Databricks UI, or, the Databricks API.
  • It is possible to define the order of execution of all the Tasks in a Databricks Job by configuring Task Dependencies, in a DAG of Task Execution.
  • There is a maximum number of Tasks that can be created in any Databricks Job.
  • If the use case demands more Tasks in a Databricks Job, in that case, one Databricks Job, containing a handful of Tasks, can be used as a Task in another Databricks Job.

Different Types of DAG Patterns

Following are the different types of DAG patterns -

  • Sequence: In the Sequence DAG pattern, control will move from one Task to the next Task in sequential order.
  • Funnel: In the Funnel DAG pattern, there are some handful of Tasks that must be completed first for the control to move to the next Task.
  • Fan-out: In the Fan-out DAG pattern, once the primary Task is completed, the control will fan-out to some handful of Tasks.

Basics of Databricks Job

Since a Databricks Job is made up of Tasks, a single Databricks Job can be made up of one, or, a handful of Tasks.

  • A Databricks Job can consist of a single Task, like a single Delta Live Table Pipeline, or, can be a large, Multi-Task Workflow with complex dependencies.

A Databricks Job can be created and run using a variety of systems, like — Databricks UI, Databricks CLI, Terraform, or by invoking the Databricks Jobs API.

Supported Task Types:

  • The list of supported Task types is ever-growing and covers a wide variety of Task types.
    The Following menu displays the supported Task types, as of now -

Supported Configuration Items: The following variety of Configuration items can be set and specified in a Databricks Job -

  • Task name
  • Type — The type of the Task
  • Source
  • Path
  • Compute
  • Dependent libraries
  • Parameters
  • Emails
  • Retries
  • Duration threshold

Supported Programming Languages: The following varieties of Programming Languages are supported in a Databricks Job -

  • Python
  • SQL
  • Scala
  • R
  • Java (via a JAR file)

Dependent Libraries: It is possible to install specific Dependent Libraries in a Databricks Job from a wide variety of sources, like -

  • Maven
  • CRAN
  • PyPI
  • Databricks Workspace location
  • DBFS, or, Cloud location

The Dependent Libraries can be of the following types -

  • JAR
  • Python Egg
  • Python Wheel

It is also possible to upload any of the Dependent Libraries from the local machine.

Configure Notifications:

  • It is possible to configure each Task in a Databricks Job to send Notifications, like — sending an Emails. This feature is optional.
  • Databricks Workflows can trigger the Notification, like — sending an Emails, when the Task begins, completes, or, fails.
  • It is also possible to configure the Notification for Late Jobs, meaning when the property duration warning threshold is set for a Task and the running time of that Task actually exceeds the value of the property duration warning threshold.

Retry Policy:

  • It is possible to configure a Databricks Job with Retry Policy to perform action on the event of the Databricks Job’s failed run.
    The Retry Policy specifies the number of times a Databricks Job should be retried, and, how long to wait before the Databricks Job is retried each time it fails.
  • It is also possible to set the Retry Policy for each Task in a Databricks Job by clicking on the Advanced option, and, selecting Edit Retry Policy.
  • The Retry Interval is calculated in milliseconds between the start of the failed run and the subsequent retry run.
  • If both the Timeout and Retry Policy are configured, then the Timeout is applicable to each Retries.

Tags

  • Databricks Jobs support Tags to allow customers to easily identify and locate the Databricks Jobs by ownership, topic, or, department.
  • Tags propagate to the Job Clusters that is created when a Databricks Job runs, allowing to use the Tags with existing Cluster Monitoring.
  • Tags can contain both Key — Value Pair, and, only Keys (Labels).

--

--

Oindrila Chakraborty
Oindrila Chakraborty

Written by Oindrila Chakraborty

I have 13+ experience in IT industry. I love to learn about the data and work with data. I am happy to share my knowledge with all. Hope this will be of help.

Responses (1)