Introduction to Databricks Workflows
What is a Databricks Workflow?
A Databricks Workflow is a Unified Orchestration area for the data, analytics, AI, GenAI etc. on the Lakehouse platform.
A Databricks Workflow has the capability to orchestrate various events using various ways of thinking with the benefits of Unity Catalog, and, Delta Lake withing three popular cloud systems, i.e., — AWS, Azure and Google Cloud Platform.
Databricks Workflow offers the following -
- Simple authoring
- Actionable insights
- Proven reliability
A Databricks Workflow can be orchestrated for the full data for diverse workloads on any cloud, such as the following -
- Streams
- File-Event Triggers etc.
A Databricks Workflow can also be scheduled to run on a particular time.
Databricks Workflows are fully Managed, and, reliable, and, now also Serverless.
Building Blocks of Databricks Workflows
- Job: A unit of orchestration in Databricks Workflows is called a Job.
A Job is made up of one, or, more Tasks. These tasks can be a Databricks Notebook, or, a Python Script, or, a Python Wheel, or, a Delta Live Table Pipeline, or, can actually be another Job etc.
One Databricks Workflow Job can be embedded inside another Databricks Workflow Job. - Control Flow
- Trigger
Different Types of Computes to Run Jobs in Databricks Workflows
Following are the three different types of Computes on which a Job of a Databricks Workflow can run -
- 1. Interactive Clusters: Interactive Clusters, or, All-Purpose Clusters can be shared by multiple Users.
It is best for performing Ad-hoc analysis, data exploration, development, or, code debugging.
Interactive Clusters should not be used in Production as these are not cost-efficient. - 2. Job Clusters: An instantiated Job Cluster can be associated with specific Tasks of a Job in a Databricks Workflow.
Job Clusters are approximately 50% cheaper as these are terminated when the Job ends. Thus, reducing resource usage, and, costs.
However, Job Clusters are subjected to Cloud Providers start-up time.
It is possible to re-use the same Job Cluster across the Tasks in a Databricks Workflow for better price performance.
It is best for performing Production grade workload, and, operational use-cases. - 3. Serverless Workflows: Serverless Workflow is a fully Managed Service that is operationally simpler and more reliable.
Serverless Workflow provides faster Clusters, and, Auto-Scaling capabilities. Thus, the Users get a better experience for a lower cost.
With out-of-the-box performance optimizations, Serverless Workflow provides an overall lower CTO.
When to Leverage Databricks Workflows?
- The Databricks Workflows can be used in case of the following scenarios -
- If the goal is to go for a simple, yet powerful ETL, or, DLT, i.e., Delta Live Table, or, ML orchestration.
- If %run, i.e., Notebook Workflows is used.
- If Apache Airflow is used in a project, and, the goal is to reduce infrastructure overhead, and, make orchestration easier.
- To consolidate the toolset.
- If the goal is to enable the Non-Engineers to orchestrate the code in a project.
- If the goal is to achieve a Cloud-Provider Independent orchestration.
Workflow Tasks as Directed Acyclic Graph (DAG)
A Directed Acyclic Graph (DAG) is a conceptual representation of series of activities, including data processing flows. Following are the characteristics of a DAG -
- Directed: Unambiguous direction for each Edge. Single directional flow from one Vertex to another.
- Acyclic: Contains no cycle. impossible to loop back to previous Vertex.
- Graph: Collection of Vertices connected by Edges.
It is possible to view the Tasks of a Databricks Workflow with a Directed Acyclic Graph (DAG) in the following way -
- The Tasks are considered as Directed as these are unambiguous. A Task has an Edge that points to another Task in a specific direction.
- The Tasks are considered as Acyclic as there is no cycle among the Tasks, which means that it is impossible to loop back to a previous Task once that is completed.
- The Tasks of a Databricks Job can be represented through a Graph as a Collection of Vertices that are connected by Edge
Running Multiple Tasks of a Databricks Job as a DAG
- It can be possible in a Databricks Job that a Task would be dependent on one, more previous Tasks.
- Databricks Jobs has introduced the ability to perform Task Orchestration, which is the ability to run multiple Tasks as a Directed Acyclic Graph (DAG) using the Databricks UI, or, the Databricks API.
- It is possible to define the order of execution of all the Tasks in a Databricks Job by configuring Task Dependencies, in a DAG of Task Execution.
- There is a maximum number of Tasks that can be created in any Databricks Job.
- If the use case demands more Tasks in a Databricks Job, in that case, one Databricks Job, containing a handful of Tasks, can be used as a Task in another Databricks Job.
Different Types of DAG Patterns
Following are the different types of DAG patterns -
- Sequence: In the Sequence DAG pattern, control will move from one Task to the next Task in sequential order.
- Funnel: In the Funnel DAG pattern, there are some handful of Tasks that must be completed first for the control to move to the next Task.
- Fan-out: In the Fan-out DAG pattern, once the primary Task is completed, the control will fan-out to some handful of Tasks.
Basics of Databricks Job
Since a Databricks Job is made up of Tasks, a single Databricks Job can be made up of one, or, a handful of Tasks.
- A Databricks Job can consist of a single Task, like a single Delta Live Table Pipeline, or, can be a large, Multi-Task Workflow with complex dependencies.
A Databricks Job can be created and run using a variety of systems, like — Databricks UI, Databricks CLI, Terraform, or by invoking the Databricks Jobs API.
Supported Task Types:
- The list of supported Task types is ever-growing and covers a wide variety of Task types.
The Following menu displays the supported Task types, as of now -
Supported Configuration Items: The following variety of Configuration items can be set and specified in a Databricks Job -
- Task name
- Type — The type of the Task
- Source
- Path
- Compute
- Dependent libraries
- Parameters
- Emails
- Retries
- Duration threshold
Supported Programming Languages: The following varieties of Programming Languages are supported in a Databricks Job -
- Python
- SQL
- Scala
- R
- Java (via a JAR file)
Dependent Libraries: It is possible to install specific Dependent Libraries in a Databricks Job from a wide variety of sources, like -
- Maven
- CRAN
- PyPI
- Databricks Workspace location
- DBFS, or, Cloud location
The Dependent Libraries can be of the following types -
- JAR
- Python Egg
- Python Wheel
It is also possible to upload any of the Dependent Libraries from the local machine.
Configure Notifications:
- It is possible to configure each Task in a Databricks Job to send Notifications, like — sending an Emails. This feature is optional.
- Databricks Workflows can trigger the Notification, like — sending an Emails, when the Task begins, completes, or, fails.
- It is also possible to configure the Notification for Late Jobs, meaning when the property duration warning threshold is set for a Task and the running time of that Task actually exceeds the value of the property duration warning threshold.
Retry Policy:
- It is possible to configure a Databricks Job with Retry Policy to perform action on the event of the Databricks Job’s failed run.
The Retry Policy specifies the number of times a Databricks Job should be retried, and, how long to wait before the Databricks Job is retried each time it fails.
- It is also possible to set the Retry Policy for each Task in a Databricks Job by clicking on the Advanced option, and, selecting Edit Retry Policy.
- The Retry Interval is calculated in milliseconds between the start of the failed run and the subsequent retry run.
- If both the Timeout and Retry Policy are configured, then the Timeout is applicable to each Retries.
Tags
- Databricks Jobs support Tags to allow customers to easily identify and locate the Databricks Jobs by ownership, topic, or, department.
- Tags propagate to the Job Clusters that is created when a Databricks Job runs, allowing to use the Tags with existing Cluster Monitoring.
- Tags can contain both Key — Value Pair, and, only Keys (Labels).