Sitemap

Introduction to Databricks Asset Bundles (DABs)

16 min readJun 22, 2025

Typical Databricks Projects

A Databricks Project consists of multiple components -

  • Code: The codes of a Databricks Project can be present in Notebooks, Python files, Wheel files, JARs, DBT and more.
  • Execution Environment: The execution environment of a Databricks Project will be present in one or more Databricks Workspaces along with specific Compute Configurations for processing.
  • Resources: A Databricks Project may involve other Databricks resources, such as — Databricks Workflow, MLFlow Tracking Server, and, Registry, Delta Live Tables and more.

A Databricks Project typically produces a variety of data products, including -

  • Tables
  • Pipelines
  • Jobs
  • Machine Learning Models
  • Dashboards, and more.

Lastly, the deliverables of a Databricks Project determines the components it needs. For example -

  • A simple report might include one or more Notebooks running on a basic Single Node Compute.
  • On the other hand, a full MLOps Pipeline would require multiple components, such as — MLFlow, Feature Store, Model Serving components, Notebooks and more.

A simple Data Engineering Databricks Project could include Notebooks, Delta Live Tables, Workflows, Catalogs within Unity Catalog in specific Compute Configurations for the project. Hence, the distribution of the components would be -

  • Code: Python / SQL / R notebooks
  • Resources: Workflow, Delta Live Tables (DLT)
  • Execution Environment: Databricks Workspace, Unity Catalog, Compute Configuration(s)

High Level CI/CD Journey to Production for a Databricks Project

  • To start CI/CD journey in a Databricks Project, a Version Control Strategy is required in-place. For example, the Version Control Strategy might include dev, stage and main branches.
Press enter or click to view image in full size
  • Now. the CI/CD journey begins by deploying the Databricks Project to the development environment from the dev branch. This allows the developers to test any changes or updates made to the Databricks Project, and, development data using specific development Configurations.
Press enter or click to view image in full size
  • After completing development and then creating a Pull Request to the stage branch, the Databricks Project will then be deployed to the staging environment. This ensures the testing of the Databricks Project continues on new staging data with the appropriate staging Configurations.
Press enter or click to view image in full size
  • Once, both the development and the staging environments have completed testing, and, passed all tests, Databricks Project will then be deployed to the production environment, i.e., in the main branch. This will fully deploy the Databricks Project to the production environment utilizing production data and the necessary production Configurations.
Press enter or click to view image in full size

Some example Configurations for each of the environments -

development: For development, the environment may be configured to -

  • Single Node Compute, as the data is small and doesn’t require a large Cluster
  • The Databricks Project will be run using the user’s account, i.e., using developer’s SSO, on the development data

staging: In staging, the aim is to closely resemble the production environment. Hence, the following configurations may be adopted -

  • Serverless Compute is used to scale the resources as needed
  • The Databricks Project will be run using a Service Principal Identity, i.e., Functional SSO (FSSO), on the development data. Using Service Principal Identity gives the automated tools access to only the necesary Databricks resources, offering better security than using a user, or, group access

production: For production, the environment may be configured to -

  • Serverless Compute is continued to be used
  • The Databricks Project will still be run using a Service Principal Identity, i.e., Functional SSO (FSSO), on the production data
  • The Databricks Project will be set to run weekly to update the data for the consumers

How to Orchestrate the CI/CD Journey to Production for a Databricks Project

Previously, there were three different methods to orchestrate the CI/CD process for a Databricks Project.

1. Manually: It is possible to manually orchestrate the CI/CD process for a Databricks Project using the UI.

Pros: The User Interface is easy to learn and navigate.

Cons:

  • It is extremely time-consuming to manually deploy the Databricks Project every time.
  • It is also highly error-prone, since it involves human-interaction.
  • Most importantly, it is not a viable option for the CI/CD process as the target is to automate the deployment process and easily move the Databricks Project from one environment to the next.

2. Programmatically: It is possible to programmatically orchestrate the CI/CD process for a Databricks Project using Databricks REST API or Databricks SDK.

Pros: Databricks REST API or Databricks SDK gives low-level control, allowing to program everything required for deploying the Databricks Project from one environment to the next.

Cons:

  • This approach requires intermediate to advanced programming knowledge. The developers need to learn a lot of APIs, if Databricks REST API is used, or, a lot of classes, if Databricks SDK is used.
  • Coding everything to automate the entire CI/CD process can be extremely time-consuming.

3. Terraform: It is also possible to orchestrate the CI/CD process for a Databricks Project using Databricks Terraform Provider.

Pros: It is a very powerful and expressive tool that Databricks Administrators typically use to manage Infrastructure.

Cons:

  • This method might work for some, but it can be challenging for the Data Scientists and Data Engineers, who may not have a deep background in Infrastructure Management.
  • Using this approach means learning another tool and manage.

So, to simplify the CI/CD process of a Databricks Project, i.e., write codes once for the Databricks Project, and, then easily deploy it to multiple environments while modifying the Configurations based on each environment, the following approaches can be utilized -

  • Co-version code with all Configurations in a simple, easy to understand format like YAML, which is a human-readable format that organizes data in a simple, easy to read structure using indentation and minimal syntax.
    This makes it easier to understand and edit compared to other formats, like — XML and JSON.
  • Define Databricks resources using existing REST API parameters. This ensures to keep the parameters consistent with other techniques.
  • Ensure user isolation during the deployment. This helps to prevent the users from interfering with each other's work during the deployment.
  • Specify environment-based overrides and variables. This allows us to easily override values based on the target environment to deploy to, ensuring flexibility and the Configurations for each environment.

All the above-mentioned approaches can be utilized in a single approach provided by Databricks, i.e., Databricks Asset Bundles, or, DABs.

Introduction to Databricks Asset Bundles (DABs)

  • DABs enable to write code once and deploy everywhere.
  • DABs, or, Databricks Asset Bundles use YAML files to specify the artifacts, resources and configurations of a Databricks Project.
  • DABs allow to manage and deploy all the necessary components of a Databricks Project in a consistent and efficient way to multiple environments.

How Do Databricks Asset Bundles (DABs) Work?

  • The new Databricks CLI has specific Bundle Commands to validate, deploy and run Databricks Asset Bundles using a bundle YAML configuration file.

Where Databricks Asset Bundles (DABs) are Used?

  • DABs are extremely useful during development and CI/CD processes for deploying Databricks assets to target environments while modifying specific Configurations for each environment.

High Level View of CI/CD Pipeline with Databricks Asset Bundles (DABs)

  • Usually, the developers, in any organization, works in the lower environment of the Databricks Workspace in their respective feature branches, which are built on top of the development branch.
    So, the lead developer can build a Project Bundle in his / her local feature branch, working in the lower environment of the Databricks Workspace for the team.
  • From the feature branch of the lower environment of the Databricks Workspace, the lead developer can either use the Bundle, or, manually deploy to the development environment, which can include a Development Databricks Workspace, or, a Development Catalog.
    In this Development Databricks Workspace, the developers can test their changes away from production.
Press enter or click to view image in full size
  • After development, the developers can commit their changes to version control within a project repository and push the development changes.
Press enter or click to view image in full size
  • After the changes are committed, it is possible to set up a notification to trigger to initiate the CI/CD pipeline.
    In a typical CI/CD pipeline, the tests are first deployed to staging environment.
Press enter or click to view image in full size
  • Once, all the tests are passed in the staging environment, the code can then be deployed to the production environment after following the organizations’ processes, such as — code review and approval.
Press enter or click to view image in full size

Simple Project Structure for Bundle

In the project folder, it should be allowed to create multiple sub-folders to organize the assets.

A simple project structure should start with the project folder, followed by various folders and files, such as — resources, src, tests, and, a databricks.yml file.

These folders may contain additional sub-folders to store specific files.

  • resources: The resources folder should contain additional YAML Configuration files for the Databricks Asset Bundles, if needed.
  • src: The src, or, source folder contains the source files needed for the data pipelines, such as — Notebooks, Python files etc.
  • tests: The tests folder contains the unit and integration tests for the data pipeline.
  • databricks.yml: The databricks.yml file is a required Bundle Configuration file that is used to deploy the Databricks assets.
    The databricks.yml file must be expressed in the YAML format.
    The databricks.yml file contains at minimum the top-level bundle mapping.
    A Databricks Project should contain only one Bundle Configuration file, i.e., databricks.yml file.

Organizing and modularizing the folders and files will help during the development and maintenance as the Databricks Project grows.

This is a simple example. The Databricks Project of any organization will have additional files and folders, or, different organizational structure.

Details of databricks.yml File

The databricks.yml file is the key to the Bundle.

The YAML file contains several top-level mapping keys that are left aligned. The databricks.yml configuration top-level mappings include -

  • bundle
  • resources
  • targets
  • variables
  • workspace
  • permissions
  • artifacts
  • include
  • sync

In the below simple example of the databricks.yml file, it includes the top-level mappings as bundle, resources and targets.

Press enter or click to view image in full size

bundle: The sample databricks.yml file declares a top-level mapping key, i.e., bundle with a name mapping that specifies the name of the Bundle, i.e., oindrila_demo_bundle.

resources: The other top-level mapping key, i.e., resources specifies information about the Databricks resources, such as — Jobs, DLT Pipelines, MLFlows and more, which are used by the Bundle.
The Databricks Resources are defined using the corresponding Databricks REST API parameters.

  • If the top-level mapping key resources includes one or more individual resource declarations, each of which must have a unique name.
    In the sample databricks.yml file, the jobs mapping key has a resource declaration, named ingestion_job_dab. This job resource creates a Databricks Job named ingestion_job using the name maping. This Databricks Job contains one or more Tasks specified using the corresponding REST API parameters.
Press enter or click to view image in full size
  • Within the tasks mapping, there is only one task resource declaration under that job resource. The task resource is named create_bronze_table, and, uses the notebook_path mapping to specify the path of the Databricks Notebook to use.
    It is important to use the Relative Path of the Databricks Notebook within the Databricks Project, with the correct extension for the Databricks Notebook.
    The developers need to be aware that starting December 20th 2024, the default format for new Databricks Notebooks will be the .ipnyb format. If the correct extension for the Databricks Notebooks is not specified, an error will be returned.
    In the sample databricks.yml file, the Databricks Notebooks is using the traditional .py extension.
Press enter or click to view image in full size

targets: The top-level mapping key, i.e., targets sets specific environments, and, corresponding environment configurations. In the targets mapping, it is also possible to include other various Configurations and Configuration Overrides for that target.
In the sample databricks.yml file, two target environments are included — development and production, each with unique Configurations and Overrides.
For the development environment, the mode mapping is specified as development, the default mapping is set to true, and, the Databricks Workspace URL for development environment is specified as the host mapping under the workspace mapping.
For the production environment, the mode mapping is specified as production, and, the Databricks Workspace URL for production environment is specified as the host mapping under the workspace mapping.

  • mode: The mode mapping represents the different environments of a Databricks Project, such as — development or production.
  • default: The default mapping represents the default target environment, which is set to the development environment in the sample databricks.yml fil. This ensures if the developers have not specified where to deploy the Bundle to run, it will deploy to the development environment, by default.

A Bundle Configuration file must contain only one top-level mapping bundle that associates the Bundle’s contents with the name of the Bundle. Additionally, other Databricks Workspace settings can also be included, like — Cluster ID, Compute ID, Git and a few others, if needed.

The additional mappings, beneath the top-level mappings must be indented.

How to Validate, Deploy, and Run and Delete the Databricks Asset Bundles (DABs)

Validate: To validate the Databricks Asset Bundles (DABs), the Databricks CLI command databricks bundle validate is used.
This command returns warnings if unknown resource properties are found in the Bundle Configuration files.

Deploy: To deploy the Databricks Asset Bundles (DABs) to the Databricks Workspace, the Databricks CLI command databricks bundle deploy is used, and the -t flag is used for specifying the target environment where the Databricks Asset Bundles would be deployed into.
For the example — databricks bundle deploy -t development, the DABs will be deployed into the development environment.
In the sample databricks.yml file, development is set as the default target environment. So, if no target environment is specified to deploy the DABs after the -t flag in the command databricks bundle deploy, then the DABs will be deployed to the development environment, by default.

Press enter or click to view image in full size
  • However, it is best practice to explicitly specify the target environment where the DABs to be deployed to after the -t flag in the command databricks bundle deploy.

Run: Once a Databricks Asset Bundles (DABs) is deployed to the Databricks Workspace, to run the Databricks Jobs, specified under the job mapping as individual unique job keys in the DABs, the Databricks CLI command databricks bundle run is used. The -t flag is used for specifying the target environment from where the Databricks Asset Bundles is being run, followed by the unique key that specifies the Databricks Job to run.
The example command — databricks bundle run -t development ingestion_job_dab, will make the Databricks Job by the name ingestion_job to be run in the development environment, which is specified by the unique job key ingestion_job_dab in the DABs.

Press enter or click to view image in full size

Delete: Once the usefulness of a Bundle is finished, it is better to delete it using the Databricks CLI command databricks bundle destroy.
By default, the developers would be prompted to confirm permanent deletion of the previously-deployed Databricks Jobs, Pipelines, and artifacts.
To skip these prompts and perform automatic permanent deletion the “— auto-approve” option is added to the Databricks CLI command databricks bundle destroy.

  • Destroying a Bundle permanently deletes that Bundle’s previously-deployed Databricks Jobs, Pipelines, and artifacts. This action cannot be undone.

Substitution of Variables in Databricks Asset Bundles (DABs)

Databricks Asset Bundles (DABs) support substitutions in custom variables, making the Bundle Configuration file more modular and reusable. Both substitutions and custom variables allow dynamic retrieval of values, meaning that settings can be determined at the time the Databricks Asset Bundles (DABs) is deployed and run.

To use a custom variable in the YAML file, reference it using a dollar sign, i.e., $, followed by the curly braces, i.e., {}. Now, inside the curly braces, place the mapping where the custom variables are declared, followed by a dot, i.e., “.”, and the name of the custom variable. Example: ${<mapping_placeholder>.user_id}.

By default, a variety of substitutions are available. Some common ones are -

  • ${bundle.name}
  • ${bundle.target}
  • ${workspace.file_path}
  • ${workspace.root_path}
  • ${resources.jobs..id}
  • ${resources.models..name}
  • ${resources.pipelines..name}

Custom Variables

It is possible to use the custom variables that are defined in the Databricks Asset Bundles (DABs).

Custom variables allow to enable the dynamic retrieval of values needed for specific scenarios.

A custom variable is assumed to be of type string. This sort of custom variable is called simple custom variable.

The custom variables are declared in the Bundle Configuration file within the variables mapping. Under the top-level mapping key variables, it is possible to specify any number of custom variables.

  • For example, define a custom variable by the name user_sso_id, followed by a description and its default value mappings respectively.

Complex Custom Variables

  • If complex data needs to be stored in a custom variable, a custom variable can be defined as a complex using the type mapping as complex.
  • For example, define a complex custom variable by the name cluster_details, followed by a description and its default value mappings respectively, where the default value is of complex type and specifies the values for spark_version, node_type_id, and, num_workers.

Defining and Referencing Simple Custom Variables in databricks.yml File

The Bundles Configuration file can contain only one top-level mapping called variables, under which all the custom variables are defined.

Once a custom variable is defined under the top-level mapping key variables, it can be referenced from other custom variables, or, mappings.

In this example, let’s define a simple custom variable by the name user_sso_id with the default value oindrila8008.

Press enter or click to view image in full size

user_sso_id will be used to create two new Simple Custom Variables, i.e., catalog_dev_details and catalog_prod_details by appending the strings “_1_dev” and “_3_prod” to it respectively.
To reference the custom variable, i.e., user_sso_id, use the dollar sign, i.e., $, followed by the curly braces, i.e., {}.
Inside the curly braces, use the top-level mapping “var”, followed by a dot, i.e., “.”, and, the name of the custom variable, i.e., user_sso_id.

Press enter or click to view image in full size

As a result, the value of the Custom Variable, i.e., catalog_dev_details will be oindrila8008_1_dev, and the value of the Custom Variable, i.e., catalog_prod_details will be oindrila8008_3_prod.

This technique allows to reference or override the custom variables throughout the databricks.yml file.

Defining and Referencing Complex Custom Variables in databricks.yml File

By default, a custom variable is assumed to be of type string, unless it is defined as a complex custom variable using the type mapping as complex.

In this example, let’s define a complex custom variable by the name cluster_details with type mapping as complex and specify the default values. This allows to provide more structured data for the custom variable, such as, specifying spark_version, node_type_id, and num_workers.

Press enter or click to view image in full size

The mapping num_workers of cluster_details will be used to create a new simple custom variable, i.e., cluster_worker_details by appending the strings “_worker(s)” to it respectively.
To reference the complex custom variable, i.e., cluster_details, use the dollar sign, i.e., $, followed by the curly braces, i.e., {}.
Inside the curly braces, first use the top-level mapping “var”, followed by a dot, i.e., “.”, and, the name of the complex custom variable, i.e., cluster_details. Then again use a dot, i.e., “.”, and, the name of the mapping inside cluster_details, i.e., num_workers.

Press enter or click to view image in full size

Target Environment Variable Overrides

Custom variables can be used throughout the Bundles Configuration file. One of the places where the custom variables can be used is the targets mapping. This allows to dynamically modify the custom variable’s values for each of the target environment.

For example, let’s say there is a custom variable by the name target_catalog that is used to populate a Databricks Job Parameter to read and write data from a specific catalog. The target_catalog variable’s value can be modified for each of the target environments.

  • When deploying to the development environment, the target_catalog variable will use the value of the catalog_dev_details variable.
  • When deploying to the production environment, the target_catalog variable will use the value of the catalog_prod_details variable.
Press enter or click to view image in full size

If an override is not specified for the custom variable, the default value will be used.
Please be aware that the default value of any custom variable must be defined in the top-level mapping variables for an override to work. If a custom variable is not assigned to a default value, the override will not function as expected.

Press enter or click to view image in full size

Lookup Variables

For specific object types, i.e., for the environment configurations — alert, cluster_policy, cluster, dashboard, instance_pool, job, metastore, notification_destination, pipeline, query, service_principal and warehouse, it is possible to define a Lookup for the custom variables to retrieve a named object’s ID.

If a lookup mapping is defined for a custom variable, the id of the object with the specified name is used as the value for that custom variable. This ensures the correct id of the object is always used.

  • First, specify the name of the custom variable to create, for example, cluster_id_details.
  • Then use the lookup mapping to define what to look up. For example, look up the name of the cluster created for this example, i.e., Oindrila_Cluster to retrieve the coresponding id value of that cluster.
Press enter or click to view image in full size

Benefits of Using Custom Variables in Bundles Configuration File

There are several benefits to using custom variables in the databricks.yml file -

  • Customizable for Different Environments: Custom variables allow to easily modify the configurations, such as — catalogs, file paths, database connections and other settings for the development, staging and / or production environments.
  • Reusability Across Databricks Projects: It is possible to use the same Databricks Asset Bundles (DABs) across multiple teams or Databricks Workspaces by adjusting only the custom variable’s values, promoting consistency and reducing redundancy.
  • Easy Maintenance and Updates: With custom variables, it is possible to quickly update the configurations of the Databricks Assets mentioned in the databricks.yml file, by modifying the custom variables, ensuring consistency across the environments and reducing the risk of errors.

--

--

Oindrila Chakraborty
Oindrila Chakraborty

Written by Oindrila Chakraborty

I have 13+ experience in IT industry. I love to learn about the data and work with data. I am happy to share my knowledge with all. Hope this will be of help.

Responses (1)