Introduction to Continuous Integration and Continuous Deployment/Delivery (CI/CD) in Databricks
Introduction to CI/CD
CI/CD is a key subset of the DevOps practice that focuses on -
- automatic code integration
- testing
- delivery
Within the DevOps lifecycle, Continuous Integration, or, CI emphasises on -
- planning
- development
- environment management
- testing of the pipelines
On the other hand, Continuous Deployment / Delivery, or, CD focuses on -
- automating release processes
- deployment
- operation
- monitoring these pipelines
Role of CI
CI, or, Continuous Integration involves regularly merging code changes from multiple contributors into a central code repository and running automated tests to ensure code quality.
Codes that fail the automated tests do not make into the central code repository.
Example: Consider three developers (developer A, developer B and developer C) are working on a project implementing updates and fixes. The developers push their codes into a version control system, which automatically run tests before commit is made. This ensures any issues with the code are caught. If the test fails, the commit is stopped. If the test passes, the changes are implemented.
The way testing and commits in version control system are handled depends on the branching strategy defined by the organizations.
Deciding the right branching strategy and testing strategy is extremely crucial for maintaining quality and smooth workflows.
Benefits of CI
Following are the four key benefits of CI -
- Early Detection of Issues: By integrating codes often, the bugs and conflicts are caught early. This way the bugs and conflicts can be fixed earlier easily.
- Faster Development Cycle: Frequent integration of code speeds up the delivery of new features and fixes.
- Improved Collaboration and Code Quality: Regular code integration leads to cleaner and modular code with better teamwork.
- Automated Testing and Validation: Automated tests run with each integration to ensure the code is stable and works with existing features.
High-Level Testing Steps within CI/CD
Testing the code is extremely important within a CI/CD pipeline.
The testing steps to follow are depicted within the testing pyramid. The testing pyramid categorizes different tests -
Unit Tests
Integration Tests
System Tests
Unit Tests: The base of the pyramid is Unit Test, which tests individual functions or methods in isolation.
Since these are small individual functions, these typically can be run quickly, frequently and automatically ensuring that the functions work as expected.
Unit Tests form the foundation because these are inexpensive, and, provide the broadest coverage.
Example of Unit Tests can be the testing of a PySpark method.
Integration Tests: The Integration Tests test the integration between different components, or, systems. These tests are typically slower, and, more costly than the Unit Tests, but, provides greater assurance that the components work together correctly.
Within Databricks, the Integration Tests typically revolve around using Notebooks, DLTs, or, Workflows.
Example of Integration Tests can be the testing of a PySpark method in a DLT pipeline to check if it works correctly.
System Tests: The System Test tests the entire application ensuring that all parts function together in a real-world scenario.
These tests are typically slow, costly and often run in Production-like environments.
Example of System Tests can be the testing of an end-to-end data pipeline to check whether it works as expected within a Workflow creating the desired results.
Role of CD
CD, or, Continuous Deployment / Delivery part is extremely important in CI/CD.
Continuous Delivery: Continuous Delivery is all about automating the process of pushing changes from lower to the higher environment, like from Development to the Staging, or, Pre-Production environments.
This setup allows for seamless updates and provides the flexibility to manually deploy to the Production, whenever needed that ensures smooth releases.
After the Continuous Integration steps are completed, and, testing has occurred, it can be decided to deploy the data pipelines to Production.
Continuous Deployment: Continuous Deployment takes automation one step further. Once a change passes all the tests, it is automatically deployed to Production ensuring the new features and fixes are delivered quickly, seamlessly without manual intervention.
If the Continuous Integration process is inline and well-implemented, it is possible to automatically deploy the data pipelines from Development to Staging, and, then to Production avoiding manual deployment tasks. Implementing this technique requires well thought-out tests to ensure if the data pipelines should be deployed.
High-Level CI/CD Workflow Overview
Looking at the high-level overview of CI/CD, it is evident that the Develop, Test, Build and Version Control steps of Continuous Integration are combined with the Deploy to Stage, and, Deploy to Production steps of Continuous Deployment / Delivery.
In the end, the CI/CD process streamlines the development and deployment of data pipelines by automating the testing and deployment, which leads to faster and more reliable releases.
This approach minimizes manual errors, improves collaboration and ensures high quality software delivered quickly and consistently.
Isolating Environments for CI/CD
Within CI/CD, isolating environments for the different stages is very important. This ensures that the code is developed and tested in the Development and Staging environments, prior to touching the Production environment.
The minimal setup is to have two environments. Although, this can vary depending on the organization requirements.
Within Databricks, it is possible to isolate the environments using Workspaces - such as DEV, STAGE and PROD Workspaces, or within Catalogs - DEV Catalog, STAGE Catalog and PROD Catalog.
Setting Up Data for CI/CD
DEV Data: Within a CI/CD pipeline, the DEV data can be generated in the following ways -
- Often a small, static, subset of Production data is used as DEV data.
- DEV data can often be anonymized, or, generated using synthetic datasets.
The DEV data is generated the way it is in order to allow rapid development and testing without compromising privacy, or, Production data integrity.
STAGE Data: If there is a Staging environment, STAGING data should closely mirror Production data in structure and volume.
With anonymized, or, scrubbed sensitive information, STAGING data ensures realistic testing and validation.
PRODUCTION Data: PRODUCTION Data is live, fully operational, and, continuously updated that contains real-world user data.
PRODUCTION Data must be handled with high security, privacy and compliance standards.
Each environment should have data suited for its purpose, i.e., balancing realism, security and compliance at every stage. How this is implemented, will depend on the organization, and, the sensitivity of the data, as well as the complexity of the project.
Deployment Tools in Databricks
There are several tools available to deploy a Databricks project -
- Databricks REST API: It is possible to use the Databricks REST API to deploy a Databricks project.
The REST API provides direct access to Databricks functionality through HTTP requests.
The REST API requires manually constructed HTTP requests and handles the returned response.
2. Databricks SDK: The Databricks SDK helps accelerate the development and deployment within the Databricks Data Intelligence Platform.
It covers all publicly exposed Databricks REST API operations and supports multiple programming languages, including - Python, Java, Go and R.
3. Databricks CLI: Another tool that can be used to deploy a Databricks project is the Databricks CLI.
The Databricks CLI offers an easy-to-use interface for automating tasks from the command prompt, terminal, or, bash scripts.
Additionally, the Databricks CLI allows to use the Databricks Asset Bundles (DABs) to write Infrastructure as code for Databricks.
Using Databricks CLI
There are variety of ways to use Databricks CLI, depending on the needs.
- Databricks Web Terminal: The Databricks Web Terminal can be used to run the Databricks CLI commands directly from the web interface.
The Databricks Web Terminal enables to execute shell commands within Databricks, making it easier to manage and interact with the working environment.
By default, the Databricks Web Terminal uses the latest version of the Databricks CLI ensuring that the users have access to the most up-to-date features and functionalities.
Authentication is based on the current user. So, the user will be automatically authenticated to run commands according to the given permissions.
However, it needs to be kept in mind that this feature must be enabled in the working environment to start using the Databricks Web Terminal.
2. VS Code: Databricks CLI can also be installed and run within an Integrated Development Environment, or, IDE, like - VS Code.
When using VS Code, it is important to authenticate the Databricks Workspace in order to run the Databricks CLI commands.
Additionally, VS Code provides the Databricks VSCode Extension, which offers extra features to enhance the development process.
3. Databricks Notebooks: Lastly, another technique that can be used to run Databricks CLI command is Databricks Notebooks.
Within Databricks Notebooks, it is possible to run shell commands directly using the %sh magic command. This allows to install and use the Databricks CLI, enabling to interact with the working environment from the command line in a Databricks Notebook cell.
To use the Databricks CLI within a Databricks Notebook, the user must be authenticated using a token, ensuring secure access to the Databricks Workspace.
Using Databricks Notebooks to connect to Databricks CLI might not be the best fit for an organization’s needs.