Introduction to Unity Catalog in Databricks

Oindrila Chakraborty
9 min readAug 14, 2023

--

What is Unity Catalog?

Unity Catalog is a fine-grained data governance solution for data present in a Data Lake.

Why Unity Catalog is Used Primarily?

Suppose, in a lakehouse project, there is a database in the Hive Metastore of a Databricks Workspace, and, in that database, there are Twenty Delta Tables present.

If the requirement is to provide a specific set of permissions, like Read Only, or, Write Only to a specific Group of Users on one, or, some of the particular Delta Tables, or, even at the Row Level, or, Column Level, which can contain a Personally Identifiable Information, i.e., PII, of that particular Delta Tables, then the Unity Catalog can simplify the solution to the requirement by implementing the Unified Data Access Control.

So, the primary reason for using the Unity Catalog is that it helps to simplify the security, as well as, the governance of the data by providing a Centralized Place, where it is possible to administer the access to the data, and, also audit the access to the data.

In What Type of Databricks Workspace, the Unity Catalog can be Used?

The Unity Catalog is only available in the Premium pricing tier.

The Unity Catalog is available across multiple Clouds, i.e., on Azure, Amazon, and, Google Cloud Platform.

Why Unity Catalog is Considered as Unified Catalog?

The Unity Catalog is considered as a Unified Catalog in the sense that it can store all the Databricks Workspace Objects, like - data, Machine Learning Models, Analytics Artifacts etc., in addition to the Metadata for each of the Databricks Workspace Objects.

Whatever is stored inside the Unity Catalog, becomes an Object.

Once an artifact that is stored inside the Unity Catalog becomes an Object, it is possible to provide the selective access to that particular Object.

The Unity Catalog is also considered as a Unified Catalog because, it is possible to blend in the data from other Catalogs, such as existing Hive Metastores into the Unity Catalog.

What is the Data Lineage Feature in Unity Catalog?

Suppose, in a lakehouse project, a particular Column of a target Delta Table is derived from multiple Columns of three other Delta Tables by using some transformation logic.

In such cases, using the Unity Catalog, it is possible to visualize the Data Lineage of that target Delta Table to get the end-to-end visibility into how the data flows in the lakehouse project from the source layer to the consumption layer.

What is the Data Sharing Feature in Unity Catalog?

Suppose, in a lakehouse project, there are twenty Delta Tables in a database.

If the requirement is to share the data of those twenty Delta Tables across the different platforms, or, the different Clouds, it can be done so, by using the Data Sharing feature of the Unity Catalog.

Data Sharing is an protocol, that is developed by Databricks for secure data sharing with other organizations, or, with other teams within the organization, regardless of which computing platform the other teams use.

What is a Hive Metastore?

In order to manage the structure of the data in a Data Lake, it is possible to register, and, share the data as Tables in a Hive Metastore.

A Hive Metastore is a database that holds the Metadata about the data, such as -

  • The schema of the created Tables
  • The paths of the underlying data in the Data Lake of the created Tables
  • The format, like - parquet or delta, in which the underlying data of the created Tables is stored in the Data Lake etc.

Can Hive Metastore of One Databricks Workspace be Shared with Another Databricks Workspace?

Every Databricks Workspace in Azure Databricks comes with a managed built-in Metastore.

In a lakehouse project, there will be multiple Databricks Workspaces for different environments, like - development, qa, uat etc., and, the same set of Tables need to be registered in each of those Databricks Workspaces.

The Table that is created in a particular Databricks Workspace, can not be used in another Databricks Workspace, because each Databricks Workspace has a separate Hive Metastore that can be accessible only from within that particular Databricks Workspace, and, not from other Databricks Workspaces.

Architecture of Unity Catalog

  • One Unity Catalog can be linked to multiple Databricks Workspaces.

1. Metastore -

Similar to the Hive Metastore, the Unity Catalog also works on the Metastore. The Metastore for the Unity Catalog has to be created manually.

Because of this Metastore, the Unity Catalog has a Centralized Metadata layer, which can be shared across multiple Databricks Workspaces.

The data itself, Data Lineage, Audit Logs and everything about the data will be captured and stored in the Metastore of the Unity Catalog.

2. User Management -

If a specific set of Users, a specific set of Groups, or, a Service Principle from the Azure Active Directory has the permission to access a specific Databricks Workspace in a lakehouse project, then those specific set of Users, specific set of Groups, or, the Service Principle can be imported into the User Management of the Unity Catalog for that lakehouse project.

The User Management captures the Users, Groups, or, the Service Principle, and, the permissions those have.

Whenever, a User, or, a Group, or, a Service Principle tries to access a particular Table inside a Databricks Workspace, that Databricks Workspace will talk back to the Unity Catalog to verify if that User, or, the Group, or, the Service Principle has the access to that particular Table.

Once, the Authentication and Authorization are both successful, only then the User, or, the Group, or, the Service Principle will be able to view the Rows, and, the Columns of that particular Table in the Databricks Workspace.

This is how the Unity Catalog actually works.

Unity Catalog Object Model

In Unity Catalog, the hierarchy of the primary data objects flow from the Metastore to the Tables.

Whenever, the Unity Catalog is enabled on a Databricks Workspace, all the data objects present in the Metastore for that Unity Catalog are displayed in the Catalog left menu of that Databricks Workspace.

1. Metastore - Metastore is the top-level container to store the Metadata in the Unity Catalog.
A Unity Catalog Metastore exposes a three-level namespace, i.e., catalog.schema.table, that organizes the data.
A Unity Catalog Metastore stores the Metadata about the data assets, i.e., Tables and Views, and the permissions that govern the access to the respective data assets.

2. Catalog - Catalog is the first-level of the object hierarchy that is used to organize the data assets.
Users can see all the Catalogs on which they have been assigned the USE CATALOG data permission.

3. Schema - Schema is also known as the database.
Schema is the second-level of the object hierarchy that is used to organize the Tables and Views it contains.

4. Table - Table is the lowest-level in the object hierarchy.
Tables can be External, i.e., stored in the External Locations of the Cloud storage of choice, or, the Tables can be Managed, i.e., stored in a storage container of the Cloud storage that is created expressly for Azure Databricks.

5. View - A View is a read-only object that is created from one, or, more Tables that is contained within a Schema.

6. External Location - An External Location is an object that contains a reference to a storage credential, and, a Cloud storage path that is contained within a Unity Catalog Metastore.

7. Storage Credential - A Storage Credential is an object that encapsulates a long-term Cloud credential that provides access to Cloud storage that is contained within a Unity Catalog Metastore.

8. Function - A Function is a User-Defined Function that is contained within a Schema.

9. Model - A Model is a Registered ML Flow that is contained within a Schema.

10. Share - A Share is a logical grouping of the Tables that is intended to be shared using Delta Sharing.
A Share is an object that is contained within a Unity Catalog Metastore.

11. Recipient - A Recipient is an object that represents an organization, or, a Group of Users with whom the data is shared with using Delta Sharing.
These objects are contained within a Unity Catalog Metastore.

12. Provider - A Provider is an object that represents an organization that has made the data available for sharing using Delta Sharing.
These objects are contained within a Unity Catalog Metastore.

Who Can Set Up the Unity Catalog in Databricks?

To set up the Unity Catalog in a Databricks Workspace, the User should have the Account Global Admin privileges on the Azure Subscription, or, must have the Owner access privileges provided to the User.

Why the Access Connector for Azure Databricks Service is Required to Set Up the Unity Catalog?

While implementing the Unity Catalog, the Access Connector for Azure Databricks is used.

The Access Connector for Azure Databricks is an Azure Service, which serves as the link between the following two Azure Services -

1. Storage Account - A Storage Account is created, which is used as the Metastore for the Unity Catalog.
The Access Connector for Azure Databricks service will have access to that Storage Account, which is being used as the Metastore.

2. Databricks Workspace - The Access Connector for Azure Databricks will also connect the Storage Account to the Databricks Workspace, on which the Unity Catalog would be applied.

So, any User, who is working on the Databricks Workspace, will not need direct access to the Storage Account, which is used as the Metastore for the Unity Catalog. This is because, the Access Connector for Azure Databricks already has access to the Storage Account, which in turn will be linked to the Databricks Workspace.

Steps to Enable the Unity Catalog in Azure Databricks

Following are the steps to enable the Unity Catalog in a Databricks Workspace -

1. Create a Resource Group.

2. Create a Storage Account, that is to be used as the Metastore for the Unity Catalog, inside the same Resource Group and under the same Region as the Resource Group.

3. Create a Databricks Workspace of Premium Tier on which the Unity Catalog would be set up, inside the same Resource Group and under the same “Region” as the “Resource Group”.
Since, the “Unity Catalog” is “Only Available” in the “PremiumPricing Tier, it must be made sure to “Select” the “Premium” as the “Pricing Tier” of the “Databricks Workspace” to be “Created”.

4. Create an Access Connector for Azure Databricks, inside the same Resource Group and under the same Region as the Resource Group.
It must be made sure that the Managed Identity feature must be on while creating the Access Connector for Azure Databricks service.

5. Assign the Access Connector for Azure Databricks the Storage Blob Data Contributor access to the Storage Account, that is to be used as the Metastore for the Unity Catalog.

6. Create a Container inside the Storage Account, so that, the Metadata can be stored in that particular Container.

7. Open the Databricks Workspace.
Click on the top right corner, where the user mail id is displayed.
From the menu options, click on the Manage Account. This will go to a completely new UI. The URL of this new UI is accounts.azuredatabricks.net. This UI is to set up the Unity Catalog manually.

8. From the left side bar menu, click on the Create a metastore.
The Data page is opened. Click on the Create metastore button.
The Create Metastore page is opened. Provide the following information -
A. Provide the name of the Metastore to be created in the Name textbox.

B. Select the same Region as the Databricks Workspace from the Region dropdown.

C. Provide the path of the Container that is created inside the Storage Account, which is used as the Metastore for the Unity Catalog, in the ADLS Gen 2 path textbox.
The format in which the path of the Container needs to be provided is - container_name@storage_account_name.dfs.core.windows.name.
If there is any particular folder, inside the Container, where the Metadata can be stored, that folder path can be provided as - container_name@storage_account_name.dfs.core.windows.name/folder_path

D. Select the Resource ID of the Access Connector for Azure Databricks from the Overview, and, provide that to the Access Connector Id textbox.

E. Finally, click on the Create button.

9. Now, the next part is to select one, or, multiple desired Databricks Workspaces to assign to the Meastore, and, click on the Assign button.

A Dialogue Box is opened with Enable Unity Catalog? as the header of the Dialogue Box. Click on the Enable button.

10. Finally, this way the Unity Catalog is implemented on one, or, multiple Databricks Workspaces at the same time.

Important Features of Unity Catalog

To configure the Unity Catalog, the User must be a Global Account Admin.

There can be only one Unity Catalog per Region.

Each Databricks Workspace can be attached to only one Unity Catalog Metastore.

One Unity Catalog Metastore can be attached to multiple Databricks Workspaces.

It is not possible to assign a Unity Catalog, which is created in the Region-A, to the Databricks Workspace, which is created in the Region-B.

--

--

Oindrila Chakraborty
Oindrila Chakraborty

Written by Oindrila Chakraborty

I have 12+ experience in IT industry. I love to learn about the data and work with data. I am happy to share my knowledge with all. Hope this will be of help.

Responses (3)