Introduction to “Unity Catalog” in Databricks

Oindrila Chakraborty
10 min readAug 14, 2023

--

What is “Unity Catalog”?

  • “Unity Catalog” is a “Fine-Grained Data Governance Solution” for “Data” present in a “Data Lake”.

Why “Unity Catalog” is Used “Primarily”?

  • Suppose, in a “Lake House Project”, there is a “Database” in the “Hive Metastore” of a “Databricks Workspace”, and, in that “Database”, there are “Twenty Delta Tables” present.
  • If the “Requirement” is to “Provide” a “Specific Set” of “Permissions”, like “Read Only”, or, “Write Only” to a “Specific Group” of “Users” on “One”, or, “Some” of the “Particular Delta Tables”, or, even at the “Row Level”, or, “Column Level”, which can “Contain” a “Personally Identifiable Information”, i.e., “PII”, of that “Particular Delta Tables”, then the “Unity Catalog” can “Simplify” the “Solution” to the “Requirement” by “Implementing” the “Unified Data Access Control”.
  • So, the “Primary Reason” for using the “Unity Catalog” is that it “Helps” to “Simplify” the “Security”, as well as, the “Governance” of the “Data” by “Providing” a “Centralized Place”, where it is possible to “Administer” the “Access” to the “Data”, and, also “Audit” the “Data Access”.

In “What Type” of “Databricks Workspace”, the “Unity Catalog” can be Used?

  • The “Unity Catalog” is “Only Available” in the “PremiumPricing Tier.
  • The “Unity Catalog” is “Available” across “Multiple Clouds”, i.e., on “Azure”, “Amazon”, and, “Google Cloud Platform”.

Why “Unity Catalog” is Considered as “Unified Catalog”?

  • The “Unity Catalog” is “Considered” as a “Unified Catalog” in the sense that it can “Store” “All” the “Databricks Workspace Objects”, like — “Data”, “Machine Learning Models”, “Analytics Artifacts” etc., in “Addition” to the “Metadata” for “Each” of the “Databricks Workspace Objects”.
  • Whatever is “Stored” inside the “Unity Catalog”, becomes an “Object”.
  • Once an “Artifact” that is “Stored” inside the “Unity Catalog” becomes an “Object”, it is possible to “Provide” the “Selective Access” to that “Particular Object”.
  • The “Unity Catalog” is also “Considered” as a “Unified Catalog” because, it is possible to “Blend” in the “Data” from “Other Catalogs”, such as “Existing Hive Metastores” into the “Unity Catalog”.

What is the “Data Lineage” Feature in “Unity Catalog”?

  • Suppose, in a “Lake House Project”, a “Particular Column” of a “Target Delta Table” is “Derived” from “Multiple Columns” of “Three Other Delta Tables” by using some “Transformation Logic”.
  • In such cases, using the “Unity Catalog”, it is possible to “Visualize” the “Data Lineage” of that “Target Delta Table” to get the “End-To-End Visibility” into “How” the “Data Flows” in the “Lake House Project” from the “Source Layer” to the “Consumption Layer”.

What is the “Data Sharing” Feature in “Unity Catalog”?

  • Suppose, in a “Lake House Project”, there are “Twenty Delta Tables” in a “Database”.
  • If the “Requirement” is to “Share” the “Data” of those “Twenty Delta Tables” across the “Different Platforms”, or, the “Different Clouds”, it can be done so, by using the “Data SharingFeature of the “Unity Catalog”.
  • “Data Sharing” is an “Protocol”, that is “Developed” by “Databricks” for “Secure Data Sharing” with “Other Organizations”, or, with “Other Teams” within the “Organization”, regardless of which “Computing Platform” the “Other Teams” use.

What is a “Hive Metastore”?

  • In order to “Manage” the “Structure” of the “Data” in a “Data Lake”, it is possible to “Register”, and, “Share” the “Data” as “Tables” in a “Hive Metastore”.
  • A “Hive Metastore” is a “Database” that “Holds” the “Metadata” about the “Data”, such as -
  • The “Schema” of the “Created Tables” in the “Hive Metastore”,
  • The “Paths” of the “Underlying Data” in the “Data Lake” of the “Created Tables” in the “Hive Metastore”,
  • The “Format”, like — “parquet” or “delta”, in which the “Underlying Data” of the “Created Tables” in the “Hive Metastore”, is “Stored” in the “Data Lake” etc.

Can “Hive Metastore” of “One Databricks Workspace” be “Shared” with “Another Databricks Workspace”?

  • “Every Databricks Workspace” in “Azure Databricks” comes with a “Managed Built-In Metastore”.
  • In a “Lake House Project”, there will be “Multiple Databricks Workspaces” for “Different Environments”, like — “Development”, “QA”, “Staging” etc., and, the “Same Set” of “Tables” need to be “Registered” in “Each” of those “Databricks Workspaces”.
  • The “Table” that is “Created” in a “Particular Databricks Workspace”, can “Not” be “Used” in “Another Databricks Workspace”, because “Each Databricks Workspace” has a “Separate Hive Metastore” that can be “Accessible” only from within that “Particular Databricks Workspace”, and, “Not” from “Other Databricks Workspaces”.

“Architecture” of “Unity Catalog”

  • “One Unity Catalog” can be “Linked” to “Multiple Databricks Workspaces”.

1. Metastore -

  • Similar to the “Hive Metastore”, the “Unity Catalog” also works on the “Metastore”. The “Metastore” for the “Unity Catalog” has to be “Created Manually”.
  • Because of this “Metastore”, the “Unity Catalog” has a “Centralized Metadata Layer”, which can be “Shared” across “Multiple Databricks Workspaces”.
  • The “Data Itself”, “Data Lineage”, “Audit Logs” and “Everything” about the “Data” will be “Captured” and “Stored” in the “Metastore” of the “Unity Catalog”.

2. User Management -

  • If a “Specific Set of Users”, a “Specific Set of Groups”, or, a “Service Principle” from the “Azure Active Directory” has the “Permission” to “Access” a “Specific Databricks Workspace” in a “Lake House Project”, those “Specific Set of Users”, “Specific Set of Groups”, or, the “Service Principle” can be “Imported” into the “User Management” of the “Unity Catalog” for that “Lake House Project”.
  • The “User Management” “Captures” the “Users”, “Groups”, or, the “Service Principle”, and, the “Permissions” those have.

Whenever, a “User”, or, a “Group”, or, a “Service Principle” tries to “Access” a “Particular Table” inside a “Databricks Workspace”, that “Databricks Workspace” will “Talk Back” to the “Unity Catalog” to “Verify” if that “User”, or, the “Group”, or, the “Service Principle” has the “Access” to that “Particular Table”.

Once, the “Authentication” and “Authorization” are both “Successful”, only then the “User”, or, the “Group”, or, the “Service Principle” will be able to “View” the “Rows”, and, the “Columns” of that “Particular Table” in the “Databricks Workspace”.

This is how the “Unity Catalog” actually works.

Unity Catalog Object Model

In “Unity Catalog”, the “Hierarchy” of the “Primary Data Objects” “Flow” from the “Metastore” to the “Tables”.

Whenever, the “Unity Catalog” is “Enabled” on a “Databricks Workspace”, “All” the “Data Objects” present in the “Metastore” for that “Unity Catalog” are “Displayed” in the “Data Explorer” of that “Databricks Workspace”.

  • 1. Metastore — “Metastore” is the “Top-Level Container” to “Store” the “Metadata” in the “Unity Catalog”.
    A “Unity Catalog Metastore” “Exposes” a “Three-Level Namespace”, i.e., “catalog.schema.table”, that “Organizes” the “Data”.
    A “Unity Catalog Metastore” “Stores” the “Metadata” about the “Data Assets”, i.e., “Tables” and “Views”, and the “Permissions” that “Govern” the “Access” to the “Data Assets”.
  • 2. Catalog — “Catalog” is the “First-Level” of the “Object Hierarchy” that is used to “Organize” the “Data Assets”.
    Users” can “See” “All” the “Catalogs” on which they have been “Assigned” the “USE CATALOGData Permission.
  • 3. Schema — “Schema” is also known as the “Database”.
    Schema” is the “Second-Level” of the “Object Hierarchy” that is used to “Organize” the “Tables” and “Views” it “Contains”.
  • 4. Table — “Table” is the “Lowest-Level” in the “Object Hierarchy”.
    Tables” can be “External”, i.e., “Stored” in the “External Locations” of the “Cloud Storage” of “Choice”.
    , or, the “Tables” can be “Managed”, i.e., “Stored” in a “Storage Container” of the “Cloud Storage” that is “Created Expressly” for “Azure Databricks”.
  • 5. View — A “View” is a “Read-Only Object” that is “Created” from “One”, or, “More” “Tables” that is “Contained” within a “Schema”.
  • 6. External Location — A “External Location” is an “Object” that “Contains” a “Reference” to a “Storage Credential”, and, a “Cloud Storage Path” that is “Contained” within a “Unity Catalog Metastore”.
  • 7. Storage Credential — A “Storage Credential” is an “Object” that “Encapsulates” a “Long-Term Cloud Credential” that provides “Access” to “Cloud Storage” that is “Contained” within a “Unity Catalog Metastore”.
  • 8. Function — A “Function” is a “User-Defined Function” that is “Contained” within a “Schema”.
  • 9. Model — A “Model” is a “Registered ML Flow” that is “Contained” within a “Schema”.
  • 10. Share — A “Share” is a “Logical Grouping” of the “Tables” that is “Intended” to be “Shared” using “Delta Sharing”.
    A “Share” is an “Object” that is “Contained” within a “Unity Catalog Metastore”.
  • 11. Recipient — A “Recipient” is an “Object” that “Represents” an “Organization”, or, a “Group of Users” to “Whom” the “Data” is “Shared With” using “Delta Sharing”.
    These “Objects” are “Contained” within a “Unity Catalog Metastore”.
  • 12. Provider — A “Provider” is an “Object” that “Represents” an “Organization” that has “Made” the “Data” “Available” for “Sharing” using “Delta Sharing”.
    These “Objects” are “Contained” within a “Unity Catalog Metastore”.

Who Can “Set Up” the “Unity Catalog” in “Databricks”?

  • To “Set Up” the “Unity Catalog” in a “Databricks Workspace”, the “User” should have the “Account Global AdminPrivileges on the “Azure Subscription”, or, must have the “OwnerAccess Privileges provided to the “User”.

Why the “Access Connector for Azure Databricks” Service is “Required” to “Set Up” the “Unity Catalog”?

While “Implementing” the “Unity Catalog”, the “Access Connector for Azure Databricks” is used.

The “Access Connector for Azure Databricks” is an “Azure Service”, which “Serves” as the “Link” between the following “Two” “Azure Services” -

  • 1. Storage Account — A “Storage Account” is “Created”, which is “Used” as the “Metastore” for the “Unity Catalog”.
    The “Access Connector for Azure DatabricksService will have “Access” to that “Storage Account”, which is being “Used” as the “Metastore”.
  • 2. Databricks Workspace — The “Access Connector for Azure Databricks” will also “Connect” the “Storage Account” to the “Databricks Workspace”, on which the “Unity Catalog” would be “Applied”.

So, any “User”, who is “Working” on the “Databricks Workspace”, will “Not Need Direct Access” to the “Storage Account”, which is “Used” as the “Metastore” for the “Unity Catalog”, “Directly”. This is because, the “Access Connector for Azure Databricks” already has “Access” to the “Storage Account”, which “In Turn” will be “Linked” to the “Databricks Workspace”.

“Steps” to “Enable” the “Unity Catalog” in “Databricks”

Following are the “Steps” to “Enable” the “Unity Catalog” in a “Databricks Workspace” -

  • 1. “Create” a “Resource Group”.
  • 2. “Create” a “Storage Account”, that is to be used as the “Metastore” for the “Unity Catalog”, inside the “Same” “Resource Group” and under the “Same” “Region” as the “Resource Group”.
  • 3. “Create” a “Databricks Workspace” on which the “Unity Catalog” would be “Set Up”, inside the “Same” “Resource Group” and under the “Same” “Region” as the “Resource Group”.
    Since, the “Unity Catalog” is “Only Available” in the “PremiumPricing Tier, it must be made sure to “Select” the “Premium” as the “Pricing Tier” of the “Databricks Workspace” to be “Created”.
  • 4. “Create” an “Access Connector for Azure Databricks”, inside the “Same” “Resource Group” and under the “Same” “Region” as the “Resource Group”.
    It must be made sure that the “Managed IdentityFeature must be “On” while “Creating” the “Access Connector for Azure DatabricksService.
  • 5. Assign the “Access Connector for Azure Databricks” the “Storage Blob Data ContributorAccess to the “Storage Account”, that is to be used as the “Metastore” for the “Unity Catalog”.
  • 6. “Create” a “Container” inside the “Storage Account”, so that, the “Metadata” can be “Stored” in that “Particular Container”.
  • 7. “Open” the “Databricks Workspace”.
    Click” on the “Top Right Corner”, where the “User Mail Id” is “Displayed”.
    From the “Menu Options”, “Click” on the “Manage Account”. This will go to a “Completely New UI”. The “URL” of this “New UI” is “accounts.azuredatabricks.net”. This “UI” is to “Set Up” the “Unity Catalog” “Manually”.
  • 8. From the “Left Side Bar Menu”, “Click” on the “Create a metastore”.
    The “DataPage is “Opened”. “Click” on the “Create metastoreButton.
    The “Create MetastorePage is “Opened”. Provide the following information -
    A. “Provide” the “Name” of the “Metastore” to be “Created” in the “NameTextbox.
    B. “Select” the “Same” “Region” as the “Databricks Workspace” from the “RegionDrop Down.
    C. “Provide” the “Path” of the “Container” that is “Created” inside the “Storage Account”, which is used as the “Metastore” for the “Unity Catalog”, in the “ADLS Gen 2 pathTextbox.
    The “Format” in which the “Path” of the “Container” needs to be “Provided” is — “container_name@storage_account_name.dfs.core.windows.name”.
    If there is any “Particular Folder”, inside the “Container”, where the “Metadata” can be “Stored”, that “Folder Path” can be “Provided” as — “container_name@storage_account_name.dfs.core.windows.name/folder_path”.
    D. “Select” the “Resource ID” of the “Access Connector for Azure Databricks” from the “Overview”, and, “Provide” that to the “Access Connector IdTextbox.
    E. Finally, “Click” on the “CreateButton.
  • 9. Now, the “Next Part” is to “Select” “One”, or, “Multiple Desired” “Databricks Workspaces” to “Assign” to the “Meastore”, and, “Click” on the “AssignButton.
    A “Dialogue Box” is “Opened” with “Enable Unity Catalog?” as the “Header” of the “Dialogue Box”. Click on the “EnableButton.
  • 10. Finally, this way the “Unity Catalog” in “Implemented” on a “Databricks Workspace”.

Important Features of “Unity Catalog”

  • To “Configure” the “Unity Catalog”, the “User” must be a “Global Account Admin”.
  • There can be “Only One Unity Catalog Per Region”.
  • “Each Databricks Workspace” can be “Attached” to “Only One Unity Catalog Metastore”.
  • “One Unity Catalog Metastore” can be “Attached” to “Multiple Databricks Workspaces”.
  • It is “Not Possible” to “Assign” a “Unity Catalog”, which is “Created” in the “Region-A”, to the “Databricks Workspace”, which is “Created” in the “Region-B”.

--

--

Oindrila Chakraborty

I have 11+ experience in IT industry. I love to learn about the data and work with data. I am happy to share my knowledge with all. Hope this will be of help.