Databricks : Unity Catalog - First Step - Part 2 - Data and Rights Management

We are going to discover the data management with the Unity Catalog solution and more precisely the management of rights (Groups and Users) and storage (Tables).

We will use a Databricks Account on AWS to perfrom this discovery.

Note: Work is based on the state of the Unity Catalog solution at the end of Q1 2023 on AWS and Azure.

Unity Catalog and the object hierarchy

Unity Catalog is the Databricks solution that allows to have a unified and centralized governance for all the data managed by the Databricks resources as well as to secure and facilitate the management and the sharing of the data to all the internal and external actors of an organization.

You can get a more complete overview by reading the official documentation

Reminder concerning the hierarchy of objects : schema_01

The objects are :

  • Storage Credential : This object is directly associated with the Metastore and is used to store access to a cloud provider (for example AWS S3) allowing the Unity Catalog solution to manage data rights
  • External Location : This object is associated with the Metastore and allows you to store the path to a cloud provider (for example an AWS S3 resource) in combination with a Storage Credential to manage data access
  • Metastore : Top-level object that can contain metadata
  • Catalog : 1st level object allowing to organize data by Schema (also called Database)
  • Schema : 2nd and last level object used to organize data (contains tables, views and functions)
  • Table : Object used to define the structure and storage of data..
  • View : Object used to encapsulate a query using one or more objects (table or view)
  • Function : Object allowing to define operations on the data

When you want to access an object (for example a table), it will be necessary to use the name of the Catalog and the name of the Schema where the object is defined. Example : select * from catalog.schema.table

Set-up the environment

For this discovery, we are going to set up a certain number of elements to be able to handle the different concepts.

Synthesis

Regarding Databricks resources :

  • The Unity Catalog solution must be activated at the Databricks Account level
  • A Unity Catalog Metastore must be attached to the Databricks Workspace
  • A SQL Warehouse must exist in the Databricks Workspace

Regarding groups and users, We will use the following elements: schema_02

Prerequisites :

  • A user with admin rights on the Databricks Account and on the Databricks Workspace
  • There must not exist a group named grp_demo at the Databricks Account level and at the Databricks Workspace level
  • There must not exist a user named john.do.dbx@gmail.com at the Databricks Account level and at the Databricks Workspace level

Summary of the actions that will be done :

  • Creation of a group named grp_demo at the Databricks Account level.
  • Creation of a user named john.do.dbx@gmail.com at the Databricks Account level.
  • Addition of the grp_demo group created at the Databricks Account level in the Databricks Workspace
  • Added rights on the Workspace Databricks on the group grp_demo
  • Added the needed rights on Unity Catalog objects at the Databricks Workspace level on the grp_demo group

Regarding Unity Catalog objects, We will use the following elements: schema_03

Prerequisites :

  • Existence of a Metastore named metastore-sandbox with the Storage Credential named sc-metastore-sandbox to store default data in the AWS S3 resource named s3-dbx-metastore-uc
  • Create an AWS S3 resource named s3-demo-data-uc
  • Created an AWS IAM role named role-databricks-demo-data-uc and an AWS IAM policy named policy-databricks-demo-data-uc allowing the Databricks global role to manage access to the AWS S3 resource s3-demo-data-uc
  • Created a Storage Credential named sc-demo-data-uc allowing the Databricks global role to manage access to the AWS S3 resource s3-demo-data-uc

Summary of the actions that will be done :

  • Creation of an External Storage to be able to access the data of the AWS S3 resource named s3-demo-data-uc with Unity Catalog
  • Creation of the Catalog ctg_ipp which will contain all the managed elements (stored on the AWS S3 resource named s3-dbx-metastore-uc associated with the Metastore metastore-sandbox)
  • Creation of the Catalog ctg_ext which will contain all the external elements (stored on the AWS S3 resource named s3-demo-data-uc)
  • Creation of the Schema ctg_ipp.sch_bronze which will create objects managed by Unity Catalog to access data stored on the AWS S3 resource named s3-dbx-metastore-uc (in Delta format only)
  • Creation of the Schema ctg_ext.sch_ref which will create the objects to access the data stored on the AWS S3 resource named s3-demo-data-uc (in the form of a CSV file or a Delta file)

Setting up

Setting up a dataset on the AWS S3 resource

Content of the file ref_products.csv:

1id,lib,brand,os,last_maj
21,Pixel 7 Pro,Google,Android,2023-01-01 09:00:00
32,Iphone 14,Apple,IOS,2023-01-01 09:00:00
43,Galaxy S23,Samsung,Android,2023-01-01 09:00:00

Content of the file fct_transactions.csv :

1id_trx,ts_trx,id_product,id_shop,id_client,quantity
21,2023-04-01 09:00:00,1,2,1,1
32,2023-04-01 11:00:00,1,1,1,3
43,2023-04-03 14:00:00,1,2,1,1
54,2023-04-05 08:00:00,3,1,2,9
65,2023-04-06 10:00:00,1,2,1,3
76,2023-04-06 12:00:00,2,2,1,1
87,2023-04-10 18:30:00,2,1,2,11
98,2023-04-10 18:30:00,3,1,2,2

Copying data to the demo directory of the AWS S3 resource named s3-demo-data-uc with the AWS CLI tool :

1aws s3 cp ref_products.csv s3://s3-demo-data-uc/demo/ref_products.csv 
2aws s3 cp fct_transactions.csv s3://s3-demo-data-uc/demo/fct_transactions.csv 

Setting up the containers on the Unity Catalog Metastore

The steps are as follows : Note: Using a user with admin rights on the Metastore

  1. Creating of an External Location to allow users to store data into the AWS S3 resource named s3-demo-data-uc
  2. Creating a Catalog ctg_ipp and ctg_ext
  3. Creating a Schema ctg_ipp.sch_bronze and ctg_ext.sch_ref
 1-- 1. Create external location to access data from s3-demo-data-uc resource
 2CREATE EXTERNAL LOCATION IF NOT EXISTS el_demo_data_uc
 3URL 's3://s3-demo-data-uc'
 4WITH (STORAGE CREDENTIAL `sc-demo-data-uc`)
 5COMMENT 'External Location to access demo data';
 6
 7-- 2. Create Catalog ctg_ipp
 8CREATE CATALOG IF NOT EXISTS ctg_ipp
 9    COMMENT 'Catalog for managed data';
10
11-- 2. Create Catalog ctg_ext (for external data)
12CREATE CATALOG IF NOT EXISTS ctg_ext
13    COMMENT 'Catalog for external data'
14;
15
16-- 3. Create Schema sch_bronze in the Catalog ctg_ipp to store managed data
17CREATE SCHEMA IF NOT EXISTS ctg_ipp.sch_bronze
18    COMMENT 'Schema for Bronze Data'
19;
20
21-- 3. Create Schema sch_ref in the Catalog ctg_ext to store external data
22CREATE SCHEMA IF NOT EXISTS ctg_ext.sch_ref
23    COMMENT 'Schema for External Data'
24;

Group management

Synthesis

A group is an object used to manage access rights on users or other groups. This allow to set up an access organization in relation to teams or profiles rather than going directly through users.

There are two types of groups :

  • A group at the Account Databricks level makes it possible to manage access to data using the Unity Catalog solution (centralization)
  • A local group at Workspace Databricks level allows to manage access at workspace level only and is not compatible with Unity Catalog solution (this is mainly used with Hive Metastore and for Workspace Databricks rights)

Quota : You can have up to a combination of 10,000 users and 5,000 groups in a Databricks Account

When a Unity Catalog Metastore is attached to a Databricks Workspace, it is no longer possible to create a local group directly from the Databricks Workspace interface.

In the context of access management with Unity Catalog, the recommendations are as follows:

  • Manage all rights based on the groups defined at the Databricks Account level
  • The creation of a group must be done at the level of the Databricks Account to be added at the level of a Databricks Workspace
  • Adding or removing a member (user or group) from a group must be done at the Databricks Account level
  • The creation of a user can be done at the Databricks Workspace level. It will be automatically added at the Databricks Account level (Warning: it is strongly recommended to manage users and groups based on an IdP (Identity provider) (Azure Active Directory, AWS IAM, …) to have a centralized and secure Databricks accounts management)
  • Data access management is done at the Databricks Workspace level (by the Databricks Workspace administrator or the object owner)
  • It is recommended to always define rights at groups level and not users level to facilitate the management of rights over time
    • You can manage groups on several levels to organize rights and users
  • When importing (creating) a group at the Workspace Databricks level, the Workspace administrator must define the rights (Entitlements) of the group on the Workspace Databricks
  • If you want to give the same rights as the user john.do.dbx@gmail.com to another user, you will only need to add this other user in the same groups as the user john.do.dbx@gmail.com

Regarding a group deletion :

  • If you delete the group from the Databricks Workspace, it will still exist at the Databricks Account level
  • If you remove the group from the Databricks Account, it will automatically be removed from all Databricks Workspaces

Practical application

Prerequisites :

  • Have an account with the Databricks Account administration role/right
  • Have an account with the Databricks Workspace administration role/right
  • Have an existing SQL Warehouse on the Databricks Workspace (so users can run SQL queries on Databricks)

For the creation of the group and the user, the following actions must be done at the Databricks Account level:

  1. Create a user named john.do.dbx@gmail.com
  2. Create a group named grp_demo and add the user john.do.dbx@gmail.com to the group grp_demo

For the user to be able to access the Databricks Workspace resources, the following actions must be done at the Databricks Workspace level: 3. Importing the grp_demo group into the Databricks Workspace 4. Add Databricks Workspace-level rights to the group

Actions :

  1. Step n°1
    1. Go to Account Administration page > User Management > Users
    2. Click on Add User
    3. Fill the informations Email, First name and Last name and click on Send invite
  2. Step n°2
    1. Go to Account Administration page > User Management > Groups
    2. Click on Add Group
    3. Fill the information Group name and click on save
    4. Click on Add members to add the user john.do.dbx@gmail.com
  3. Step n°3
    1. Go to Workspace page > username > Admin Settings > Groups
    2. Click on Add Group
    3. Choose the group (from the list of existing groups at the Databricks Account level) and click on Add
  4. Step n°4 :
    1. Go to Workspace page > username > Admin Settings > Groups
    2. Click on the group
    3. Click on Entitlements
      1. Check the Workspace access option to give acces to the Databricks Workspace to all group users
      2. Check the Databricks SQL access option to give acces to the SQL resources on Databricks Workspaces to all group users

If you want to give access to a SQL Warehouse to a group, follow these steps:

  1. Go to Workspace page > SQL > SQL Warehouses
  2. Click on the button with 3 dots and choose the Permissions option
  3. Add the groupe and choose Can use option

Note: It is also possible to add the group at the Databricks Workspace level with the Databricks CLI tool

1# Add a group in the Workspace Databricks (the group must exist at the Account Databricks level)
2databricks groups create --group-name "grp_demo"

Rights management

Synthesis

Rights management is done based on the ANSI SQL syntax and more specifically with the instructions GRANT and REVOKE.

You will find an exhaustive list of the rights that can be managed with Unity Catalog on the official documentation

To give a right to a group or a user, the syntax is the following :

1GRANT <Rights with comma separator> ON <Type Object> <Name Object> TO <Group or User>;

To delete a right from a group or a user, the syntax is the following :

1REVOKE <Rights with comma separator> ON <Type Object> <Name Object> FROM <Group or User>;

Note: It is mandatory to have the right USE/USAGE on the Catalog and the Schema to access a table even if you already have the right SELECT on it, otherwise the user (or the group ) will not be able to see the contents of the Catalog and Schema by default.

For example: If the owner of the table ctg_ext.sch_ref.tbl_demo gives the right SELECT to the group grp_demo then the users of the group grp_demo will not be able to read the data of the table tbl_demo as long as they will not have the USE right on the Catalog ctg_ext and on the Schema sch_ref.

It is possible to manage the rights at the object (table/view) level but it is recommended to organize the data and the rights at the Schema level when the organization allows it to facilitate the management of the access rights to the different teams and profiles of the organization.

Practical application

We have already created the Catalogs and Schemas with an admin account and we want to give the possibility to the user john.do.dbx@gmail.com to manage the objects in the different Catalogs and Schemas.

By default, the grp_demo group has no rights to the Metastore Catalogs (and cannot view them).

We will do the necessary actions to ensure that the rights will be sufficient to be able to perform the following manipulations:

  1. Creation of new schemas in the Catalog ctg_ipp
  2. Creation of tables in all Schema of Catalog ctg_ipp and ctg_ext

The actions to be done are the following :

  1. Give use rights on Catalogs
1-- Right to view the Catalog ctg_ipp
2GRANT USAGE ON CATALOG ctg_ipp TO grp_demo;
3-- Right to view the Catalog ctg_ext
4GRANT USAGE ON CATALOG ctg_ext TO grp_demo;
  1. Give schema creation rights at Catalog level ctg_ipp
1-- Right to create new Schema in Catalog ctg_ipp
2GRANT USE SCHEMA, CREATE SCHEMA ON CATALOG ctg_ipp TO grp_demo;
  1. Give permissions to access and create objects in the Schema ctg_ext.sch_ref as well as permissions to create External Tables based on the storage defined in the External Location object named el_demo_data_uc :
1-- Rights to create objects in the Schema ctg_ext.sch_ref
2GRANT USE SCHEMA, SELECT , MODIFY, CREATE TABLE ON SCHEMA ctg_ext.sch_ref TO grp_demo;
3-- Right to create External tables with the External Location
4GRANT CREATE EXTERNAL TABLE ON EXTERNAL LOCATION `el_demo_data_uc` TO grp_demo;
  1. Grant all rights on Schema ctg_ipp.sch_bronze
1-- All privileges on the Schema ctg_ipp.sch_bronze
2GRANT ALL PRIVILEGES ON SCHEMA ctg_ipp.sch_bronze TO grp_demo;

If you want a user to be able to create the External Location object to set up access to a new AWS S3 resource, you must give him the following rights:

1-- Right on the Metastore for the group
2GRANT CREATE EXTERNAL LOCATION ON METASTORE TO grp_demo;
3-- Right on the Storage Credential to the group
4GRANT CREATE EXTERNAL LOCATION ON STORAGE CREDENTIAL `sc-demo-data-uc` TO grp_demo;

Example of queries to retrieve all the rights given to the grp_demo group on the objects of the Catalog ctg_ipp and ctg_ext:

 1-- How to get all the existing privileges for a group ?
 2-- Get privileges from Catalog
 3select 'GRANT '||replace(privilege_type,'_',' ')||' ON CATALOG '||catalog_name||' TO '||grantee||';' as grant_query
 4from system.INFORMATION_SCHEMA.CATALOG_PRIVILEGES
 5where grantee='grp_demo' 
 6    and grantor <> 'System user'
 7    and catalog_name in ('ctg_ipp','ctg_ext')
 8    and inherited_from = 'NONE'
 9order by catalog_name,privilege_type
10;
11
12-- Get privileges from Schema
13select 'GRANT '||replace(privilege_type,'_',' ')||' ON SCHEMA '||catalog_name||'.'||schema_name||' TO '||grantee||';' as grant_query
14from system.INFORMATION_SCHEMA.SCHEMA_PRIVILEGES
15where grantee='grp_demo' 
16  and grantor <> 'System user' 
17  and catalog_name in ('ctg_ipp','ctg_ext')
18  and inherited_from = 'NONE'
19order by catalog_name, schema_name,privilege_type;
20
21-- Get privileges from Table
22select 'GRANT '||replace(privilege_type,'_',' ')||' ON TABLE '||table_catalog||'.'||table_schema||'.'||table_name||' TO '||grantee||';'  as grant_query
23from system.INFORMATION_SCHEMA.TABLE_PRIVILEGES
24where grantee = 'grp_demo' 
25  and grantor <> 'System user'
26  and table_catalog in ('ctg_ipp','ctg_ext')
27  and inherited_from = 'NONE'
28order by table_catalog,table_schema,table_name,privilege_type;

Storage management

Synthesis

The Table object is used to define the structure and storage of data.

Tables and views must be created in a Schema.

By default, if nothing is specified when creating the Catalog and Schema objects, all the data will be managed by Unity Catalog based on the storage (AWS S3 resource) defined at the Metastore level. If the MANAGED LOCATION option is defined at the Catalog level then all the children elements (Schema and Table) will use this default storage option instead of the storage defined at the Metastore level. If the MANAGED LOCATION option is defined at the Schema level then all the children elements (Table) will use this default storage option instead of the storage defined at the Metastore level.

Syntax example :

 1-- Catalog creation
 2CREATE CATALOG XXX
 3    MANAGED LOCATION 's3://YYYYYYYYY'
 4    COMMENT 'ZZZZZZZZ'
 5;
 6
 7-- Schema creation
 8CREATE SCHEMA XXX
 9    MANAGED LOCATION 's3://YYYYYYYYY'
10    COMMENT 'ZZZZZZZZ'
11;

Schema and Catalog are just logical container for organizing data. There is no directory creation when creating these objects.

There are two types of tables:

  • Managed Table: Metadata and data are managed by Unity Catalog and the storage format used is the Delta format.
  • External Table: Only metadata is managed by Unity Catalog. The storage format can be one of the following formats “Delta, CSV, JSON, Avro, Parquet, ORC or Text”

Regarding Managed Table:

  • When a Managed Table is created, the data is created in a sub-directory of the storage defined at the Metastore level (in the case where the “MANAGED LOCATION” option is not defined at the Catalog or parent Schema level )
  • The path will be the following <Metastore S3 Path>/<Metastore ID>/tables/<table ID>,
    • Example: s3://s3-dbx-metastore-uc/metastore-sandbox/13a746fa-c056-4b32-b6db-9d31c0d1eecf/tables/5c725019-fe8f-4778-a97c-c640eaf9b13e
      • Metastore S3 Path: s3-dbx-metastore-uc
      • Metastore ID: 13a746fa-c056-4b32-b6db-9d31c0d1eecf
      • Table ID: 5c725019-fe8f-4778-a97c-c640eaf9b13e
    • From the storage point of view, all the managed tables at the Metastore level will be found by default in the tables/ directory with a unique identifier defined when they are created.
  • When deleting the table, metadata as well as data (files) will be deleted

Regarding the External Table :

  • When creating an External Table, you must define the data full access path (Unity Catalog manages rights based on a Storace Credential object and an External Location object)

    • If files already exist, the definition of the table (source data format, schema, etc.) must be compatible with the existing data
    • If no file exists, the Delta format log elements will be created in the _log_delta subdirectory of the defined path and the data files will be created when adding data to the table
  • When deleting the table, only metadata is deleted. The data (files) are not impacted by the deletion.

  • Some limitations:

    • If the table is based on a csv file, no action other than reading will be allowed
    • If the table is based from data (files) in CSV format (or other format outside Delta) (written by spark processing for example), it will be possible to make insert but not update or delete
    • If the table is based from data in Delta format then it will be possible to do exactly the same actions as with a Managed Table
    • It is not possible to define two different external tables using exactly the same data files path as external storage

Practical application

Creating an External Table and a Managed Table :

  1. Creation of an External Table named ref_products in the Schema sch_ref of the Catalog ctg_ext from a csv file named ref_products.csv
  2. Creation of an Managed Table named fct_transactions in the Schema sch_bronze of the Catalog ctg_ipp
  3. Insertion of data into the fct_transactions table from a CSV file named fct_transactions.csv
 1-- 1. Creation of external table ref_products
 2CREATE TABLE ctg_ext.sch_ref.ref_products (
 3    id int
 4    ,lib string
 5    ,brand string
 6    ,os string
 7    ,last_maj timestamp
 8)
 9USING CSV
10OPTIONS (path "s3://s3-demo-data-uc/demo/ref_products.csv",
11        delimiter ",",
12        header "true")
13COMMENT 'Product referential (external)'
14;
15
16
17-- 2. Creation of managed table fct_transactions
18CREATE TABLE ctg_ipp.sch_bronze.fct_transactions (
19    id_trx string
20    ,ts_trx string
21    ,id_product string
22    ,id_shop string
23    ,id_client string
24    ,quantity string
25)
26COMMENT 'Bronze Transactions Data'
27;
28
29
30-- 3. Insert data from CSV files into the fct_transactions table
31COPY INTO ctg_ipp.sch_bronze.fct_transactions
32  FROM 's3://s3-demo-data-uc/demo/fct_transactions.csv'
33  FILEFORMAT = CSV
34  FORMAT_OPTIONS ('encoding' = 'utf8'
35                ,'inferSchema' = 'false'
36                ,'nullValue' = 'N/A'
37                ,'mergeSchema' = 'false'
38                ,'delimiter' = ','
39                ,'header' = 'true'
40                ,'mode' = 'failfast')
41;

Example of an error when updating data from an External Table based on a CSV file 4. Updating ref_products table data in error 5. Deleting data from the ref_products table in error

 1-- 4. Error : Try to update the data in the ref_products table
 2UPDATE ctg_ext.sch_ref.ref_products
 3SET last_maj = current_timestamp()
 4WHERE id = 1;
 5-- Result : UPDATE destination only supports Delta sources
 6
 7
 8-- 5. Error : Try to delete the data from ref_products table
 9DELETE FROM ctg_ext.sch_ref.ref_products
10WHERE id = 1;
11-- Result : Could not verify permissions for DeleteFromTable

Example of data storage management for an External Table (in Delta format) when it is created and deleted: 6. Creating an external table with the appendOnly property 7. Checking the list of files on the AWS S3 resource 8. Inserting data through the Metastore Unity Catalog 9. Checking the list of files on the AWS S3 resource 10. Reading the Delta Data History 11. Deleting data without using the Metastore Unity Catalog table (the appendOnly property blocks the execution of this action) 12. Deleting the external table 13. Checking existence of files with S3

 1
 2-- 6. Create an external table named "ext_tbl" in schema "ctg_ext.sch_ref" in the location "s3://s3-demo-data-uc/demo/ext_tbl" with the delta properties "appendOnly"
 3CREATE TABLE ctg_ext.sch_ref.ext_tbl (
 4    col1 int
 5    ,col2 timestamp
 6)
 7COMMENT 'Test External Table with TBLProperties'
 8LOCATION 's3://s3-demo-data-uc/demo/ext_tbl'
 9TBLPROPERTIES('delta.appendOnly' = 'true')
10;
11
12
13-- 7. Check if the file exists on AWS S3 (with AWS CLI)
14-- aws s3 ls s3://s3-demo-data-uc/demo/ --recursive | grep "ext_tbl" | sed 's/[[:space:]][[:space:]]*/|/g' | cut -d "|" -f 4
15/* Result : 
16demo/ext_tbl/_delta_log/
17demo/ext_tbl/_delta_log/.s3-optimization-0
18demo/ext_tbl/_delta_log/.s3-optimization-1
19demo/ext_tbl/_delta_log/.s3-optimization-2
20demo/ext_tbl/_delta_log/00000000000000000000.crc
21demo/ext_tbl/_delta_log/00000000000000000000.json
22*/
23
24
25-- 8. Insert Data into the "ext_tbl" table
26INSERT INTO ctg_ext.sch_ref.ext_tbl VALUES (1,current_timestamp());
27
28-- 9. Check if the file exists on AWS S3 (with AWS CLI)
29-- aws s3 ls s3://s3-demo-data-uc/demo/ --recursive | grep "ext_tbl" | sed 's/[[:space:]][[:space:]]*/|/g' | cut -d "|" -f 4
30/* Result : 
31demo/ext_tbl/_delta_log/
32demo/ext_tbl/_delta_log/.s3-optimization-0
33demo/ext_tbl/_delta_log/.s3-optimization-1
34demo/ext_tbl/_delta_log/.s3-optimization-2
35demo/ext_tbl/_delta_log/00000000000000000000.crc
36demo/ext_tbl/_delta_log/00000000000000000000.json
37demo/ext_tbl/_delta_log/00000000000000000001.crc
38demo/ext_tbl/_delta_log/00000000000000000001.json
39demo/ext_tbl/part-00000-097fb1f5-1d7e-464a-a65b-1303bd06e2b0.c000.snappy.parquet
40*/
41
42-- 10. Check Delta History
43DESCRIBE HISTORY ctg_ext.sch_ref.ext_tbl;
44/* Result : 
45Version 1 : WRITE
46Version 0 : CREATE TABLE
47*/
48
49-- 11. Delete data without using the Unity Catalog Table
50DELETE FROM DELTA.`s3://s3-demo-data-uc/demo/ext_tbl` WHERE col1 = 1;
51/* Result : This table is configured to only allow appends.*/
52
53
54-- 12. Delete the table "ext_tbl"
55DROP TABLE ctg_ext.sch_ref.ext_tbl;
56
57-- 13. Check if the file exists on AWS S3 (with AWS CLI)
58-- aws s3 ls s3://s3-demo-data-uc/demo/ --recursive | grep "ext_tbl" | sed 's/[[:space:]][[:space:]]*/|/g' | cut -d "|" -f 4
59/* Result : 
60demo/ext_tbl/_delta_log/
61demo/ext_tbl/_delta_log/.s3-optimization-0
62demo/ext_tbl/_delta_log/.s3-optimization-1
63demo/ext_tbl/_delta_log/.s3-optimization-2
64demo/ext_tbl/_delta_log/00000000000000000000.crc
65demo/ext_tbl/_delta_log/00000000000000000000.json
66demo/ext_tbl/_delta_log/00000000000000000001.crc
67demo/ext_tbl/_delta_log/00000000000000000001.json
68demo/ext_tbl/part-00000-097fb1f5-1d7e-464a-a65b-1303bd06e2b0.c000.snappy.parquet
69*/

Remarks :

  • When creating a table, only the _delta_log subdirectory is created to contain the information of version n°0
  • When defining a property with the “TBLPROPERTIES” option on the external table, it is also saved on the storage.
  • After deleting the table, the data is no longer accessible through the table but is still accessible through the AWS S3 resource path

Regarding the properties that can be used with the Delta format: You can find more information in the official documentation

Clean environment

You will find, below, all the instructions needed to clean the environment.

Drop elements from Unity Catalog using SQL statements:

1-- Delete the Catalog with CASCADE option (to delete all objects)
2DROP CATALOG IF EXISTS ctg_ipp CASCADE;
3DROP CATALOG IF EXISTS ctg_ext CASCADE;
4
5-- Delete the External Location
6DROP EXTERNAL LOCATION IF EXISTS el_demo_data_uc force;

Note regarding the removal of the External Location object:

  • Must be the owner
  • It must not already be used by a table, otherwise you must use the force option

Deleting data used on the AWS S3 resource:

1# Delete all files stored in the AWS S3 resource
2aws s3 rm "s3://s3-demo-data-uc/demo/" --recursive 

Deleting the group from the Databricks Account (this will automatically delete the group at the Databricks Workspace level):

  1. Go to Account Databricks page > User management > Groups
  2. Use the search bar to find the desired group
  3. Click on the button with the 3 dots associated with the group and choose the Delete option
  4. Click on Confirm Delete

Deleting the user at the Databricks Account level:

  1. Go to Account Databricks page > User management > Users
  2. Use the search bar to find the desired user
  3. Click on the user’s name
  4. Click on the button with the 3 dots and choose the Delete user option
  5. Click on Confirm Delete

Conclusion

We were able to see a first overview of some features of the Unity Catalog solution, mainly concerning the management of rights and storage with External Table and Managed Table.

Rights management is simplified by the use of ANSI SQL syntax, the use of user groups and the centralization of access management within Unity Catalog..

Group management requires being an Administrator on the Databricks Account (creation and deletion of groups, addition and deletion of users in groups) and an Administrator on the Databricks Workspace to add or delete existing groups at the level of the Databricks Account in the Databricks Workspace.

Warning : The management of rights and groups are not compatible between Unity Catalog and Hive Metastore, it is recommended to migrate the elements from Hive Metastore to Unity Catalog by recreating the necessary elements.

If you want to implement an open data lake by minimizing the use of Databricks, it is possible to use Unity Catalog with external tables in Delta format to be able to use the maximum functionality while using other tools to directly access Delta data (AWS EMR, AWS Glue, AWS Athena, …).

However, if you wish to use the Unity Catalog solution, it is strongly recommended to centralize all data governance in order to be able to manage all access within the Unity Catalog solution.

From the Unity Catalog solution, it is possible to access the data using Clusters, SQL Warehouse or the Delta Sharing functionality to multiply the usage of the data.

Resources

Creation of a SQL Warehouse

Databricks CLI does not support managing SQL Warehouses at the moment, therefore we can use Databricks REST API to do it.

Prerequisites :

  • Have the Curl tool installed

Actions:

  1. Creation of a .netrc file to manage access to the REST API:
 1# Create the folder where the .netrc file will be stored
 2mkdir ~/.databricks
 3
 4# Create the .netrc file
 5echo "machine <url workspace databricks without https://>
 6login token
 7password <token user databricks>
 8" > ~/.databricks/.netrc
 9
10# Create an alias to use the tool curl with .netrc file
11echo "alias dbx-api='curl --netrc-file ~/.databricks/.netrc'" >> ~/.alias 
  1. Creation of a warehouse
 1
 2export DBX_API_URL="<url workspace databricks>"
 3
 4# Get the list of the existing SQL Warehouses (check if the SQL Warehouse exists)
 5dbx-api -X GET ${DBX_API_URL}/api/2.0/sql/warehouses/ | jq .
 6
 7# Create the temporary config file for the SQL Warehouse
 8cat > tmp_databricks_create_warehouse.json <<EOF
 9{
10  "name": "DEMO_WH_XXS",
11  "cluster_size": "2X-Small",
12  "min_num_clusters": 1,
13  "max_num_clusters": 1,
14  "auto_stop_mins": 20,
15  "auto_resume": true,
16  "tags": {
17    "custom_tags": [
18      {
19        "key": "owner",
20        "value": "admin@tech.demo"
21      },
22      {
23        "key": "project",
24        "value": "databricks-demo"
25      }
26    ]
27  },
28  "spot_instance_policy":"COST_OPTIMIZED",
29  "enable_photon": "true",
30  "enable_serverless_compute": "false",
31  "warehouse_type": "CLASSIC",
32  "channel": {
33    "name": "CHANNEL_NAME_CURRENT"
34  }
35}
36EOF
37
38
39# Creation of the SQL Warehouse
40dbx-api -X POST -H "Content-Type: application/json" -d "@tmp_databricks_create_warehouse.json" ${DBX_API_URL}/api/2.0/sql/warehouses/ 
41
42# Delete the temporary config file
43rm tmp_databricks_create_warehouse.json

Some actions that may be useful :

 1# Export the SQL Warehouse ID
 2export SQL_WH_ID="<SQL Warehouse ID>"
 3
 4# Check the state of the SQL Warehouse
 5dbx-api -X GET ${DBX_API_URL}/api/2.0/sql/warehouses/${SQL_WH_ID} | jq '.state'
 6
 7# Stop the SQL Warehouse (by default, it starts at creation)
 8dbx-api -X POST ${DBX_API_URL}/api/2.0/sql/warehouses/${SQL_WH_ID}/stop
 9
10# Start the SQL Warehouse
11dbx-api -X POST ${DBX_API_URL}/api/2.0/sql/warehouses/${SQL_WH_ID}/start
12
13# Delete the SQL Warehouse
14dbx-api -X DELETE ${DBX_API_URL}/api/2.0/sql/warehouses/${SQL_WH_ID}