Query Data Lake with SQL Database in Azure Synapse – Part 2

Here, I presented how to query a Data Lake using Azure Synapse Serverless Pool and the Lake database. Azure Synapse workspace enables us to create a SQL Database on top of a Data Lake, but this isn’t managed by Spark. In this type of database, we can only use a serverless SQL pool.

In this follow-up to the previous article, I’ll show you how to query data in a Synapse serverless SQL pool using SQL database.

Synapse SQL Database

What is possible with SQL Database?

  • Create external tables
  • Create views
  • Create stored procedures
  • Use OPENROWSET command to query files

In a nutshell, we can create a logical data warehouse with a serverless pool on top of a Data Lake. This allows any tool to work with files in a Data Lake like with SQL tables. Once we create our data model, we can use a SQL client, Power BI, SQL Management Studio, Python, etc. to query files in the Data Lake.

How to query data in Synapse SQL pool

The simplest way to query data in a Data Lake in Azure is to use the OPENROWSET command. As it’s presented below you can query parquet files from the files kept in the Azure Data Lake container.

SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://xxx.dfs.core.windows.net/silver/movies/netflix/',
        FORMAT = 'PARQUET'
    ) AS [result]

Using Synapse workspace GUI you can navigate to the specific location and select a file that you want to display.

Synapse Workspace GUI

However, this method requires your knowledge of the files that you keep in the Data Lake. To avoid navigating through directories every time you want to query data, you can create a query-based view or tables.

CREATE VIEW vnetflix
AS
SELECT
     *
FROM
    OPENROWSET(
        BULK 'https://xxx.dfs.core.windows.net/silver/movies/netflix/netflix_titles.parquet',
        FORMAT = 'PARQUET'
    ) AS [result]

After executing the query you should see a new view in the Synapse workspace in the SQL database tab. As you can see, columns are the varchar data type. It’s not perfectly accurate, but we can query data using the view name. We can set appropriate data types using the WITH command.

WITH (
        vendorID varchar(4),
        tpepPickupDateTime datetime2,
        passengerCount int
    ) AS result;
SQL Database

Another option in SQL database is to create an external table where we can define source files’ location and data types. To perform mathematical operations and improve query performance, data types are important.

For instance, according to Microsoft’s recommendation integer-base data types perform better in SORT, JOINs, and GROUP BY. To create an external table you need to set up a data source and a format file like in the code below.

CREATE EXTERNAL DATA SOURCE silver WITH (
    LOCATION = 'https://xxx.dfs.core.windows.net/silver'

);

CREATE EXTERNAL FILE FORMAT ParquetFormat
WITH ( 
        FORMAT_TYPE = PARQUET
     );



CREATE EXTERNAL TABLE netflix (show_id  VARCHAR(256),
type  VARCHAR(256),
title  VARCHAR(256),
director  VARCHAR(256),
cast  VARCHAR(256),
country  VARCHAR(256),
date_added  DATE,
release_year  BIGINT,
rating  DECIMAL(10,2),
duration  VARCHAR(256),
listed_in  VARCHAR(256),
description  VARCHAR(256))
    WITH (
        LOCATION = 'movies/netflix',
        DATA_SOURCE = silver,
        FILE_FORMAT = ParquetFormat
    );

Created tables can be seen in Synapse workspace, as shown in the screen below.

SQL database Workspace

As presented, Synapse allows us to create a logical data model based on files stored in the Data Lake. We can use tables to import data into Power BI, ad-hoc analysis, or python scripts.

When you need low latency, such as Power BI direct queries, you should avoid using Synapse Serverless.

File formats

Synapse serverless supports popular data formats like Parquet, Delta Lake, and CSV. Essentially, the Delta format stores data as parquet, but has additional features that provide transaction logs and more flexibility when it comes to editing the content. This includes updating, deleting, and merging. It’s more effective to use Paquet compared to CSV files because Parquet has columnar compression, we can query only a subset of columns, and keeps metadata information. 

To see the main differences between Parquet and CSV I’ll execute two queries.

CSV file
SELECT
    count(*)
FROM
   csv_netflix

Total size of data scanned is 4 megabytes, total size of data moved is 1 megabytes, total size of data written is 0 megabytes.

SELECT
    count(*)
FROM
   parquet_netflix

Total size of data scanned is 1 megabytes, total size of data moved is 1 megabytes, total size of data written is 0 megabytes.

You can see that the SQL engine needs to scan the entire CSV file and only metadata information from the parquet file. This behavior is significant from a cost perspective, we pay for processed data in Synapse Serverless.

SQL Stored procedures

The last issue is how to automate table creation and schema inference. For a table with a few columns, it’s not a big deal, but for a large number of tables, it will take a lot of time. To avoid manual work I created a python script that reads metadata from parquet files and creates table definitions based on it.

def read_metdata(file, abfs):
    print(f"-----------------{file}")
    pfile = pq.read_table(file, filesystem=abfs)
    columns = []
    for x in pfile.schema:
        columns.append(f"{x.name}  {convert_to_sql(x.type)}")
    tab_name = file[:file.find(".")].split("/")[-2]
    location = "/".join(file.split("/")[1:-1])
    sql = ""
    sql = f"CREATE EXTERNAL TABLE {tab_name} ("
    sql += ",\n".join(columns)
    sql += f""")
    WITH (
        LOCATION = '{location}',
        DATA_SOURCE = silver,
        FILE_FORMAT = ParquetFormat
    );"""
    return sql

The code traverses folders in Data Lake, reads metadata and names parent folders that act as table names. You can find all the code in my git repository.

https://github.com/MariuszKu/azure_codes

Summary

Using SQL databases in Synapse Serverless pools is a good option if you aren’t planning to implement Spark pools. You can load your files to your Data Lake via Azure Data Factory, Azure Synapse integration pipelines, third-party ETL tools, or python scripts and then create a metadata layer on the top of your Data Lake and use SQL-endpoint to query your data.