Azure Synapse SQL endpoint is a serverless query engine that enables you to read the files on Azure storage. You can leverage Synapse SQL in Azure SQL Managed Instance with linked services to implement Polybase-like scenarios.

Polybase technology enables SQL Server and SQL DW to directly query a large amount of big data stored on the external storage. You can run standard SQL queries over different files (such as CSV or parquet) that are placed in Azure Storage or Hadoop without need to load the data in SQL database tables. Polybase is currently not available in Azure SQL (database or managed instance). You can vote for this feature request on Azure feedback site. If you need to load data from Azure storage you need to use OPENROWSET(BULK) over Azure storage that works only with CSV and can read a single file. As an alternative, you can load external data into a table using some external process such as Azure Data Factory.

Azure Synapse Analytics is a big data analytics service that enables you to query and analyze data on Azure storage. With Synapse Analytics workspace, you are getting one SQL serverless query endpoint that enables you analyze your data using T-SQL language. There is no upfront cost for this default serverless endpoint. You are charged per each query depending on the amount of data that the query process (few dollars per TB processed). This looks like a perfect match for Polybase-like scenarios because it can queries the files on Azure Data Lake. However, this functionality is not available in Azure SQL.

Although you cannot use Polybase on Azure SQL, you have a workaround if you are using Azure SQL Managed Instance and you need to implement Polybase-like scenario on Azure storage. Managed Instance enables you to run SQL queries on Synapse SQL query endpoint using Linked servers.

Azure Managed Instance can access Azure Storage using Synapse

Since Synapse SQL query endpoint is T-SQL compliant SQL endpoint, you can create Linked service to Synapse SQL and run remote queries from Managed Instance. In this article you will see how to implement that kind of solution.

Azure Synapse Analytics – SQL endpoint

Synapse SQL endpoint in Synapse is serverless and stateless SQL query engine that allows you to query your data in query-as-a-service model. You just need to send a T-SQL query and get the results.

If you don’t have Synapse Analytics workspace, you can easily deploy it using Azure portal or this Deployment template. Synapse workspace automatically deploys one serverless Synapse SQL endpoint that is everything we need for this kind of analysis. With SQL endpoint you are getting the functionally that looks like standalone Polybase service, but it is much more.

You are not reserving resources or paying for default Synapse Workspace if you are not using it. Synapse SQL endpoint in the workspace is charged per query. If you are not executing queries on default endpoint you are not paying anything. There is no additional fixed cost of default Synapse Workspace.

In Synapse SQL we can query the files on Azure Data Lake storage using OPENROWSET function:

select *
from openrowset(bulk 'https://****.blob.core.windows.net/year=*/*/*.parquet',
                format='parquet') as cases

OPENROWSET function very similar to the OPENROWSET function that you use in SQL Server and Azure SQL with some enhancements:

  1. It can read both csv and parquet files on Azure storage
  2. It can reference set of files using wildcards (.parquet or /year=/month=/.parquet)
  3. It can automatically get the schema from underlying parquet files without need to specify format file

In addition to OPENROWSET, you can create external table on top of set of files on azure storage and query them via standard external table interface. The following script creates external table on a set of CSV files placed on the paths that match the pattern csv/population/year=/month= on Azure storage referenced via DATA_SOURCE:

CREATE EXTERNAL TABLE csv.population
(
    country_code VARCHAR (5),
    country_name VARCHAR (100),
    year smallint,
    population bigint
)
WITH (
    LOCATION = 'csv/population/year=*/month=*/*.csv',
    DATA_SOURCE = AzureDataSource,
    FILE_FORMAT = QuotedCsvWithHeader
);

External tables in Synapse SQL are very similar to Polybase external tables. You can see how to create objects in SynapseSQL database in this setup script.

This is very useful set of features and in the next few sections we will see how to use them in Azure SQL Managed Instance.

Linked servers in Managed Instance

Linked servers in Managed Instance enable you to link remote SQL Server or Azure SQL endpoint and send remote queries or join remote tables with the local ones. Since default SQL Serverless query endpoint in Azure Synapse Analytics is T-SQL compatible query engine, you can create linked server pointing to Synapse SQL in the workspace created.

EXEC master.dbo.sp_addlinkedserver
                        @server = N'SynapseSQL',@srvproduct=N'', @provider=N'SQLNCLI11',
                        @datasrc=N'mysynapseworkspace-ondemand.sql.azuresynapse.net', 
                        @catalog=N'master';

EXEC master.dbo.sp_addlinkedsrvlogin @rmtsrvname=N'SynapseSQL', @useself=N'False',
                       @locallogin=NULL,
                       @rmtuser=N'<your user name here>',@rmtpassword='<your password>'
GO

EXEC master.dbo.sp_serveroption @server=N'SynapseSQL', 
                       @optname=N'remote proc transaction promotion', @optvalue=N'false'
GO
EXEC master.dbo.sp_serveroption @server=N'SynapseSQL', 
                       @optname=N'rpc', @optvalue=N'true'
GO
EXEC master.dbo.sp_serveroption @server=N'SynapseSQL',
                       @optname=N'rpc out', @optvalue=N'true'
GO

You can also use UI in SSMS to create this linked server and just these parameters. Once you setup your linked server to Synapse SQL default query endpoint, you can start reading external data.

Executing queries on Azure storage data

Managed Instance has EXEC function that enables you to execute T-SQL query on remote linked server. We can use this function to send a query that will be executed on SynapseSQL endpoint and return results of the scan. This following query returns results of the query that is reading the file content from Azure Data Lake storage and return the result set to client via Managed Instance endpoint:

SET QUOTED_IDENTIFIER OFF
EXEC("
select *
from openrowset(bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet',
                format='parquet') as cases
") AT SynapseSQL

NOTE: Make sure that you have configured RPC options in the Linked server configuration using sp_serveroption procedure in the setup script above, or you will get the messages like: Server ‘SynapseSQL’ is not configured for RPC. You need to set ‘rpc’ and ‘rpc out’ options to value true.

You can create complex ad-hoc reports and fetch the data via SynapseSQL to Managed Instance as shown in the following example:

SET QUOTED_IDENTIFIER OFF

EXEC("
use SampleDB;
SELECT
nyc.filename() AS [filename],COUNT_BIG(*) AS [rows]
FROM
OPENROWSET(
    BULK 'parquet/taxi/year=2017/month=9/*.parquet',
    DATA_SOURCE = 'SqlOnDemandDemo',
    FORMAT='PARQUET'
) nyc
GROUP BY nyc.filename();
") AT SynapseSQL;

This query will read a set of parquet files and return the results to the client. Computation will be done on remote SynapseSQL endpoint without affecting the workload on your Managed Instance.

#azure sql #linked server #managed instance #polybase #synapse

Azure SQL Managed Instance can query Azure Storage using Synapse
23.60 GEEK