Fabric: Query a SQL Endpoint from a Notebook

Comments 0

Share to social media

Let’s analyse why we would like to query an SQL Endpoint. Once we understand why, we can dig into how to make a query to a SQL Endpoint.

We use notebooks to connect directly to lakehouse. Except by the T-SQL notebook, the notebooks have a default lakehouse and work directly with it from spark. However, accessing other data object may be more complex

Why execute a Query on a SQL Endpoint from a Notebook

The reasons to query a SQL Endpoint from a notebook are not very usual, but there are many.

  • Most data objects have a SQL Endpoint: Data Warehouse, Fabric Database, mirrored database. If you need to access data from these objects, you will need to query the SQL Endpoint.
  • I wrote before about a delay in SQL Endpoints to reflect lakehouse data. You may would like to query the SQL Endpoint to analyse this.
  • There are some queries and features we can use through SQL Endpoints but we can’t execute directly in a lakehouse.

These are only a few examples I identified. Let’s talk in the comments about what other scenarios may create this need.

Querying the SQL Endpoint: The method to use

A SQL Endpoint uses TDS, the same connection protocol as SQL Server and other Microsoft SQL flavours. Most software development languages are capable to make a connection to a Microsoft SQL flavour.

The notebooks use powerful languages, such as pySpark and Scala. In this way, there are multiple methods to make the connection to a SQL Endpoint. For example, we can use ODBC, JDBC or many other connection options.

However, I was looking for a more “Fabric native” method. There is one. It’s a bit strange, but at the moment, is the most “Fabric native” method.

This method requires us to use libraries only available to Scala language. In this way, the notebook will need to be in Scala.

Querying the SQL Endpoint in Scala

This is the basic code to query a SQL Endpoint using Scala:

import com.microsoft.spark.fabric.tds.implicits.read.FabricSparkTDSImplicits._
import com.microsoft.spark.fabric.Constants

val df = spark.read.option(Constants.WorkspaceId, datausageWorkspaceId).option(Constants.DatabaseName, SQLEndpoint_name).synapsesql(qry.query)

These are the details about this code:

  • The two imports only work in Scala
  • The two constants contain strings with the option name we need to use for the read operation
  • We need to have the Workspace Id and SQL Endpoint name
  • The Workspace is considered the server and the SQL Endpoint is considered the database
  • At the end we execute the query

Interaction between pySpark and Scala

My entire environment uses pySpark, while this solution requires Scala.

In some situations, we may need to create an interaction between the pySpark environment and Scala environment.

What we can do:

  • The parameters cell can be in Scala and the notebook can be called by pySpark notebooks. No problem about this.
  • The Scala cells can call pySpark notebooks.
  • By “call” this means either ‘%Run’ or ‘notebook.run’

What we can’t do:

  • Scala cells can’t access pySpark variables, neither the opposite

Exchanging values between languages

You may need to exchange values between the two languages from time to time. It depends on each scenario.

The solution for this is to use spark configurations. In one language you can set a configuration with the the variable value and read the configuration in the other language to retrieve the value.

Example – Set the configuration in pySpark:

spark.conf.set("lakemirrorSQLEndpointId",lakemirrorSQLEndpointId)

Retrieve the configuration in Scala:

val lakemirrorSQLEndpointId=spark.conf.get("lakemirrorSQLEndpointId")

Exceptions to the basic syntax

The “synapsesql” method doesn’t accept the same syntaxes supported by a SQL Endpoint.

These syntax may be perfect and acceptable by the SQL Endpoint, but “synapsesql” will reject them. It doesn’t help the error message being completely meaningless, taking us in the wrong directions.

Another syntax option: Breaking the query in two

The queries can be broken down in two parts. This is only one of the ways to solve the problem. Let’s analyse how this syntax works

val df = spark.read.option(Constants.WorkspaceId, wksId).option(Constants.DatabaseName, EndPointName).option("prepareQuery", firstpart).synapsesql(secondpart)

This first option breaks the query into two parts. The first part is sent together the “option” method, the second part is sent using the “synapsesql” method.

The relation between one query and another is made using CTE’s. Other options would be way more expensive, such as temporary views. The solution using CTE’s is like the code below:

%%spark

import com.microsoft.spark.fabric.tds.implicits.read.FabricSparkTDSImplicits._
import com.microsoft.spark.fabric.Constants

val df = spark.read.option(Constants.WorkspaceId, "bb457c73-bf10-4f53-8933-8c409192747a")
.option(Constants.DatabaseName, "interlake")
.option("prepareQuery", "with qry as (select * from fact_sale_1y_full where CustomerKey=204)")
.synapsesql("select count(*) as total from qry")

display(df)

PrepareQuery and SynapseSQL difference

SynapseSQL method is processed by spark. Spark tries to optimize the query and make additional checks. In this way, some native syntaxes are not supported in synapsesql.

The prepareQuery, on the other hand, sends the query “as is” to the target. In this way, the query is not affected by any spark processing.

There are many syntaxes which can cause this problem and require to be broken down in CTE’s and sent in two pieces:

  • Common Table Expressions (CTEs)
  • Deeply nested queries
  • Window functions
  • Custom SQL Functions
  • Dynamic SQL

Summary

After analysing some reasons that would require executing a query on a SQL Endpoint from a notebook, we analysed an interesting method to do this.

This method still has limitations in relation to the language, but it’s interesting to notice how it uses Fabric libraries directly.

This is far from be the only method to execute these queries, depending how the driver used to connect to the SQL endpoint.

Article tags

Load comments

About the author

Dennes Torres

See Profile

Dennes Torres is a Data Platform MVP and Software Architect living in Malta who loves SQL Server and software development and has more than 20 years of experience. Dennes can improve Data Platform Architectures and transform data in knowledge. He moved to Malta after more than 10 years leading devSQL PASS Chapter in Rio de Janeiro and now is a member of the leadership team of MMDPUG PASS Chapter in Malta organizing meetings, events, and webcasts about SQL Server. He is an MCT, MCSE in Data Platforms and BI, with more titles in software development. You can get in touch on his blog https://dennestorres.com or at his work https://dtowersoftware.com