Let’s analyse why we would like to query an SQL Endpoint. Once we understand why, we can dig into how to make a query to a SQL Endpoint.
We use notebooks to connect directly to lakehouse. Except by the T-SQL notebook, the notebooks have a default lakehouse and work directly with it from spark. However, accessing other data object may be more complex
Why execute a Query on a SQL Endpoint from a Notebook
The reasons to query a SQL Endpoint from a notebook are not very usual, but there are many.
- Most data objects have a SQL Endpoint: Data Warehouse, Fabric Database, mirrored database. If you need to access data from these objects, you will need to query the SQL Endpoint.
- I wrote before about a delay in SQL Endpoints to reflect lakehouse data. You may would like to query the SQL Endpoint to analyse this.
- There are some queries and features we can use through SQL Endpoints but we can’t execute directly in a lakehouse.
These are only a few examples I identified. Let’s talk in the comments about what other scenarios may create this need.
Querying the SQL Endpoint: The method to use
A SQL Endpoint uses TDS, the same connection protocol as SQL Server and other Microsoft SQL flavours. Most software development languages are capable to make a connection to a Microsoft SQL flavour.
The notebooks use powerful languages, such as pySpark and Scala. In this way, there are multiple methods to make the connection to a SQL Endpoint. For example, we can use ODBC, JDBC or many other connection options.
However, I was looking for a more “Fabric native” method. There is one. It’s a bit strange, but at the moment, is the most “Fabric native” method.
This method requires us to use libraries only available to Scala language. In this way, the notebook will need to be in Scala.
Querying the SQL Endpoint in Scala
This is the basic code to query a SQL Endpoint using Scala:
import com.microsoft.spark.fabric.tds.implicits.read.FabricSparkTDSImplicits._ import com.microsoft.spark.fabric.Constants val df = spark.read.option(Constants.WorkspaceId, datausageWorkspaceId).option(Constants.DatabaseName, SQLEndpoint_name).synapsesql(qry.query)
These are the details about this code:
- The two imports only work in Scala
- The two constants contain strings with the option name we need to use for the read operation
- We need to have the Workspace Id and SQL Endpoint name
- The Workspace is considered the server and the SQL Endpoint is considered the database
- At the end we execute the query
Interaction between pySpark and Scala
My entire environment uses pySpark, while this solution requires Scala.
In some situations, we may need to create an interaction between the pySpark environment and Scala environment.
What we can do:
- The parameters cell can be in Scala and the notebook can be called by pySpark notebooks. No problem about this.
- The Scala cells can call pySpark notebooks.
- By “call” this means either ‘%Run’ or ‘notebook.run’
What we can’t do:
- Scala cells can’t access pySpark variables, neither the opposite
Exchanging values between languages
You may need to exchange values between the two languages from time to time. It depends on each scenario.
The solution for this is to use spark configurations. In one language you can set a configuration with the the variable value and read the configuration in the other language to retrieve the value.
Example – Set the configuration in pySpark:
spark.conf.set("lakemirrorSQLEndpointId",lakemirrorSQLEndpointId)
Retrieve the configuration in Scala:
val lakemirrorSQLEndpointId=spark.conf.get("lakemirrorSQLEndpointId")
Exceptions to the basic syntax
The “synapsesql” method doesn’t accept the same syntaxes supported by a SQL Endpoint.
These syntax may be perfect and acceptable by the SQL Endpoint, but “synapsesql” will reject them. It doesn’t help the error message being completely meaningless, taking us in the wrong directions.
Another syntax option: Breaking the query in two
The queries can be broken down in two parts. This is only one of the ways to solve the problem. Let’s analyse how this syntax works
val df = spark.read.option(Constants.WorkspaceId, wksId).option(Constants.DatabaseName, EndPointName).option("prepareQuery", firstpart).synapsesql(secondpart)
This first option breaks the query into two parts. The first part is sent together the “option” method, the second part is sent using the “synapsesql” method.
The relation between one query and another is made using CTE’s. Other options would be way more expensive, such as temporary views. The solution using CTE’s is like the code below:
%%spark import com.microsoft.spark.fabric.tds.implicits.read.FabricSparkTDSImplicits._ import com.microsoft.spark.fabric.Constants val df = spark.read.option(Constants.WorkspaceId, "bb457c73-bf10-4f53-8933-8c409192747a") .option(Constants.DatabaseName, "interlake") .option("prepareQuery", "with qry as (select * from fact_sale_1y_full where CustomerKey=204)") .synapsesql("select count(*) as total from qry") display(df)
PrepareQuery and SynapseSQL difference
SynapseSQL method is processed by spark. Spark tries to optimize the query and make additional checks. In this way, some native syntaxes are not supported in synapsesql.
The prepareQuery, on the other hand, sends the query “as is” to the target. In this way, the query is not affected by any spark processing.
There are many syntaxes which can cause this problem and require to be broken down in CTE’s and sent in two pieces:
- Common Table Expressions (CTEs)
- Deeply nested queries
- Window functions
- Custom SQL Functions
- Dynamic SQL
Summary
After analysing some reasons that would require executing a query on a SQL Endpoint from a notebook, we analysed an interesting method to do this.
This method still has limitations in relation to the language, but it’s interesting to notice how it uses Fabric libraries directly.
This is far from be the only method to execute these queries, depending how the driver used to connect to the SQL endpoint.
Load comments