The series so far:
- SQL Server R Services: The Basics
- SQL Server R Services: Digging into the R Language
- SQL Server R Services: Working with ggplot2 Statistical Graphics
- SQL Server R Services: Working with Data Frames
- SQL Server R Services: Generating Sparklines and Other Types of Spark Graphs
- SQL Server R Services: Working with Multiple Data Sets
With the release of SQL Server 2016, you can now run R scripts within the SQL Server environment, using data that you retrieve directly from a SQL Server database. The ability to run R scripts comes through a new feature: SQL Server R Services. The key to working with R Services is the sys.sp_execute_external_script system stored procedure, which Microsoft also introduced in SQL Server 2016. Unfortunately, Microsoft documentation is a bit unclear on some of the finer points of using the procedure, and it’s up to you to read between the lines, while trying to make sense of the bigger picture.
To help you better understand the sp_execute_external_script stored procedure, this article walks you through the process of running R scripts based on SQL Server data. This article is the first in a series about using SQL Server R Services to write R scripts. Although most of the focus of this article is on the stored procedure itself, in subsequent articles, we’ll dig deeper into the R language, after you have a foundation in how to run R scripts that incorporate SQL Server data.
To help you get started, this article walks you through a series of examples that demonstrate the sp_execute_external_script stored procedure. I created the examples on a local instance of SQL Server 2016, using data from the AdventureWorks2014 database. The R scripts in the examples are very simple and are meant only to introduce you to basic concepts. As we progress through the series, we’ll go into greater depth into R scripting in order to demonstrate more advanced analytical capabilities.
Getting started with R scripting
The sp_execute_external_script stored procedure provides the structure necessary to run R scripts against SQL Server data, directly within the SQL Server environment. To run an R script, you must call the system stored procedure and provide the necessary parameters, adhering to the procedure’s requirements when assigning values. One of those values is the R script itself.
Currently, the stored procedure supports only the R language, but Microsoft plans to add support for Python in SQL Server 2017. Because of this expansion, Microsoft will be changing the name of SQL Server R Services to SQL Server Machine Learning Services and Microsoft R Server to Microsoft Machine Learning Server. For now, we’ll stick with the original names.
Before you can use the sp_execute_external_script stored procedure, you must enable your SQL Server instance to support external scripts, but before you can do that, you must ensure that R Services is installed correctly and up and running. Unfortunately, this in itself can be a tricky and frustrating process, especially if service packs or cumulative updates are involved. If you have not set up R Services or enabled external scripting and are not sure how to go about this, a good place to start is with the MSDN article Set up SQL Server R Services (In-Database).
Once you have R Services ready to go, you should be able to try out the sp_execute_external_script stored procedure. The following syntax show the procedure’s basic elements:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
sp_execute_external_script @language = N'language', @script = N'script', [ @input_data_1 = 'SelectStatement' [,] ] [ @input_data_1_name = N'InputDataSet' [,] ] [ @output_data_1_name = N'OutputDataSet' [,] ] [ @params = N'@ParameterName DataType [ OUT | OUTPUT ] [ ,...n ]' [,] ] [ @ParameterName = 'value' [ OUT | OUTPUT ] [,] [ ,...n ] ] [ WITH <execute_option> ] [;] <execute_option>::= { { RESULT SETS UNDEFINED } | { RESULT SETS NONE } | { RESULT SETS ( <result_sets_definition> ) } } |
I’ve simplified the syntax a bit for this article, but it still contains most of the basic components. One thing that’s missing is the breakdown for the <result_sets_definition> element, which is used to provide column names for the data outputted by the R script, as well as to take other steps. We’ll get into the specifics of all this later in the article. In the meantime, if you want to view a more complete version of the syntax, see the Microsoft document sp_execute_external_script (Transact-SQL). Note, however, that the last time I looked at this information, the syntax appeared incorrect, in terms of how some of the optional elements have been bracketed, so proceed with caution.
With that in mind, let’s start with a very basic example of how to use the sp_execute_external_script stored procedure so you can start to get a feel for the way it all works. At a minimum, when calling the procedure, you must provide values for the @language and @script parameters, as shown in the following example:
1 2 3 4 |
EXEC sp_execute_external_script @language = N'R', @script = N'OutputDataSet <- data.frame(seq(1,4,0.5));'; GO |
For the @language parameter, we must specify the value R (for the R language), which is currently the only value accepted for this parameter. Once support for Python has been added, we’ll be able to specify that language instead.
Next comes the value for the @script parameter, which defines the R script that we will ultimately run. In this case, the script is broken into two elements, separated by the back arrow (<-) assignment operator. Essentially, we’re assigning the value on the right side of the operator to the variable on the left side of the operator, similarly to how we would assign a value to a T-SQL variable.
In this case, we’re assigning the value to the OutputDataSet variable on the left side of the operator. This is the default name used to identify the output data set that is generated when we run the R script. (As you’ll see later in the article, we can provide a different name for the output data set.)
On the right side of the assignment operator is an expression that defines the data we want to return. The seq function generates a sequence of numbers starting with 1, ending with 4, and incremented by 0.5. The results of that function are converted to a data frame (data.frame), which provides a table-like structure for handling the data.
When we run the stored procedure, the data frame containing the numeric sequence is assigned to the OutputDataSet variable, which is returned by the stored procedure when we run it, giving us the results shown in the following table:
(No column name) |
1 |
1.5 |
2 |
2.5 |
3 |
3.5 |
4 |
Of course, you’ll want to do far more with your R scripts than what we’ve done here, but this should give you a general idea of how you can use the sp_execute_external_script stored procedure to run R scripts. No doubt most R scripts you work with will be much more involved. In fact, because of the potential complexities, I find that it is often easier to put the script in a T-SQL variable and then call that variable from within the stored procedure, as shown in the following example:
1 2 3 4 5 6 |
DECLARE @rscript NVARCHAR(MAX); SET @rscript = N'OutputDataSet <- data.frame(seq(1,4,0.5));'; EXEC sp_execute_external_script @language = N'R', @script = @rscript; GO |
The stored procedure will return the same results as the preceding example, only now it’s easier to read and update the R script, or at least it will be as they become more complex.
Using SQL Server data in R scripts
As fun as it might be to run R scripts alone in SQL Server, the real power that comes with R Services is the ability to incorporate SQL Server data into those scripts. For that, we need to include the procedure’s @input_data_1 parameter, which specifies the SELECT statement to use to retrieve data from our database, as shown in the following example:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
DECLARE @rscript NVARCHAR(MAX); SET @rscript = N'OutputDataSet <- InputDataSet;'; DECLARE @sqlscript NVARCHAR(MAX); SET @sqlscript = N' SELECT FirstName, LastName, SalesYTD FROM Sales.vSalesPerson WHERE SalesYTD > 2000000 ORDER BY SalesYTD DESC;'; EXEC sp_execute_external_script @language = N'R', @script = @rscript, @input_data_1 = @sqlscript; GO |
As with the R script, I’m using a T-SQL variable (@sqlscript) to store the SELECT statement. I then assign that variable to the @input_data_1 parameter.
I’ve also updated the R script itself to assign the InputDataSet value to the OutputDataSet variable. InputDataSet is the default name of the variable used to represent the query specified in the @input_data_1 parameter. In other words, I’m assigning the data returned by the SELECT statement to the OutputDataSet variable, giving us the results shown in the following table:
(No column name) | (No column name) | (No column name) |
Linda | Mitchell | 4251368.5497 |
Jae | Pak | 4116871.2277 |
Michael | Blythe | 3763178.1787 |
Jillian | Carson | 3189418.3662 |
Ranjit | Varkey Chudukatil | 3121616.3202 |
José | Saraiva | 2604540.7172 |
Shu | Ito | 2458535.6169 |
Tsvi | Reiter | 2315185.611 |
All we’ve done here is to pass our T-SQL query results to the R script, which returns those same results as that of the SELECT statement. Of course, this is something we could have done without using an R script; however, by being able to pass our query results to the R script, we can then use the analytical power built into R to do something with that data, which is the whole point of SQL Server R Services and the sp_execute_external_script stored procedure.
For example, the following R script divides the SalesYTD totals by 7 and then rounds that figure to two decimal points:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
DECLARE @rscript NVARCHAR(MAX); SET @rscript = N' OutputDataSet <- InputDataSet; OutputDataSet[,3] <- round(InputDataSet$SalesYTD / 7, 2);'; DECLARE @sqlscript NVARCHAR(MAX); SET @sqlscript = N' SELECT FirstName, LastName, SalesYTD FROM Sales.vSalesPerson WHERE SalesYTD > 2000000 ORDER BY SalesYTD DESC;'; EXEC sp_execute_external_script @language = N'R', @script = @rscript, @input_data_1 = @sqlscript; GO |
The idea here is it find the average monthly sales based on the assumption that seven months have passed so far this year and then round that figure. To do so, I’ve added a second line to the R script. To the left of the assignment operator, I’ve specified OutputDataSet[,3] to indicate that we want to update the third column within data frame.
To the right of the assignment operator, we reference the InputDataSet variable again, but this time add a dollar sign ($) and the name of the source column (SalesYTD). We then divide this amount by 7 and use the round function to round the amount to two decimal points, giving us the results shown in the following table:
(No column name) | (No column name) | (No column name) |
Linda | Mitchell | 607338.36 |
Jae | Pak | 588124.46 |
Michael | Blythe | 537596.88 |
Jillian | Carson | 455631.2 |
Ranjit | Varkey Chudukatil | 445945.19 |
José | Saraiva | 372077.25 |
Shu | Ito | 351219.37 |
Tsvi | Reiter | 330740.8 |
This, of course, is an extremely simple example of R (and something we could have also done in T-SQL), but it demonstrates how you can incorporate SQL Server data into your R scripts. Later in the series, we’ll dig a lot more deeply into R, but for now, we’ll continue to focus on the sp_execute_external_script stored procedure.
Naming data elements
The examples we’ve looked at so far have used the default names for the input and output data sets (InputDataSet and OutputDataSet, respectively); however, the sp_execute_external_script stored procedure lets us provide our own names. For example, to specify the name of the input data set (the SELECT statement), we include the @input_data_1_name parameter when calling the stored procedure, as shown in the following example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
DECLARE @rscript NVARCHAR(MAX); SET @rscript = N' OutputDataSet <- SqlData; OutputDataSet[,3] <- round(SqlData$SalesYTD / 7, 2);'; DECLARE @sqlscript NVARCHAR(MAX); SET @sqlscript = N' SELECT FirstName, LastName, SalesYTD FROM Sales.vSalesPerson WHERE SalesYTD > 2000000 ORDER BY SalesYTD DESC;'; EXEC sp_execute_external_script @language = N'R', @script = @rscript, @input_data_1 = @sqlscript, @input_data_1_name = N'SqlData'; GO |
In this case, the @input_data_1_name parameter specifies that the stored procedure should use SqlData to refer to the input data set, rather than the default name, InputDataSet. Notice that I’ve updated all references to the InputDataSet variable in the R script to SqlData. The stored procedure will still return the same results, despite having changed the references within the R script.
This process works much the same for the output data set, only this time we include the @output_data_1_name parameter when calling the stored procedure:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
DECLARE @rscript NVARCHAR(MAX); SET @rscript = N' MonthlySales <- SqlData; MonthlySales[,3] <- round(SqlData$SalesYTD / 7, 2);'; DECLARE @sqlscript NVARCHAR(MAX); SET @sqlscript = N' SELECT FirstName, LastName, SalesYTD FROM Sales.vSalesPerson WHERE SalesYTD > 2000000 ORDER BY SalesYTD DESC;'; EXEC sp_execute_external_script @language = N'R', @script = @rscript, @input_data_1 = @sqlscript, @input_data_1_name = N'SqlData', @output_data_1_name = N'MonthlySales'; GO |
In this case, the @output_data_1_name parameter specifies MonthlySales for the output data frame. To accommodate this new name, I changed all references in the R script from OutputDataSet to MonthlySales. Once again, the stored procedure still returns the same results.
Up to this point, none of the examples included column names for the returned data. To do so, we must add a WITH RESULT SETS clause when calling the sp_execute_external_script stored procedure. In the clause, we specify the column names and their data types, as shown in the following example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
DECLARE @rscript NVARCHAR(MAX); SET @rscript = N' MonthlySales <- SqlData; MonthlySales[,3] <- round(SqlData$SalesYTD / 7, 2);'; DECLARE @sqlscript NVARCHAR(MAX); SET @sqlscript = N' SELECT FirstName, LastName, SalesYTD FROM Sales.vSalesPerson WHERE SalesYTD > 2000000 ORDER BY SalesYTD DESC;'; EXEC sp_execute_external_script @language = N'R', @script = @rscript, @input_data_1 = @sqlscript, @input_data_1_name = N'SqlData', @output_data_1_name = N'MonthlySales' WITH RESULT SETS( (FirstName NVARCHAR(50), LastName NVARCHAR(50), MonthlyAvg MONEY)); GO |
Now the results include the specified column names:
FirstName | LastName | MonthlyAvg |
Linda | Mitchell | 607338.36 |
Jae | Pak | 588124.46 |
Michael | Blythe | 537596.88 |
Jillian | Carson | 455631.2 |
Ranjit | Varkey Chudukatil | 445945.19 |
José | Saraiva | 372077.25 |
Shu | Ito | 351219.37 |
Tsvi | Reiter | 330740.8 |
Instead of including the column names when adding the WITH RESULT SETS clause, we could have specified UNDEFINED to indicate that no column names should be included, but it wasn’t necessary because this is the default behavior. Another option is to specify NONE, which prevents the stored procedure from returning the data assigned to the MonthlySales variable.
Defining R input parameters
The sp_execute_external_script stored procedure also lets us use input parameters to pass data into our SELECT statement or R script. To do so, we must first use the @params parameter to declare the variables and then, as a separate step, assign a value to each parameter.
In the following EXECUTE statement, I’ve included the @params parameter and declared two input parameters: @TotalSales and @TotalMonths:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
DECLARE @rscript NVARCHAR(MAX); SET @rscript = N' MonthlySales <- SqlData; MonthlySales[,3] <- round(SqlData$SalesYTD / TotalMonths, 2);'; DECLARE @sqlscript NVARCHAR(MAX); SET @sqlscript = N' SELECT FirstName, LastName, SalesYTD FROM Sales.vSalesPerson WHERE SalesYTD > @TotalSales ORDER BY SalesYTD DESC;'; EXEC sp_execute_external_script @language = N'R', @script = @rscript, @input_data_1 = @sqlscript, @input_data_1_name = N'SqlData', @output_data_1_name = N'MonthlySales', @params = N'@TotalSales MONEY, @TotalMonths TINYINT', @TotalSales = 2000000, @TotalMonths = 7 WITH RESULT SETS( (FirstName NVARCHAR(50), LastName NVARCHAR(50), MonthlyAvg MONEY)); GO |
When defining the @params parameter, I assigned the input parameters as a single string value, using a comma to separate the individual variable declarations. Each declaration includes the name of the input variable and its data type. (We can also specify nullability.) After the @params parameter, I set the value for each input parameter. In this case, I’m using literal values, but as you’ll see later in the article, we can assign other types of values to the input parameters.
In addition to defining the input parameters, I also updated the R script to reference the @TotalMonths parameter and the SELECT statement to reference the @TotalSales parameter. Notice that for the R script, I do not precede the parameter name with an at (@) symbol and instead treat it just like a variable.
Adding R scripts to stored procedures
You can include the sp_execute_external_script stored procedure in a user-defined stored procedure, making it possible to call your R script just like any other database object. The only tricky part is to connect the input parameters of the user-defined stored procedure to the input parameters of the sp_execute_external_script stored procedure.
Let’s look at an example of how this works, based on the preceding code. In the following CREATE PROCEDURE statement, I include the @MinSales and @MonthsYTD input parameters as part of the procedure definition:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
DROP PROCEDURE IF EXISTS dbo.GetMonthlyAverages; GO CREATE PROCEDURE dbo.GetMonthlyAverages (@MinSales MONEY, @MonthsYTD TINYINT) AS SET NOCOUNT ON; DECLARE @rscript NVARCHAR(MAX); SET @rscript = N' MonthlySales <- SqlData; MonthlySales[,3] <- round(SqlData$SalesYTD / TotalMonths, 2);'; DECLARE @sqlscript NVARCHAR(MAX); SET @sqlscript = N' SELECT FirstName, LastName, SalesYTD FROM Sales.vSalesPerson WHERE SalesYTD > @TotalSales ORDER BY SalesYTD DESC;'; EXEC sp_execute_external_script @language = N'R', @script = @rscript, @input_data_1 = @sqlscript, @input_data_1_name = N'SqlData', @output_data_1_name = N'MonthlySales', @params = N'@TotalSales MONEY, @TotalMonths TINYINT', @TotalSales = @MinSales, @TotalMonths = @MonthsYTD WITH RESULT SETS( (FirstName NVARCHAR(50), LastName NVARCHAR(50), MonthlyAvg MONEY)); GO |
Notice that I assigned the @MinSales parameter to the @TotalSales parameter and the @MonthsYTD parameter to the @TotalMonths parameter. In this way, the values provided for the @MinSales and @MonthsYTD parameters when calling the GetMonthlyAverages stored procedure are passed to the @TotalSales and the @TotalMonths parameters. We can demonstrate this by running the following EXECUTE statement:
1 |
EXEC dbo.GetMonthlyAverages @MinSales = 2000000, @MonthsYTD = 7; |
We still get the same results as in the previous example, only now we get to persist the structure and have the ability to pass in different values when calling the GetMonthlyAverages procedure.
Defining R output parameters
One other point about the sp_execute_external_script stored procedure worth noting: You can define an output parameter for returning a scalar value in addition to or in place of the data frame returned by the R script.
The following example updates the preceding one to return only a scalar value, using the @mean variable and @MeanOut output parameter:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
DROP PROCEDURE IF EXISTS dbo.GetMonthlySalesMean; GO CREATE PROCEDURE dbo.GetMonthlySalesMean (@MinSales MONEY, @MonthsYTD TINYINT) AS SET NOCOUNT ON; DECLARE @rscript NVARCHAR(MAX); SET @rscript = N' sales <- SqlData; sales[,3] <- SqlData$SalesYTD / TotalMonths; MeanOut <- mean(sales[,3]);'; DECLARE @sqlscript NVARCHAR(MAX); SET @sqlscript = N' SELECT FirstName, LastName, SalesYTD FROM Sales.vSalesPerson WHERE SalesYTD > @TotalSales ORDER BY SalesYTD DESC;'; DECLARE @mean MONEY; EXEC sp_execute_external_script @language = N'R', @script = @rscript, @input_data_1 = @sqlscript, @input_data_1_name = N'SqlData', @params = N'@TotalSales MONEY, @TotalMonths TINYINT, @MeanOut MONEY OUTPUT', @TotalSales = @MinSales, @TotalMonths = @MonthsYTD, @MeanOut = @mean OUTPUT WITH RESULT SETS NONE; SELECT @mean AS SalesMean; GO |
First, I declared the @mean variable, using the MONEY data type. I then added the @MeanOut parameter declaration to the @params parameter value, again using the MONEY data type. As part of the declaration, I included the OUTPUT keyword. Next, I assigned the @mean variable to the @MeanOut parameter, again including the OUTPUT keyword.
Next, I simplified the second line of the R script by removing the round function, and then added a third line that uses the mean function to find the mean value of the SalesYTD averages. I then assigned this value to the @MeanOut variable so that result will be returned.
Another step I took was to set the WITH RESULT SETS clause NONE so the data frame itself is not returned. I also removed the @output_data_1_name parameter because it cannot be used when specifying NONE for the WITH RESULT SETS clause. If you include this parameter, SQL Server will return an error.
We can now try out our new procedure by running the following EXECUTE statement:
1 |
EXEC dbo.GetMonthlySalesMean @MinSales = 2000000, @MonthsYTD = 7; |
The stored procedure returns only a scalar value, which on my system is 461084.189.
Working with the R language
If you followed along with all the examples in this article, you should have a good idea of how to use SQL Server to run an R script that incorporates SQL Server data. Regardless of how complex your R scripts become, the basic stricture for running them is the same. Essentially, you define a SELECT statement, assign the results to a variable, and use the variable in your R script. In the R script, you can apply whatever logic you want, and then output the final data to the output variable, which is returned when you run the sp_execute_external_script stored procedure.
Now that you have a foundation for how to work with R in SQL Server, you should be ready to implement more robust scripts. In the articles to follow in this series, we’ll dive deeper into R scripting so you can perform more complex analytics. We’ll also look at how to render those results into meaningful visualizations. As you’ll discover, the R examples we’ve looked at so far barely scratch the surface of R’s potential. But they should at least provide you with a good foundation for moving forward.
Load comments