What is Python?
Python is a dynamic, interpreted language which was originally designed by Guido van Rossum with code readability as one of its primary goals. As part of that, one its most unusual features is the use of meaningful white space. Python uses code blocks that are shown and enforced through indentation. Although some of its extensions and libraries are specific to one operating system, the core language is available for all major operating systems and most Python scripts will run unaltered from one system to another.
Why would a SQL Server DBA be interested in Python?
The Microsoft T-SQL programming language that is built into SQL Server along with the capabilities of included tools such as SSIS and SSRS form a remarkably versatile platform without any third party tools whatsoever. In fact, T-SQL itself is Turing complete.
However, they are not the best tool for all tasks. For instance, SQL Server by itself does not have full support for regular expressions. While it is fully capable of interacting with the file system there are generally better tools for any complex manipulation of the operating system. Pure SQL is also not the strongest option for complex mathematics or complicated file munging.
Python is one tool capable of handling those tasks. It is certainly not the only option, with Microsoft C# and PowerShell also being excellent options for some of these tasks. Perl and Lisp are amongst many other options, each with their own strong points.
Python has several strong points which make it worthy of attention. It is free and open source software, but far more significantly it is easy to learn and easy to read even for people who are not Python programmers. It is capable of being used to write full scale applications and server software, but for most DBAs it is more significant that it is excellent for jotting off quick utility scripts to solve a specific problem. For instance, it is excellent for writing scripts to reorganize files into a new directory structure and automatically verify that all the automated backup files actually are where they are supposed to be each morning.
How Python can be used with SQL Server
Python can be used to parse data that is not in a standard format so it can be imported into SQL Server.
This is one of my primary uses for it for in connection with SQL. I have received numerous files of data that was either structured in a nonstandard format or completely unstructured but with data such as e-mail addresses that need to be pulled out en mass. While it is possible to do these in SSIS or even pull it into a SQL Temp Table line by line and then parse it using pure T-SQL it is generally easier and faster to use Python or occasionally C# depending on the file type.
For instance, a file that is structured like this:
1 2 3 4 5 |
a, a, b, b, b, c, c, c, c d, d, d e |
… does not work easily with tools such as bulk insert or SSIS. [2] However Python can be used to easily determine the needed number of commas and add them to all lines, such as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
#Standardizes the number of commas per line prior to being read by #programs such as BCP and required for compliance with RFC4180 for CSV files #initialize the variables, change FileNames as appropriate InFileName = r'C:\test\testimport.csv' OutFileName = r'C:\test\testimport2.csv' NumCommas = 0 File = open(InFileName) for line in File: if line.count(',') > NumCommas: NumCommas = line.count(',') #return to the start of the file File.seek(0) OutFile = open(OutFileName, 'w') for line in File: OutFile.write(line.rstrip() + ',' * (NumCommas - line.count(',')) + '\n') OutFile.close() File.close() |
Afterwards the output file is quite amenable to Bulk Insert.
There are several choices to actually connect with SQL Server within python. The options include the default odbc which comes as a standard library, the win32com client tools, mxODBC (commercial product) and pyODBC. While the odbc module will handle almost any requirement, the other options add additional features which can simplify development.
Pyodbc, for example, permits access of values by name and makes cursors iterable. These changes from the default odbc package make it much easier to use. As an example, pyodbc can be used to connect to a SQL Server and then extract all e-mail addresses from a free form memo field.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
import pyodbc as p import re #RegEx library server = 'TWISEMAN-PC' database = 'test' table = 'salesData' field = 'memo' connStr = ( r'DRIVER={SQL Server};SERVER=' + server + ';DATABASE=' + database + ';' + 'Trusted_Connection=yes' ) #e-mail regex examples available at #http://www.regular-expressions.info/email.html pattern = r'\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b' emailMatcher = re.compile(pattern, re.I) emails = [] #create a list conn = p.connect(connStr) dbCursor = conn.cursor() sql = ('select ' + field + ' as memo ' ' from '+ table+ ' where ' + field + ' like \'%@%\'') dbCursor = conn.cursor() dbCursor.execute(sql) for row in dbCursor: newEmails = emailMatcher.finditer(row.memo) if newEmails is not None: emails.extend([addr.group() for addr in newEmails]) conn.close() print( emails) |
A similar script can scan an unstructured text document to extract e-mail addresses and insert them into a SQL table. Of course, it is often more useful to be able to call stored procedures. This can be done with pyodbc by sending the exec command for the procedure with all parameters through the execute method. Many of the libraries automatically begin transactions, so those most be committed or rolled back in the code, though some such as pyodbc permit a connection to be opened in autocommit mode.
How to acquire Python
Python is free and open source software maintained by the Python Software Foundation (PSF). The latest version can be downloaded directly from the PSF at Python.ORG Python Programming Language — Official Website
For many people that is not the best option though. For one thing, depending on what libraries and compatibility with other older code is needed, it may be wise to use a slightly more established version than the latest release. Currently, both the 2.7.x and 3.x lines are being maintained. The 3.x line is the more modern version with many enhancements, but not all libraries have been brought up to date and it is not entirely backwards compatible with the 2.7 line. Python 3 has many, but since it breaks backwards compatibility, a need to work with an older code base or with libraries that have not been updated may make 2.7 the better choice in some cases.
Also, it may be easier to download a pre-packaged distribution such as WinPython, ActivePython, or the commercially supported Enthought Canopy. They come with many packages and libraries that are not part of the core Python Distribution as well as pre-configured development environments which are more sophisticated than IDLE. Getting one of the distributions can save a lot of time downloading and installing individual components later.
There are many options for Python IDE’s and editors. Personally, I tend to use Spyder when I want to work with Python interactively or write short, simple scripts. It is lightweight and fast, but has some basic editing and introspection features that are convenient. When I am writing a longer or more complicated script/program I tend to use Eclipse with Pydev. It is loads slowly but has very sophisticated introspection and project management abilities. It is largely a matter of taste.
How to learn and get help with Python
There are many excellent resources on learning python and getting help. Once the absolute basics have been mastered, one of the best sources of information is the Doc strings for the entity being called. Many IDE’s provide user friendly access to the doc strings, but they can always be called by typing:
1 |
print([object_name].__doc__)" |
at an interactive prompt.
A few of the more useful resources for learning Python are:
- Dive into Python
- How to Think Like a Computer Scientist
- Think Stats: Probability and Statistics for Programmers
Also, there are some more generalized programming resources with excellent Python sections, such as Stack Overflow.
Further Reading:
- Comparing Python and PowerShell DBA Scripting – An excellent article by Ron Dameron comparing Python and Powershell with examples in each.
- DBA Script Exodus to Python or Ruby Expected in the Near Term – A broad overview of the value of scripting languages to the DBA.
Load comments