Well, hello there. It’s been a long (long!) time since I posted anything here. Since we’re well overdue for new content, I thought I’d post something quick and useful.
Any programmer who’s tried to write SQL queries is well aware that SQL does it’s own, limited, pattern matching for values in strings. As useful and powerful as SQL can be, it really needs better facilities for managing strings. In Microsoft SQL Server, that need can be answered through the use of a CLR function. That is, user written C# or VB (or any other CLR supported language) functions that can be called from your SQL statement to manipulate strings, implement complex business logic, or pretty much anything else. In this case, we’re going to add regular expression handling to SQL Server.
This journey began with the need to extract a business license number from a comment field. Unfortunately, the comment field held all sorts of information, such as who authorized the license, dates of one sort or another, phone numbers, and an unexpected multitude of ways to abbreviate “business license”. Not the sort of thing that native SQL could do – but could easily be handled through regular expression pattern matching. Unfortunately, SQL Server doesn’t support regular expressions. But – we can fix that!
There is a lot of information about writing CLR functions for SQL Server and much of it revolves around implementing regular expressions. This site is one of the best. However, the assembly presented there is in VB and I’m a C# guy. Their implementation also changes the order of parameters from the .NET Regular Expression object’s order, and I wanted to preserve the parameter order so that general reference material on .NET regular expressions would still be relevant – i.e. no hidden gotchas.
So, we’re off to writing our own CLR function for SQL. This turned out to be far simpler than expected, however I did run into one hiccup. The CLR function you create must be compiled for .NET 3.5 for SQL Server 2008. SQL Server 2012 will, I believe support .NET 4.0 and later. Several of the CLR regular expression implementations that you can find on the web include a step for creating a Regex object. My initial thoughts went along the lines of creating and then caching a Regex object in a static variable, and recreating the object whenever the search pattern changed. Ultimately, in the name of simplicity, I ended up using the Regex static methods and skipped manually creating any objects. Here’s my code for the Regex implementation:
using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading; using System.Data.Sql; using Microsoft.SqlServer.Server; using System.Data.SqlTypes; using System.Runtime.InteropServices; using System.Text.RegularExpressions; using System.Collections; namespace SQLRegularExpression { public static class RegExFunctions { [SqlFunction(IsDeterministic=true, IsPrecise=true)] public static SqlInt32 RegExOptions(SqlBoolean IgnoreCase, SqlBoolean Multiline, SqlBoolean ExplicitCapture, SqlBoolean Compiled, SqlBoolean Singleline, SqlBoolean IgnorePatternWhitespace, SqlBoolean RightToLeft, SqlBoolean ECMAScript, SqlBoolean CultureInvariant) { RegexOptions options; options = (IgnoreCase ? RegexOptions.IgnoreCase : RegexOptions.None) | (Multiline ? RegexOptions.Multiline : RegexOptions.None) | (ExplicitCapture ? RegexOptions.ExplicitCapture : RegexOptions.None) | (Compiled ? RegexOptions.Compiled : RegexOptions.None) | (Singleline ? RegexOptions.Singleline : RegexOptions.None) | (IgnorePatternWhitespace ? RegexOptions.IgnorePatternWhitespace : RegexOptions.None) | (RightToLeft ? RegexOptions.RightToLeft : RegexOptions.None) | (ECMAScript ? RegexOptions.ECMAScript : RegexOptions.None) | (CultureInvariant ? RegexOptions.CultureInvariant : RegexOptions.None); return (Int32) options; } [SqlFunction(IsDeterministic=true, IsPrecise=true)] public static SqlString RegExMatch(SqlString input, SqlString pattern, SqlInt32 options) { if (input.IsNull || pattern.IsNull) return null; return Regex.Match(input.Value, pattern.Value, (RegexOptions)options.Value).Value; } [SqlFunction(IsDeterministic=true, IsPrecise=true)] public static SqlInt32 RegExIndex(SqlString input, SqlString pattern, SqlInt32 options) { if (input.IsNull || pattern.IsNull) return 0; return Regex.Match(input.Value, pattern.Value, (RegexOptions)options.Value).Index; } [SqlFunction(IsDeterministic=true, IsPrecise=true)] public static SqlBoolean RegExIsMatch(SqlString input, SqlString pattern, SqlInt32 options) { if (input.IsNull || pattern.IsNull) return false; return Regex.IsMatch(input.Value, pattern.Value, (RegexOptions)options.Value); } [SqlFunction(IsDeterministic = true, IsPrecise = true)] public static SqlString RegExReplace(SqlString input, SqlString pattern, SqlString replacement, SqlInt32 options) { if (input.IsNull || pattern.IsNull) return null; return Regex.Replace(input.Value, pattern.Value, replacement.Value, (RegexOptions)options.Value); } [SqlFunction(IsDeterministic = true, IsPrecise = true)] public static SqlString RegExSqlReplace(SqlString input, SqlString pattern, SqlString replacement) { if (input.IsNull || pattern.IsNull) return null; return Regex.Replace(input.Value, pattern.Value, replacement.Value, RegexOptions.IgnoreCase | RegexOptions.Multiline); } [SqlFunction(IsDeterministic=true, IsPrecise=true)] public static SqlString RegExEscape(SqlString input) { if (input.IsNull) return null; return Regex.Escape(input.Value); } [SqlFunction(DataAccess = DataAccessKind.None, IsDeterministic = true, IsPrecise = true, Name = "RegExSplit", SystemDataAccess = SystemDataAccessKind.None, FillRowMethodName = "RegExSplitRow")] public static IEnumerable RegExSplit(SqlString input, SqlString pattern, SqlInt32 options) { if (input.IsNull || pattern.IsNull) return null; return Regex.Split(input.Value, pattern.Value, (RegexOptions)options.Value); } public static void RegExSplitRow(Object input, ref SqlString match) { match = new SqlString(input.ToString()); } [SqlFunction(DataAccess = DataAccessKind.None, IsDeterministic = true, IsPrecise = true, Name = "RegExMatches", SystemDataAccess = SystemDataAccessKind.None, FillRowMethodName = "RegExMatchesRow")] public static IEnumerable RegExMatches(SqlString input, SqlString pattern, SqlInt32 options) { if (input.IsNull || pattern.IsNull) return null; return Regex.Matches(input.Value, pattern.Value, (RegexOptions)options.Value); } public static void RegExMatchesRow(Object input, ref SqlString match, ref SqlInt32 matchIndex, ref SqlInt32 matchLength) { Match m = (Match)input; match = new SqlString(m.Value); matchIndex = m.Index; matchLength = m.Length; } } }
That’s it. So what’s going on here that makes this class callable from SQL server? Well, there are a couple of things. First and foremost is the attribute declaring that each of the main interface functions is a ‘SQLFunction‘. The properties of the SQLFunction attribute class help define to SQL Server how and what the method it’s applied to interfaces with SQL Server. While we’re not going to talk about this in detail, a little later we will examine the “FillRowMethodName” property. The other point to mention is the use of the SqlXxxx variable types both for paramters and return values. These types contain extra information for dealing with Null values from SQL Server. Compile this with an output type of ‘Class Library’ and then copy the .dll file onto your SQL server machine.
On the SQL Server side, we need to register our new assembly and define the methods as functions. This script locates the .dll file you just copied to the SQL Server machine and declares each function. You will need to put in the path to the location of the .dll in the ‘Create Assembly’ statement in the middle of the script.
sp_configure 'clr enabled', 1 RECONFIGURE WITH OVERRIDE -- ************************************************************************************************ -- Remove Function Entry Points -- ************************************************************************************************ IF EXISTS ( SELECT 1 FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.RegExOptions') ) DROP FUNCTION dbo.RegExOptions IF EXISTS ( SELECT 1 FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.RegExSplit') ) DROP FUNCTION dbo.RegExSplit go IF EXISTS ( SELECT 1 FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.RegExEscape') ) DROP FUNCTION dbo.RegExEscape go IF EXISTS ( SELECT 1 FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.RegExReplace') ) DROP FUNCTION dbo.RegExReplace go -- IF EXISTS ( SELECT 1 -- FROM sys.objects -- WHERE object_id = OBJECT_ID(N'dbo.RegExReplacex') ) -- DROP FUNCTION dbo.RegExReplacex -- go IF EXISTS ( SELECT 1 FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.RegExIndex') ) DROP FUNCTION dbo.RegExIndex go IF EXISTS ( SELECT 1 FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.RegExIsMatch') ) DROP FUNCTION dbo.RegExIsMatch go IF EXISTS ( SELECT 1 FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.RegExMatch') ) DROP FUNCTION dbo.RegExMatch go IF EXISTS ( SELECT 1 FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.RegExMatches') ) DROP FUNCTION dbo.RegExMatches go -- ************************************************************************************************ -- Remove the assembly from SQL Server -- ************************************************************************************************ IF EXISTS ( SELECT 1 FROM sys.assemblies asms WHERE asms.name = N'RegexFunction ' ) DROP ASSEMBLY [RegexFunction] -- ************************************************************************************************ -- Create Assembly reference -- ************************************************************************************************ CREATE ASSEMBLY RegexFunction FROM 'C:\\path\\to\\your\\assembly\\SQLRegularExpression.dll' GO -- ************************************************************************************************ -- Create Function Entry Points -- ************************************************************************************************ CREATE FUNCTION [dbo].[RegExOptions] ( @IgnoreCase bit, @MultiLine bit, @ExplicitCapture bit, @Compiled bit, @SingleLine bit, @IgnorePatternWhitespace bit, @RightToLeft bit, @ECMAScript bit, @CultureInvariant bit ) returns int AS EXTERNAL NAME RegexFunction.[SQLRegularExpression.RegExFunctions].RegExOptions go CREATE FUNCTION [dbo].[RegExIsMatch] ( @Input NVARCHAR(MAX), @Pattern NVARCHAR(4000), @Options int ) RETURNS BIT AS EXTERNAL NAME RegexFunction.[SQLRegularExpression.RegExFunctions].RegExIsMatch GO CREATE FUNCTION [dbo].[RegExIndex] ( @Input NVARCHAR(MAX), @Pattern NVARCHAR(4000), @Options int ) RETURNS int AS EXTERNAL NAME RegexFunction.[SQLRegularExpression.RegExFunctions].RegExIndex GO CREATE FUNCTION [dbo].[RegExMatch] ( @Input NVARCHAR(MAX), @Pattern NVARCHAR(4000), @Options int ) RETURNS NVARCHAR(MAX) AS EXTERNAL NAME RegexFunction.[SQLRegularExpression.RegExFunctions].RegExMatch GO CREATE FUNCTION [dbo].[RegExEscape] ( @Input NVARCHAR(MAX) ) RETURNS NVARCHAR(MAX) AS EXTERNAL NAME RegexFunction.[SQLRegularExpression.RegExFunctions].RegExEscape GO CREATE FUNCTION [dbo].[RegExSplit] ( @Input NVARCHAR(MAX), @Pattern NVARCHAR(4000), @Options int ) RETURNS TABLE (Match NVARCHAR(MAX)) AS EXTERNAL NAME RegexFunction.[SQLRegularExpression.RegExFunctions].RegExSplit GO CREATE FUNCTION [dbo].[RegExReplace] ( @Input NVARCHAR(MAX), @Pattern NVARCHAR(4000), @Repacement NVARCHAR(MAX), @Options int ) RETURNS NVARCHAR(MAX) AS EXTERNAL NAME RegexFunction.[SQLRegularExpression.RegExFunctions].RegExReplace GO CREATE FUNCTION [dbo].[RegExMatches] ( @Input NVARCHAR(MAX), @Pattern NVARCHAR(4000), @Options int ) RETURNS TABLE (Match NVARCHAR(MAX), MatchIndex INT, MatchLength INT) AS EXTERNAL NAME RegexFunction.[SQLRegularExpression.RegExFunctions].RegExMatches GO
The one thing that astonished me about all this was that it worked the first time. That never happens! Anyway, there is one thing about these that I would like to point out: the two functions that return tables instead of a value. These are RegExSplit and RegExMatches. If you scroll up and look at the C# definitions, you will see a couple of things. First, these two functions return an IEnumerable instead of a value, and that each have the SQLFunction attribute parameter “FillRowMethodName” set to a function also defined in the C# code. What happens with these is that the base function does it’s thing and returns a list (IEnumerable) of results. The SQL interface process then calls the method defined in the “FillRowMethodName” property for each result entry in the IEnumerable. The first paramter to the fill row method is the enumerated object returned by the base call (RegExSplit or RegExMatches), and the second and later parameters are all defined as ‘ref’ parameters. You’ll notice that the SQL declaration for these functions returns a table whose columns exactly match those ‘ref’ parameters on the fill row method. So, each row of the result table is the product of calling the fill row method for each entry in the IEnumerable list returned by the base method.
So, how to use this? You can use the declared functions in the same way that you would use any other SQL Function. In this case, I’m using the RegExMatch function to extract the aforementioned business license number from the comments field, and using the RegExReplace function to remove terminal abbreviations for Street or Road (‘ST’ or ‘RD’, etc.) from the street name.
select dbo.RegExMatch(comments, '\d{7}|d{6}', 0) as license, street_number, dbo.RegExReplace(street_name, '\bLN\Z|\bAV\Z|\bDR\Z|\bPL\Z|\bRD\Z|\bCR\Z|\bCT\Z|\bBL\Z|\bST\Z|\bWY\Z|\bHY\Z', '', 0) as name, city, state, zip from mytable z
A final word on the use of CLR functions. Depending on your security setup, the use of external functions can be considered a security risk. You will need to evaluate the business need for functions like this against the risks involved. You should also be wary of using CLR functions in SQL Server to implement functionality that architecturally belongs in the business or presentation layers. This I would strongly encourage you to not do. Put things where they belong, even if it’s difficult, and you’ll be rewarded with a better performing, more maintainable system overall.