Feeds:
Posts
Comments

Posts Tagged ‘software’

Well, hello there. It’s been a long (long!) time since I posted anything here. Since we’re well overdue for new content, I thought I’d post something quick and useful.

Any programmer who’s tried to write SQL queries is well aware that SQL does it’s own, limited, pattern matching for values in strings. As useful and powerful as SQL can be, it really needs better facilities for managing strings. In Microsoft SQL Server, that need can be answered through the use of a CLR function. That is, user written C# or VB (or any other CLR supported language) functions that can be called from your SQL statement to manipulate strings, implement complex business logic, or pretty much anything else. In this case, we’re going to add regular expression handling to SQL Server.

This journey began with the need to extract a business license number from a comment field. Unfortunately, the comment field held all sorts of information, such as who authorized the license, dates of one sort or another, phone numbers, and an unexpected multitude of ways to abbreviate “business license”. Not the sort of thing that native SQL could do – but could easily be handled through regular expression pattern matching. Unfortunately, SQL Server doesn’t support regular expressions. But – we can fix that!

There is a lot of information about writing CLR functions for SQL Server and much of it revolves around implementing regular expressions. This site is one of the best. However, the assembly presented there is in VB and I’m a C# guy. Their implementation also changes the order of parameters from the .NET Regular Expression object’s order, and I wanted to preserve the parameter order so that general reference material on .NET regular expressions would still be relevant – i.e. no hidden gotchas.

So, we’re off to writing our own CLR function for SQL. This turned out to be far simpler than expected, however I did run into one hiccup. The CLR function you create must be compiled for .NET 3.5 for SQL Server 2008. SQL Server 2012 will, I believe support .NET 4.0 and later. Several of the CLR regular expression implementations that you can find on the web include a step for creating a Regex object. My initial thoughts went along the lines of creating and then caching a Regex object in a static variable, and recreating the object whenever the search pattern changed. Ultimately, in the name of simplicity, I ended up using the Regex static methods and skipped manually creating any objects. Here’s my code for the Regex implementation:

using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;

using System.Data.Sql;
using Microsoft.SqlServer.Server;
using System.Data.SqlTypes;
using System.Runtime.InteropServices;
using System.Text.RegularExpressions;
using System.Collections;

namespace SQLRegularExpression
{
    public static class RegExFunctions
    {

        [SqlFunction(IsDeterministic=true, IsPrecise=true)]
        public static SqlInt32 RegExOptions(SqlBoolean IgnoreCase, SqlBoolean Multiline, SqlBoolean ExplicitCapture, SqlBoolean Compiled,
            SqlBoolean Singleline, SqlBoolean IgnorePatternWhitespace, SqlBoolean RightToLeft, SqlBoolean ECMAScript, SqlBoolean CultureInvariant)
        {
            RegexOptions options;

            options = (IgnoreCase ? RegexOptions.IgnoreCase : RegexOptions.None)
                | (Multiline ? RegexOptions.Multiline : RegexOptions.None)
                | (ExplicitCapture ? RegexOptions.ExplicitCapture : RegexOptions.None)
                | (Compiled ? RegexOptions.Compiled : RegexOptions.None)
                | (Singleline ? RegexOptions.Singleline : RegexOptions.None)
                | (IgnorePatternWhitespace ? RegexOptions.IgnorePatternWhitespace : RegexOptions.None)
                | (RightToLeft ? RegexOptions.RightToLeft : RegexOptions.None)
                | (ECMAScript ? RegexOptions.ECMAScript : RegexOptions.None)
                | (CultureInvariant ? RegexOptions.CultureInvariant : RegexOptions.None);

            return (Int32) options;
        }

        [SqlFunction(IsDeterministic=true, IsPrecise=true)]
        public static SqlString RegExMatch(SqlString input, SqlString pattern, SqlInt32 options)
        {
            if (input.IsNull || pattern.IsNull) return null;
            return Regex.Match(input.Value, pattern.Value, (RegexOptions)options.Value).Value;
        }

        [SqlFunction(IsDeterministic=true, IsPrecise=true)]
        public static SqlInt32 RegExIndex(SqlString input, SqlString pattern, SqlInt32 options)
        {
            if (input.IsNull || pattern.IsNull) return 0;
            return Regex.Match(input.Value, pattern.Value, (RegexOptions)options.Value).Index;
        }

        [SqlFunction(IsDeterministic=true, IsPrecise=true)]
        public static SqlBoolean RegExIsMatch(SqlString input, SqlString pattern, SqlInt32 options)
        {
            if (input.IsNull || pattern.IsNull) return false;
            return Regex.IsMatch(input.Value, pattern.Value, (RegexOptions)options.Value);
        }

        [SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlString RegExReplace(SqlString input, SqlString pattern, SqlString replacement, SqlInt32 options)
        {
            if (input.IsNull || pattern.IsNull) return null;
            return Regex.Replace(input.Value, pattern.Value, replacement.Value, (RegexOptions)options.Value);
        }

        [SqlFunction(IsDeterministic = true, IsPrecise = true)]
        public static SqlString RegExSqlReplace(SqlString input, SqlString pattern, SqlString replacement)
        {
            if (input.IsNull || pattern.IsNull) return null;
            return Regex.Replace(input.Value, pattern.Value, replacement.Value, RegexOptions.IgnoreCase | RegexOptions.Multiline);
        }

        [SqlFunction(IsDeterministic=true, IsPrecise=true)]
        public static SqlString RegExEscape(SqlString input)
        {
            if (input.IsNull) return null;
            return Regex.Escape(input.Value);
        }

        [SqlFunction(DataAccess = DataAccessKind.None, IsDeterministic = true, IsPrecise = true, Name = "RegExSplit",
            SystemDataAccess = SystemDataAccessKind.None, FillRowMethodName = "RegExSplitRow")]
        public static IEnumerable RegExSplit(SqlString input, SqlString pattern, SqlInt32 options)
        {
            if (input.IsNull || pattern.IsNull) return null;
            return Regex.Split(input.Value, pattern.Value, (RegexOptions)options.Value);
        }

        public static void RegExSplitRow(Object input, ref SqlString match)
        {
            match = new SqlString(input.ToString());
        }

        [SqlFunction(DataAccess = DataAccessKind.None, IsDeterministic = true, IsPrecise = true, Name = "RegExMatches",
            SystemDataAccess = SystemDataAccessKind.None, FillRowMethodName = "RegExMatchesRow")]
        public static IEnumerable RegExMatches(SqlString input, SqlString pattern, SqlInt32 options)
        {
            if (input.IsNull || pattern.IsNull) return null;
            return Regex.Matches(input.Value, pattern.Value, (RegexOptions)options.Value);
        }

        public static void RegExMatchesRow(Object input, ref SqlString match, ref SqlInt32 matchIndex, ref SqlInt32 matchLength)
        {
            Match m = (Match)input;
            match = new SqlString(m.Value);
            matchIndex = m.Index;
            matchLength = m.Length;
        }
    }
}

That’s it. So what’s going on here that makes this class callable from SQL server? Well, there are a couple of things. First and foremost is the attribute declaring that each of the main interface functions is a ‘SQLFunction‘. The properties of the SQLFunction attribute class help define to SQL Server how and what the method it’s applied to interfaces with SQL Server. While we’re not going to talk about this in detail, a little later we will examine the “FillRowMethodName” property. The other point to mention is the use of the SqlXxxx variable types both for paramters and return values. These types contain extra information for dealing with Null values from SQL Server. Compile this with an output type of ‘Class Library’ and then copy the .dll file onto your SQL server machine.

On the SQL Server side, we need to register our new assembly and define the methods as functions. This script locates the .dll file you just copied to the SQL Server machine and declares each function. You will need to put in the path to the location of the .dll in the ‘Create Assembly’ statement in the middle of the script.

sp_configure 'clr enabled', 1
RECONFIGURE WITH OVERRIDE
 
-- ************************************************************************************************
-- Remove Function Entry Points
-- ************************************************************************************************
 
IF EXISTS ( SELECT   1
            FROM     sys.objects
            WHERE    object_id = OBJECT_ID(N'dbo.RegExOptions') )
   DROP FUNCTION dbo.RegExOptions
 
IF EXISTS ( SELECT   1
            FROM     sys.objects
            WHERE    object_id = OBJECT_ID(N'dbo.RegExSplit') )
   DROP FUNCTION dbo.RegExSplit
go
 
IF EXISTS ( SELECT   1
            FROM     sys.objects
            WHERE    object_id = OBJECT_ID(N'dbo.RegExEscape') )
   DROP FUNCTION dbo.RegExEscape
go
 
IF EXISTS ( SELECT   1
            FROM     sys.objects
            WHERE    object_id = OBJECT_ID(N'dbo.RegExReplace') )
   DROP FUNCTION dbo.RegExReplace
go
-- IF EXISTS ( SELECT   1
--             FROM     sys.objects
--             WHERE    object_id = OBJECT_ID(N'dbo.RegExReplacex') )
--    DROP FUNCTION dbo.RegExReplacex
-- go
 
IF EXISTS ( SELECT   1
            FROM     sys.objects
            WHERE    object_id = OBJECT_ID(N'dbo.RegExIndex') )
   DROP FUNCTION dbo.RegExIndex
go
 
IF EXISTS ( SELECT   1
            FROM     sys.objects
            WHERE    object_id = OBJECT_ID(N'dbo.RegExIsMatch') )
   DROP FUNCTION dbo.RegExIsMatch
go
 
IF EXISTS ( SELECT   1
            FROM     sys.objects
            WHERE    object_id = OBJECT_ID(N'dbo.RegExMatch') )
   DROP FUNCTION dbo.RegExMatch
go
 
IF EXISTS ( SELECT   1
            FROM     sys.objects
            WHERE    object_id = OBJECT_ID(N'dbo.RegExMatches') )
   DROP FUNCTION dbo.RegExMatches
go
 
-- ************************************************************************************************
-- Remove the assembly from SQL Server
-- ************************************************************************************************

IF EXISTS ( SELECT   1
            FROM     sys.assemblies asms
            WHERE    asms.name = N'RegexFunction ' ) 
   DROP ASSEMBLY [RegexFunction]

-- ************************************************************************************************
-- Create Assembly reference
-- ************************************************************************************************

CREATE ASSEMBLY RegexFunction 
    FROM 'C:\\path\\to\\your\\assembly\\SQLRegularExpression.dll'
GO

-- ************************************************************************************************
-- Create Function Entry Points
-- ************************************************************************************************
 
CREATE FUNCTION [dbo].[RegExOptions]
      (
      @IgnoreCase bit,
        @MultiLine bit,
        @ExplicitCapture bit,
        @Compiled  bit,
        @SingleLine  bit,
        @IgnorePatternWhitespace  bit,
        @RightToLeft  bit,
        @ECMAScript  bit,
        @CultureInvariant  bit
        )
returns int
AS EXTERNAL NAME
   RegexFunction.[SQLRegularExpression.RegExFunctions].RegExOptions
 
go
 
CREATE FUNCTION [dbo].[RegExIsMatch]
   (
    @Input NVARCHAR(MAX),
    @Pattern NVARCHAR(4000),
    @Options int
   )
RETURNS BIT
AS EXTERNAL NAME
   RegexFunction.[SQLRegularExpression.RegExFunctions].RegExIsMatch
GO
 
CREATE FUNCTION [dbo].[RegExIndex]
   (
    @Input NVARCHAR(MAX),
    @Pattern NVARCHAR(4000),
    @Options int
   )
RETURNS int
AS EXTERNAL NAME
   RegexFunction.[SQLRegularExpression.RegExFunctions].RegExIndex
GO
 
CREATE FUNCTION [dbo].[RegExMatch]
   (
    @Input NVARCHAR(MAX),
    @Pattern NVARCHAR(4000),
    @Options int
 )
RETURNS NVARCHAR(MAX)
AS EXTERNAL NAME
   RegexFunction.[SQLRegularExpression.RegExFunctions].RegExMatch
GO
 
CREATE FUNCTION [dbo].[RegExEscape]
   (
    @Input NVARCHAR(MAX)
   )
RETURNS NVARCHAR(MAX)
AS EXTERNAL NAME
   RegexFunction.[SQLRegularExpression.RegExFunctions].RegExEscape
GO

CREATE FUNCTION [dbo].[RegExSplit]
   (
    @Input NVARCHAR(MAX),
    @Pattern NVARCHAR(4000),
    @Options int
   )
RETURNS TABLE (Match NVARCHAR(MAX))
AS EXTERNAL NAME
   RegexFunction.[SQLRegularExpression.RegExFunctions].RegExSplit
GO
 
CREATE FUNCTION [dbo].[RegExReplace]
   (
    @Input NVARCHAR(MAX),
    @Pattern NVARCHAR(4000),
    @Repacement NVARCHAR(MAX),
    @Options int
   )
RETURNS  NVARCHAR(MAX)
AS EXTERNAL NAME
   RegexFunction.[SQLRegularExpression.RegExFunctions].RegExReplace
GO
 
CREATE FUNCTION [dbo].[RegExMatches]
   (
    @Input NVARCHAR(MAX),
    @Pattern NVARCHAR(4000),
    @Options int
   )
RETURNS TABLE (Match NVARCHAR(MAX), MatchIndex INT, MatchLength INT)
AS EXTERNAL NAME
   RegexFunction.[SQLRegularExpression.RegExFunctions].RegExMatches
GO

The one thing that astonished me about all this was that it worked the first time. That never happens! Anyway, there is one thing about these that I would like to point out: the two functions that return tables instead of a value. These are RegExSplit and RegExMatches. If you scroll up and look at the C# definitions, you will see a couple of things. First, these two functions return an IEnumerable instead of a value, and that each have the SQLFunction attribute parameter “FillRowMethodName” set to a function also defined in the C# code. What happens with these is that the base function does it’s thing and returns a list (IEnumerable) of results. The SQL interface process then calls the method defined in the “FillRowMethodName” property for each result entry in the IEnumerable. The first paramter to the fill row method is the enumerated object returned by the base call (RegExSplit or RegExMatches), and the second and later parameters are all defined as ‘ref’ parameters. You’ll notice that the SQL declaration for these functions returns a table whose columns exactly match those ‘ref’ parameters on the fill row method. So, each row of the result table is the product of calling the fill row method for each entry in the IEnumerable list returned by the base method.

So, how to use this? You can use the declared functions in the same way that you would use any other SQL Function. In this case, I’m using the RegExMatch function to extract the aforementioned business license number from the comments field, and using the RegExReplace function to remove terminal abbreviations for Street or Road (‘ST’ or ‘RD’, etc.) from the street name.

select 
    dbo.RegExMatch(comments, '\d{7}|d{6}', 0) as license, 
    street_number, 
    dbo.RegExReplace(street_name, '\bLN\Z|\bAV\Z|\bDR\Z|\bPL\Z|\bRD\Z|\bCR\Z|\bCT\Z|\bBL\Z|\bST\Z|\bWY\Z|\bHY\Z', '', 0) as name,
    city,
    state,
    zip
from mytable z

A final word on the use of CLR functions. Depending on your security setup, the use of external functions can be considered a security risk. You will need to evaluate the business need for functions like this against the risks involved. You should also be wary of using CLR functions in SQL Server to implement functionality that architecturally belongs in the business or presentation layers. This I would strongly encourage you to not do. Put things where they belong, even if it’s difficult, and you’ll be rewarded with a better performing, more maintainable system overall.

Read Full Post »