By Rudderstack Team

SQL Pattern Matching Guide

Pattern matching over strings in SQL is a frequent need, much more frequent than some may think. Queries aren’t just for compiling demanding aggregate calculations, advanced joins, and table partitioning. This applies especially to data scientists and data engineers who often have to deal with raw, unstructured data. For these professional pattern matching operations, from the most basic to more advanced, are probably an everyday task.

The demand for efficient SQL pattern matching becomes immediate as the velocity of available data increases, especially within data-driven companies where decision-making is highly dependent on data.

The simplest and most common case in which such an operation is needed is when one simply wants to search in his database for data that match exactly or are similar to a specific pattern. However, this becomes extremely useful when you do not actually know the exact word or phrase you seek. E.g., Those whose name is ‘John’ or contains the string ‘John’, for example, ‘Johnpaul’.

In this post, we will go through all the capabilities that the majority of SQL dialects offer regarding both the aforementioned tasks when using Amazon Redshift, PostgreSQL, Google BigQuery, or SQL Server.

So, let’s get started!

When it comes to pattern matching, the usual options are 3: LIKE operator, SIMILAR TO operator, which is available only on some SQL dialects, and Regular Expressions.

LIKE operator

We will start our exploration with LIKE. It is probably the simplest of all expressions and is present in many database systems, including PostgreSQL, MS SQL Server, Redshift, and BigQuery.

When it comes to BigQuery, the LIKE operator is available for Legacy SQL, the default for BigQuery, and the updated SQL.

Moreover, according to PostgreSQL and Redshift official documentation, LIKE is more secure than SIMILAR TO, which is available only for PostgreSQL and Redshift, or POSIX-style regular expressions. Since SIMILAR TO is based on regular expressions, both methods are prone to the same security hazards and, more specifically, the exploitation of the fact that most regex can be pushed to extreme situations that make them work slowly. Entering these extreme situations, a program can hang for a long time.

In order to use LIKE one has to use the following syntax: <string> LIKE <pattern>.

This syntax remains the same for all the aforementioned database systems.

So, LIKE is an expression that is evaluated either as true or false depending on whether the string matches the given pattern.

Also, LIKE is often combined with other special characters, i.e. ‘%’ and ‘_’, which are called wildcards. When the underscore is present in a pattern, it means that it matches any single character while percent matches any sequence of characters of any length.

In order to clarify things a bit more, below we present some common examples of LIKE usages:

SELECT name
FROM customers
WHERE location LIKE%Yor%

This query will retrieve customers that come from New York, Yorkshire or any other city that includes ‘Yor’ in its name. In contrast, if we would like to retrieve only customers located in cities starting with Yor, this query would be sufficient:

SELECT name
FROM customers
WHERE location LIKE ‘Yor%

Equivalently one can use the underscore wildcard, which, as mentioned before, it matches any single character. With the following query, you will retrieve customers that are located, for example, in either Granada (Spain) or Grenada (Caribbean Island).

SELECT name
FROM customers
WHERE location LIKE ‘Gr_nada’

In Google BigQuery, an operator alternative to LIKE is CONTAINS. Its functionality is quite the same except the fact that with CONTAINS, you do not have to use the % wildcard mentioned above as, by default, CONTAINS matches anywhere within the string.

SIMILAR TO operator

In order to use SIMILAR TO, one has to use the following syntax: <string> SIMILAR TO <pattern>.

SIMILAR TO can be used only on PostgreSQL and Redshift, and for the time being there is no equivalent operator in the rest of the database systems like BigQuery and Microsoft SQL Server.

SIMILAR TO works in quite the same way as LIKE, as it returns true or false depending on whether the string matches the supplied pattern entirely. SIMILAR TO also supports features borrowed by regular expressions, including the following:

  • . denotes the matching of any single character. E.g., ‘a.c’ matches ‘abc’, ‘adb’ etc.
  • | is used to denote two alternatives. E.g. ‘blendo’ SIMILAR TO ‘%(b | z)%’ returns True
  • * denotes repetition of the previous symbol zero or more times. E.g. both ‘bbblendo’ SIMILAR TO ‘%b*le%’ and ‘lendo’ SIMILAR TO ‘%b*le%’ return True.
  • {m} denotes repetition of the previous symbol exactly m times. E.g. ‘blenndo’ SIMILAR TO ‘%len{2}%’ returns True.
  • {m,} denotes repetition of the previous symbol at least m times. E.g. ‘blennndo’ SIMILAR TO ‘%len{2,}%’ returns True.
  • {m,n} denotes repetition of the previous item at least m times and no more than n times. E.g. ‘bleeendo’ SIMILAR TO ‘%le{2,4}n%’ also returns True.

Regular expressions

In cases where LIKE or SIMILAR TO does not cover your needs, regular expressions, or simply regex, probably will as they provide you with a much more powerful way of SQL pattern matching. Those familiar with UNIX systems will probably have already used regex at some point, as they are part of many UNIX tools, including sed, awk, and grep.

So, what is a regular expression? A regular expression is a sequence of characters that comply with certain formation rules and can be seen as the abbreviation of the whole set of strings. In this case, a string is said to match the regular expression only if it is included in the set of strings the regex describes.

To begin with, all the features described in the SIMILAR TO section, i.e |, *, {m}, {m,} and {m,n}, can be used for formulating a regex.

The simplest case is when you want to check if a string contains or does not contains a specified set of characters. In PostgreSQL and Redshift, the operator widely used is the tilde (~). The tilde can be combined with other operators such as (*) and (!) to form different matching conditions.

Some examples are presented below:

Especially, for BigQuery the function used for pattern matching using regular expressions is the REGEX_MATCH. This function considers the <string>, or more generally the column name, and the regex pattern.

Other important features of regular expressions are the so-called bracket expressions. The [ <set of chars> ] denotes a list of characters and matches every single character from this list. Two characters are separated within a bracket list with the minus symbol (-), abbreviating the full range of characters between them. For example [A-D] is a shorthand for ‘A,B,C,D’ while [0-4] a shorthand for ‘0,1,2,3,4’. When the power sign (^) is included before the bracket expression, it matches every character which is NOT included in the list.

It is worth mentioning that unlike LIKE, regular expressions match anywhere within the given string unless specified otherwise by explicitly anchoring the match to the beginning or the end of the string. This can be done by adding ^ or $ at the beginning of the end of the regex to anchor the match at each position accordingly. So, for example, while ‘blendo’ ~ ‘nd’ will returns true, ‘blendo’ ~ ‘^nd’ will return False because in order to match, it expects to match ‘nd’ at the beginning of the string.

In PostgreSQL, there is also a special function called substring(<string> from <pattern>) which takes as input a string and a regular expression and extracts the first substring that is found to match the provided expression. So, for example, substring(‘blendo’ from ‘e.d’) will return ‘end’.

If your needs cannot be satisfied by any of these alternatives, a user-defined function in Perl is probably your way to go.

Outro

As you can easily tell, using the SQL pattern matching capabilities most database systems offer can help database users perform partial matches of data values and obtain answers to questions beyond simple text queries and conventional searching and comparing operations.

No matter what database system best fits your needs, the majority of them satisfactorily support all these advanced concepts previously mentioned in this post.

Get Started Image

Get started today

Start building smarter customer data pipelines today with RudderStack. Our solutions engineering team is here to help.