By Rudderstack Team

SQL Pattern Matching Guide

Pattern matching over strings in SQL is a frequent need, much more frequent than some may think. Queries aren’t just for compiling demanding aggregate calculations, advanced joins, and table partitioning. This applies especially to data scientists and data engineers who often have to deal with raw, unstructured data. For these professional pattern matching operations, from the most basic to more advanced, are probably an everyday task.

The demand for efficient SQL pattern matching becomes immediate as the velocity of available data increases, especially within data-driven companies where decision-making is highly dependent on data.

The simplest and most common case in which such an operation is needed is when one simply wants to search in his database for data that match exactly or are similar to a specific pattern. However, this becomes extremely useful when you do not actually know the exact word or phrase you seek. E.g., Those whose name is ‘John’ or contains the string ‘John’, for example, ‘Johnpaul’.

In this post, we will go through all the capabilities that the majority of SQL dialects offer regarding both the aforementioned tasks when using Amazon Redshift, PostgreSQL, Google BigQuery, or SQL Server.

So, let’s get started!

When it comes to pattern matching, the usual options are 3: LIKE operator, SIMILAR TO operator, which is available only on some SQL dialects, and Regular Expressions.

LIKE operator

We will start our exploration with LIKE. It is probably the simplest of all expressions and is present in many database systems, including PostgreSQL, MS SQL Server, Redshift, and BigQuery.

When it comes to BigQuery, the LIKE operator is available for Legacy SQL, the default for BigQuery, and the updated SQL.

Moreover, according to PostgreSQL and Redshift official documentation, LIKE is more secure than SIMILAR TO, which is available only for PostgreSQL and Redshift, or POSIX-style regular expressions. Since SIMILAR TO is based on regular expressions, both methods are prone to the same security hazards and, more specifically, the exploitation of the fact that most regex can be pushed to extreme situations that make them work slowly. Entering these extreme situations, a program can hang for a long time.

In order to use LIKE one has to use the following syntax: <string> LIKE <pattern>.

This syntax remains the same for all the aforementioned database systems.

So, LIKE is an expression that is evaluated either as true or false depending on whether the string matches the given pattern.

Also, LIKE is often combined with other special characters, i.e. ‘%’ and ‘_’, which are called wildcards. When the underscore is present in a pattern, it means that it matches any single character while percent matches any sequence of characters of any length.

In order to clarify things a bit more, below we present some common examples of LIKE usages:

FROM customers
WHERE location LIKE%Yor%

This query will retrieve customers that come from New York, Yorkshire or any other city that includes ‘Yor’ in its name. In contrast, if we would like to retrieve only customers located in cities starting with Yor, this query would be sufficient:

FROM customers
WHERE location LIKE ‘Yor%

Equivalently one can use the underscore wildcard, which, as mentioned before, it matches any single character. With the following query, you will retrieve customers that are located, for example, in either Granada (Spain) or Grenada (Caribbean Island).