Building advanced SQL search from a user text input

Gajus Kuizinas

Fullstack Engineer
Node.js
PostgreSQL
Slonik

We are going to build a program that parses user search query and serializes it into a SQL statement. Here is an example of a search query:

firstName:"Thomas" -jefferson birthdate:[1960 TO 1970] (profession:"inventor" OR profession:"engineer")

What this says is:

  • Find records with first name "Thomas"
  • Exclude mentions of "jefferson"
  • Narrow down results to where the birthdate is between 1960 and 1970
  • Narrow down results to where the profession is either "inventor" or "engineer"

In SQL, this could look something like:

"firstName" = 'Thomas' AND
"searchName" ILIKE '%jefferson%' AND
"birthdate" BETWEEN 1960 AND 1970 AND
(
  "profession" = 'inventor' OR
  "profession" = 'engineer'
)

(More about searchName later in the article.)

It might not be the most succinct form of SQL to do the job, but it is good enough.

Deciding between using a dedicated search software vs. doing it in SQL

I would like to briefly stop and address the elephant in the room. Historically, I would have advised against doing this in SQL and instead use a dedicated search software such as Algolia, MeiliSearch, TypeSense, or even ElasticSearch. This is because there are at least two problems with doing this in SQL:

  1. Safety. Historically, there was no (easy) way to do this safely, and a whacky implementation could quickly lead to security issues.
  2. Performance. DBAs hate dynamic queries because they are hard to optimize for.

In today's article, I introduce the solution to #1, and #2 has been mostly addressed thanks to hyper scalable SQL backends, such as AlloyDB.

Dedicated search software is still likely going to be faster. This is because they are easier to scale and distribute. However, depending on the context, the difference is likely to be marginal, and the upside of doing this in SQL is that you have a single source of truth. No more delays syncing data between your source of truth and a dedicated search software/services.

Parsing the search query

The first thing that we need to do is to parse the query. Rolling out your parser is bound to be a painful experience because writing and testing parsers is hard. Instead we are going to use Liqe – a lightweight and performant Lucene-like syntax parser and serializer.

In short, Liqe takes a text input in the form of Liqe query language (which you are already familiar with from the first paragraph) and converts that into an abstract syntax tree (AST), e.g., foo:bar becomes:

{
  expression: {
    location: {
      start: 4,
    },
    quoted: false,
    type: 'LiteralExpression',
    value: 'bar',
  },
  field: {
    location: {
      start: 0,
    },
    name: 'foo',
    path: ['foo'],
    quoted: false,
    type: 'Field',
  },
  location: {
    start: 0,
  },
  operator: {
    location: {
      start: 3,
    },
    operator: ':',
    type: 'ComparisonOperator',
  },
  type: 'TagExpression',
}

A lot is going on in this example, but the key takeaway is that from this we can extract search attributes. Important to note that Liqe does not dictate how you implement your search engine; it is up to you how you interpret the AST. That being said, Liqe comes with a built-in in-memory search engine that you can refer to when implementing your search engine.

Now that we have AST, it is time to build SQL with it.

Building SQL

We are using Slonik query builder to construct the SQL. The key thing to know about Slonik, is that it allows to compose SQL fragments safely.

To convert AST tokens to SQL, we need to write a serializer that implements all AST tokens. A good place to start is the built-in serializer as it demonstrates how to convert the AST back to the original text query. However, note that the built-in serializer is more complex than we need it for SQL serializer, as it has to respect whitespaces, which we don't care about when converting AST to SQL.

Our text query can be handled with as simple serializer as:

import {
  type LiqeQuery,
} from 'liqe';
import {
  type QueryResultRow,
  type SqlSqlToken,
  type IdentifierSqlToken,
  sql,
} from 'slonik';

export const createBuildSearchSqlFragment = (searchQuery: LiqeQuery) => {
  const serializeTagExpression = (ast: LiqeQuery): SqlSqlToken<QueryResultRow> => {
    if (ast.type !== 'TagExpression') {
      throw new Error('Expected a tag expression.');
    }

    const {
      field,
      expression,
    } = ast;

    let column: IdentifierSqlToken;

    if (field.type === 'ImplicitField') {
      column = sql.identifier([
        'searchName',
      ]);
    } else if (field.name === 'firstName') {
      column = sql.identifier([
        'firstName',
      ]);
    } else if (field.name === 'birthdate') {
      column = sql.identifier([
        'birthdate',
      ]);
    } else if (field.name === 'profession') {
      column = sql.identifier([
        'profession',
      ]);
    } else {
      throw new Error('Unexpected field name.');
    }

    if (expression.type !== 'LiteralExpression') {
      throw new Error('Expected a literal expression.');
    }

    if (expression.quoted) {
      return sql`${column} = ${expression.value}`;
    }

    return sql`${column} ILIKE ${'%' + expression.value + '%'}`;
  };

  const serialize = (ast: LiqeQuery): SqlSqlToken<QueryResultRow> => {
    if (ast.type === 'TagExpression') {
      return serializeTagExpression(ast);
    }

    if (ast.type === 'LogicalExpression') {
      let operatorToken;

      if (ast.operator.operator === 'AND') {
        operatorToken = sql`AND`;
      } else if (ast.operator.operator === 'OR') {
        operatorToken = sql`OR`;
      } else {
        throw new Error('Unexpected operator.');
      }

      return sql`${serialize(ast.left)} ${operatorToken} ${serialize(ast.right)}`;
    }

    if (ast.type === 'UnaryOperator') {
      return sql`NOT ${serialize(ast.operand)}`;
    }

    if (ast.type === 'ParenthesizedExpression') {
      return sql`(${serialize(ast.expression)})`;
    }

    throw new Error('Missing AST type.');
  };

  return serialize(searchQuery);
};

This already converts the original text query into the desired SQL.

The only thing this example is missing is handling of the birthdate range (birthdate:[1960 TO 1970]). Therefore, if you used the above code, you'd get "Missing AST type." error. To fix it, you must add another condition that converts RangeExpression token into SQL. It is as simple as:

if (ast.type === 'RangeExpression') {
  return sql`BETWEEN ${ast.expression.min} AND ${ast.expression.max}`;
}

You still need to handle inclusivity, but this is enough to get started. And since AST types are finite, you can ensure that your implementation handles all possible text query permutations simply by implementing all 11 AST types.

And that's really it – now you just use the generated SQL fragment to extend your base query, e.g.,

const textQuery = 'firstName:"Thomas" -jefferson birthdate:[1960 TO 1970] (profession:"inventor" OR profession:"engineer")';

sql`
  SELECT *
  FROM (
    SELECT
      id
      "firstName" || "lastName" AS "searchName",
      first_name "firstName",
      birthdate,
      profession
    FROM person
  ) as t1
  WHERE ${createBuildSearchSqlFragment(parse(textQuery))}
`

Mapping fields

Note how our original example does not blindly allow to search any fields. Instead, it maps firstName, birthdate and profession to specific identifiers in our SQL query. How you choose to implement (e.g., ignore or throw an error) unknown fields is up to you.

Note about searchName

A common requirement is to search any field that contains a substring. In our original example, this is the case with -jefferson fragment that does not specify a field. There are many ways to implement it, but by far the simplest one is to simply concatenate all the columns that are searchable into a single column, such as how we did in searchName example.

Protecting against misuse

Now that you have a parser that converts text query into AST (Liqe) and a SQL builder that safely composes tokens (Slonik), you can run arbitrary user-input search queries against your SQL database without risking of SQL injections. However, just because someone cannot inject arbitrary SQL, doesn't mean that they cannot use your powerful search interface as a vector for a DDoS attack, e.g. A query such as:

a OR b OR c OR d OR e OR ...

which, in our example, would translate to:

"searchName" ILIKE '%a%' OR
"searchName" ILIKE '%b%' OR
"searchName" ILIKE '%c%' OR
"searchName" ILIKE '%d%' ...

whether intentional or accidental, such a query would be computationally expensive. At the very least, you should implement a guard that limits maximum query execution time. However, since you have AST of the query, can you can also calculate complexity of the query and set an upper limit, e.g. by searching how many unquoted, ORed statements the query includes. The exact implementation will vary depending on the use case.

Closing thoughts

Thanks to Slonik and Liqe, we can parse and run complex search queries against our SQL database, which historically was not easily achievable. And thanks to technologies such as AlloyDB, we can run even the most complex queries quickly. There are still use cases (typo tolerance, faceting, synonyms) that are better handled by specialized search software. However, projects that do not require such features can now easily implement custom search using their existing database and keeping the design complexity at bay.

2022

Partner With Gajus
View Services

More Projects by Gajus