An idiot’s guide to fulltext search in PostgreSQL.

I love PostgreSQL. It’s probably the most powerful open-source database system out there. Recent features to handle JSON and geospatial data are allowing it to supplant specialized database systems and become closer to a one-DB-fits-all solution. One feature that I’ve recently been able to exploit is its fulltext search engine. It allowed me to easily move from a terrible search implementation (using regular expressions) to one that actually meets users’ expectations.

In this article, I will walk through a basic fulltext search configuration, as well as highlight a few potential improvements that can be made if you’re so inclined.

Many of the features discussed in this post are only available as of PostgreSQL 9.6. Earlier versions have some rudimentary fulltext functionality, but a lot of the more powerful tools we’ll be using are fairly new.

The Wrong Way To Search

Let’s start with a basic table:

create table test_table (id serial, title text, description text);

And populate it with some bogus data:

insert into test_table (title, description) values ('Hello world!', 'This is an example.'), ('Filler text', 'foo bar baz');

How might we go about doing a simple search on this table? The naive strategy, which I originally might have used, would go something like this:

select * from test_table where title ~* 'hello' or description ~* 'hello';

This is the result (which is correct, in this case):

id  | title        | description
----+--------------+---------------------
1   | Hello world! | This is an example.
(1 row)

However, consider that a human searching for something might not always know exactly what they’re looking for. Let’s put a new row in that table:

insert into test_table (title, description) values ('A helpful example', 'This is where things get complicated');

Let’s say a user kind of knows what they’re looking for: they enter the query “gets complicated”:

select * from test_table where description ~* 'gets complicated';

There is no result. Why? Because we have no way to identify that “gets” and “get” are semantically equivalent.

Also consider that a regex-based search is at least linear in the length of each string we check (unless you enable trigram indices, apparently, but that’s out of scope of this article). If you’re searching a large table, you’re going to be in big trouble.

Fulltext search solves both of these problems: It treats semantically equivalent words as matches, and can be drastically sped up by virtue of indices.

A Better Solution

To handle fulltext searches, there are two main things we have to do: add a tsvector to the table to store keywords, and write a tsquery when we want to perform a search.

A tsvector is essentially a list of keywords. We can assign one to each row in the table. Postgres will automatically filter out common words and replace similar words with roots or synonyms. Here’s an example query, to demonstrate how a human-written string is translated into an indexable tsvector:

select to_tsvector('This is where things get complicated');

to_tsvector
-------------------------------
'complic':6 'get':5 'thing':4
(1 row)

We can see that the input got filtered down into three keywords. One of them, “complicated,” got trimmed down to the root “complic.” The rules for how this translation takes place are defined by the regconfig.

Your default regconfig is probably either “english” or “simple.” It can be retrieved by selecting get_current_ts_config(), or set with the default_text_search_config directive. If you need to change it per-query, you can stick it in any call to to_tsvector or its related functions, e.g. to_tsvector('english', 'hello world'). In most cases, you’ll want “english” (or some other human language) rather than “simple” (which doesn’t handle magic such as removing meaningless words and distilling words into lexemes). Read the documentation if you think you’ll need to do anything funky with regconfig.

Now that we understand what we’re doing, let’s add a tsvector column to the table:

alter table test_table add column tsv tsvector;

Note: It isn’t always necessary to create a dedicated tsvector column (i.e., you can skip right ahead to creating a fulltext index on a concatenation of existing columns), but I’ve done it here because it’s a more flexible option and it’s easier to reason about.

How do we populate this column? Well, it’s quite easy to add a trigger that does it for us. We just tell it the name of our tsvector column, the language to use for processing, and then list all of the fields that should be considered for keywords:

create trigger test_tsv before insert or update on test_table
for each row execute procedure tsvector_update_trigger(
  tsv, 'pg_catalog.english', title, description
);

Note that ‘pg_catalog.english’ is actually our regconfig showing up again. It is mandatory to supply a regconfig to tsvector_update_trigger, and you must fully qualify it with the schema name (pg_catalog).

Now, any new or updated tuples will be given a tsvector containing the title and description. Let’s see it in action:

insert into test_table (title, description) values ('We have tsvectors now!', 'Pretty cool, eh?') returning tsv;

tsv
-----------------------------------------
'cool':6 'eh':7 'pretti':5 'tsvector':3
(1 row)

We can do a quick hack to trigger the hook and create vectors for our pre-existing rows. Be careful doing this in production.

update test_table set id=id;

Creating the tsvectors was half the battle. Now, how do we query them? The quite intuitive answer: tsquery.

A tsquery, like a tsvector, is essentially a list of keywords. However, we can do some fun things, like apply boolean operations and specify that words have to be in a certain order. Here’s a list of a few useful operators we can abuse:

Syntax Description
foo <-> bar Matches “foo” followed by “bar”
foo <2> baz Matches “foo” with “baz” two words later (i.e., one word in between). Replace the 2 with another number to match a larger “gap.”
foo & bar Matches a vector that contains both “foo” and “bar,” in any order and with anything in between.
foo | bar Matches a vector that contains either “foo” or “bar.”
!baz Matches a vector that doesn’t contain “baz.”

Here’s an example of a simple tsquery, to see how it is interpreted:

select to_tsquery('foo <-> baz');

to_tsquery
-----------------
'foo' <-> 'baz'

And here’s how we would use it to search (say hello to the @@ operator):

select id, title, description from test_table where tsv @@ to_tsquery('foo <-> bar');

id  | title       | description 
----+-------------+-------------
2   | Filler text | foo bar baz 

Of course, we’re not going to get our users to type their queries in this format. Fortunately, Postgres gives us a few helper functions for creating queries. Let’s take a look at both plainto_tsquery and phraseto_tsquery, and see how they differ:

select plainto_tsquery('foo baz'), phraseto_tsquery('foo baz');
 plainto_tsquery | phraseto_tsquery 
-----------------+------------------
 'foo' & 'baz'   | 'foo' <-> 'baz'
(1 row)

Most noticeably, plainto_tsquery just matches for the boolean conjunction of all words in the input, while phraseto_tsquery matches the exact ordering. Choose the one most appropriate for your application. You can easily make queries using a condition like where tsv @@ plainto_tsquery(user_input_here).

At this point, you’re able to handle simple fulltext queries. Congratulations. If you’re interested, read on for some interesting tweaks and optimizations.

Even Better

The next thing you’re going to want to add is an index. The Postgres documentation recommends using GIN indices for tsvectors (the reasons why are outside the scope of this post):

create index tsv_index on test_table using gin (tsv);

This will speed up your queries significantly, and is quite important if your tables are large.

There are several other features you might want to explore. For sake of brevity, I’ll just list a couple interesting ones and provide links to the appropriate documentation:

There are many more features I’ve omitted here, so don’t be afraid to read through that entire chapter of the documentation to learn more.

Conclusions

There are lots of reasons why we want to avoid substring-based or regex-based searching in our database. Perhaps most importantly, the results are often unintuitive to the user.

Postgres gives us an easy way to harness powerful fulltext searching using tsvectors and tsqueries. There’s absolutely no reason not to use them — you’ll avoid needing to configure a separate system for searching, you’ll easily be able to keep your keyword vectors and indices consistent with the latest database contents, and most importantly you’ll be giving your users the search results they expect.

Published by

Travis Mick

Travis is the chief architect of systems and software at Zeall.us. A background in network security research has fostered in him a passion for values such as digital privacy, net neutrality, and intellectual freedom. In a world where these causes are increasingly important, he aims to both raise awareness of them and further their goals through technology.

Leave a Reply