Often times when making an application we have large amount of data that we want to search over. Data entry and basic analysis can be done using CRUD however a common requirement is to create a more sophisticated algorithm for searching text documents. There are complex standalone tools for performing this task such as Elasticsearch, however these applications can be difficult to set up and manage. In this article we are going to be covering how to use the tools provided to us by the Postgres database to create a search engine.
Term Frequency–Inverse Document Frequency
Term frequency–inverse document frequency (TF-IDF) for short is the algorithm that is used by many search engines for filtering and ordering results of plain text documents. A document in this context is a large piece of text that we want to search over. An example of a document could be a blog or wiki article.
The Wikipedia page has a detailed breakdown of the TF-IDF algorithm but we will go over it in this article as well. In short, it is an algorithm that gets the term frequency of a word in a document and multiplies it by the inverse document frequency of the same word over all documents.
Term Frequency (TF)
Term frequency is a value that is calculated on a per document basis and is the proportion of the number of times the word appears in the document compared to the length of the document. This means that if a word is mentioned more often in a document then it will be ranked more important in the search order.
Inverse Document Frequency (IDF)
Inverse document frequency is a measure of how common a term is across all documents. It is a method of working out how ‘important’ a word is over all of our documents. For example the word the
will appear incredibly often in a list of English language documents and that makes it unimportant for search queries. This is because if we search for the word the
in a search engine then the number of documents retrieved will not be reduced by much since most documents will contain that word. Therefore the word the
will have a very low IDF value.
Combining these together
Given these two metrics we can define a sort order that rates each document by their relevance to the provided search query. Because of the term frequency measure it will ensure that documents that have your search terms quoted often inside of them will be rated higher. Conversely the inverse document frequency measure will work to ensure that terms that are only mentioned in a few documents will be rated much more important than ones that are not, thus reducing the importance of ‘junk’ search terms.
Real world example
Now we have gone over the theory of how document search works we can implement it in Postgres. If all that theory was a bit confusing then don’t worry since the Postgres text search features will implement that all for us!
Configuring Postgres
Postgres has three different concepts for interacting with a TF-IDF data. The first is the tsvector
type which is used to store the TF values for each document that we have stored. The second is the tsquery
type which is used to query for results in the vectors that we have created. The last piece of the puzzle is a GIN
index that is used to store the IDF values for the entire table.
First of all we need to have a data set to search over. For this article we are going to make a basic blog database table that would allow us to search over the contents of the blog articles that are posted.
CREATE TABLE Articles (
id SERIAL PRIMARY KEY,
title TEXT,
blurb TEXT,
contents TEXT
);
Now we have created our table we need to add some extra metadata to store a tsvector
for each row.
ALTER TABLE Articles ADD COLUMN search_vector tsvector;
CREATE INDEX ix_search_vector ON Articles USING GIN (search_vector);
This shall create a new column to store the tsvector
as well as an index to increase the speed of the queries. You might notice that it is a GIN
index, which is an inverted index that is designed for increasing the speed of a text search query.
Now we have an indexed column to store our data we can add a trigger to populate the vector whenever a new article is added to our table. To do this we are going to use the following snippet to make a trigger function.
CREATE OR REPLACE FUNCTION update_article_content() RETURNS trigger AS $$
BEGIN
new.search_vector := setweight(to_tsvector(coalesce(new.title, '')), 'A') ||
setweight(to_tsvector(coalesce(new.blurb, '')), 'B') ||
setweight(to_tsvector(coalesce(new.contents, '')), 'C');
return new;
END
$$ LANGUAGE plpgsql;
Going over this code, the important part of the logic is defined in the snippet below.
setweight(to_tsvector(coalesce(new.title, '')), 'A')
-
coalesce
is a function that will check if the value variable is null. If it is then provide a default. -
to_tsvector
will create a newtsvector
for the value that is passed in. -
setweight
will add a weighting to the created vector to rank its importance is regards to other fields. The four weightings that can be applied areA
,B
,C
,D
withA
being the most important.
Following on with this we can combine multiple tsvector
values together with the ||
operator to create a single tsvector
value that can be used for our article search.
Now we have a trigger function to populate the column on our table, we need to call it in a trigger.
CREATE TRIGGER article_search_vector_update
BEFORE INSERT OR UPDATE
ON Articles
FOR EACH ROW EXECUTE PROCEDURE update_article_content();
Running a Search Query
Now we have got all of the configuration out of the way we can insert some data into our database and try it out. First of all we can create some test data.
INSERT INTO Articles (title, blurb, contents) VALUES (
'Hello World',
'This is a hello world article',
'Some great contents to the hello world article'
);
INSERT INTO Articles (title, blurb, contents) VALUES (
'Another article',
'This is our second article',
'Hello world this is another article'
);
INSERT INTO Articles (title, blurb, contents) VALUES (
'Last article',
'This is our third article',
'This one will not show up'
);
Now we can get to searching. The query that we are going to be using is the following.
SELECT title, ts_rank(search_vector, websearch_to_tsquery('hello world')) as rank
FROM Articles
ORDER BY rank DESC
Again we can go over the important bits of this query.
-
websearch_to_tsquery('hello world')
is an easy way to create a newtsquery
from a plain text string. This function is designed to take a human written input and convert it to an output that Postgres can understand. This input will look for articles that have both the wordshello
andworld
. -
ts_rank
is a function that takes a vector and a query and ranks them based on relevance. We can use this to order our documents based on order of importance to the user.
And with that we can see our search results.
title | rank |
---|---|
Hello World | 0.99954385 |
Another article | 0.19820644 |
Last article | 0 |
The first article that we made with the title in the name has a near perfect match. By comparison, the article with the query only in the contents is rated as a less important match. Lastly, the article without the search term in the title or contents has a rank of zero, indicating no match with the query.
With more development we could spruce up the SQL we used to run the query, in order to remove irrelavent results or provide additional filtering of our articles. However, with what we have now we can create a search for our blog articles.
Summary
It can be very easy to make a high quality search just by leveraging the power of Postgres. While it is not as powerful as a dedicated search service, it is significantly easier to configure and manage than the alternatives. Being integrated with the rest of the database allows for simple management in terms of both code and infrastructure.