Elastic vs. PostgreSQL for text lookups

During one of my missions, I could compare two database management systems and their performance on text lookups: Elasticsearch and PostgreSQL. Here is the what I found.

The initial project in which I participate is as follows: A Named Entity Recognition (NER) algorithm detects company names in documents. Then, these company names needs to be looked-up in an external database joined to the project. This is where I come in. I work on the following mission: join an external database to the project and compare the performance of several database management systems, choose the one that is more adaptable. Finally, find an algorithm that is able to do a fuzzy lookups in case the NER algo didn’t detect the company name well.

I compare a relational and a NoSQL database systems for the task: PostreSQL and Elasticsearch respectively. The data I need to put in a database is the Sirene database. This free, governmental database contains information about all enterprises residing in France, describing them by their unique SIRET numbers, providing information about their sector of activity and contact information along other details. So the aim of the project is to find a quick and reliable way to look-up the detected company names (by the NER algo) in one of these database solutions.

There are several difficulties related to the lookups. First, we need a quick way to look up an enterprise in the database as the Sirene data contains about 28 million enterprises with more than 50 columns of mostly text data. Second, the name of the enterprises detected by the NER algo is often incorrectly written. This can have many reasons. It can be misspelled or the NER algorithm fails to detect the exact name. Therefore, this mission is twofold: firstly to compare a relational and NoSQL database in the performance of look-ups, secondly to find an algorithm that is able to do the fuzzy searches and the non-exact look-ups well. Altogether, a quick and reliable database solution is needed that is able to do fuzzy look-ups rapidly and well. Therefore, I compare two database systems: PostgreSQL, so a relational database management system, and Elasticsearch, that is a NoSQL database.

First I talk about PostgreSQL, followed by Elasticsearch and the way I implement the lookups, and finally I present the results.

PostgreSQL

PostgreSQL is an open-sourced relational database management system (RDBMS). RDBMS is based on the relations in the data, therefore it organises data into tables with relations between tables as primary and secondary keys. PostgreSQL uses the SQL language, therefore a declarative language, in which we declare what we are searching for and not how to find it.

We can define the way data is stored and therefore searched, by indexing data with various methods, such as hash, tree, n-gram and generalised indexes. The way we define the indexing of the data will greatly contribute to the performance of a look-up. For instance, for exact name searches hash index will be extremely fast, however, if we conduct range searches, hash indexing will be very slow and tree indexing should be used. Therefore, defining the indexing structure will define what and how quick we are able to retrieve from the database stored in
PostgreSQL.

I used the following indices:

  • Hash index:

    With a hash index, data is accessed through an in-memory hash table, which is an array of buckets containing the pointers of each data point, allocated to a given bucket by the hash function. The hash function is a deterministic function that might allocate two different index keys to the same bucket. Upon retrieval, the hash index is identified of the searched data, and the pointer is retrieved. Therefore the hash index is extremely fast in retrieving exact matches of the index key, however, not applicable in any other retrieval process. The following figure show a schema of the hash index.

  • GIN index:

    PostgreSQL proposes Generalized Inverted Index (GIN) for full text searches. It is a binary search tree, however, instead of indexing the search column’s values (for instance the name of the company, e.x. “BNP Paribas”, it indexes the individual values of it, that is “BNP” and “Paribas”. Therefore the GIN index is a binary tree with all unique words appearing in the search column in the tree. This index speeds up considerably the text search.

  • Trigram indices:

    PostgreSQL equally proposes trigram searches in order to be able to fuzzy-lookup a given text. The method works with the pg-trgm module. Combined with the GIN index, this method saves the trigrams of the text search column in a separate column, then it uses a GIN index on this trigram column. This method returns all matches superior to a given threshold of similarity, defined as the Levenstein distance by default. The Levenshtein distance is a string metric for measuring the similarity between two sequences of characters. Between two words, it is the minimum number of single character edits required to change one word into the other. (i.e. insertions, deletions or  substitutions).

Installation of PostgreSQL (mac):

$ brew install postgresql

Start PostgreSQL:

$ brew services start postgresql

$ psql postgres

Create a password:

CREATE ROLE username WITH LOGIN PASSWORD 'quoted password'

Then create a database and a table:

CREATE DATABASE SIRENE;

CREATE TABLE SIRENE.sirene_data (siren VARCHAR(9),
nic VARCHAR(5),
siret VARCHAR(14),
statutDiffusionEtablissement VARCHAR(1),
dateCreationEtablissement DATE,
...
caractereEmployeurEtablissement VARCHAR(1),
CONSTRAINT p_key PRIMARY KEY (siret)
);

And copy the data from the csv downloaded from the sirene website to the table: 

COPY SIRENE.sirene_table(
siren,
nic,
siret,
statutDiffusionEtablissement,
...
caractereEmployeurEtablissement)
FROM ‘./data/StockEtablissement_utf8.csv' DELIMITER ',' CSV HEADER;

As an UI for PostgreSQL, I used pgadmin.

Elasticsearch

Elasticsearch (ES) is an open source, distributable, schema-less, REST-based and highly scalable full text search engine built on top of Apache Lucene, written in Java. ES is a document-orientated data store where objects, which are called documents, are stored and retrieved in the form of JSON. Elasticsearch became popular because of the very quick and efficient text search it provides. The search is based on an initial analysis that might include string cleaning, lowercasing, tokenisation and lemmatisation. We define the analysis for the attribute in which we wish to search later on. Then ES saves the analysed version of this attribute, and compute the inverted index the following schema shows.

Now the following figure serves as an example on how ES computes the inverted indexes.

Suppose we have two documents, “BNP Paribas” and “BNP Paribas France” and we define uniquely lowercasing as the analysis step. Then, ES will transform the two documents according to the analysis step (lowercasing in our example) and save the inverted index, Document:Position for each term. Next, suppose we want to search the term “BNP”. ES does the same analysis on the search term as on the documents, and looks up ‘bnp” in the inverted index table. It identifies immediately that document 1 and 2 has the term “bnp” as their first word.

Now suppose that we have many documents containing the word “BNP”. Instead of returning all such document, ES weights these documents according to the TF-IDF index (term frequencies inverse document frequencies). The weights of the TF-IDF are calculated as the the product of the term frequency (TF = The number of times a term appears in a document / The total number of terms in the document) and Inverse Document Frequency (IDF = The total number of documents / The number of documents with the t term in it).

To download Elasticsearch:

brew install elasticsearch

We may need to edit the ES configuration (node and cluster names). I used Postman to try the Elasticsearch API calls and Dejavu to visualise the data.

Finally, I sent the data to ES from Python first by converting it to a json format by the aid of the below functions:

def row_to_json(df, row):
    """
    Creates a json formated dictionary from a dataframe row and dumps it at the folder location. 

    Inputs: 
    df: pandas dataframe 
    row: int (row number)

    Returns: dict containing the json format
    """
    row = df[row, :]
    output = row.to_dict()

    for key in output:
        output[key] = str(output[key][0])

    return output
def df_to_json(df): 
    """
    Creates a json file from a pandas dataframe and dumps it on the folder location.
    
    Inputs: 
        df: pandas dataframe 
    Returns 0 (dumps the json file)
    """
    
    with open('sirene_json.json', 'w') as fp:
        for row in range(df.shape[0]):
            if row % 100000 == 0: 
                print('{} rows are done.'.format(row))
            out = row_to_json(df, row)
            json.dump(out, fp)
    return 0 


import csv
import json

csvfile = open('/Users/reka/Downloads/StockUniteLegale_utf8.csv', 'r')
print('Csv is ok to go.')
jsonfile = open('sirene_json.json', 'w')

fieldnames = ("siren","nic","siret", ...., "caractereEmployeurEtablissement")
reader = csv.DictReader(csvfile, fieldnames)
for row in reader:
json.dump(row, jsonfile)
jsonfile.write('\n')

Then by sending the json from Python to ES:

from elasticsearch import Elasticsearch

es = Elasticsearch()

for row in range(sirene_table1.shape[0]):
doc = row_to_json(sirene_table1, row)
res = es.index(index="sireneindex", doc_type='sirene_data', id=row, body=doc)

Implementation

The way I evaluate the search in two databases is to run several variants of look-ups on 100 entity names identified by the NER algorithm, in order to compute the top-5 recall. I try many versions of look-ups for the two databases and detail in what follows the methods that work best.

PostgreSQL implementation

PostgreSQL needs four index for the best-performing look-up. First, for exact searches, I put a hash index on the official name column. Next, to be able to find LIKE matches (so for instance to find “BNP Paribas” in the name “France, BNP Paribas”, I need to put a left hash index on this column as well. For fuzzy searches, I need to create the GIN index and use the pg-trgm package, that creates a column with the trigrams of the names. First, I clean the detected entity names and remove unnecessary elements (such as “www”, etc.), then I look up the cleaned names by first an exact match followed by a LIKE and LEFT LIKE match. If there is no results returned, I run the fuzzy search with the help of the pg-trgm package. This package returns all names that have trigrams with a higher similarity than the Levenstein distance. This is set to 0.5 by default, however, it is difficult to set this level manually. If set too high, no results are returned while if set too low, a huge number of names are returned. Finally, if too many results are returned, I consider the DamerauLevenshtein distance and choose the 5 most similar names to the searched term. One downside of the PostgreSQL is that it does not rank the results. If we limit the number of results with the “LIMIT 100” command, it will only return the first 100 names but not the 100 most relevant names.

CREATE INDEX b_index ON sirene_data USING btree (denominationunitelegale);
CREATE INDEX trgm_idx ON sirene_data USING GIST ( denominationunitelegale gist_trgm_ops);
CREATE INDEX trgm_idx ON sirene_data USING GIN ( denominationunitelegale gist_trgm_ops);

During my project, I access the PostgreSQL from Python. To do this, I use the psycopg2 package. The code to connect:

import psycopg2 
password = ‘password_to_sql_server' 
conn = psycopg2.connect("dbname='SIRENE' 
user='user' host='localhost' password={}".format(password)) 
cur = conn.cursor() 
# and at the end to deconnect: 
# cur.close()

ES implementation

The best option is to define two analysers, an analyser that cleans and lowercase the name fields, and a second n-gram analyser that breaks down the names to n-grams and creates the inverted index from them. This way ES can efficiently run an exact match, and if the exact match does not return any results, ES runs the fuzzy match, realised by the n-grams. We can immediately notice two differences between the PostgreSQL and ES. First, we do not need to clean the names given by the NER algorithm for ES, since it does it by default with the same analyser as applied to the saved data. Second, as ES ranks results according to the TF-IDF, I can choose to return only the 5 most relevant results and no additional ranking is necessary.

To create index with standard analyser (from Postman):

PUT sirene_data{
  "settings": {
     "analysis": {
         "analyzer": {
             "my_custom_analyzer": {
                 "type": "custom",
                 "tokenizer": "standard",
                 "char_filter": ["html_strip" ],
                 "filter": [
                     "lowercase",
                     "asciifolding"
                           ]
                                    }
                      }
                   }
               }
         }

and to define mapping:

Define the mapping
PUT sirene_data/_mapping {
   "properties":{
      "_id" : {"type": "long"},
               "activitePrincipaleUniteLegale" :{"type": "string »,
               "anneeCategorieEntreprise" : {"type":"string"},
                ...
               "anneeEffectifsUniteLegale": {"type": "string"},
               "caractereEmployeurUniteLegale": {"type" : "string"},
               "unitePurgeeUniteLegale" : {"type":"string"}
                }
              }
           }
       }

Finally to define index with ngram analyser:

PUT sirene_ngram{
  "settings": {
     "analysis": {
        "analyzer": {
           "my_analyzer": {
               "tokenizer": "my_tokenizer"
                           }
                     },
           "tokenizer": {
               "my_tokenizer": {
                   "type": "ngram",
                   "min_gram": 3,
                   "max_gram": 3,
                   "token_chars": ["letter ,"digit"]
                                }
                      }
                   }
                }
            }

And to define the corresponding mapping:

PUT sirene_ngram/_mapping {
  "properties":{
     "_id" : {"type": "long"},
     "activitePrincipaleUniteLegale" :{"type": "string",
     "anneeCategorieEntreprise": {"type":"string"},
      ...
     "anneeEffectifsUniteLegale": {"type": "string"},
     "caractereEmployeurUniteLegale": {"type" : "string"},
     "unitePurgeeUniteLegale": {"type": "string"}
               }
           }
       }
   }

Results

Interestingly, there is almost no difference in the capability of retrieving results between the two databases. PostgreSQL and ES had a top-5 recall of 0.82 and 0.83 respectively. However, there is a huge difference between the search time and the time needed for the preparation of the data. ES looked up 100 company names just over 2 seconds while the same lookup took over 780 seconds in PostgreSQL!

Why did PostgreSQL perform so slowly compared to ES?
  • Ways to retrieve data: ES is simply more adaptable for full-text search. As I already mentioned, ES works with the inverted index that speeds up the full text search remarkably.

  • Ranking of results: ES ranks results according to their relevance whereas PostgreSQL does not provide such a functionality. We have two options for this latter: limit the number of returned examples, in returning the first few results found. This saves time, however, the most relevant results might not be the first results. As another option, we can return all result and rank them according to a text-similarity function, however, this option is time-consuming.

  • Preprocessing of the data: ES pre-processes data according to the analyser. It processes the data in the base and the search query in the same way. However, PostgreSQL does no such thing, therefore, an initial string cleaning of the search name is needed.

Altogether, ES is more suitable for full-text searches. This mission was a good example to understand the different, sometimes opposing philosophies database systems are built on. It is important to mention that Elasticsearch is in no way superior to PostreSQL, it is simply more adaptable for the full-text searches.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Create a website or blog at WordPress.com

Up ↑

%d bloggers like this: