Evaluating Search Performance: Elasticsearch vs. Algolia
Evaluating Search Performance: Elasticsearch vs. Algolia

Evaluating Search Performance: Elasticsearch vs. Algolia

Introduction

Search is everywhere—whether you're browsing an online store, digging through academic papers, or trying to find the right document at work, you rely on search systems to deliver the right results quickly. As the volume of digital information keeps growing, strong information retrieval tools have become essential for making that data useful.

Among the many search engine solutions available, Algolia and Elasticsearch stand out as two of the most widely used platforms. Both offer powerful tools for optimizing search functionality, but they differ significantly in how they approach indexing, querying, and ranking.

In this blog, we will compare these two solutions. Through the use of a standardized dataset, we will evaluate their performance across several key metrics. This comparison is valuable because both search engines could be implemented in the retrieval step of a Retrieval-Augmented Generation (RAG) system, as well as in traditional enterprise search environments and other similar applications.

Understanding Algolia and Elasticsearch

In a previous article, we covered the pricing and features of both search engines. In this section, we will explore how the search engines work in more detail.

Algolia

Data Storage

In Algolia, the basic units for ingested data are called “records”. A record is an object-like collection of attributes, each consisting of a name and a value. Depending on the Algolia plan, a single record may be limited to 10KB (in the Free Build plan) or 100KB (in the Grow, Premium, and Elevate plans).

As part of the ingestion process, Algolia automatically creates an objectID field for each record.

Below is an example of a corpus record saved to Algolia:

{
  "_id": "w5kjmw88",
  "title": "Weathering the pandemic: How the Caribbean Basin can use viral and environmental patterns to predict, prepare, and respond to COVID-19",
  "text": "The 2020 coronavirus pandemic is developing at different paces throughout the world. Some areas...",
  "objectID": "1f636f408f90ea_dashboard_generated_id"
}

Searching

Algolia querying does not support boolean operators like AND, OR, NOT. Instead, when working with multi-word queries, records are matched only when all words are present. In other words, every word in the query must appear in a record for it to be returned as a search result.

That being said, Algolia does offer some flexibility with its search functionality, such as the ability to use stopwords and specify searchable fields, among other features.

Additionally, Algolia also offers semantic search capabilities, but this feature is only available with their premium Elevate plan.

Ranking

Once the queries are returned, Algolia ranks the results and applies tie-breaking based on an ordered list of criteria.

Elasticsearch

Data Storage

In Elasticsearch, data is stored as "documents," which can be either structured or unstructured, with a maximum document size limit of 100 MB. For consistency with Algolia during the evaluation, we will upload object items with the fields: _id, title, and txt as individual documents. This ensures a fair comparison between the two search engines.

Searching

In contrast to Algolia’s AND-like approach, Elasticsearch uses the OR operator for all words in a query. This means that as long as any word in the query is found in the document, it will be returned as a result.

In Elasticsearch, searches can be performed either lexically or semantically. To enable semantic search, the fields being queried must first be semantically indexed.

Ranking

Elasticsearch employs a ranking system, so even if only a single word from a query is found in a document, it is less likely to appear as a top search result. It achieves this by sorting results based on relevancy using the BM25 ranking system after the search is performed.

Evaluating Retrieval Performance

Choosing an Evaluation Dataset

For the purpose of evaluation, the TREC-COVID MTEB dataset—which is both an MTEB dataset and a BEIR dataset—was selected to base the evaluations of search engine results on.

BEIR Benchmark

BEIR (Benchmarking IR) is a standard benchmark for evaluating retrieval task performance, primarily using NDCG and Recall as the key metrics.

BEIR provides 18 datasets covering a wide range of domains, used to test various retrieval tasks, including fact-checking, question answering, document retrieval, and recommendation systems.

Each BEIR dataset consists of a corpus, queries, and relevance judgments (qrels) for those queries (a sample of the data format is shown later).

MTEB Benchmark

MTEB (Massive Text Embedding Benchmark) is a standard benchmark for measuring the performance of 8 text embedding tasks and includes 56 datasets. MTEB is also a superset of the BEIR benchmark, meaning the MTEB retrieval task datasets reuse those from BEIR.

TREC-COVID Dataset

Among all the datasets in MTEB and BEIR, the TREC-COVID dataset was chosen specifically for its relatively higher labeling rate.

The original dataset consists of a corpus containing 171,332 documents, 50 queries, and 24,763 qrels.

For the evaluation experiments, we removed documents that exceeded the 10 KB record size limit in Algolia’s Free Plan. As a result, the majority of the corpus data from the original dataset was used for both search engines, except for 12 documents. The corresponding qrels for these excluded documents were also omitted from the evaluation.

Below are sample records from the dataset, which include the corpus, queries, and relevance judgments (qrels):

Corpus

Each corpus item includes the following fields: "_id", "title" and "text". Some "title" and "text" fields are empty, so both fields were set to be searchable.

{
  "_id": "0a0i6vjn",
  "title": "Zambia’s National Cancer Centre response to the COVID-19 pandemic—an opportunity for improved care",
  "text": "The COVID-19 pandemic has overwhelmed health systems around the globe even in countries with strong economies..."
}
{
  "_id": "d1stzy8w",
  "title": "Susceptibility of tree shrew to SARS-CoV-2 infection",
  "text": "Since SARS-CoV-2 became a pandemic event in the world, it has not only caused huge economic losses, but also..."
}
{
  "_id": "6jej7l24",
  "title": "Diagnosing rhinitis: viral and allergic characteristics.",
  "text": ""
}

Queries

The fields for a single query item are: “_id” and “text”.

{
    "_id": "1",
    "text": "what is the origin of COVID-19"
},
{
    "_id": "2",
    "text": "how does the coronavirus respond to changes in the weather"
},
{
    "_id": "3",
    "text": "will SARS-CoV2 infected people develop immunity? Is cross protection possible?"
}

Relevance Judgements (qrels)

Each qrel entry identifies a query, a corpus ID, and the relevance score for that corpus. Since a single query often yields multiple search results, there are 50 queries and 24,763 qrels.

query-id	corpus-id	score
1	        005b2j4b	2.0
1	        00fmeepz	1.0
1	        g7dhmyyo	2.0

The score values are 0, 1, or 2, where:

  • 0 indicates the corpus is not relevant,
  • 1 indicates the corpus is somewhat relevant,
  • 2 indicates the corpus is relevant to the query.

For the chosen metric evaluations, the difference between scores of 1 and 2 will primarily affect the results of the NDCG metric.

Choosing Evaluation Metrics and Demonstrating Metric Calculations

We have chosen to evaluate the following common metrics for retrieval tasks: Precision, Recall, NDCG, and MAP. The evaluations were performed using pytrec_eval.

Precision and Recall are not rank-aware, while NDCG and MAP are. This means that Precision and Recall scores indicate whether the correct sources were surfaced, while NDCG and MAP also account for the ordering of relevant results.

NOTE: In the metric equations below, Metric@K refers to the value of the metric calculated for the top K retrieved results.

Sample Data for Demonstrating Metric Calculations

To illustrate the calculation of these metrics, we provide sample results from Algolia searches on queries with IDs "1", "2", and "3." In these examples, we assume the search engine returns up to 10 results per query.

The format of the queries and qrels data below follows the structure accepted by pytrec_eval.

Queries

{'_id': '1', 'text': 'what is the origin of COVID-19'}
{'_id': '2', 'text': 'how does the coronavirus respond to changes in the weather'}
{'_id': '3', 'text': 'will SARS-CoV2 infected people develop immunity? Is cross protection possible?'}

Qrels

Below is a sample of the qrels for queries "1", "2", and "3." This representation is reordered and truncated to focus on the results for query "1," highlighting the corresponding Algolia search scores for easier following of the metric calculations.

(The full qrels for the queries “1”, “2” and “3” can be found here.)

{
    "1": {
        "dckuhrlf": 0,
        "96zsd27n": 0,
        "0paafp5j": 0,
        "fqs40ivc": 1,
        "hmvo5b0q": 1,
        "l2wzr3w1": 1,
        "41378qru": 0,
        "dv9m19yk": 1,
        "ipl6189w": 0,
        "084o1dmp": 0,
        "08ds967z": 1,
        ...
        },
    "2": {...},
    "3": {...}
}

Search Results (Limited to the Top 10 Results as Determined by the Engine)

pytrec_eval interprets search results similarly to how it handles qrels, where each result has an associated relevance score. While search engines may use different scoring systems or scales, pytrec_eval evaluates these scores based on their relative rank. In other words, it focuses on the rank order of results, with higher scores indicating greater relevance. The absolute value of the score itself is less important than its position relative to other results in the ranked list.

In the case of the Algolia search results below, the corpus ID with the highest score — "dv9ml9yk" — was ranked as the top result, while "26276rpr" was ranked at the bottom of the top 10 relevant results.

{
    "1": {
        "26276rpr": 2,
        "dckuhrlf": 3,
        "96zsd27n": 4,
        "0paafp5j": 5,
        "hmvo5b0q": 6,
        "fqs40ivc": 7,
        "l2wzr3w1": 8,
        "41378qru": 9,
        "ipl6189w": 10,
        "dv9m19yk": 11
    },
    "2": {},
    "3": {}
}

From the results above, we can see that Algolia did not return any results for queries "2" and "3.”

We chose to use incrementing numbers and avoid duplicate scores because Algolia doesn’t return an explicit search score. Instead, we rely on the order in which results are returned, which is determined by a series of tie-breakers. You can learn more about these tie-breakers in the documentation.

Additionally, we’ve excluded scores of 0 or 1, as Algolia guarantees that all documents returned contain the query terms. Therefore, we treat all results as relevant.

With the sample data above, we can now move forward with calculating values for both non-rank-aware metrics (Precision and Recall) and rank-aware metrics (MAP and NDCG).

Non-Rank-Aware Metrics

Precision

Precision measures how many of the retrieved items are relevant, indicating the percentage of retrieved results that are considered correct according to the ground truth (qrels).

Precision@K=TPTP+FP=TPK=Number of relevant items in KTotal number of items in KPrecision@K=\frac{TP}{TP+FP}=\frac{TP}{K}=\frac{\text{Number of relevant items in K}}{\text{Total number of items in K}}

Where:

  • TP = True Positives
  • FP = False Positives
  • K = The number of retrieved results

Example calculation:

For Query 1, there were 4 true positive results. Since we're calculating Precision@10, we have:

Precision@10=410=0.4Precision@10=\frac{4}{10}=0.4

For Queries 2 and 3, since no relevant results were retrieved, Precision@10 = 0 for both queries.

Average Precision@10:

Average Precision@10=0.4+0+03=0.13333\text{Average }Precision@10=\frac{0.4+0+0}{3}=0.13333

Recall

Recall measures how many of the relevant items were retrieved, indicating the percentage of all relevant items that were retrieved.

Recall@K=TPTP+FN=Number of relevant items in KTotal number of relevant itemsRecall@K = \frac{TP}{TP+FN} = \frac{\text{Number of relevant items in K}}{\text{Total number of relevant items}}

Where:

  • TP = True Positives
  • FN = False Negatives

Example calculation:

For Query 1, there were 4 true positive results. The total number of relevant items is 637, which comes from the sum of items marked as somewhat relevant (score of 1) and relevant (score of 2) in the qrels. So, the calculation for Recall@10 is:

Recall@10=4637=0.00628Recall@10=\frac{4}{637}=0.00628

For Queries 2 and 3, since no relevant results were retrieved, Recall@10 = 0 for both queries.

Average Recall@10:

Average Recall@10=0.00627943+0+03=0.00209\text{Average }Recall@10=\frac{0.00627943 +0+0}{3}=0.00209

Rank-Aware Metrics

MAP

Mean Average Precision (MAP) evaluates the system’s ability to return relevant items and rank them appropriately, with the most relevant items appearing at the top of the list.

To calculate MAP, we first need to compute the Average Precision (AP) for a single query. AP is the average of the precision values at K = 1,…, where each K corresponds to the top-K retrieved results in MAP@K.

AP@K=1Nk=1kPrecision(k)×rel(k)AP@K=\frac{1}{N}\sum_{k=1}^{k} \text{Precision(k)} \times \text{rel(k)}

Here:

  • rel(k) represents a binary relevance function, where the value is 1 if the k-th retrieved item is relevant, and 0 if it is not.
  • N is the total number of relevant documents for the query.

The final MAP score, is the average of the AP values for all queries.

MAP@K=1Uu=1UAP@KuMAP@K=\frac{1}{U}\sum_{u=1}^{U}AP@K_{u}

Where:

  • U is the number of queries.

Example calculation:

For Query 1, the table below shows the Precision@K, rel(k), and AP@K for the ordered search results.

Ordered query result #
Corpus ID
rel(k)
Precision@K
Precision(k) * rel(k)
1
dv9m19yk
1
1 / 1
1
2
ipl6189w
0
1 / 2
0
3
41378qru
0
1 / 3
0
4
l2wzr3w1
1
2 / 4
0.5
5
fqs40ivc
1
3 / 5
0.6
6
hmvo5b0q
1
4 / 6
0.66667
7
0paafp5j
0
4 / 7
0
8
96zsd27n
0
4 / 8
0
9
dckuhrlf
0
4 / 9
0
10
26276rpr
0
4 / 10
0

This results in:

AP@10=1+0+0+0.5+0.6+0.66667+0+0+0+0637=0.00434AP@10=\frac{1+0+0+0.5+ 0.6 + 0.66667 + 0 + 0 + 0 + 0} {637} = 0.00434

For Queries 2 and 3, since no relevant items were retrieved, AP@10 = 0 for both queries.

Average AP@10 (or simply MAP@10):

MAP@10=0.00434+0+03=0.00145MAP@10=\frac{0.00434 + 0 + 0} {3} = 0.00145

NDCG

Normalized Discounted Cumulative Gain (NDCG) measures a system’s ability to rank items based on their relevance. Unlike other metrics, NDCG accounts for how relevant the retrieved items are, using relevance values from the qrels.

NDCG is calculated by first determining the Discounted Cumulative Gain (DCG) and then normalizing it by the Ideal Discounted Cumulative Gain (IDCG).

Formula:

NDCG@K=DCG@KIDCG@KNDCG@K=\frac{DCG@K}{IDCG@K}

DCG Calculation:

DCG@K=k=1Krelilog2(i+1)DCG@K=\sum_{k=1}^{K}\frac{rel_{i}}{log_{2}(i+1)}

Where:

  • rel(i) is the relevance of the i-th result.
  • log⁡2(i+1) is the logarithmic discount factor.

Example calculation:

For Query 1, we first calculate the DCG@10.

Rank (i)
corpus_id
rel
rel(i) / log2(i + 1)
1
dv9m19yk
1
1
2
ipl6189w
0
0
3
41378qru
0
0
4
l2wzr3w1
1
0.4307
5
fqs40ivc
1
0.3869
6
hmvo5b0q
1
0.3562
7
0paafp5j
0
0
8
96zsd27n
0
0
9
dckuhrlf
0
0
10
26276rpr
0
0

By summing the last column, we get:

DCG@10=1+0+0+0.4307+0.3869+0.3562+0+0+0+0=2.1738DCG@10=1+0+0+0.4307+0.3869+0.3562+0+0+0+0=2.1738

IDCG@10 for Query 1 represents the ideal scenario where the most relevant documents are ranked highest for Query 1. There are 305 documents with relevance 2, so the top 10 most relevant documents should each have a relevance of 2. We calculate IDCG@10 as follows:

IDCG@10=2log2(2)+2log2(3)+2log2(4)+2log2(5)+2log2(6)+2log2(7)+2log2(8)+2log2(9)+2log2(10)+2log2(11)=9.0886IDCG@10=\frac{2}{log_2(2)} + \frac{2}{log_2(3)} + \frac{2}{log_2(4)} + \frac{2}{log_2(5)} + \frac{2}{log_2(6) }+ \frac{2}{log_2(7)} + \frac{2}{log_2(8)} + \frac{2}{log_2(9)} + \frac{2}{log_2(10)} + \frac{2}{log_2(11)}= 9.0886

Finally, we compute NDCG@10 for Query 1:

NDCG@10=0.23915.NDCG@10=0.23915.

For Queries 2 and 3, since no relevant items were retrieved, NDCG@10 = 0 for both queries.

Average NDCG@10:

Average NDCG@10=0.23915+0+03=0.07972\text{Average }NDCG@10=\frac{0.23915+0+0}{3}=0.07972

Running Full Evaluations

Now that we’ve covered the selected metrics and how they’re calculated, we’ll apply them to the full dataset to compare the performance of Elasticsearch lexical search, Elasticsearch semantic search, default Algolia Search, and Algolia Search with stopwords enabled. This evaluation will help us understand how each method performs across different configurations using the same set of queries and ground-truth relevance labels.

Code Reference

You can access the full code in our GitHub repository.

Results

We evaluated the following metrics: Precision, Recall, NDCG, and MAP at different values of K: 1, 5, 10, 15, 25, 35, 45, and 55.

Results for Elasticsearch lexical search:

K
Precision
Recall
NDCG
MAP
1
0.76000
0.00203
0.72000
0.00203
5
0.71200
0.00880
0.66893
0.00777
10
0.65400
0.01589
0.61587
0.013061
15
0.63067
0.02308
0.59091
0.018215
25
0.58560
0.03441
0.55259
0.02652
35
0.56000
0.04560
0.52855
0.03396
45
0.53956
0.05595
0.50753
0.04088
55
0.51273
0.06416
0.48471
0.04630

Results for Elasticsearch semantic search:

K
Precision
Recall
NDCG
MAP
1
0.94000
0.00246
0.87000
0.00246
5
0.83600
0.01099
0.78145
0.01033
10
0.78800
0.02036
0.75003
0.01865
15
0.52533
0.02036
0.58140
0.01865
25
0.31520
0.02036
0.41907
0.01865
35
0.22514
0.02036
0.33592
0.01865
45
0.17511
0.02036
0.28382
0.01865
55
0.14327
0.02036
0.24802
0.01865

Below are the results when running the metrics on Algolia.

K
Precision
Recall
NDCG
MAP
1
0.26000
0.00042
0.20000
0.00042
5
0.20800
0.00200
0.17705
0.00157
10
0.19200
0.00349
0.16598
0.00266
15
0.17333
0.00485
0.15414
0.00361
25
0.12640
0.00588
0.12298
0.00434
35
0.09029
0.00588
0.09858
0.00434
45
0.07022
0.00588
0.08329
0.00434
55
0.05745
0.00588
0.07269
0.00434

Results for Algolia with stopwords:

K
Precision
Recall
NDCG
MAP
1
0.38000
0.00071
0.31000
0.00071
5
0.31600
0.00350
0.27898
0.00272
10
0.27400
0.00596
0.25137
0.00440
15
0.25733
0.00829
0.23713
0.00577
25
0.19920
0.01061
0.19793
0.00721
35
0.14229
0.01061
0.15866
0.00721
45
0.11067
0.01061
0.13405
0.00721
55
0.0905
0.01061
0.11699
0.00721

Results Interpretation

Elasticsearch

Overall, Elasticsearch performs better than Algolia.

When comparing regular lexical search and semantic search within Elasticsearch, semantic search shows stronger performance at lower K-values. However, starting from K=15 and beyond, the regular lexical search begins to outperform semantic search across all metrics.

Looking more closely, we notice that for Elasticsearch semantic search, both Recall and MAP scores remain constant at and beyond K = 15, mirroring the previous Recall@10 and MAP@10 scores. This is likely because the semantic search returns no more than 15 results for many queries. As a result, the percentage of relevant results found—and therefore the scores—stop increasing beyond that point.

Since these two approaches were used exclusively in this evaluation, a potential improvement could be to combine semantic and lexical signals. This hybrid approach might offer the precision of semantic search at the top ranks while maintaining broader coverage through lexical matching.

Algolia

From the results, we can observe that Algolia’s scores are significantly lower compared to Elasticsearch.

Similar to Elasticsearch, we notice that Algolia Recall and MAP scores remains constant at and beyond K = 25, likely for the same reason.

Interestingly, we observe a slight improvement in Algolia’s metric scores when stopwords are applied. By excluding certain words from the strict AND condition, Algolia broadens its search criteria, allowing more relevant results to surface. However, this enhancement is limited. Algolia’s default behavior still requires that all (non-stopword) query terms appear in the result, which can restrict its ability to retrieve relevant records in some cases. This is evident in the following example:

Query “2”: “How does the coronavirus respond to changes in the weather”

Returns: Nothing

Here, Algolia returns no results because no records contain all of the terms in the query.

Modified Query “2”: “coronavirus respond to the weather”

Returns: records “w5kjmw88”, “gan10za0”

In this case, Algolia returns results because these records contain the specified terms.

This suggests that Algolia is better suited for keyword-based searches. However, it’s worth noting that Algolia does offer a semantic search feature called NeuralSearch, available with their most expensive plan: Elevate. Implementing this feature could potentially improve Algolia's results for more complex queries.

Conclusion

In this article, we compared the querying performance of Algolia and Elasticsearch. While Elasticsearch performed better, its results were not perfect. Algolia, on the other hand, demonstrated limitations, particularly in handling more complex queries due to its strict search behavior.

When ranking the methods based on performance, we found the following:

  1. (Tie): Elasticsearch lexical and semantic search – Lexical search performs better when K ≥ 15, while semantic search excels when K ≤ 15.
  2. Algolia search with stopwords enabled
  3. Algolia default search

To improve performance in enterprise search, both Algolia and Elasticsearch could potentially benefit from preprocessing techniques, especially for complex queries. This combination could ensure more relevant and accurate results across diverse use cases.

References:

Benchmarking IR Information Retrieval (BEIR)Benchmarking IR Information Retrieval (BEIR)

Elasticsearch Labs The BEIR benchmark & Elasticsearch search relevance evaluation - Elasticsearch LabsElasticsearch Labs The BEIR benchmark & Elasticsearch search relevance evaluation - Elasticsearch Labs

huggingface MTEB: Massive Text Embedding Benchmarkhuggingface MTEB: Massive Text Embedding Benchmark

Algolia Documentation Prepare your records for indexing | AlgoliaAlgolia Documentation Prepare your records for indexing | Algolia

Built In Mean Average Precision (mAP) Explained | Built InBuilt In Mean Average Precision (mAP) Explained | Built In

Evaluation Metrics for Search and Recommendation Systems | WeaviateEvaluation Metrics for Search and Recommendation Systems | Weaviate