Introduction
Search is everywhere—whether you're browsing an online store, digging through academic papers, or trying to find the right document at work, you rely on search systems to deliver the right results quickly. As the volume of digital information keeps growing, strong information retrieval tools have become essential for making that data useful.
Among the many search engine solutions available, Algolia and Elasticsearch stand out as two of the most widely used platforms. Both offer powerful tools for optimizing search functionality, but they differ significantly in how they approach indexing, querying, and ranking.
In this blog, we will compare these two solutions. Through the use of a standardized dataset, we will evaluate their performance across several key metrics. This comparison is valuable because both search engines could be implemented in the retrieval step of a Retrieval-Augmented Generation (RAG) system, as well as in traditional enterprise search environments and other similar applications.
Understanding Algolia and Elasticsearch
In a previous article, we covered the pricing and features of both search engines. In this section, we will explore how the search engines work in more detail.
Algolia
Data Storage
In Algolia, the basic units for ingested data are called “records”. A record is an object-like collection of attributes, each consisting of a name and a value. Depending on the Algolia plan, a single record may be limited to 10KB (in the Free Build plan) or 100KB (in the Grow, Premium, and Elevate plans).
As part of the ingestion process, Algolia automatically creates an objectID field for each record.
Below is an example of a corpus record saved to Algolia:
{
"_id": "w5kjmw88",
"title": "Weathering the pandemic: How the Caribbean Basin can use viral and environmental patterns to predict, prepare, and respond to COVID-19",
"text": "The 2020 coronavirus pandemic is developing at different paces throughout the world. Some areas...",
"objectID": "1f636f408f90ea_dashboard_generated_id"
}
Searching
Algolia querying does not support boolean operators like AND, OR, NOT. Instead, when working with multi-word queries, records are matched only when all words are present. In other words, every word in the query must appear in a record for it to be returned as a search result.
That being said, Algolia does offer some flexibility with its search functionality, such as the ability to use stopwords and specify searchable fields, among other features.
Additionally, Algolia also offers semantic search capabilities, but this feature is only available with their premium Elevate plan.
Ranking
Once the queries are returned, Algolia ranks the results and applies tie-breaking based on an ordered list of criteria.
Elasticsearch
Data Storage
In Elasticsearch, data is stored as "documents," which can be either structured or unstructured, with a maximum document size limit of 100 MB. For consistency with Algolia during the evaluation, we will upload object items with the fields: _id, title, and txt as individual documents. This ensures a fair comparison between the two search engines.
Searching
In contrast to Algolia’s AND-like approach, Elasticsearch uses the OR operator for all words in a query. This means that as long as any word in the query is found in the document, it will be returned as a result.
In Elasticsearch, searches can be performed either lexically or semantically. To enable semantic search, the fields being queried must first be semantically indexed.
Ranking
Elasticsearch employs a ranking system, so even if only a single word from a query is found in a document, it is less likely to appear as a top search result. It achieves this by sorting results based on relevancy using the BM25 ranking system after the search is performed.
Evaluating Retrieval Performance
Choosing an Evaluation Dataset
For the purpose of evaluation, the TREC-COVID MTEB dataset—which is both an MTEB dataset and a BEIR dataset—was selected to base the evaluations of search engine results on.
BEIR Benchmark
BEIR (Benchmarking IR) is a standard benchmark for evaluating retrieval task performance, primarily using NDCG and Recall as the key metrics.
BEIR provides 18 datasets covering a wide range of domains, used to test various retrieval tasks, including fact-checking, question answering, document retrieval, and recommendation systems.
Each BEIR dataset consists of a corpus, queries, and relevance judgments (qrels) for those queries (a sample of the data format is shown later).
MTEB Benchmark
MTEB (Massive Text Embedding Benchmark) is a standard benchmark for measuring the performance of 8 text embedding tasks and includes 56 datasets. MTEB is also a superset of the BEIR benchmark, meaning the MTEB retrieval task datasets reuse those from BEIR.
TREC-COVID Dataset
Among all the datasets in MTEB and BEIR, the TREC-COVID dataset was chosen specifically for its relatively higher labeling rate.
The original dataset consists of a corpus containing 171,332 documents, 50 queries, and 24,763 qrels.
For the evaluation experiments, we removed documents that exceeded the 10 KB record size limit in Algolia’s Free Plan. As a result, the majority of the corpus data from the original dataset was used for both search engines, except for 12 documents. The corresponding qrels for these excluded documents were also omitted from the evaluation.
Below are sample records from the dataset, which include the corpus, queries, and relevance judgments (qrels):
Corpus
Each corpus item includes the following fields: "_id", "title" and "text". Some "title" and "text" fields are empty, so both fields were set to be searchable.
{
"_id": "0a0i6vjn",
"title": "Zambia’s National Cancer Centre response to the COVID-19 pandemic—an opportunity for improved care",
"text": "The COVID-19 pandemic has overwhelmed health systems around the globe even in countries with strong economies..."
}
{
"_id": "d1stzy8w",
"title": "Susceptibility of tree shrew to SARS-CoV-2 infection",
"text": "Since SARS-CoV-2 became a pandemic event in the world, it has not only caused huge economic losses, but also..."
}
{
"_id": "6jej7l24",
"title": "Diagnosing rhinitis: viral and allergic characteristics.",
"text": ""
}
Queries
The fields for a single query item are: “_id” and “text”.
{
"_id": "1",
"text": "what is the origin of COVID-19"
},
{
"_id": "2",
"text": "how does the coronavirus respond to changes in the weather"
},
{
"_id": "3",
"text": "will SARS-CoV2 infected people develop immunity? Is cross protection possible?"
}
Relevance Judgements (qrels)
Each qrel entry identifies a query, a corpus ID, and the relevance score for that corpus. Since a single query often yields multiple search results, there are 50 queries and 24,763 qrels.
query-id corpus-id score
1 005b2j4b 2.0
1 00fmeepz 1.0
1 g7dhmyyo 2.0
The score values are 0, 1, or 2, where:
- 0 indicates the corpus is not relevant,
- 1 indicates the corpus is somewhat relevant,
- 2 indicates the corpus is relevant to the query.
For the chosen metric evaluations, the difference between scores of 1 and 2 will primarily affect the results of the NDCG metric.
Choosing Evaluation Metrics and Demonstrating Metric Calculations
We have chosen to evaluate the following common metrics for retrieval tasks: Precision, Recall, NDCG, and MAP. The evaluations were performed using pytrec_eval.
Precision and Recall are not rank-aware, while NDCG and MAP are. This means that Precision and Recall scores indicate whether the correct sources were surfaced, while NDCG and MAP also account for the ordering of relevant results.
NOTE: In the metric equations below, Metric@K refers to the value of the metric calculated for the top K retrieved results.
Sample Data for Demonstrating Metric Calculations
To illustrate the calculation of these metrics, we provide sample results from Algolia searches on queries with IDs "1", "2", and "3." In these examples, we assume the search engine returns up to 10 results per query.
The format of the queries and qrels data below follows the structure accepted by pytrec_eval.
Queries
{'_id': '1', 'text': 'what is the origin of COVID-19'}
{'_id': '2', 'text': 'how does the coronavirus respond to changes in the weather'}
{'_id': '3', 'text': 'will SARS-CoV2 infected people develop immunity? Is cross protection possible?'}
Qrels
Below is a sample of the qrels for queries "1", "2", and "3." This representation is reordered and truncated to focus on the results for query "1," highlighting the corresponding Algolia search scores for easier following of the metric calculations.
(The full qrels for the queries “1”, “2” and “3” can be found here.)
{
"1": {
"dckuhrlf": 0,
"96zsd27n": 0,
"0paafp5j": 0,
"fqs40ivc": 1,
"hmvo5b0q": 1,
"l2wzr3w1": 1,
"41378qru": 0,
"dv9m19yk": 1,
"ipl6189w": 0,
"084o1dmp": 0,
"08ds967z": 1,
...
},
"2": {...},
"3": {...}
}
Search Results (Limited to the Top 10 Results as Determined by the Engine)
pytrec_eval interprets search results similarly to how it handles qrels, where each result has an associated relevance score. While search engines may use different scoring systems or scales, pytrec_eval evaluates these scores based on their relative rank. In other words, it focuses on the rank order of results, with higher scores indicating greater relevance. The absolute value of the score itself is less important than its position relative to other results in the ranked list.
In the case of the Algolia search results below, the corpus ID with the highest score — "dv9ml9yk" — was ranked as the top result, while "26276rpr" was ranked at the bottom of the top 10 relevant results.
{
"1": {
"26276rpr": 2,
"dckuhrlf": 3,
"96zsd27n": 4,
"0paafp5j": 5,
"hmvo5b0q": 6,
"fqs40ivc": 7,
"l2wzr3w1": 8,
"41378qru": 9,
"ipl6189w": 10,
"dv9m19yk": 11
},
"2": {},
"3": {}
}
From the results above, we can see that Algolia did not return any results for queries "2" and "3.”
We chose to use incrementing numbers and avoid duplicate scores because Algolia doesn’t return an explicit search score. Instead, we rely on the order in which results are returned, which is determined by a series of tie-breakers. You can learn more about these tie-breakers in the documentation.
Additionally, we’ve excluded scores of 0 or 1, as Algolia guarantees that all documents returned contain the query terms. Therefore, we treat all results as relevant.
With the sample data above, we can now move forward with calculating values for both non-rank-aware metrics (Precision and Recall) and rank-aware metrics (MAP and NDCG).
Non-Rank-Aware Metrics
Precision
Precision measures how many of the retrieved items are relevant, indicating the percentage of retrieved results that are considered correct according to the ground truth (qrels).
Where:
- TP = True Positives
- FP = False Positives
- K = The number of retrieved results
Example calculation:
For Query 1, there were 4 true positive results. Since we're calculating Precision@10, we have:
For Queries 2 and 3, since no relevant results were retrieved, Precision@10 = 0 for both queries.
Average Precision@10:
Recall
Recall measures how many of the relevant items were retrieved, indicating the percentage of all relevant items that were retrieved.
Where:
- TP = True Positives
- FN = False Negatives
Example calculation:
For Query 1, there were 4 true positive results. The total number of relevant items is 637, which comes from the sum of items marked as somewhat relevant (score of 1) and relevant (score of 2) in the qrels. So, the calculation for Recall@10 is:
For Queries 2 and 3, since no relevant results were retrieved, Recall@10 = 0 for both queries.
Average Recall@10:
Rank-Aware Metrics
MAP
Mean Average Precision (MAP) evaluates the system’s ability to return relevant items and rank them appropriately, with the most relevant items appearing at the top of the list.
To calculate MAP, we first need to compute the Average Precision (AP) for a single query. AP is the average of the precision values at K = 1,…, where each K corresponds to the top-K retrieved results in MAP@K.
Here:
- rel(k) represents a binary relevance function, where the value is 1 if the k-th retrieved item is relevant, and 0 if it is not.
- N is the total number of relevant documents for the query.
The final MAP score, is the average of the AP values for all queries.
Where:
- U is the number of queries.
Example calculation:
For Query 1, the table below shows the Precision@K, rel(k), and AP@K for the ordered search results.
Ordered query result # | Corpus ID | rel(k) | Precision@K | Precision(k) * rel(k) |
1 | dv9m19yk | 1 | 1 / 1 | 1 |
2 | ipl6189w | 0 | 1 / 2 | 0 |
3 | 41378qru | 0 | 1 / 3 | 0 |
4 | l2wzr3w1 | 1 | 2 / 4 | 0.5 |
5 | fqs40ivc | 1 | 3 / 5 | 0.6 |
6 | hmvo5b0q | 1 | 4 / 6 | 0.66667 |
7 | 0paafp5j | 0 | 4 / 7 | 0 |
8 | 96zsd27n | 0 | 4 / 8 | 0 |
9 | dckuhrlf | 0 | 4 / 9 | 0 |
10 | 26276rpr | 0 | 4 / 10 | 0 |
This results in:
For Queries 2 and 3, since no relevant items were retrieved, AP@10 = 0 for both queries.
Average AP@10 (or simply MAP@10):
NDCG
Normalized Discounted Cumulative Gain (NDCG) measures a system’s ability to rank items based on their relevance. Unlike other metrics, NDCG accounts for how relevant the retrieved items are, using relevance values from the qrels.
NDCG is calculated by first determining the Discounted Cumulative Gain (DCG) and then normalizing it by the Ideal Discounted Cumulative Gain (IDCG).
Formula:
DCG Calculation:
Where:
- rel(i) is the relevance of the i-th result.
- log2(i+1) is the logarithmic discount factor.
Example calculation:
For Query 1, we first calculate the DCG@10.
Rank (i) | corpus_id | rel | rel(i) / log2(i + 1) |
1 | dv9m19yk | 1 | 1 |
2 | ipl6189w | 0 | 0 |
3 | 41378qru | 0 | 0 |
4 | l2wzr3w1 | 1 | 0.4307 |
5 | fqs40ivc | 1 | 0.3869 |
6 | hmvo5b0q | 1 | 0.3562 |
7 | 0paafp5j | 0 | 0 |
8 | 96zsd27n | 0 | 0 |
9 | dckuhrlf | 0 | 0 |
10 | 26276rpr | 0 | 0 |
By summing the last column, we get:
IDCG@10 for Query 1 represents the ideal scenario where the most relevant documents are ranked highest for Query 1. There are 305 documents with relevance 2, so the top 10 most relevant documents should each have a relevance of 2. We calculate IDCG@10 as follows:
Finally, we compute NDCG@10 for Query 1:
For Queries 2 and 3, since no relevant items were retrieved, NDCG@10 = 0 for both queries.
Average NDCG@10:
Running Full Evaluations
Now that we’ve covered the selected metrics and how they’re calculated, we’ll apply them to the full dataset to compare the performance of Elasticsearch lexical search, Elasticsearch semantic search, default Algolia Search, and Algolia Search with stopwords enabled. This evaluation will help us understand how each method performs across different configurations using the same set of queries and ground-truth relevance labels.
Code Reference
You can access the full code in our GitHub repository.
Results
We evaluated the following metrics: Precision, Recall, NDCG, and MAP at different values of K: 1, 5, 10, 15, 25, 35, 45, and 55.
Results for Elasticsearch lexical search:
K | Precision | Recall | NDCG | MAP |
1 | 0.76000 | 0.00203 | 0.72000 | 0.00203 |
5 | 0.71200 | 0.00880 | 0.66893 | 0.00777 |
10 | 0.65400 | 0.01589 | 0.61587 | 0.013061 |
15 | 0.63067 | 0.02308 | 0.59091 | 0.018215 |
25 | 0.58560 | 0.03441 | 0.55259 | 0.02652 |
35 | 0.56000 | 0.04560 | 0.52855 | 0.03396 |
45 | 0.53956 | 0.05595 | 0.50753 | 0.04088 |
55 | 0.51273 | 0.06416 | 0.48471 | 0.04630 |
Results for Elasticsearch semantic search:
K | Precision | Recall | NDCG | MAP |
1 | 0.94000 | 0.00246 | 0.87000 | 0.00246 |
5 | 0.83600 | 0.01099 | 0.78145 | 0.01033 |
10 | 0.78800 | 0.02036 | 0.75003 | 0.01865 |
15 | 0.52533 | 0.02036 | 0.58140 | 0.01865 |
25 | 0.31520 | 0.02036 | 0.41907 | 0.01865 |
35 | 0.22514 | 0.02036 | 0.33592 | 0.01865 |
45 | 0.17511 | 0.02036 | 0.28382 | 0.01865 |
55 | 0.14327 | 0.02036 | 0.24802 | 0.01865 |
Below are the results when running the metrics on Algolia.
K | Precision | Recall | NDCG | MAP |
1 | 0.26000 | 0.00042 | 0.20000 | 0.00042 |
5 | 0.20800 | 0.00200 | 0.17705 | 0.00157 |
10 | 0.19200 | 0.00349 | 0.16598 | 0.00266 |
15 | 0.17333 | 0.00485 | 0.15414 | 0.00361 |
25 | 0.12640 | 0.00588 | 0.12298 | 0.00434 |
35 | 0.09029 | 0.00588 | 0.09858 | 0.00434 |
45 | 0.07022 | 0.00588 | 0.08329 | 0.00434 |
55 | 0.05745 | 0.00588 | 0.07269 | 0.00434 |
Results for Algolia with stopwords:
K | Precision | Recall | NDCG | MAP |
1 | 0.38000 | 0.00071 | 0.31000 | 0.00071 |
5 | 0.31600 | 0.00350 | 0.27898 | 0.00272 |
10 | 0.27400 | 0.00596 | 0.25137 | 0.00440 |
15 | 0.25733 | 0.00829 | 0.23713 | 0.00577 |
25 | 0.19920 | 0.01061 | 0.19793 | 0.00721 |
35 | 0.14229 | 0.01061 | 0.15866 | 0.00721 |
45 | 0.11067 | 0.01061 | 0.13405 | 0.00721 |
55 | 0.0905 | 0.01061 | 0.11699 | 0.00721 |
Results Interpretation
Elasticsearch
Overall, Elasticsearch performs better than Algolia.
When comparing regular lexical search and semantic search within Elasticsearch, semantic search shows stronger performance at lower K-values. However, starting from K=15 and beyond, the regular lexical search begins to outperform semantic search across all metrics.
Looking more closely, we notice that for Elasticsearch semantic search, both Recall and MAP scores remain constant at and beyond K = 15, mirroring the previous Recall@10 and MAP@10 scores. This is likely because the semantic search returns no more than 15 results for many queries. As a result, the percentage of relevant results found—and therefore the scores—stop increasing beyond that point.
Since these two approaches were used exclusively in this evaluation, a potential improvement could be to combine semantic and lexical signals. This hybrid approach might offer the precision of semantic search at the top ranks while maintaining broader coverage through lexical matching.
Algolia
From the results, we can observe that Algolia’s scores are significantly lower compared to Elasticsearch.
Similar to Elasticsearch, we notice that Algolia Recall and MAP scores remains constant at and beyond K = 25, likely for the same reason.
Interestingly, we observe a slight improvement in Algolia’s metric scores when stopwords are applied. By excluding certain words from the strict AND condition, Algolia broadens its search criteria, allowing more relevant results to surface. However, this enhancement is limited. Algolia’s default behavior still requires that all (non-stopword) query terms appear in the result, which can restrict its ability to retrieve relevant records in some cases. This is evident in the following example:
Query “2”: “How does the coronavirus respond to changes in the weather”Returns: Nothing
Here, Algolia returns no results because no records contain all of the terms in the query.
Modified Query “2”: “coronavirus respond to the weather”Returns: records “w5kjmw88”, “gan10za0”
In this case, Algolia returns results because these records contain the specified terms.
This suggests that Algolia is better suited for keyword-based searches. However, it’s worth noting that Algolia does offer a semantic search feature called NeuralSearch, available with their most expensive plan: Elevate. Implementing this feature could potentially improve Algolia's results for more complex queries.
Conclusion
In this article, we compared the querying performance of Algolia and Elasticsearch. While Elasticsearch performed better, its results were not perfect. Algolia, on the other hand, demonstrated limitations, particularly in handling more complex queries due to its strict search behavior.
When ranking the methods based on performance, we found the following:
- (Tie): Elasticsearch lexical and semantic search – Lexical search performs better when K ≥ 15, while semantic search excels when K ≤ 15.
- Algolia search with stopwords enabled
- Algolia default search
To improve performance in enterprise search, both Algolia and Elasticsearch could potentially benefit from preprocessing techniques, especially for complex queries. This combination could ensure more relevant and accurate results across diverse use cases.
References:
Benchmarking IR Information Retrieval (BEIR)
huggingface MTEB: Massive Text Embedding Benchmark
Algolia Documentation Prepare your records for indexing | Algolia
Built In Mean Average Precision (mAP) Explained | Built In
Evaluation Metrics for Search and Recommendation Systems | Weaviate
- Introduction
- Understanding Algolia and Elasticsearch
- Algolia
- Elasticsearch
- Evaluating Retrieval Performance
- Choosing an Evaluation Dataset
- Choosing Evaluation Metrics and Demonstrating Metric Calculations
- Sample Data for Demonstrating Metric Calculations
- Non-Rank-Aware Metrics
- Precision
- Recall
- Rank-Aware Metrics
- MAP
- NDCG
- Running Full Evaluations
- Code Reference
- Results
- Results Interpretation
- Conclusion