Introduction
Imagine you're working at a small and midsized business (SMB) with around 100 employees. As the business continues to grow, both new and current employees constantly need to stay up to date and have access to resources related to company information.
These data sources may include customer service tools (e.g., Intercom, Salesforce), work documents (e.g., Google Drive, Microsoft Office 365), databases (e.g., PostgreSQL, MySQL), messaging apps (e.g., Gmail, Slack, Zoom), and other unstructured files (e.g., PDF, TXT).
This creates a need for a method to streamline enterprise information retrieval from various data source platforms.
Example Scenario
For the purpose of price estimation and determining which plans are suitable for SMB companies in similar situations, assume that your company requires connectors to five data sources and that the total estimated number of documents (including both structured and unstructured) is ~150,000, with an estimated storage requirement of 350 GB. Additionally, it is anticipated that the average number of queries per employee throughout the workday will be 10, resulting in a total of 1,000 queries per day. Employees may ask natural language questions, meaning the search service must support semantic search. Furthermore, it is assumed that full data synchronization and updates will take place for one hour per week, totaling 52 hours per year.
Article Objective
This article will explore some of the most common retrieval solutions and evaluate how each can address the needs of a growing business. We will provide a comparative overview of the storage and search features of five passage retrieval options and one document retrieval option, along with a short description of each.
Understanding Document and Passage Retrieval
Before comparing the methods, it’s important to clarify what is meant by document and passage retrieval. Document retrieval refers to the process of retrieving entire documents by searching through them as a whole. While this can be useful, it may not always be practical, as it can be difficult to quickly locate specific information within a large document. On the other hand, passage retrieval focuses on retrieving smaller, specific sections of a document that match the query, making it easier to quickly find the information you need.
Document retrieval can still be relevant because it can be adapted to function similarly to passage retrieval by segmenting documents into smaller chunks during preprocessing. While this approach can simulate passage retrieval, it may require more effort than necessary and could introduce additional complexity. For this reason, we have decided to include a document retrieval method, as it can still be an option in certain scenarios where preprocessing and chunking are feasible.
Service Features Overview
Below is a brief side-by-side comparison of the data ingestion and processing features offered by each service. For more detailed information, please refer to the specific section.
Overview of Data & Processing Features
Service / Product Name | Supported Data Types | Max Ingestible Data Size (per Item) | OCR and ASR Support | Image Analysis Capabilities | Role-Based Access Control (RBAC) | Multilingual Capabilities | Scalability |
Amazon Kendra | Structured, Unstructured | - Using a connector:
50 MB / Document
- Using Batch API: 5 MB / Document | Both | No[^1] | Yes | Yes | Yes |
Azure AI Search | Structured, Unstructured | - Regular: 16 MB / Document
- Blob: 16 - 256 MB / File [^2] | OCR | Yes | Yes | Yes | Yes |
Coveo | Structured, Unstructured | - 6 MB / Item
- 256 MB / File Container [^3] | OCR | Yes | Yes | Yes | Not specified |
Elasticsearch | Structured, Unstructured | - 100 MB / Document | None | No[^4] | Yes | Implementable [^5] | Yes |
Elastic Enterprise Search | Structured, Unstructured | - 100 MB / Document | None | No[^4] | Yes | Implementable [^5] | Yes |
Algolia | Structured only | - 10 or 100 KB / Record [^7] | None | No | Yes | Yes | Yes |
Supported Data Types
It’s important to consider whether or not the service supports structured, unstructured, or both types of data.
- Structured data refers to data with a fixed schema and predefined format. It is typically stored in tabular formats such as SQL databases, Excel spreadsheets, or predefined JSON objects.
- Unstructured data has no predefined format, offering more flexibility. This includes emails, text documents, images, videos, and message text files.
Max Ingestible Data Size (per Item)
This feature indicates the maximum amount of data that can be ingested in a single operation or document. The data size limit can vary depending on the connector, API method, or file type used. It's important to consider these limits when planning data ingestion, as large items may need to be split into smaller chunks before they can be processed.
[^2] Azure AI Search has multiple plans. The Free and Basic plans support 16 MB / blob file; S1 supports 128 MB / blob file; S2, S3, L1, L2 supports 256 MB / blob file. Click here to learn more about indexer limits for the plans.
[^3] Coveo file containers are temporary, private, and encrypted Amazon S3 data structures you can use to safely upload content, and is typically used for pushing batch items or larger items.
[^7] Algolia records are object-like collection of attributes with a name and a value. Depending on the Algolia plan, a single record may be limited to 10KB (in the Free Build plan) or 100KB (in the Grow, Premium, and Elevate plans).
OCR and ASR Support
You might be looking to ingest more complicated data and be able to search on them — like PDFs, images with text or audio files, which is when OCR and ASR support could be helpful. Typically, services supporting these file types would process these by applying OCR or ASR and store the extracted text as metadata or separate fields that can be searchable.
- OCR (Optical Character Recognition) is the process of extracting text from images that contain text, such as scanned pages of a book.
- ASR (Automatic Speech Recognition) converts human speech into written text, often used for meeting or presentation audio recordings.
Image Analysis Capabilities
This feature refers to the ability of the service to analyze and extract insights from images, often through techniques such as image recognition, visual search, and similarity search. Services that support image analysis can identify objects, text, and patterns within images, allowing for more advanced search capabilities beyond traditional text-based queries.
Some of the products don’t provide native support, but there are tutorials on how to implement them with help with additional code and attaching additional services.
[^1] Amazon Kendra provides example code and setup tutorial on how to build an image search engine with the help of Amazon Rekognition, however this involves training a model. The resulting image search engine allows for images to be found given text descriptions of what you’re looking for. Click here for more details.
[^4] Elasticsearch provides example code to setup image similarity searches using image embeddings. Click here for more details.
Role-Based Access Control (RBAC)
This feature describes the ability to enforce control over data access, restricting it based on the user's role. For example, certain documents or content may only be accessible to higher-level managers, ensuring that sensitive information is not readily available to all employees.
Multilingual Capabilities
If your company operates in a multilingual environment or deals with multilingual documents, consider services that support indexing and searching across multiple languages.
[^5] Elasticsearch provides example code to setup image similarity searches using image embeddings. Click here for more details.
Scalability
As your data grows, it's important to consider solutions that can scale effectively, allowing the system to handle larger volumes of documents and information.
Overview of Search & UI Features
Service / Product Name | Search Method (Keyword or Semantic) | Autocomplete / Query Suggestions | Relevance Tuning | Includes UI (Full Application or UI Components) |
Amazon Kendra | Both | Yes | Yes | Both |
Azure AI Search | Both | Yes | Yes | None |
Coveo | Both | Yes | Yes | UI Components |
Elasticsearch | Both | Yes | Yes | None |
Elastic Enterprise Search | Both | Yes | Yes | UI Components |
Algolia | Both[^1] | Yes | Yes | Both |
Search Method (Keyword or Semantic)
Ideally, it would be best to have a service that offers both keyword and semantic search for more powerful searching.
- Keyword Search: This is the traditional method of searching, where search results are based on exact word matches. The search engine looks for documents containing the specified keywords, making it suitable for simple, direct queries. It's fast but may miss context or variations of the term.
- Semantic Search: In contrast, semantic search understands the meaning behind words and can return results based on the intent rather than exact matches. For example, when searching for "pants," the system might also show results for "jeans" or "trousers." This method often uses techniques like vector searching or natural language processing (NLP) to map words to semantic meanings, improving search relevance and flexibility.
[^1] Algolia can support semantic search, but through the NeuralSearch service, which is available only with their Elevate plan.
Autocomplete / Query Suggestions
Having the feature to autocomplete and query suggestions allows users to see suggested queries as they type, based on previous searches, popular terms, or context, helping them find accurate results more quickly.
Relevance Tuning
Relevance tuning refers to the ability to adjust the weight of specific fields and prioritize certain types of documents based on their importance.
Includes UI (Full Application or UI Components) This feature indicates whether the service includes a full user interface (UI) application or provides UI components that can be integrated into existing systems. A full application would be a ready-to-use platform with all the necessary features, while UI components are smaller, customizable elements like search bars, filters, and result displays that can be embedded into a larger application. This is important to consider if you're looking to build a solution from scratch or if you need a complete out-of-the-box solution.
Overview of Commonly Supported Data Sources
In addition to common file types like PDF, HTML, XML, CSV, PPT, DOC, DOCX, and TXT (except Algolia, which only supports JSON and CSV), the following data sources are also supported. (For a full list, see the links attached to each service name).
Service / Product Name | Web Crawler | Cloud Storage & Services | Databases | Communication & Collaboration Platforms | Development & Infrastructure | Business & Workflow Tools | Others |
• Amazon Kendra Web Crawler | • Amazon S3
• Confluence
• Dropbox
• Google Drive
• Microsoft (OneDrive, SharePoint) | • Amazon RDS
• MySQL
• Oracle
• PostgreSQL | • Gmail
• Jira
• Microsoft (Exchange, Teams)
• Salesforce
• Slack | • Docker
• Github
• Kubernetes | • Box
• Salesforce
• Trello | ||
• Selenium
• Website Crawler | • Amazon S3
• Confluence
• Google Drive
• Microsoft (OneDrive, Sharepoint, Azure Blob Storage, Azure SQL Database) | • Amazon RDS
• MicrosoftSQL
• MySQL
• Oracle
• PostgreSQL | • Gmail
• Jira
• Microsoft (Exchange, Teams)
• Slack
• Trello
• Twitter | • Git | • Box
• Salesforce
• SAP | ||
• Coveo Web Source Crawler module | • Amazon S3
• Box
• Confluence
• Google (Drive)
• Microsoft (Sharepoint) | • Databricks
• Microsoft SQL
• MySQL
• PostgreSQL
• Oracle | • Jira
• Slack | • Box
• Salesforce
• SAP | • REST APIs
• Youtube | ||
• Amazon S3
• Azure Blob Storage
• Confluence
• Dropbox
• Google (Cloud , Drive)
• OneDrive
• SharePoint | • MicrosoftSQL
• MongoDB
• MySQL
• Oracle
• PostgreSQL
• Redis | • Gmail
• Jira
• Outlook
• Microsoft Teams
• Salesforce
• Slack
• Zoom | • GitHub | • Box
• Salesforce | • Notion | ||
• Elastic Web Crawler | • Amazon S3
• Azure Blob Storage
• Confluence
• Dropbox
• Google (Cloud , Drive)
• OneDrive
• SharePoint | • MicrosoftSQL
• MongoDB
• MySQL
• Oracle
• PostgreSQL
• Redis | • Gmail
• Jira
• Outlook
• Microsoft Teams
• Salesforce
• Slack
• Zoom | • GitHub | • Box
Business
• Jira
• Salesforce
• Teams | • Notion | |
Algolia (sources listed only on the dashboard) | • Algolia Web Crawlers | • Google (Analytics 4, BigQuery, BigQuery Export) | • MySQL | • Firebase
• Segment
• Google Tag Manager | • BigCommerce
• Salesforce
• Shopify | The only file formats accepted by Algolia are JSON and CSV (others accept generic files) |
Passage Retrieval Services
Below are five passage retrieval solutions to consider for enterprise search.
1. Amazon Kendra
Amazon Kendra is a cloud-based AI-powered enterprise search service from Amazon Web Services (AWS) designed to help organizations quickly search, discover, and retrieve relevant information from a variety of data sources.
Features
Amazon Kendra leverages machine learning to process queries, enabling natural language querying rather than relying on exact keyword matches. It uses models for reading comprehension and semantic matching to improve search results.
Kendra includes an Experience Builder, allowing users to create applications using UI components. Its search index can also be integrated into custom retrieval-augmented generation (RAG) applications.
As highlighted in the features comparison table, Amazon Kendra supports ASR and OCR indexing capabilities. This is achieved through integration with Amazon Textract and Amazon Transcribe, which process files, extract text, and index it to make the content searchable.
Designed for easy integration into developer RAG systems, Amazon Kendra also provides APIs for developers. For more information, check out the Amazon Kendra API documentation.
Estimated Pricing
Amazon Kendra pricing is based on storage units, query units, and connectors. The minimum cost includes a Base Index with one Storage Unit and one Query Unit, with additional charges for extra units or connectors.
Kendra offers three types of indexes, ranked from least to most advanced: Basic Developer Edition (for proof-of-concept solutions), Basic Enterprise (which includes semantic search capabilities for production workloads), and Gen AI Enterprise Edition (for the highest accuracy).
Based on the example case, the Basic Enterprise Edition would be a suitable option, given its storage and querying capabilities. The estimated annual cost for this plan is $79,739.20.
For more information on pricing and features, visit this link.
2. Azure AI Search
Azure AI Search (formally called Azure Cognitive Search) is a cloud-based enterprise information retrieval system which can be used for RAG-based applications or document search and data exploration.
Features
Azure AI Search offers hybrid searching, combining both keyword and semantic search capabilities. Its ingestion process includes an integrated vectorization extension, which provides end-to-end data processing tailored for RAG usage, including chunking, vectorizing, and indexing.
While Azure AI Search does not provide a built-in UI for interaction, it offers APIs for searching and accessing search results. For more details, check out the Azure AI Search REST API documentation.
Estimated Pricing
Azure AI Search offers four tier options: Free, Basic, Standard, and Storage Optimized, each with fixed pricing. To determine the most suitable plan, businesses need to consider their storage, number of indices, and scale-out limits, as pricing is based on storage rather than querying. For more details about the pricing plans, visit this link.
For the example case, the Standard S2 plan would be the most appropriate choice, as it supports up to 512 GB of storage, meeting the data requirements. The estimated annual cost for this plan is $11,773.44.
3. Coveo
Coveo provides enterprise search software specifically designed for commerce, websites, workplace (enterprise search), and service (customer self-service).
Features
Coveo’s Enterprise AI-Search is a hybrid search system that combines keyword search with semantic search capabilities.
Access to Coveo’s index searching is available through the Search API or by using Coveo’s provided UI visual search components.
For more details, check out the Coveo API documentation.
Estimated Pricing
Coveo offers both a Pros and an Enterprise Plan. The Enterprise Plan’s pricing is not publicly disclosed and must be discussed on a case-by-case basis. However, it has been reported on forums that Coveo can be a very expensive service, potentially more costly than Algolia.
Pricing may ultimately depend on factors specific to the organization’s needs. For more details, visit the Coveo pricing page.
4. Elasticsearch
Elasticsearch, developed by Elastic, is a popular open-source search and analytics engine built on Apache Lucene. It is used for storing, searching, and analyzing both structured and unstructured data, including logs, metrics, and security events.
Features
Elasticsearch offers full-text search in near real-time and supports semantic searching using natural language processing (NLP) and vector searching. It also includes vector databases for storing and querying vectorized data using embeddings from third-party NLP models.
Elasticsearch provides a REST API for interacting with the search engine. For more details, check out the Elasticsearch REST API documentation.
Estimated Pricing
Elasticsearch, when used independently from Elastic Cloud, is free to use.
Pricing for Elasticsearch largely depends on the chosen deployment method, self-management, and hosting options. In some cases, self-hosting may or may not be more cost-effective than using the Elastic Cloud alternative—Elastic Enterprise Search.
5. Elastic Enterprise Search
Not to be confused with Elasticsearch, Elastic Enterprise Search is an additional service offered through Elastic Cloud that uses Elasticsearch under the hood and provides additional features.
Features
One additional feature of Elastic Enterprise Search is its built-in Search UI, which allows for easy integration of pre-made search bars into applications or websites.
Regarding data sources and ingestion, Elastic Enterprise Search offers an Elastic Web Crawler for data ingestion from websites. Its connectors are managed by the Elastic Cloud service, rather than being self-managed, making setup and maintenance simpler.
For more details, check out the Elastic Enterprise Search API documentation.
Estimated Pricing
Since Elastic Enterprise Search is an additional service within Elastic Cloud, pricing depends on the chosen Elastic Cloud plan. The minimum costs could range from $95, $109, $125, or $175 per month, depending on the plan.
Given the conditions of our example business, the Platinum Plan would be the most suitable option due to its support for semantic searching. To estimate costs, visit the Elastic Cloud pricing page.
Document Retrieval Services
1. Algolia
Algolia is a hosted search engine commonly used for e-commerce, SaaS applications, and knowledge bases.
It’s important to note that Algolia supports indexing and querying over structured data only. This means it may not be ideal for unstructured data retrieval, as unstructured data would need to be converted into structured formats for use with Algolia.
Features
Algolia provides lexical searching and can also support semantic searching, but only through its NeuralSearch service, which is available exclusively with the Elevate plan. Additionally, Algolia offers UI widgets and API clients for easy integration.
For more details, check out the Algolia REST API documentation.
Estimated Pricing
Algolia is known for being a very expensive search engine.
Algolia’s pricing plans include: Free, Grow, Premium and Elevate. Since we would like the ability to search semantically, the NeuralSearch service is required, which is only available with the Elevate plan, which is not disclosed and must be discussed. To learn more about the plans, check out this link.
Choosing the Right Method: Pros and Cons
Pros & Cons and Estimated Pricing of Services
The features table provides an overview of some general features to consider when choosing a service. These are not the only features of the services but highlight key aspects that might influence your decision. We also include the estimated pricing overview.
API / PlatformService / Product Name | Advantages | Disadvantages | Estimated Pricing (per year) |
Amazon Kendra | • Supports Automatic Speech Recognition (ASR)
• Provides image analysis
• Easy to set up with a user-friendly interface | • Can be expensive for small businesses with large data volumes due to pricing based on indexed documents and monthly queries. | $79,739.20 |
Azure AI Search | • Provides image analysis | • Expensive for large-scale use due to fixed monthly pricing structure
• No built-in search UI components
• Additional features may increase overall cost | $11,773.44 |
Coveo | • Provides image analysis | • Cumbersome setup ad time-consuming for some users
• High learning curve for new developers
• Slow indexing | Consult sales |
Elasticsearch | • Near real-time search
• No direct cost for the service itself | • Resource-intensive and self-managed
• Complex setup and maintenance
• No search UI components
• No web crawler
• Better suited for small use cases, may struggle with large-scale or streaming data | Depends on own infrastructure |
Elastic Enterprise Search | • Near real-time search | • Can be more expensive than Elasticsearch (requires Elastic Cloud service) | Consult sales or check out the price estimator |
Algolia | • Straightforward UI
• Easy setup and configuration | • Requires structured records
• Very expensive
• Limited record size
• High cost for scaling
| Consult sales |
Choosing a service
Most retrieval services offer similar capabilities, but depending on your business needs, some may be better suited than others:
- Amazon Kendra is ideal for businesses handling image and audio files, thanks to its ASR support. It also offers a no-code UI for customizing search applications.
- Azure AI Search provides a more affordable alternative to Kendra, especially for businesses that only need an API rather than a full pre-built search application.
- Elasticsearch is a great choice for businesses with development expertise that prefer a self-managed, open-source search solution.
- Algolia is suitable for businesses that need a search interface for structured data, though it may come at a high cost.
Conclusion
When selecting an enterprise search solution, factors like storage, data type, connectors, search capabilities, and query frequency are critical. This article provides an overview of popular options to help guide your decision.
If you're interested in how search engines like Algolia and Elasticsearch handle natural queries, check out this article, where we evaluate them using the TREC-COVID dataset and standard retrieval metrics.
- Introduction
- Example Scenario
- Article Objective
- Understanding Document and Passage Retrieval
- Service Features Overview
- Overview of Data & Processing Features
- Overview of Search & UI Features
- Overview of Commonly Supported Data Sources
- Passage Retrieval Services
- 1. Amazon Kendra
- Features
- Estimated Pricing
- 2. Azure AI Search
- Features
- Estimated Pricing
- 3. Coveo
- Features
- Estimated Pricing
- 4. Elasticsearch
- Features
- Estimated Pricing
- 5. Elastic Enterprise Search
- Features
- Estimated Pricing
- Document Retrieval Services
- 1. Algolia
- Features
- Estimated Pricing
- Choosing the Right Method: Pros and Cons
- Pros & Cons and Estimated Pricing of Services
- Choosing a service
- Conclusion