Enterprise Information Retrieval Solution

Enterprise Information Retrieval Solution

Introduction

Imagine you're working at a small and midsized business (SMB) with around 100 employees. As the business continues to grow, both new and current employees constantly need to stay up to date and have access to resources related to company information.

These data sources may include customer service tools (e.g., Intercom, Salesforce), work documents (e.g., Google Drive, Microsoft Office 365), databases (e.g., PostgreSQL, MySQL), messaging apps (e.g., Gmail, Slack, Zoom), and other unstructured files (e.g., PDF, TXT).

This creates a need for a method to streamline enterprise information retrieval from various data source platforms.

Example Scenario

For the purpose of price estimation and determining which plans are suitable for SMB companies in similar situations, assume that your company requires connectors to five data sources and that the total estimated number of documents (including both structured and unstructured) is ~150,000, with an estimated storage requirement of 350 GB. Additionally, it is anticipated that the average number of queries per employee throughout the workday will be 10, resulting in a total of 1,000 queries per day. Employees may ask natural language questions, meaning the search service must support semantic search. Furthermore, it is assumed that full data synchronization and updates will take place for one hour per week, totaling 52 hours per year.

Article Objective

This article will explore some of the most common retrieval solutions and evaluate how each can address the needs of a growing business. We will provide a comparative overview of the storage and search features of five passage retrieval options and one document retrieval option, along with a short description of each.

Understanding Document and Passage Retrieval

Before comparing the methods, it’s important to clarify what is meant by document and passage retrieval. Document retrieval refers to the process of retrieving entire documents by searching through them as a whole. While this can be useful, it may not always be practical, as it can be difficult to quickly locate specific information within a large document. On the other hand, passage retrieval focuses on retrieving smaller, specific sections of a document that match the query, making it easier to quickly find the information you need.

Document retrieval can still be relevant because it can be adapted to function similarly to passage retrieval by segmenting documents into smaller chunks during preprocessing. While this approach can simulate passage retrieval, it may require more effort than necessary and could introduce additional complexity. For this reason, we have decided to include a document retrieval method, as it can still be an option in certain scenarios where preprocessing and chunking are feasible.

Service Features Overview

Below is a brief side-by-side comparison of the data ingestion and processing features offered by each service. For more detailed information, please refer to the specific section.

Overview of Data & Processing Features

Service / Product Name
Supported Data Types
Max Ingestible Data Size (per Item)
OCR and ASR Support
Image Analysis Capabilities
Role-Based Access Control (RBAC)
Multilingual Capabilities
Scalability
Amazon Kendra
Structured, Unstructured
- Using a connector: 50 MB / Document - Using Batch API: 5 MB / Document
Both
No[^1]
Yes
Yes
Yes
Azure AI Search
Structured, Unstructured
- Regular: 16 MB / Document - Blob: 16 - 256 MB / File [^2]
OCR
Yes
Yes
Yes
Yes
Coveo
Structured, Unstructured
- 6 MB / Item - 256 MB / File Container [^3]
OCR
Yes
Yes
Yes
Not specified
Elasticsearch
Structured, Unstructured
- 100 MB / Document
None
No[^4]
Yes
Implementable [^5]
Yes
Elastic Enterprise Search
Structured, Unstructured
- 100 MB / Document
None
No[^4]
Yes
Implementable [^5]
Yes
Algolia
Structured only
- 10 or 100 KB / Record [^7]
None
No
Yes
Yes
Yes

Supported Data Types

It’s important to consider whether or not the service supports structured, unstructured, or both types of data.

  • Structured data refers to data with a fixed schema and predefined format. It is typically stored in tabular formats such as SQL databases, Excel spreadsheets, or predefined JSON objects.
  • Unstructured data has no predefined format, offering more flexibility. This includes emails, text documents, images, videos, and message text files.

Max Ingestible Data Size (per Item)

This feature indicates the maximum amount of data that can be ingested in a single operation or document. The data size limit can vary depending on the connector, API method, or file type used. It's important to consider these limits when planning data ingestion, as large items may need to be split into smaller chunks before they can be processed.

[^2] Azure AI Search has multiple plans. The Free and Basic plans support 16 MB / blob file; S1 supports 128 MB / blob file; S2, S3, L1, L2 supports 256 MB / blob file. Click here to learn more about indexer limits for the plans.

[^3] Coveo file containers are temporary, private, and encrypted Amazon S3 data structures you can use to safely upload content, and is typically used for pushing batch items or larger items.

[^7] Algolia records are object-like collection of attributes with a name and a value. Depending on the Algolia plan, a single record may be limited to 10KB (in the Free Build plan) or 100KB (in the Grow, Premium, and Elevate plans).

OCR and ASR Support

You might be looking to ingest more complicated data and be able to search on them — like PDFs, images with text or audio files, which is when OCR and ASR support could be helpful. Typically, services supporting these file types would process these by applying OCR or ASR and store the extracted text as metadata or separate fields that can be searchable.

  • OCR (Optical Character Recognition) is the process of extracting text from images that contain text, such as scanned pages of a book.
  • ASR (Automatic Speech Recognition) converts human speech into written text, often used for meeting or presentation audio recordings.

Image Analysis Capabilities

This feature refers to the ability of the service to analyze and extract insights from images, often through techniques such as image recognition, visual search, and similarity search. Services that support image analysis can identify objects, text, and patterns within images, allowing for more advanced search capabilities beyond traditional text-based queries.

Some of the products don’t provide native support, but there are tutorials on how to implement them with help with additional code and attaching additional services.

[^1] Amazon Kendra provides example code and setup tutorial on how to build an image search engine with the help of Amazon Rekognition, however this involves training a model. The resulting image search engine allows for images to be found given text descriptions of what you’re looking for. Click here for more details.

[^4] Elasticsearch provides example code to setup image similarity searches using image embeddings. Click here for more details.

Role-Based Access Control (RBAC)

This feature describes the ability to enforce control over data access, restricting it based on the user's role. For example, certain documents or content may only be accessible to higher-level managers, ensuring that sensitive information is not readily available to all employees.

Multilingual Capabilities

If your company operates in a multilingual environment or deals with multilingual documents, consider services that support indexing and searching across multiple languages.

[^5] Elasticsearch provides example code to setup image similarity searches using image embeddings. Click here for more details.

Scalability

As your data grows, it's important to consider solutions that can scale effectively, allowing the system to handle larger volumes of documents and information.

Overview of Search & UI Features

Service / Product Name
Search Method (Keyword or Semantic)
Autocomplete / Query Suggestions
Relevance Tuning
Includes UI (Full Application or UI Components)
Amazon Kendra
Both
Yes
Yes
Both
Azure AI Search
Both
Yes
Yes
None
Coveo
Both
Yes
Yes
UI Components
Elasticsearch
Both
Yes
Yes
None
Elastic Enterprise Search
Both
Yes
Yes
UI Components
Algolia
Both[^1]
Yes
Yes
Both

Search Method (Keyword or Semantic)

Ideally, it would be best to have a service that offers both keyword and semantic search for more powerful searching.

  • Keyword Search: This is the traditional method of searching, where search results are based on exact word matches. The search engine looks for documents containing the specified keywords, making it suitable for simple, direct queries. It's fast but may miss context or variations of the term.
  • Semantic Search: In contrast, semantic search understands the meaning behind words and can return results based on the intent rather than exact matches. For example, when searching for "pants," the system might also show results for "jeans" or "trousers." This method often uses techniques like vector searching or natural language processing (NLP) to map words to semantic meanings, improving search relevance and flexibility.

[^1] Algolia can support semantic search, but through the NeuralSearch service, which is available only with their Elevate plan.

Autocomplete / Query Suggestions

Having the feature to autocomplete and query suggestions allows users to see suggested queries as they type, based on previous searches, popular terms, or context, helping them find accurate results more quickly.

Relevance Tuning

Relevance tuning refers to the ability to adjust the weight of specific fields and prioritize certain types of documents based on their importance.

Includes UI (Full Application or UI Components) This feature indicates whether the service includes a full user interface (UI) application or provides UI components that can be integrated into existing systems. A full application would be a ready-to-use platform with all the necessary features, while UI components are smaller, customizable elements like search bars, filters, and result displays that can be embedded into a larger application. This is important to consider if you're looking to build a solution from scratch or if you need a complete out-of-the-box solution.

Overview of Commonly Supported Data Sources

In addition to common file types like PDF, HTML, XML, CSV, PPT, DOC, DOCX, and TXT (except Algolia, which only supports JSON and CSV), the following data sources are also supported. (For a full list, see the links attached to each service name).

Service / Product Name
Web Crawler
Cloud Storage & Services
Databases
Communication & Collaboration Platforms
Development & Infrastructure
Business & Workflow Tools
Others
• Amazon Kendra Web Crawler
• Amazon S3 • Confluence • Dropbox • Google Drive • Microsoft (OneDrive, SharePoint)
• Amazon RDS • MySQL • Oracle • PostgreSQL
• Gmail • Jira • Microsoft (Exchange, Teams) • Salesforce • Slack
• Docker • Github • Kubernetes
• Box • Salesforce • Trello
• Selenium • Website Crawler
• Amazon S3 • Confluence • Google Drive • Microsoft (OneDrive, Sharepoint, Azure Blob Storage, Azure SQL Database)
• Amazon RDS • MicrosoftSQL • MySQL • Oracle • PostgreSQL
• Gmail • Jira • Microsoft (Exchange, Teams) • Slack • Trello • Twitter
• Git
• Box • Salesforce • SAP
• Coveo Web Source Crawler module
• Amazon S3 • Box • Confluence • Google (Drive) • Microsoft (Sharepoint)
• Databricks • Microsoft SQL • MySQL • PostgreSQL • Oracle
• Jira • Slack
• Box • Salesforce • SAP
• REST APIs • Youtube
• Amazon S3 • Azure Blob Storage • Confluence • Dropbox • Google (Cloud , Drive) • OneDrive • SharePoint
• MicrosoftSQL • MongoDB • MySQL • Oracle • PostgreSQL • Redis
• Gmail • Jira • Outlook • Microsoft Teams • Salesforce • Slack • Zoom
• GitHub
• Box • Salesforce
• Notion
• Elastic Web Crawler
• Amazon S3 • Azure Blob Storage • Confluence • Dropbox • Google (Cloud , Drive) • OneDrive • SharePoint
• MicrosoftSQL • MongoDB • MySQL • Oracle • PostgreSQL • Redis
• Gmail • Jira • Outlook • Microsoft Teams • Salesforce • Slack • Zoom
• GitHub
• Box Business • Jira • Salesforce • Teams
• Notion
Algolia (sources listed only on the dashboard)
• Algolia Web Crawlers
• Google (Analytics 4, BigQuery, BigQuery Export)
• MySQL
• Firebase • Segment • Google Tag Manager
• BigCommerce • Salesforce • Shopify
The only file formats accepted by Algolia are JSON and CSV (others accept generic files)

Passage Retrieval Services

Below are five passage retrieval solutions to consider for enterprise search.

1. Amazon Kendra

Amazon Kendra is a cloud-based AI-powered enterprise search service from Amazon Web Services (AWS) designed to help organizations quickly search, discover, and retrieve relevant information from a variety of data sources.

Image Source:

Features

Amazon Kendra leverages machine learning to process queries, enabling natural language querying rather than relying on exact keyword matches. It uses models for reading comprehension and semantic matching to improve search results.

Kendra includes an Experience Builder, allowing users to create applications using UI components. Its search index can also be integrated into custom retrieval-augmented generation (RAG) applications.

As highlighted in the features comparison table, Amazon Kendra supports ASR and OCR indexing capabilities. This is achieved through integration with Amazon Textract and Amazon Transcribe, which process files, extract text, and index it to make the content searchable.

Designed for easy integration into developer RAG systems, Amazon Kendra also provides APIs for developers. For more information, check out the Amazon Kendra API documentation.

Estimated Pricing

Amazon Kendra pricing is based on storage units, query units, and connectors. The minimum cost includes a Base Index with one Storage Unit and one Query Unit, with additional charges for extra units or connectors.

Kendra offers three types of indexes, ranked from least to most advanced: Basic Developer Edition (for proof-of-concept solutions), Basic Enterprise (which includes semantic search capabilities for production workloads), and Gen AI Enterprise Edition (for the highest accuracy).

Based on the example case, the Basic Enterprise Edition would be a suitable option, given its storage and querying capabilities. The estimated annual cost for this plan is $79,739.20.

For more information on pricing and features, visit this link.

2. Azure AI Search

Azure AI Search (formally called Azure Cognitive Search) is a cloud-based enterprise information retrieval system which can be used for RAG-based applications or document search and data exploration.

Image Source:

Features

Azure AI Search offers hybrid searching, combining both keyword and semantic search capabilities. Its ingestion process includes an integrated vectorization extension, which provides end-to-end data processing tailored for RAG usage, including chunking, vectorizing, and indexing.

While Azure AI Search does not provide a built-in UI for interaction, it offers APIs for searching and accessing search results. For more details, check out the Azure AI Search REST API documentation.

Estimated Pricing

Azure AI Search offers four tier options: Free, Basic, Standard, and Storage Optimized, each with fixed pricing. To determine the most suitable plan, businesses need to consider their storage, number of indices, and scale-out limits, as pricing is based on storage rather than querying. For more details about the pricing plans, visit this link.

For the example case, the Standard S2 plan would be the most appropriate choice, as it supports up to 512 GB of storage, meeting the data requirements. The estimated annual cost for this plan is $11,773.44.

3. Coveo

Coveo provides enterprise search software specifically designed for commerce, websites, workplace (enterprise search), and service (customer self-service).

Image Source:

Features

Coveo’s Enterprise AI-Search is a hybrid search system that combines keyword search with semantic search capabilities.

Access to Coveo’s index searching is available through the Search API or by using Coveo’s provided UI visual search components.

For more details, check out the Coveo API documentation.

Estimated Pricing

Coveo offers both a Pros and an Enterprise Plan. The Enterprise Plan’s pricing is not publicly disclosed and must be discussed on a case-by-case basis. However, it has been reported on forums that Coveo can be a very expensive service, potentially more costly than Algolia.

Pricing may ultimately depend on factors specific to the organization’s needs. For more details, visit the Coveo pricing page.

4. Elasticsearch

Elasticsearch, developed by Elastic, is a popular open-source search and analytics engine built on Apache Lucene. It is used for storing, searching, and analyzing both structured and unstructured data, including logs, metrics, and security events.

Image Source:

Features

Elasticsearch offers full-text search in near real-time and supports semantic searching using natural language processing (NLP) and vector searching. It also includes vector databases for storing and querying vectorized data using embeddings from third-party NLP models.

Elasticsearch provides a REST API for interacting with the search engine. For more details, check out the Elasticsearch REST API documentation.

Estimated Pricing

Elasticsearch, when used independently from Elastic Cloud, is free to use.

Pricing for Elasticsearch largely depends on the chosen deployment method, self-management, and hosting options. In some cases, self-hosting may or may not be more cost-effective than using the Elastic Cloud alternative—Elastic Enterprise Search.

5. Elastic Enterprise Search

Not to be confused with Elasticsearch, Elastic Enterprise Search is an additional service offered through Elastic Cloud that uses Elasticsearch under the hood and provides additional features.

Image Source:

Features

One additional feature of Elastic Enterprise Search is its built-in Search UI, which allows for easy integration of pre-made search bars into applications or websites.

Regarding data sources and ingestion, Elastic Enterprise Search offers an Elastic Web Crawler for data ingestion from websites. Its connectors are managed by the Elastic Cloud service, rather than being self-managed, making setup and maintenance simpler.

For more details, check out the Elastic Enterprise Search API documentation.

Estimated Pricing

Since Elastic Enterprise Search is an additional service within Elastic Cloud, pricing depends on the chosen Elastic Cloud plan. The minimum costs could range from $95, $109, $125, or $175 per month, depending on the plan.

Given the conditions of our example business, the Platinum Plan would be the most suitable option due to its support for semantic searching. To estimate costs, visit the Elastic Cloud pricing page.

Document Retrieval Services

1. Algolia

Algolia is a hosted search engine commonly used for e-commerce, SaaS applications, and knowledge bases.

It’s important to note that Algolia supports indexing and querying over structured data only. This means it may not be ideal for unstructured data retrieval, as unstructured data would need to be converted into structured formats for use with Algolia.

Image Source:

Features

Algolia provides lexical searching and can also support semantic searching, but only through its NeuralSearch service, which is available exclusively with the Elevate plan. Additionally, Algolia offers UI widgets and API clients for easy integration.

For more details, check out the Algolia REST API documentation.

Estimated Pricing

Algolia is known for being a very expensive search engine.

Algolia’s pricing plans include: Free, Grow, Premium and Elevate. Since we would like the ability to search semantically, the NeuralSearch service is required, which is only available with the Elevate plan, which is not disclosed and must be discussed. To learn more about the plans, check out this link.

Choosing the Right Method: Pros and Cons

Pros & Cons and Estimated Pricing of Services

The features table provides an overview of some general features to consider when choosing a service. These are not the only features of the services but highlight key aspects that might influence your decision. We also include the estimated pricing overview.

API / PlatformService / Product Name
Advantages
Disadvantages
Estimated Pricing (per year)
Amazon Kendra
• Supports Automatic Speech Recognition (ASR) • Provides image analysis • Easy to set up with a user-friendly interface
• Can be expensive for small businesses with large data volumes due to pricing based on indexed documents and monthly queries.
$79,739.20
Azure AI Search
• Provides image analysis
• Expensive for large-scale use due to fixed monthly pricing structure • No built-in search UI components • Additional features may increase overall cost
$11,773.44
Coveo
• Provides image analysis
• Cumbersome setup ad time-consuming for some users • High learning curve for new developers • Slow indexing
Consult sales
Elasticsearch
• Near real-time search • No direct cost for the service itself
• Resource-intensive and self-managed • Complex setup and maintenance • No search UI components • No web crawler • Better suited for small use cases, may struggle with large-scale or streaming data
Depends on own infrastructure
Elastic Enterprise Search
• Near real-time search
• Can be more expensive than Elasticsearch (requires Elastic Cloud service)
Consult sales or check out the price estimator
Algolia
• Straightforward UI • Easy setup and configuration
• Requires structured records • Very expensive • Limited record size • High cost for scaling
Consult sales

Choosing a service

Most retrieval services offer similar capabilities, but depending on your business needs, some may be better suited than others:

  • Amazon Kendra is ideal for businesses handling image and audio files, thanks to its ASR support. It also offers a no-code UI for customizing search applications.
  • Azure AI Search provides a more affordable alternative to Kendra, especially for businesses that only need an API rather than a full pre-built search application.
  • Elasticsearch is a great choice for businesses with development expertise that prefer a self-managed, open-source search solution.
  • Algolia is suitable for businesses that need a search interface for structured data, though it may come at a high cost.

Conclusion

When selecting an enterprise search solution, factors like storage, data type, connectors, search capabilities, and query frequency are critical. This article provides an overview of popular options to help guide your decision.

If you're interested in how search engines like Algolia and Elasticsearch handle natural queries, check out this article, where we evaluate them using the TREC-COVID dataset and standard retrieval metrics.