ElasticSearch-Part1-Index Configuration

The request for google like functionality from a single text box is common. In this series of blog posts I will describe how to implement this using ElasticSearch, and providing a code example.

The first step in this process it to determine which content will be searched over. For this sake of this blog post let's consider this website www.chanderyoga.com". In the case of ChanderYoga the desired search content are the asanas, mudras, pranayamas, and blog posts. All properties of these objects should be searchable except for the "Id" property. Below are the basic properties of these resources.

In an effort to improve the user experience we need to allow the search to match on more than just full words, we also need starts with, and contains search functionality. Since one text box determines the value being searched for we need to construct queries that allow multiple types of matches to take place, instead of creating a different query for each type of match. To do this we will use ElasticSearch's analysis and mapping to configure our indices.

First, we need to allow full word matching without case sensitivity. Different properties will need different types of analysis to get the desired effect. A name or a title should use the whitespace tokenizer, while a field containing sentences should use the standard tokenizer. The standard tokenizer uses an algorithm meant to handle European language grammar, and while that is great for large bodies of English text, it might strip out valuable characters from a name or title. Here are two custom analyzers that should fulfill these needs.

  "whitespace_lowercase": { "tokenizer": "whitespace", "filter": [ "lowercase" ] },  "standard_lowercase": { "tokenizer":"standard" }

The names "whitespace_lowercase" and "standard_lowercase" are names I have assigned to the combination of the existing tokenizers and filters. The lowercase filter is not necessary for the "standard_lowercase" analyzer because the standard tokenizer does this automatically.

At some point in the future we may want to sort based on the original value of fields like Name, Author, or Title; so we should use the keyword tokenizer with a lowercase filter (just in case we want to search against this at some point in the future).

"keyword_lowercase": { "tokenizer": "keyword", "filter": [ "lowercase" ] }

The edge ngram tokenizer and filter built specifically for starts with searching. In this example we will create an edge ngram tokenizer and then use that in an analyzer.

Tokenizer:

 "edgeNGram_tokenizer": { "type": "edgeNGram", "min_gram": "1", "max_gram":"25","token_chars": [ "letter", "digit", "punctuation", "symbol" ] }

Analyzer:

"starts_with": { "tokenizer": "edgeNGram_tokenizer", "filter": [ "lowercase" ] }

Our "edgeNGram_tokenizer" allows all characters except for whitespace characters, and ranges from 1 to 25 characters. The phrase "The quick brown fox", would be turned into the following tokens if analyzed by the "starts_with" analyzer: "t", "th", "the", "q", "qu", "qui", "quic", "quick", "b", "br", "bro", "brow", "brown", "f", "fo", "fox".

Finally, we need to configure an analyzer to allow contains searching. For this I have elected to use an analysis pattern that creates shingles. Shingles are groups of characters, not necessarily starting at the beginning of a word. First I will create the custom filter I want to use:

"shingle_filter": { "type": "shingle", "min_shingle_size": "2", "max_shingle_size": "5", "output_unigrams": false }

The tokens fed into this filter will be grouped together, with a minimum of 2 a maximum of 5. This filter supplies a wide range of potential values to match on. I will be combining the "shingle_filter" with an ngram tokenizer:

Tokenizer:

 "bigram_tokenizer": { "type": "ngram", "min_gram": "2", "max_gram": "2", "token_chars": [ "letter", "digit", "punctuation", "symbol" ] }

Analyzer:

"contains_shingle": { "tokenizer": "bigram_tokenizer", "filter": [ "lowercase", "shingle_filter" ] }

The tokenizer creates tokens of two characters breaking at whitespace characters. The analyzer produces the following tokens from "Chander Dhall": "ch ha", "ch ha an", "ch ha an nd", "ch ha an nd de", "ha an", "ha an nd", "ha an nd de", "ha an nd de er", "an nd", "an nd de", "an nd de er", "an nd de er dh", "nd de", "nd de er", "nd de er dh", "nd de er dh ha", "de er", "de er dh", "de er dh ha", "de er dh ha al", "er dh", "er dh ha", "er dh ha al", "er dh ha al ll", "dh ha", "dh ha al", "dh ha al ll", "ha al", "ha al ll", "al ll". While this looks very strange the same analysis will be applied to the search text as well, this gives us contains matching on a minimum of three characters and can allow some typos.

Now that we have the necessary analysis available we can map the analyzers to the fields. A field like the title of a blog post will have a mapping entry like this:

 "title": { "type": "string", "analyzer": "whitespace_lowercase", "fields": { "raw": { "type": "string", "analyzer": "keyword_lowercase", "norms": { "enabled": false } }, "starts_with": { "type": "string", "index_analyzer": "starts_with", "search_analyzer": "whitespace_lowercase" }, "contains_shingle": { "type": "string", "analyzer": "contains_shingle" } } }

This will actually create four fields in ElasticSearch, "title", "title.raw", "title.starts_with", "title.contains_shingle". The raw field is primarily for sorting purposes, and search requests can be written against any/all of these fields. You may notice that in the top level of "title" we use the "analyzer" field, while in the "starts_with" field we use the "index_analyzer" and "search_analyzer" fields. Using the two distinct fields, "index_analyzer" and "search_analyzer", allows me to set how the text is analyzed when it is indexed versus the analysis performed on a search. If only "analyzer" is used then the analyzer specified is used for both index and search time. While if one of "index_analyzer" or "search_analyzer" is specified the other uses the indices default analyzer (defaults to standard).

Some fields we never want to search against, like Id.

"id": { "type": "string", "index": "not_analyzed", "norms": { "enabled": false } }

In the Asana resource there are many fields that have the same relevance to the search results, and I have no requirement to search against them individually. ElasticSearch gives us the ability to group these together through copy_to.

 "sanskrit_name": { "type": "string", "index": "not_analyzed", "copy_to": "name", "norms": { "enabled": false } }, "english_name": { "type": "string", "index": "not_analyzed", "copy_to": "name", "norms": { "enabled": false } }, "name": { "type": "string", "analyzer": "whitespace_lowercase", "fields": { "starts_with": { "type": "string", "index_analyzer": "starts_with", "search_analyzer": "whitespace_lowercase" }, "contains_shingle": { "type": "string", "analyzer": "contains_shingle" } } }

The "sanskrit_name" and "english_name" field values are held in the original field and a copy of that value is passed to the "name" field where it is analyzed for search purposes.

Full examples of index settings documents can be found in ChanderYoga.ElasticSearch.Startup.Resources. Here is the development environment version of the blog posts index.

 { "settings": { "index": { "number_of_shards": 1, "number_of_replicas": 0 }, "analysis": { "filter": { "shingle_filter": { "type": "shingle", "min_shingle_size": "2", "max_shingle_size": "5", "output_unigrams": false } }, "tokenizer": { "edgeNGram_tokenizer": { "type": "edgeNGram", "min_gram": "1", "max_gram": "25", "token_chars": [ "letter", "digit", "punctuation", "symbol" ] }, "bigram_tokenizer": { "type": "ngram", "min_gram": "2", "max_gram": "2", "token_chars": [ "letter", "digit", "punctuation", "symbol" ] } }, "analyzer": { "whitespace_lowercase": { "tokenizer": "whitespace", "filter": [ "lowercase" ] }, "standard_lowercase": { "tokenizer": "standard", "filter": [ "lowercase" ] }, "keyword_lowercase": { "tokenizer": "keyword", "filter": [ "lowercase" ] }, "starts_with": { "tokenizer": "edgeNGram_tokenizer", "filter": [ "lowercase" ] }, "contains_shingle": { "tokenizer": "bigram_tokenizer", "filter": [ "lowercase", "shingle_filter" ] } } } }, "mappings": { "post": { "dynamic": "strict", "Id": { "path": "id" }, "_all": { "enabled": "false" }, "properties": { "id": { "type": "string", "index": "not_analyzed", "norms": { "enabled": false } }, "title": { "type": "string", "analyzer": "whitespace_lowercase", "fields": { "raw": { "type": "string", "analyzer": "keyword_lowercase", "norms": { "enabled": false } }, "starts_with": { "type": "string", "index_analyzer": "starts_with", "search_analyzer": "whitespace_lowercase" }, "contains_shingle": { "type": "string", "analyzer": "contains_shingle" } } }, "tags": { "type": "string", "analyzer": "whitespace_lowercase", "fields": { "starts_with": { "type": "string", "index_analyzer": "starts_with", "search_analyzer": "whitespace_lowercase" }, "contains_shingle": { "type": "string", "analyzer": "contains_shingle" } } }, "text": { "type": "string", "analyzer": "standard_lowercase", "fields": { "starts_with": { "type": "string", "index_analyzer": "starts_with", "search_analyzer": "standard_lowercase" }, "contains_shingle": { "type": "string", "analyzer": "contains_shingle" } } }, "author": { "type": "string", "analyzer": "whitespace_lowercase", "fields": { "raw": { "type": "string", "analyzer": "keyword_lowercase", "norms": { "enabled": false } }, "starts_with": { "type": "string", "index_analyzer": "starts_with", "search_analyzer": "whitespace_lowercase" }, "contains_shingle": { "type": "string", "analyzer": "contains_shingle" } } } } } } }

Top Services

Artificial Intelligence

Cazton offers a comprehensive suite of services encompassing custom software development, consult... Read more

Big Data Development

Cazton has been at the forefront of Big Data innovation. Our diverse team, including but not limi... Read more

Web Development

Cazton specializes in full-stack development, DevOps, and comprehensive training across the lates... Read more

Mobile Development

Cazton, a renowned leader in mobile application development and consulting, stands at the forefro... Read more

Desktop Development

Cazton, an industry leader in desktop application development and consulting, stands as a beacon ... Read more

API Development

API development has evolved as a fascinating and challenging architectural paradigm over the year... Read more

Database Development

Selecting the right database solution is a pivotal business decision, as data volume grows, intro... Read more

Cloud

Navigating the vast landscape of cloud frameworks is a critical endeavor, and at Cazton, we stand... Read more

DevOps

In the dynamic landscape of IT operations, DevOps has emerged as a transformative force, and at C... Read more

Enterprise Search

Cazton, a trusted name in technology solutions, takes pride in its team of Enterprise Search expe... Read more

Enterprise Architecture

Cazton stands as a distinguished authority in enterprise architecture, with a team of seasoned ex... Read more

Blockchain

Cazton stands as a premier provider of top-notch Blockchain consulting and training services, spe... Read more

Latest Articles

FreshDiskANN: Revolutionizing Real-Time Similarity Search

In today's data-driven world, similarity search has become a cornerstone te...

HNSW vs DiskANN

Searching large datasets effectively is a challenge, especially when each d...

OpenAI Agents API

Discover how OpenAI's latest Agents API is transforming AI development with...

Voice AI

Imagine a world where interacting with technology feels as natural as havin...

AI Voice Assistant

In the ever-evolving landscape of artificial intelligence, the need for sea...

Voice RAG

At Cazton, we specialize in creating AI systems that have high accuracy and...

vCore-based Azure Cosmos DB for MongoDB vs MongoDB Atlas

This benchmark study provides a comprehensive comparison of Azure Cosmos DB...

Fine-tuning vs RAG vs RAFT

Fine-tuning, retrieval-augmented generation (RAG), and retrieval-augmented ...

Snowflake Experts

Snowflake emerges as a versatile and powerful solution for organizations se...

Retrieval-Augmented Fine-Tuning

RAFT

The article introduces RAFT, an acronym for Retrieval-Augmented Fine-Tuning...

Advanced RAG Techniques

The landscape of artificial intelligence (AI) has been constantly evolving,...

Create Your AI Team

AI Agents

The realm of artificial intelligence has birthed a transformative evolution...