Priya Dhingra
4 min readNov 21, 2021

How is data organized in Elasticsearch(ES)?

Glossary

  1. Cluster: A group of one or more connected Elasticsearch nodes
  2. Document: JSON object containing data stored in Elasticsearch
  3. Index: Collection of JSON documents
  4. Indexing: To add one or more JSON documents to Elasticsearch. This process is called indexing.
  5. Inverted Index : An inverted index lists every unique word that appears in any document and identifies all of the documents each word occurs in.
  6. Lucene : Lucene or Apache Lucene is an open-source Java library used as a search engine. Elasticsearch is built on top of Lucene.
  7. Node: A single Elasticsearch server. One or more nodes can form a cluster.
  8. Primary Shard: Lucene instance containing some or all data for an index. When you index a document, Elasticsearch adds the document to primary shards before replica shards.
  9. Priority Queue: A priority queue is just a sorted list that holds the top-n matching documents.
  10. Replica Shard: Copy of a primary shard. Replica shards can improve search performance and resiliency by distributing data across multiple nodes.
  11. Shards: Lucene instance containing some or all data for an index.
  12. Tokenization: Tokenization is a process of breaking the strings into sections of strings or terms called tokens based on a certain rule. For eg. WHITESPACE TOKENIZER takes the string and breaks the string based on whitespace. Input => “quick brown fox”, Output => [quick, brown, fox]

Assume you have your customer data, such as customerIds, names, and pincodes, and you want to create a search engine that can quickly search this data. Let’s have a look at how this data would be arranged in ES to make searching more efficient.

Let’s start by deciding on the design of our ES document. To begin, consider the JSON sample below. We could have a zillion of such documents in our datastore.

{
“customerId”: “1A”,
“name”: “James”,
“street”: “308 3rd Avenue”,
“city”: “Seattle”,
“state”: “WA”
}

Let’s say we simply wanted to support searches on the customer’s Name and Ids in our application, which would mean we only needed to establish indexes on these two columns. By default, Elasticsearch indexes all data in every field and each indexed field has a dedicated, optimized data structure. For example, text fields are stored in inverted indices, and numeric and geo fields are stored in BKD trees. The more indexes there are, the more resources they demand, so we should choose number of indexes based on the usecase requirements.

An index is basically a collection of documents. Here we decided to create two index, one for customer-Id and second for customer-Name, ES would perform tokenisation on each documents and arranged them in key-document order.

Customer-Name Index would have inverted index like following:

“James” -> Doc1, Doc8, Doc 11 …. [It is basically denoting that “James” is mentioned against “name” field in Doc1, Doc8 and Doc11]

“Alice” -> Doc23, Doc48, Doc56…

“Harry” -> Doc12,Doc90,Doc 30…

Customer-Id Index would have inverted index like following :

“1A” -> Doc1, Doc8, Doc11….

“1B” -> Doc23, Doc48, Doc56…

“1C” -> Doc12,Doc90,Doc 30…

The JSON docs that correspond to that indexed-field, as well as the inverted index mapping, are saved in an index. Because any number of documents can be linked to a single index, the index is further broken down into a list of shards. A shard is a collection of documents that is basically a smaller subset of an index. Multiple smaller shards can be found in an index. Each primary shard has a Replica shard that stores a copy of the content in the primary shard. A replica shard serves as a back-up strategy. If the primary shard fails, the replica shard takes over as the primary.

An Elasticsearch cluster consists of a number of servers (nodes) working together as one. Clustering is a technology which enables Elasticsearch to scale up to hundreds of nodes that together are able to store many terabytes of data and respond coherently to large numbers of requests at the same time.

Search or indexing requests will usually be load-balanced across the Elasticsearch data nodes, and the node that receives the request will relay requests to other nodes as necessary and coordinate the response back to the user.

All of the indexed json documents in an ES data source will be distributed over one or more nodes. A node can have numerous shards, and the primary and replica nodes will never be put on same node. This is done so that if one node goes down for any reason, the replica on a different node will take over as the primary.

ES Data distribution: Both the primary and replica would never reside in the same node

How does search query process in ES distributed environment?

Let’s suppose we have 3 nodes in cluster and Node 3 first receive a search request.

Search Process
  1. The client sends a search request to Node 3, which creates an empty priority queue of size from + size. Both “from” and “size” parameter would be part of the search request
  2. Node 3 forwards the search request to a primary or replica copy of every shard in the index. Each shard executes the query locally and adds the results into a local sorted priority queue of size from + size.
  3. Each shard returns the doc IDs and sort values of all the docs in its priority queue to the coordinating node, Node 3, which merges these values into its own priority queue to produce a globally sorted list of results.

When a search request is sent to a node, that node becomes the coordinating node. It is the job of this node to broadcast the search request to all involved shards, and to gather their responses into a globally sorted result set that it can return to the client.