Elasticsearch: Finding Missing Documents between 2 indices

The Challenge

Reindexing in Elasticsearch is a routine task, but it often comes with a common pitfall: document count discrepancies. It's not uncommon to find fewer documents in the destination index than in the source after a reindexing operation, raising valid concerns about data integrity.

In this article, I will present the approach that I take to efficiently find these missing documents.

Reindexing: The Good, The Bad, and The Ugly

First, let's talk about reindexing. It's a necessary evil in the world of Elasticsearch. Whether you're changing mappings, optimizing performance, or just doing some spring cleaning, reindexing is part of the job.

The problem often arises when you use wait_for_completion: false in your reindex task. It's great for not timing out on large indices, but it can leave you in the dark if something goes wrong. Even worse, sometimes the task completes successfully, but you still end up with mismatched document counts.

My preferred approach

What if we could compare the document IDs between the source and destination indices? Here's the approach I came up with:

Grab all the document IDs from the source index.
Do the same for the destination index.
Find the IDs that exist in the source but not in the destination.

Simple, right? Well, not so fast. Elasticsearch has 10,000 document limit on queries. With millions of documents in my indices, that wouldn't work for me.

Enter the Scroll API

This is where Elasticsearch's Scroll API came to my rescue. It's like pagination, but better, allowing us to efficiently retrieve all documents from an index. Here's a snippet of how I used it:

const allIds = [];
const scrollSearch = esClient.helpers.scrollSearch({
  index: 'my-index-00001',
  body: {
    size: 10000,
    _source: false
  },
  scroll: '1m'
});
for await (const result of scrollSearch) {
  allIds.push(...result.body.hits.hits.map(({ _id }) => _id));
}
// allIds now has all ids from the my-index-00001 as an array

It lets us scroll through all documents, grabbing all the IDs(and no other fields) we need. It's fast, efficient, and works on indices of any size.

Putting It All Together

Here's the full script I ended up with:


import { Client } from '@elastic/elasticsearch';

async function findMissingIds(sourceIndex, destinationIndex) {
  const esClient = new Client({
    node: process.env.ELASTICSEARCH_ENDPOINT,
    auth: {
      username: process.env.ELASTICSEARCH_USER,
      password: process.env.ELASTICSEARCH_PASSWORD,
    },
  });

  async function getIds(index) {
    console.log(`Started searching for all ids in ${index} index`);

    const allIds = [];
    const scrollSearch = esClient.helpers.scrollSearch({
      index: index,
      body: {
        size: 10000,
        _source: false
      },
      scroll: '1m'
    });
    for await (const result of scrollSearch) {
      allIds.push(...result.body.hits.hits.map(({ _id }) => _id));
    }

    console.log(`Acquired all ids from ${index} index`);
    return allIds;
  }

  const sourceIds = await getIds(sourceIndex);
  const destinationIds = await getIds(destinationIndex);

  await esClient.close();

  const destinationIdSet = new Set(destinationIds);
  return sourceIds.filter(id => !destinationIdSet.has(id));
}

console.time('Total time');

const sourceIndex = 'your-source-index';
const destinationIndex = 'your-destination-index';

const missingIds = await findMissingIds(sourceIndex, destinationIndex);

console.timeEnd('Total time');
console.log('Missing ids:', missingIds);

We set up an Elasticsearch client (make sure you've got your credentials in a .env file).
The findMissingIds function does the heavy lifting, using the Scroll API to fetch IDs from both indices.
We use a Set to efficiently compare the IDs and find the missing ones.

To run the script:

Save it as findMissingIds.js
Create a .env file with your Elasticsearch credentials
Execute node --env-file=.env ./findMissingIds.js

You'll get a list of IDs that didn't make it to the destination index.

About the author: A sleep-deprived data engineer who's learned to love (or at least tolerate) Elasticsearch's quirks.

Blog

Elasticsearch: Finding Missing Documents between 2 indices

Aditya Singh

The Challenge

Reindexing: The Good, The Bad, and The Ugly

My preferred approach

Enter the Scroll API

Putting It All Together

Join Our Newsletter. No Spam, Only the good stuff.

Related