Elasticsearch Synchronization: Practical Approaches That Work in Real Systems

Elasticsearch Synchronization
1024 1024 Ahmet Onur

Elasticsearch synchronization is one of those topics almost everyone postpones at the beginning. Everything works fine at first. Queries are fast, data looks correct, life is good.

Then data grows.
Updates increase.
And suddenly, Elasticsearch starts to feel… fragile.

If you’ve ever asked yourself:

“How do I reliably keep Elasticsearch in sync with my database?”

you’re in the right place.

In this article, we’ll walk through real-world Elasticsearch synchronization strategies—without buzzwords, and without unnecessary complexity.


Why Elasticsearch Synchronization Is Harder Than It Looks

Let’s clear up a common misconception first:

Elasticsearch is not a primary data store.

In most systems:

  • PostgreSQL, MySQL, or another database is the source of truth
  • Elasticsearch holds a search-optimized copy of that data

The challenge is keeping these two in sync—especially when data changes frequently.


Why Are Updates Expensive in Elasticsearch?

Elasticsearch is built on top of Lucene, and Lucene has a core design principle:
segments are immutable.

Here’s a simple analogy:

Imagine printing a book. If you want to change a single word, you don’t edit the page—you print a new one and mark the old one as obsolete.

That’s exactly what happens in Elasticsearch:

  • The old document is marked as deleted
  • The new document is indexed again
  • All analyzers run from scratch

This leads to:

  • Higher CPU usage
  • More disk I/O
  • Increased segment count

In short:
👉 Frequent updates are expensive in Elasticsearch.

Source:
https://www.elastic.co/blog/found-keeping-elasticsearch-in-sync


Bulk API: The Foundation of Elasticsearch Synchronization

The first rule of Elasticsearch writes is simple:

Never index documents one by one. Always use batches.

The Bulk API:

  • Reduces network overhead
  • Minimizes segment churn
  • Improves overall throughput

But using Bulk API alone isn’t enough.
The real question is when and how you send data to Elasticsearch.


Approach 1: Queue-Based Elasticsearch Synchronization

One of the most common and flexible approaches is queue-based synchronization.

How Queue-Based Sync Works

At a high level:

  1. A record changes in the primary database
  2. A queue entry is created (document ID, index name, etc.)
  3. If the same record changes again shortly after:
    • No duplicate queue entry is created (deduplication)
  4. Workers periodically:
    • Dequeue, for example, 1000 entries
    • Fetch fresh data from the database
    • Index everything using the Bulk API

Why This Approach Works Well

  • Elasticsearch load is isolated from the main application
  • User-facing operations remain fast
  • Write throughput can be controlled via worker count
  • Temporary Elasticsearch outages don’t break the system

This approach is especially effective for domains with frequent updates.

The Trade-off

Queues perform poorly for full reindex operations.
Pushing millions of records through a queue is slow, expensive, and complex.


Approach 2: Range-Based (Time Window) Batch Synchronization

If your data is:

  • Immutable
  • Append-only
  • Log, event, or audit based

you can use a much simpler approach—without queues.

How Range-Based Synchronization Works

The logic is straightforward:

  • A worker runs every X seconds
  • It remembers the last processed timestamp
  • It calculates a new range:
    • last_run_timenow() - 1 second
  • Fetches records in that range
  • Indexes them using the Bulk API
  • Updates last_run_time

Why the 1-Second Delay?

Because:

  • Elasticsearch makes documents searchable after ~1 second by default
  • Slow database transactions need a small buffer to complete

That tiny delay significantly reduces the risk of missing data.


Time-Based Indexing: Small Change, Big Win

If you’re using range-based batching, time-based indices are a natural fit.

For example:

  • logs-2025-01
  • logs-2025-02

Benefits:

  • Queries target only relevant indices
  • Old data can be deleted with a single operation
  • Better search performance overall

This small design decision pays off quickly as data grows.


Error Handling and Idempotency (Often Overlooked)

This is where many systems quietly fail.

All Elasticsearch synchronization processes should be idempotent.

That means:

  • Running the same batch twice should not break anything
  • Queue entries should be removed only after successful indexing
  • Bulk API partial failures must be handled explicitly

When designed this way:

  • Retry logic becomes simple
  • Recovery scenarios are predictable
  • The system remains stable under failure

Conclusion: Choosing the Right Elasticsearch Synchronization Strategy

There is no single “correct” solution—but there is a correct choice for each use case.

  • Frequent updates, complex domains → Queue-based synchronization
  • Immutable or append-only data → Range-based batch synchronization
  • Always → Bulk API + idempotent design

When done right, Elasticsearch synchronization becomes boring—and that’s a good thing.,

Thanks for reading 🙌
If you’ve implemented Elasticsearch synchronization in a different way, or if something here raised questions, feel free to share your thoughts.
Elasticsearch synchronization may look tricky at first, but with the right approach, it’s surprisingly manageable.

Leave a Reply

Your email address will not be published.