Kafka: How to handle duplicate messages



Kafka is "at least once" by default.


????

What is "At Least Once" in Kafka? (Simple Explanation)

"At least once" means:

  • Kafka guarantees that every message will be delivered,

  • But sometimes the same message might be delivered more than once (duplicate).


๐Ÿ” Why it happens:

  • If a consumer crashes or times out, Kafka resends the message after retry.

           (Here, what I meant by resend that Kafka allows to pull the same message. If you know about Kafka Architecture it does not push/ send messages to consumer. Instead consumer is the one pulls the messages)
  • If the system didn’t record that it already processed it, it may process it again.


๐Ÿ›  Example:

  • You (Consumer) receive a payment event.

  • You process it and update the database.

  • But before you commit (Telling Kafka "I have successfully processed this message" or in other words consumer failed to submit the offset to Kafka), the consumer crashes.

  • Kafka thinks it failed, so it retries (allows consumers to pull the message) the message.

  • You process the same payment again → duplicate!


๐Ÿ’ก To Handle It:

  • Make your processing idempotent (safe to repeat).

  • Or track processed messages (deduplication).

    Below are the ways how idempotent can be guaranteed within Kafka

  • Use message keys

    • Set a unique key (like order_id) so Kafka sends messages with same key to the same partition.

๐Ÿ” What It Does:

  • In Kafka, a message key determines the partition it goes to.

  • Kafka uses a hash of the key to decide the partition.

  • So:
    → Messages with the same key always go to the same partition.

๐Ÿ’ก Why That Helps:

  • Kafka guarantees ordering within a partition, not across partitions.

  • If all messages about a specific entity (e.g., an order_id) go to one partition, then:

    • You can safely deduplicate or replace older messages with newer ones.

    • Consumers can track state per key, making it easier to detect and skip duplicates.

๐Ÿงพ Example:

Imagine you get two messages for the same order:

  • order_id = 123, status = "paid"

  • order_id = 123, status = "shipped"

If these messages go to the same partition, your consumer can:

  • Keep a record of the last status.

  • Detect if a duplicate "paid" status appears again and skip it.


  • Enable idempotence in producers

Kafka producer setting:
enable.idempotence = true

Ensures retries don't create duplicates.

๐Ÿ” What It Does:

  • This makes the Kafka producer smart enough to avoid writing the same message twice, even if a retry happens.

๐Ÿง  Why retries happen:

  • A producer sends a message.

  • Kafka receives it, but the acknowledgment gets lost (e.g., network issue).

  • Producer thinks it failed and retries sending the same message.

Without idempotence:

  • Kafka writes the same message again → duplicate.

With idempotence enabled:

  • Kafka knows it's the same message, and only writes it once.


๐Ÿ” How it works under the hood:

When enable.idempotence=true, Kafka uses:

  • Producer ID (PID) → identifies the producer.

  • Sequence numbers per topic-partition.

Kafka tracks:

“Hey, producer X already sent message #7 to this partition — don’t store it again.”

This makes retries safe and duplicate-free.


⚠️ Important Notes:

  • Idempotence only guarantees no duplicates in the Kafka log.

  • If a producer sends the same logical message twice with different metadata, Kafka won’t treat them as duplicates.


  • Use idempotent consumers / processing

    • Design your system so repeated processing doesn't cause problems (e.g., check if a record already exists before inserting).

  • Use deduplication logic

    • Store processed message IDs (in DB or cache like Redis) and skip if already processed.

Comments

Popular posts from this blog

Apache Kafka - The basics

Spring: How to deal with circular dependencies