Kafka: How to handle duplicate messages
Kafka is "at least once" by default.
????
What is "At Least Once" in Kafka? (Simple Explanation)
"At least once" means:
-
Kafka guarantees that every message will be delivered,
-
But sometimes the same message might be delivered more than once (duplicate).
๐ Why it happens:
-
If a consumer crashes or times out, Kafka resends the message after retry.
-
If the system didn’t record that it already processed it, it may process it again.
๐ Example:
-
You (Consumer) receive a payment event.
-
You process it and update the database.
-
But before you commit (Telling Kafka "I have successfully processed this message" or in other words consumer failed to submit the offset to Kafka), the consumer crashes.
-
Kafka thinks it failed, so it retries (allows consumers to pull the message) the message.
-
You process the same payment again → duplicate!
๐ก To Handle It:
-
Make your processing idempotent (safe to repeat).
-
Or track processed messages (deduplication).
Below are the ways how idempotent can be guaranteed within Kafka
Use message keys
-
Set a unique key (like
order_id
) so Kafka sends messages with same key to the same partition.
๐ What It Does:
-
In Kafka, a message key determines the partition it goes to.
-
Kafka uses a hash of the key to decide the partition.
-
So:
→ Messages with the same key always go to the same partition.
๐ก Why That Helps:
-
Kafka guarantees ordering within a partition, not across partitions.
-
If all messages about a specific entity (e.g., an
order_id
) go to one partition, then:-
You can safely deduplicate or replace older messages with newer ones.
-
Consumers can track state per key, making it easier to detect and skip duplicates.
-
๐งพ Example:
Imagine you get two messages for the same order:
-
order_id = 123
, status = "paid" -
order_id = 123
, status = "shipped"
If these messages go to the same partition, your consumer can:
-
Keep a record of the last status.
-
Detect if a duplicate "paid" status appears again and skip it.
-
Enable idempotence in producers
enable.idempotence = true
๐ What It Does:
This makes the Kafka producer smart enough to avoid writing the same message twice, even if a retry happens.
๐ง Why retries happen:
A producer sends a message.
Kafka receives it, but the acknowledgment gets lost (e.g., network issue).
Producer thinks it failed and retries sending the same message.
Without idempotence:
Kafka writes the same message again → duplicate.
With idempotence enabled:
Kafka knows it's the same message, and only writes it once.
๐ How it works under the hood:
When enable.idempotence=true
, Kafka uses:
Producer ID (PID) → identifies the producer.
Sequence numbers per topic-partition.
Kafka tracks:
“Hey, producer X already sent message #7 to this partition — don’t store it again.”
This makes retries safe and duplicate-free.
⚠️ Important Notes:
Idempotence only guarantees no duplicates in the Kafka log.
If a producer sends the same logical message twice with different metadata, Kafka won’t treat them as duplicates.
-
Use idempotent consumers / processing
-
Design your system so repeated processing doesn't cause problems (e.g., check if a record already exists before inserting).
-
-
Use deduplication logic
-
Store processed message IDs (in DB or cache like Redis) and skip if already processed.
-
Comments
Post a Comment