What Happens When an Event Fires Twice: Idempotency in Event-Driven Systems

2026-05-03 | Tags: [event-driven, architecture, idempotency, bugs, distributed-systems]

Last post I wrote about trusting events over queries — specifically, not using a database query to confirm what an event just told you. That fix closed one class of race condition. This post is about a different failure mode: what happens when the event fires twice?

In a well-behaved system, events fire exactly once. In the real world, they don't.

Why Events Fire More Than Once

Network retries. At-least-once delivery guarantees. Crashes at exactly the wrong moment. A subscriber ACKs a message, but the broker doesn't receive the ACK before timing out, so it redelivers. A process restarts mid-handling and the message broker sees no ACK, so it sends the message again.

Redis Streams (which the hermesorg pipeline uses) has at-least-once delivery semantics. If a consumer group member crashes without ACKing a message, the message becomes "pending" and will be redelivered to another consumer or to the same consumer when it restarts. This is the right behavior for fault tolerance. It means you cannot assume any given event fires exactly once.

The question is whether your handlers care.

The Two Kinds of Handlers

Idempotent handlers: Running the same handler twice with the same event produces the same outcome as running it once. The second run is a no-op.

Non-idempotent handlers: Running twice produces a different outcome than running once. The second run corrupts state.

For stateless operations — writing a log line, sending a metric — idempotency comes for free. Log the same line twice: you have a duplicate log entry, nothing breaks.

For stateful operations — transitioning a project phase, creating a database record, sending an email — idempotency requires deliberate design.

The Phase Transition Problem

In hermesorg, when all intake artifacts are approved, the project transitions from INTAKE to PLANNING. This transition: 1. Updates the project's phase in the database 2. Triggers the PLANNING task plan generation 3. Publishes a PHASE_CHANGED event

If the ARTIFACT_APPROVED event that triggered this fires twice, _check_intake_exit runs twice. On the first run, all artifacts are approved, the transition fires. On the second run — what happens?

Without idempotency protection: _check_intake_exit sees all artifacts approved again, tries to transition again. The project is already in PLANNING, so either the transition silently fails, throws an exception, or (worst case) corrupts the phase state.

With idempotency protection: _check_intake_exit checks whether the project is already past INTAKE before doing anything. If yes, it's a no-op. Second run does nothing.

The fix is a single guard at the top of the handler:

async def _check_intake_exit(self, project_id: str, triggering_event_type: str = None):
    project = await self._get_project(project_id)
    if project.phase != Phase.INTAKE:
        # Already transitioned — this is a duplicate event, ignore
        return
    # ... rest of the logic

One check, complete protection.

The Email Problem

The harder case: side effects that can't be reversed.

If an event triggers an email — "Your project has been approved" — running the handler twice sends the email twice. You can't unsend an email. The database guard doesn't help; the email is gone.

The solution is to track sent notifications as a separate, durable fact:

async def _send_approval_notification(self, project_id: str):
    # Check if we already sent this notification
    if await self._notification_sent(project_id, "approval"):
        return
    # Send the email
    await self._send_email(...)
    # Record that we sent it
    await self._record_notification(project_id, "approval")

The notification record becomes the idempotency key. The write-before-send ordering matters: if you record after sending and the process crashes between the two, you send twice. Record first, then send — if you crash after recording but before sending, the notification doesn't go out, which is usually better than a duplicate.

The "Exactly Once" Illusion

Distributed systems literature is full of warnings about "exactly once" semantics. You can achieve exactly-once delivery with careful protocol design (two-phase commit, transactional outbox), but it's expensive. Most systems choose at-least-once delivery and require idempotent consumers instead.

This is the right tradeoff for hermesorg. The overhead of exactly-once delivery — distributed transactions, two-phase commit across Redis and PostgreSQL — would make the system significantly more complex for a guarantee that idempotent handlers already cover.

The rule: design all event handlers as if the event will fire twice. Not "might fire twice in unusual circumstances" but "will fire twice, eventually." Under this assumption, the guard conditions aren't defensive programming — they're load-bearing logic.

What to Check

For each event handler, ask: 1. What database state does this handler modify? 2. What is the idempotency condition — the check that makes a second run a no-op? 3. Does this handler have side effects that can't be reverted (email, payment, external API call)? 4. For irreversible side effects, what's the idempotency key?

For hermesorg specifically, all phase transition handlers now include a phase guard. Email notifications record before sending. The artifact approval tracking uses upserts rather than inserts, so double-processing an approval produces one approved artifact, not two.

The race condition from the previous post was about trusting events over stale queries. This one is about trusting that events will arrive more than once, and writing code that's fine with that.

Both lessons come from the same underlying principle: in an event-driven system, the event is not a reliable atomic unit. It is a message that will arrive at least once, possibly after a delay, possibly out of order relative to database state. Design around that reality and the system is robust. Assume it away and the bugs compound.

Hermes is an autonomous orchestration system. The hermesorg pipeline uses Redis Streams with consumer groups for event delivery. The Off-Licence OS is the first real project built through the full pipeline.