2026-06-19

Transactional outbox on Postgres: how not to lose an event when the server crashes

Postgres transactional outbox: write events in the same transaction as business data, publish with a SKIP LOCKED worker, at-least-once delivery and consumer idempotency.

The transactional outbox pattern makes event emission (webhooks, notifications, queue messages) atomic with respect to the database: you write the event to an outbox table in the same transaction as business data, and a worker publishes it afterward. It is one of the patterns we use at Snowinch when a system cannot afford to lose orders, payments, or state updates, even through crashes, restarts, or network timeouts.

The problem that does not seem to exist until it does

You have an endpoint that receives an order. You save the order to the database, then send a message to the queue (or call a webhook, or publish an event). If the process crashes between the two operations, the order is saved but the event never left. No visible error, no automatic retry, just a silently lost event.

The opposite approach, emit the event before saving, has the same problem inverted: the event goes out but the save fails.

The critical point is that the two operations live on separate systems. The database has ACID transactions; the queue, broker, or external notification service does not participate in that transaction. There is no practical two-phase commit in this context, and even if there were you would not want to use it.

This is the pattern that shows up most often once a system grows past the prototype stage and starts handling data with real consequences.

The solution: write everything to the database, publish later

The transactional outbox works like this: instead of emitting the event directly to the external system, you write it to an outbox table in the same transaction as your business write. A separate worker reads the table and publishes events to the final destination.

Both operations become atomic with respect to the database, either both happen, or neither does. The worker can fail, time out, or restart: it picks up where it left off because unpublished events are still in the table.

Minimal schema

CREATE TABLE outbox (
  id            uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  created_at    timestamptz NOT NULL DEFAULT now(),
  published_at  timestamptz,                  -- null = pending
  topic         text NOT NULL,                -- e.g. 'order.created'
  payload       jsonb NOT NULL,
  attempts      int NOT NULL DEFAULT 0,
  last_error    text
);

A null published_at means the event is waiting. The worker sets it after confirmed publication. attempts and last_error support monitoring and backoff logic, they are not decorative.

Write inside the business transaction

// Single transaction: both writes succeed or neither does
await db.transaction(async (trx) => {
  const order = await trx('orders').insert({
    user_id: userId,
    total: orderTotal,
    status: 'confirmed',
  }).returning('*');

  await trx('outbox').insert({
    topic: 'order.created',
    payload: {
      order_id: order[0].id,
      user_id: userId,
      total: orderTotal,
    },
  });
});

The external system, queue, webhook, notification, is not touched here. The transaction stays short and does not depend on any external service being available.

The worker: SKIP LOCKED for safe concurrency

The worker must be idempotent and work correctly with multiple parallel instances. SKIP LOCKED is the Postgres primitive that lets multiple workers claim different rows without colliding or blocking each other.

-- Claim up to 10 unpublished events, skipping rows already being processed
SELECT * FROM outbox
WHERE published_at IS NULL
ORDER BY created_at ASC
FOR UPDATE SKIP LOCKED
LIMIT 10;

async function processOutbox() {
  await db.transaction(async (trx) => {
    const events = await trx.raw(`
      SELECT * FROM outbox
      WHERE published_at IS NULL
      ORDER BY created_at ASC
      FOR UPDATE SKIP LOCKED
      LIMIT 10
    `).then(r => r.rows);

    for (const event of events) {
      try {
        await publishToExternalSystem(event.topic, event.payload);

        await trx('outbox')
          .where({ id: event.id })
          .update({ published_at: new Date() });

      } catch (err) {
        // Do not rethrow: event stays queued, worker continues
        await trx('outbox')
          .where({ id: event.id })
          .update({
            attempts: trx.raw('attempts + 1'),
            last_error: String(err),
          });
      }
    }
  });
}

Important detail: FOR UPDATE SKIP LOCKED must live in the same transaction that updates published_at. If the worker transaction fails, the lock is released and the event remains available for the next cycle. Keep the external call as short as possible: you are holding a row lock until the destination responds.

Frequency and backoff

The worker runs on an interval, every 5 seconds, every 30, depending on acceptable latency for your system. It is not strictly a cron job: you can use a simple setInterval in a long-running Node process, or a scheduled job with the same logic described in the article on serverless cron with GitHub Actions if you prefer not to keep a process always on.

For events with many failed attempts, add a backoff condition:

WHERE published_at IS NULL
  AND (attempts = 0 OR created_at < now() - (attempts * interval '1 minute'))

This avoids hammering an event that keeps failing for a structural reason, a missing topic, malformed payload, or a permanently down external service.

Cleanup

Published events can accumulate. A reasonable retention window, 7 days, 30 days, domain-dependent, and a periodic cleanup job:

DELETE FROM outbox
WHERE published_at IS NOT NULL
  AND published_at < now() - interval '30 days';

Do not delete unpublished events on a time basis: if they are still there, there is a reason.

What this pattern does not solve

The outbox guarantees at-least-once delivery, not exactly-once. The destination system must handle duplicates, or you must implement idempotency on the consumer side. On that topic, the article on Stripe webhooks and idempotency covers the consumer side in detail.

The pattern does not replace a dedicated queue (RabbitMQ, SQS, Kafka) for high volume or complex fan-out requirements. It is the right fit for systems that already use Postgres as the primary database and do not want extra infrastructure just to make events reliable, which covers most cases up to a certain volume.

It does not cover the case where the external system requires exactly-once delivery with no duplicates and does not support idempotency keys. That needs more sophisticated coordination.

Operational summary

Write the event to the outbox table in the same transaction as business data, never as two separate operations.
Use FOR UPDATE SKIP LOCKED in the worker for safe concurrency across multiple instances.
Set published_at only after confirmed publication, inside the same transaction as the lock.
Handle failures with attempts and backoff, do not rethrow exceptions that stall the worker.
Implement consumer idempotency: the outbox gives at-least-once, not exactly-once.
Add a cleanup job for published events, the table must not grow forever.

Tell us your context, constraints, and goals: we will say whether working together makes sense and how to set up a first step.

Email hello@snowinch.com

Want to ship ideas like these into your product?