2026-06-19
Transactional outbox on Postgres: how not to lose an event when the server crashes
Postgres transactional outbox: write events in the same transaction as business data, publish with a SKIP LOCKED worker, at-least-once delivery and consumer idempotency.
The transactional outbox pattern makes event emission (webhooks, notifications, queue messages) atomic with respect to the database: you write the event to an outbox table in the same transaction as business data, and a worker publishes it afterward. It is one of the patterns we use at Snowinch when a system cannot afford to lose orders, payments, or state updates, even through crashes, restarts, or network timeouts.
The problem that does not seem to exist until it does
You have an endpoint that receives an order. You save the order to the database, then send a message to the queue (or call a webhook, or publish an event). If the process crashes between the two operations, the order is saved but the event never left. No visible error, no automatic retry, just a silently lost event.
The opposite approach, emit the event before saving, has the same problem inverted: the event goes out but the save fails.
The critical point is that the two operations live on separate systems. The database has ACID transactions; the queue, broker, or external notification service does not participate in that transaction. There is no practical two-phase commit in this context, and even if there were you would not want to use it.
This is the pattern that shows up most often once a system grows past the prototype stage and starts handling data with real consequences.
The solution: write everything to the database, publish later
The transactional outbox works like this: instead of emitting the event directly to the external system, you write it to an outbox table in the same transaction as your business write. A separate worker reads the table and publishes events to the final destination.
Both operations become atomic with respect to the database, either both happen, or neither does. The worker can fail, time out, or restart: it picks up where it left off because unpublished events are still in the table.
Minimal schema
CREATE TABLE outbox (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
created_at timestamptz NOT NULL DEFAULT now(),
published_at timestamptz, -- null = pending
topic text NOT NULL, -- e.g. 'order.created'
payload jsonb NOT NULL,
attempts int NOT NULL DEFAULT 0,
last_error text
);
A null published_at means the event is waiting. The worker sets it after confirmed publication. attempts and last_error support monitoring and backoff logic, they are not decorative.
Write inside the business transaction
// Single transaction: both writes succeed or neither does
await db.transaction(async (trx) => {
const order = await trx('orders').insert({
user_id: userId,
total: orderTotal,
status: 'confirmed',
}).returning('*');
await trx('outbox').insert({
topic: 'order.created',
payload: {
order_id: order[0].id,
user_id: userId,
total: orderTotal,
},
});
});
The external system, queue, webhook, notification, is not touched here. The transaction stays short and does not depend on any external service being available.
The worker: SKIP LOCKED for safe concurrency
The worker must be idempotent and work correctly with multiple parallel instances. SKIP LOCKED is the Postgres primitive that lets multiple workers claim different rows without colliding or blocking each other.
-- Claim up to 10 unpublished events, skipping rows already being processed
SELECT * FROM outbox
WHERE published_at IS NULL
ORDER BY created_at ASC
FOR UPDATE SKIP LOCKED
LIMIT 10;
async function processOutbox() {
await db.transaction(async (trx) => {
const events = await trx.raw(`
SELECT * FROM outbox
WHERE published_at IS NULL
ORDER BY created_at ASC
FOR UPDATE SKIP LOCKED
LIMIT 10
`).then(r => r.rows);
for (const event of events) {
try {
await publishToExternalSystem(event.topic, event.payload);
await trx('outbox')
.where({ id: event.id })
.update({ published_at: new Date() });
} catch (err) {
// Do not rethrow: event stays queued, worker continues
await trx('outbox')
.where({ id: event.id })
.update({
attempts: trx.raw('attempts + 1'),
last_error: String(err),
});
}
}
});
}
Important detail: FOR UPDATE SKIP LOCKED must live in the same transaction that updates published_at. If the worker transaction fails, the lock is released and the event remains available for the next cycle. Keep the external call as short as possible: you are holding a row lock until the destination responds.
Frequency and backoff
The worker runs on an interval, every 5 seconds, every 30, depending on acceptable latency for your system. It is not strictly a cron job: you can use a simple setInterval in a long-running Node process, or a scheduled job with the same logic described in the article on serverless cron with GitHub Actions if you prefer not to keep a process always on.
For events with many failed attempts, add a backoff condition:
WHERE published_at IS NULL
AND (attempts = 0 OR created_at < now() - (attempts * interval '1 minute'))
This avoids hammering an event that keeps failing for a structural reason, a missing topic, malformed payload, or a permanently down external service.
Cleanup
Published events can accumulate. A reasonable retention window, 7 days, 30 days, domain-dependent, and a periodic cleanup job:
DELETE FROM outbox
WHERE published_at IS NOT NULL
AND published_at < now() - interval '30 days';
Do not delete unpublished events on a time basis: if they are still there, there is a reason.
What this pattern does not solve
The outbox guarantees at-least-once delivery, not exactly-once. The destination system must handle duplicates, or you must implement idempotency on the consumer side. On that topic, the article on Stripe webhooks and idempotency covers the consumer side in detail.
The pattern does not replace a dedicated queue (RabbitMQ, SQS, Kafka) for high volume or complex fan-out requirements. It is the right fit for systems that already use Postgres as the primary database and do not want extra infrastructure just to make events reliable, which covers most cases up to a certain volume.
It does not cover the case where the external system requires exactly-once delivery with no duplicates and does not support idempotency keys. That needs more sophisticated coordination.
Operational summary
- Write the event to the
outboxtable in the same transaction as business data, never as two separate operations. - Use
FOR UPDATE SKIP LOCKEDin the worker for safe concurrency across multiple instances. - Set
published_atonly after confirmed publication, inside the same transaction as the lock. - Handle failures with
attemptsand backoff, do not rethrow exceptions that stall the worker. - Implement consumer idempotency: the outbox gives at-least-once, not exactly-once.
- Add a cleanup job for published events, the table must not grow forever.
Tell us your context, constraints, and goals: we will say whether working together makes sense and how to set up a first step.