The 200 OK that silently ate your events.
Your SDK sends an event. The ingest endpoint returns 200 OK. The dashboard is empty. This is not a hypothetical. It is one of the most common failure modes in error tracking infrastructure, and it is the one operators discover last, because every layer in the chain reports success.
20 seconds. A 200 OK from an error-tracking ingest endpoint means the server received your payload. It does not mean the event was written to storage. Five distinct failure modes can produce a 200 with no stored event: rate limits that return 200 instead of 429, malformed payloads that pass schema validation but get dropped downstream, SQLite WAL frames that lose a race against a checkpoint, in-memory queues that shed load under backpressure, and reverse proxy timeouts that kill large payloads before they reach the ingest process.
60 seconds. Each failure mode has a different cause, a different detection method, and a different fix. Sentry’s rate-limit headers are not surfaced by default SDK logging. The SQLite WAL race is a consequence of OS write buffering. In-memory queues drop events silently when the high-water mark is crossed. The nginx 1 MB body limit kills breadcrumb-heavy browser events. Individually, each is fixable. Together, they explain why the absence of an event in your dashboard is a question, not an answer. This guide walks through all five, shows you how to detect each one, and describes the patterns that make event loss harder to hit regardless of which tracker you run.
urgentry changes the default posture on several of these: it writes synchronously before returning 200, returns 429 when rate-limited rather than 200, and runs a fsyncing WAL checkpoint after each write. But any error tracker, including urgentry, can still lose events to the failure modes that live outside the tracker itself, namely the reverse proxy, the client SDK, and the network. This guide covers both sides.
The first time you notice
It usually starts with a known crash. Someone on your team ships a regression, you catch it in staging, and you check the error tracker to see how many users hit it in production. Nothing. The issue does not exist. No events, no count, no breadcrumbs.
Your first instinct is the SDK. You check the DSN. It is correct. You check the project. It is the right one. You add a debug log to the SDK and watch it report a successful send. The response: 200 OK.
The dashboard is still empty.
You add a test event manually. You send it with curl using the same DSN key. The response is 200 OK. You wait. You reload. You wait again. Nothing.
This is the moment where most operators go looking in the wrong direction. The 200 told you the server accepted the payload. But acceptance is not storage, and the gap between those two things is where events go to die.
Why 200 OK does not mean stored
The ingest path for an error event is longer than it looks from the outside. The SDK sends an HTTP POST to the ingest endpoint. The endpoint receives the payload, parses it, validates it to some degree, and returns a response. So far so fast. What happens next depends entirely on the architecture behind that endpoint.
In most production-grade error tracking stacks, the endpoint returns its response before the event reaches permanent storage. The payload goes into a queue, a channel, or a ring buffer. A worker downstream pulls from that buffer and performs the actual write. The response was already sent. The write is pending.
This design is rational. It decouples ingest throughput from write latency. It allows the receiver to handle bursts without blocking on disk IO. It lets the system absorb a slow database write without making the SDK wait.
The cost of that design is that the 200 is not a write receipt. It is an acceptance receipt. The two are different things, and the gap between them is where all five failure modes below live.
Even a synchronous-write architecture, where the event is written to storage before the response goes out, has downstream risks: the reverse proxy can time out before the write completes, the payload can fail schema validation after transport, or the write can commit but not reach the right table due to a migration state mismatch. The 200 is evidence that the transport succeeded. It is not evidence that the event is in your database.
Failure mode 1: rate limits returning 200
Sentry’s ingest protocol has a rate-limiting mechanism that is genuinely confusing. When your project exceeds its ingest quota, Sentry does not return 429. It returns 200 OK with a response header that signals the rate limit: X-Sentry-Rate-Limits.
The header looks like this:
X-Sentry-Rate-Limits: 60:error:organization, 2700:transaction:organization
That tells the SDK: you are rate-limited on the error category for 60 seconds, and on the transaction category for 2700 seconds. The event was not stored. The response was still 200.
The SDK understands this header and backs off. But the default SDK log level does not surface it as a dropped event. It surfaces as a successful send followed by a quiet backoff period. If you are not monitoring SDK transport metrics or inspecting raw HTTP responses, you do not know the rate limit is active. You see a healthy-looking SDK and an empty dashboard.
This failure mode is common on self-hosted Sentry deployments where the rate limit configuration was set conservatively during setup and never revisited as event volume grew. It is also common on SaaS plans where overage behavior was set to “drop” rather than “bill.” The response is 200. The event is gone.
urgentry returns 429 with a Retry-After header when rate-limited. The 429 surfaces as an error in SDK logs, in monitoring dashboards, and in any network inspection tool you are running. The difference between a quiet 200 and a loud 429 is the difference between a silent failure and a visible one.
Failure mode 2: malformed payloads that pass schema
The Sentry SDK serializes events in a best-effort mode. When a field value is not serializable, the SDK typically omits the field rather than aborting the send. The payload arrives at the ingest endpoint, passes the top-level schema check (it is valid JSON, it has the required envelope headers, the event type is recognized), and gets accepted with a 200.
The problem surfaces downstream. The event processor reads the payload, encounters a field type it did not expect, and either drops the field or drops the entire event depending on the strictness of the downstream validation. The 200 was already sent. The processing failure is not reported back to the SDK.
One concrete version of this: encoding mismatches in stack frame data. The SDK on some runtimes serializes line numbers as strings rather than integers when the source is a transpiled bundle with a corrupt source map. The ingest envelope looks fine. The stack frame processor rejects the frame-level data, silently. The event appears to arrive, and then either surfaces as a frameless error or gets bucketed incorrectly and is invisible in the issue list you were expecting it to join.
Another version: the SDK sends a user field with a key that is not a string ({"id": {nested: "object"}} rather than {"id": "abc123"}). The envelope passes. The user context gets dropped. The event stores, but the PII scrubbing pipeline, which runs after ingest, may also behave unexpectedly on that shape of data, depending on the scrubber version.
You cannot prevent all encoding edge cases at the SDK layer. You can detect this class of failure by comparing the count of events the SDK reports sending against the count your tracker reports receiving. The gap is the field-level drop rate.
Failure mode 3: write-after-WAL-truncate
This one is specific to SQLite deployments. It is also the most likely to permanently destroy data without producing any visible error.
SQLite in WAL mode does not write directly to the main database file. Each new write goes to the write-ahead log (the .db-wal file). A checkpoint process periodically moves committed WAL frames into the main database file and truncates the WAL. Under normal operation, this is invisible. Under specific timing, it produces event loss.
The race: urgentry writes an event to the WAL, returns 200 OK, and the event is now in the WAL but not yet in the main database file. A checkpoint begins. The OS write buffer has not yet fsynced the WAL frame. The host receives a kill signal (a VPS reboot, an OOM kill, a spot instance termination) before the fsync completes. The WAL frame was in a kernel buffer, not on disk. It is lost.
What you see: urgentry returned 200. The SDK logged a successful send. The event is not in the database after the restart. No error message. No crash log that points to the event. The tracker comes back up healthy. The event is simply absent.
The conditions that increase this risk: OS-level write buffering that is not tuned to flush frequently, a WAL checkpoint that runs infrequently (so the WAL is long and unflushed data accumulates), and anything that causes ungraceful process termination. A $5 VPS on a shared host with no UPS protection, rebooted by the provider without warning, is a perfect environment for this failure mode.
urgentry uses PRAGMA synchronous = FULL and triggers an fsync at the WAL checkpoint boundary after each write commit. This does not eliminate all possible loss from a hardware failure mid-write, but it closes the OS-buffer race that accounts for most real instances of this failure.
Failure mode 4: in-memory queue drops on the receiver
Any system that uses an in-memory buffer between ingest and storage has a high-water mark. When events arrive faster than the downstream writer can process them, the buffer fills. When the buffer is full, new arrivals have two choices: block the sender (which destroys ingest throughput) or drop the event (which preserves throughput at the cost of data). Almost every production ingest system chooses to drop.
The 200 goes out before the event enters the queue. The drop happens after. The SDK sees a successful send. The queue shed the event at the high-water mark.
This failure mode is most visible during error storms. Your service throws an exception in a tight loop. A thousand events arrive in a second. The ingest buffer absorbs the first hundred and returns 200 for all of them. The next nine hundred overflow the buffer and are dropped. Your dashboard shows a hundred events. Your actual exception count is a thousand. The tracker appears to be working; it just stopped counting.
The metric that should but often does not alert on this: the queue depth gauge on the ingest process. When that gauge sits at or near the high-water mark, drops are happening. But the queue depth gauge is often not exposed to the monitoring system that watches the error tracker, because operators tend not to instrument the error tracker itself. You are monitoring your application with the tracker. You are not monitoring the tracker.
urgentry has no in-memory accept queue between the HTTP handler and the storage write. The write is synchronous on the request path. This means urgentry can backpressure the SDK directly (and return 503 under severe overload) rather than accepting events and silently dropping them.
Failure mode 5: the reverse proxy timeout
nginx defaults proxy_read_timeout to 60 seconds and client_max_body_size to 1 MB. Both are relevant to error event ingest, and both bite operators who do not configure them explicitly.
The 1 MB body limit: a Sentry event from a browser SDK with a full breadcrumb trail (100 entries of network requests, console logs, and click events), large request body capture, and user context exceeds 1 MB without straining. The SDK sends the payload. nginx measures the incoming body, hits the limit, and returns 413 Entity Too Large. The SDK receives 413, logs a failed send, and either retries or discards the event.
The 413 is at least visible: the SDK logs an error. The silent version involves the proxy timeout. The SDK sends a large payload over a slow connection. The upload takes longer than 60 seconds. nginx closes the connection. The SDK may log a connection reset, but it may also retry the send and exhaust its retry budget before the app session ends, with no error surfaced to the application layer.
The configuration change:
# /etc/nginx/sites-available/urgentry
server {
listen 443 ssl;
server_name errors.example.com;
# Increase body limit for event payloads with large breadcrumb sets
client_max_body_size 20M;
# Extend timeouts for large uploads over slow connections
proxy_read_timeout 120s;
proxy_send_timeout 120s;
location / {
proxy_pass http://127.0.0.1:9000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
This is a silent failure mode because, from the error tracker’s perspective, the event never arrived. The tracker logs show no request. The tracker metrics show no ingest. The problem lives entirely in the proxy layer, and it only appears in the nginx error log at the proxy layer, which you are probably not watching.
How to detect each one
Detection requires instrumentation at multiple layers. Here are the four queries and checks you should run weekly.
Query 1: compare SDK send count to tracker event count
Most Sentry SDKs expose transport metrics. The Python SDK exposes sentry_sdk.Hub.current.client.transport internals; the JavaScript SDK exposes client reports via the __sentry__ client report mechanism. Enable client reports and emit them to a secondary destination (a simple logging endpoint or your application’s own metrics system):
// Enable Sentry client reports (JavaScript SDK)
Sentry.init({
dsn: "https://your-dsn@errors.example.com/1",
sendClientReports: true, // default true in recent SDK versions
beforeSend(event) {
// Count sends in your own metrics system for comparison
myMetrics.increment("sentry.event.sent", { type: event.type || "error" });
return event;
},
});
Then query your tracker for the count of events in the same window:
-- urgentry: count events received in the last 24 hours
SELECT
DATE(received_at) AS day,
COUNT(*) AS event_count
FROM events
WHERE received_at >= datetime('now', '-24 hours')
GROUP BY DATE(received_at)
ORDER BY day DESC;
A persistent gap between what the SDK reports sending and what the tracker reports receiving is your signal. Any gap above 1% over a 24-hour window warrants investigation.
Query 2: check for the rate-limit header in ingest responses
Run a test send and inspect the response headers directly:
# Send a test event and capture response headers
curl -si \
-X POST "https://errors.example.com/api/1/envelope/" \
-H "X-Sentry-Auth: Sentry sentry_version=7,sentry_key=your-dsn-key" \
-H "Content-Type: application/x-sentry-envelope" \
--data-binary @test-envelope.txt \
| grep -i "x-sentry-rate-limits\|x-sentry-error\|retry-after\|HTTP/"
On urgentry, a rate-limited response looks like:
HTTP/2 429
Retry-After: 60
Content-Type: application/json
On Sentry (self-hosted or SaaS), when rate-limited, the response looks like:
HTTP/2 200
X-Sentry-Rate-Limits: 60:error:organization
If you see the X-Sentry-Rate-Limits header with a non-empty value, your events are being dropped. Raise the rate limit in the Sentry admin panel or reduce your ingest volume.
Query 3: check the WAL checkpoint state on SQLite deployments
-- Check WAL mode and pending frame count on urgentry's SQLite database
-- Run via: sqlite3 /var/lib/urgentry/urgentry.db
PRAGMA journal_mode;
-- Expected: wal
PRAGMA wal_checkpoint(PASSIVE);
-- Returns: (busy pages, log pages, checkpointed pages)
-- If busy > 0 frequently, writes are stacking up faster than checkpoints run
PRAGMA synchronous;
-- Expected: 2 (FULL) for urgentry deployments
-- If this is 0 or 1, WAL frames are at risk under ungraceful shutdown
Query 4: watch the nginx access log for 413s and timeouts
# Count 413 responses in the last 24 hours from the nginx log
awk '$9 == "413"' /var/log/nginx/access.log | wc -l
# Find large request body sizes that are close to the limit
awk '{if ($10+0 > 900000) print $0}' /var/log/nginx/access.log | tail -20
# Look for upstream timed-out connections in the error log
grep "upstream timed out" /var/log/nginx/error.log | tail -20
The canary event pattern
The most reliable detection method is a canary event sent on a fixed schedule. A canary is a synthetic event you send yourself, through the same SDK and the same network path as your real events, with a known unique identifier. After sending it, you query the tracker’s API to confirm it arrived.
#!/bin/bash
# canary-check.sh: send a synthetic event and verify it arrived
# Run from cron every 15 minutes
CANARY_ID=$(openssl rand -hex 16)
DSN_ENDPOINT="https://errors.example.com/api/1/store/"
DSN_KEY="your-dsn-public-key"
# Send the canary event
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
-X POST "$DSN_ENDPOINT" \
-H "X-Sentry-Auth: Sentry sentry_version=7,sentry_key=$DSN_KEY" \
-H "Content-Type: application/json" \
-d "{
\"event_id\": \"$CANARY_ID\",
\"platform\": \"other\",
\"level\": \"info\",
\"message\": \"canary-$CANARY_ID\",
\"tags\": {\"canary\": \"true\"}
}")
echo "Canary send: HTTP $HTTP_STATUS, event_id $CANARY_ID"
# Wait 30 seconds for ingest and storage to complete
sleep 30
# Query the tracker API to confirm arrival
FOUND=$(curl -s \
-H "Authorization: Bearer your-auth-token" \
"https://errors.example.com/api/0/projects/your-org/your-project/events/$CANARY_ID/" \
| jq -r '.eventID // empty')
if [ -z "$FOUND" ]; then
echo "ALERT: canary event $CANARY_ID not found after 30s"
# Page your on-call rotation here
else
echo "OK: canary event $CANARY_ID confirmed in tracker"
fi
A canary that goes missing after a 200 OK catches every failure mode in this guide. Set it up. Alert on it. Run it in the same environment as your real SDKs so it traverses the same reverse proxy, the same network path, and the same authentication layer.
What urgentry does to make this harder to hit
Not all of these failure modes are things any single tracker can prevent. But several of them reflect deliberate choices about where to put the write in the request lifecycle.
Synchronous write before 200. urgentry writes the event to SQLite or Postgres before it sends the 200 OK response. If the write fails, the response is an error. The 200 means the row exists. This eliminates the gap between acceptance and storage that the in-memory queue design produces.
429 on rate limit, not 200. When urgentry rate-limits a project or organization, it returns 429 with a Retry-After header. The event was not stored. The response says so. The SDK and any network monitoring tool will surface this as a visible failure rather than a quiet header that most operators never read.
WAL checkpoint with fsync. urgentry runs SQLite with PRAGMA synchronous = FULL. Each WAL write goes to disk before the write is considered committed. The OS write buffer race that causes write-after-WAL-truncate loss is closed. The tradeoff is write latency: synchronous writes are slower than buffered writes, but the measured overhead at 400 events/sec on a $5 VPS is within the performance envelope for every team this architecture targets.
No in-memory accept queue. The HTTP handler writes synchronously. Under severe overload, urgentry returns 503 rather than silently dropping events into a full buffer. The 503 is visible: the SDK logs it, your monitoring system sees it, and you know the tracker is under pressure. Compare this to a quiet high-water-mark drop that looks identical to successful ingest from the outside.
What urgentry cannot prevent: the reverse proxy timeout, the SDK encoding edge case, and the network-level loss. Those live outside the tracker and require the proxy configuration and the canary pattern described above.
The patterns that work regardless of tool
These practices reduce error tracker silent failures independent of whether you run urgentry, self-hosted Sentry, or the SaaS product.
Canary events on a fixed schedule. Covered above. This is the one pattern that catches every failure mode in this guide. Build it, run it in the same environment as your real SDKs, and alert on it.
End-to-end synthetic tests. Distinct from canary events: these are full SDK-initiated sends from a staging or synthetic environment. They test the SDK initialization, DSN lookup, event serialization, and transport layer together. A canary tests the ingest endpoint directly. A synthetic test tests the whole chain.
Log the SDK’s queue depth and send count. Every major SDK exposes internal metrics or hooks. Use beforeSend to count sends. Use the SDK’s transport client report or similar mechanism to count drops. Feed both numbers into your application’s own metrics system so they are visible in the same dashboards as your application health.
Alert on drop_to_zero. The specific alert that catches the worst class of silent failure: when your event tracker has recorded events for a project continuously for more than 7 days, and then records zero events for any 24-hour window during business hours, alert. The absence of events when you expect events is a signal. It can mean your application is perfectly healthy (unlikely during most active development periods), or it can mean your ingest pipeline is silent-failing. Either way, it deserves a look.
-- drop_to_zero alert query: projects that had events yesterday but none today
-- Run daily as a cron job and alert if any rows return
SELECT
p.slug AS project,
yesterday.event_count AS yesterday_count,
COALESCE(today.event_count, 0) AS today_count
FROM projects p
JOIN (
SELECT project_id, COUNT(*) AS event_count
FROM events
WHERE received_at >= datetime('now', '-48 hours')
AND received_at < datetime('now', '-24 hours')
GROUP BY project_id
) yesterday ON p.id = yesterday.project_id
LEFT JOIN (
SELECT project_id, COUNT(*) AS event_count
FROM events
WHERE received_at >= datetime('now', '-24 hours')
GROUP BY project_id
) today ON p.id = today.project_id
WHERE COALESCE(today.event_count, 0) = 0
AND yesterday.event_count > 10; -- ignore projects with trivial volume
Configure your reverse proxy explicitly. Do not rely on nginx defaults. Set client_max_body_size to at least 20 MB for ingest endpoints. Set proxy_read_timeout to at least 120 seconds. Review the nginx error log weekly. These take five minutes to configure and they close the proxy failure modes permanently.
The lesson
Your error tracker can be wrong. Not wrong in the sense of misclassifying errors or grouping them incorrectly, but wrong in the sense of simply not having the event at all. The 200 OK is not a write receipt. The empty dashboard is not evidence that no error occurred. Absence of data in your error tracker is a question: did this really not happen, or did something in the ingest path eat the event?
The operators who get burned by silent ingest failures are almost always the ones who skipped the last 10% of the setup: configuring the proxy correctly, enabling SDK transport metrics, running a canary. The tracker is running. The SDK is initialized. It looks like it works. The assumption becomes that it does.
The assumption is wrong often enough to matter. You ship a regression. The tracker is silent. You spend three hours debugging a production issue with no error context, because the error context was silently dropped six hours ago on a 200 OK.
Build the canary. Run the proxy configuration. Check the rate limit headers once. After that, you can trust the silence. Before that, you are trusting a 200 OK.
Frequently asked questions
Can a 200 OK from my error tracker mean the event was dropped?
Yes, in several distinct ways. Rate limits that return 200 with a header instead of 429, in-memory queues that drop events after returning 200, and payload validation failures that happen after the response was sent can all produce this pattern. urgentry closes the most common of these by writing synchronously before returning 200 and returning 429 when rate-limited.
How do I catch Sentry’s rate-limit silent drops?
Inspect the X-Sentry-Rate-Limits response header on ingest responses. When this header is present with a non-empty value, Sentry accepted the payload but did not store it. The default SDK does not surface this header in its own logs. Use curl to inspect raw responses, or enable SDK client reports and emit them to a secondary metrics system.
What is the SQLite WAL checkpoint race and how does it cause data loss?
SQLite in WAL mode writes to a write-ahead log file before checkpointing frames into the main database. If the process is killed between a WAL write and the fsync that makes it durable, the frame may be lost. urgentry runs with PRAGMA synchronous = FULL to force an fsync on each WAL commit, which closes this race at the cost of slightly higher write latency.
What nginx settings should I change for an error tracker ingest endpoint?
Set client_max_body_size to at least 20 MB (default is 1 MB), and set both proxy_read_timeout and proxy_send_timeout to at least 120 seconds (default is 60 seconds). The 1 MB default kills large breadcrumb payloads. The 60-second timeout kills large uploads on slow connections. Neither failure produces a useful error on the tracker side, because the request never arrives.
How do I set up a canary event check?
Send a synthetic event through the real SDK path every 15 minutes, with a known unique identifier (a UUID you generate yourself). After a fixed delay (30 seconds is enough for most deployments), query the tracker’s API for that event ID. If it is missing, alert. The canary traverses the full ingest path, including your reverse proxy, your DSN key, and your storage layer, and catches every failure mode described in this guide.
Sources and further reading
- Sentry SDK client reports specification — the SDK-side mechanism for tracking dropped events by category, including rate-limit-based drops.
- Sentry rate-limiting headers specification — semantics of
X-Sentry-Rate-LimitsandRetry-After, including the 200 OK vs 429 behavior difference between Sentry and the protocol spec. - SQLite WAL mode documentation — the write-ahead log design, checkpoint mechanics, and the
synchronouspragma interaction with WAL frame durability. - nginx client_max_body_size directive — default value (1 MB), interaction with 413 response, and configuration scope (server vs location block).
- urgentry compatibility matrix — the 218/218 Sentry API operation coverage, ingest protocol conformance, and SDK compatibility details referenced in this guide.
- FSL-1.1-Apache-2.0 license — the Functional Source License under which urgentry is distributed; converts to Apache 2.0 after two years from each release date.
Want an error tracker that returns 429 when it means it?
urgentry writes events synchronously before returning 200, returns 429 on rate limits rather than a silent header, and runs with WAL fsync enabled by default. The full ingest behavior is documented in the compatibility matrix. A single binary, 52 MB resident at 400 events/sec, runs on a $5 VPS with no additional services.