Reading urgentry benchmarks honestly.
Urgentry publishes a number: 400 events per second at 52 MB resident memory on a cheap VPS. That number was measured by the urgentry team, on hardware we chose, with a workload we designed. This guide explains exactly what the benchmark covers, what it leaves out, how to reproduce it yourself, and how to apply the same scrutiny to every other number in this space — including ours.
20 seconds. The 400 ev/s figure is a sustained-load benchmark on a 1-vCPU / 1 GB VPS running urgentry with SQLite. The load generator is open. The test ran for 30 minutes. It measures ingest throughput and RSS; it does not measure concurrent UI queries, the symbolicator path, or burst behavior. Every benchmark a vendor publishes was designed by that vendor. Treat all of them, including this one, as a starting point, not a conclusion.
60 seconds. There are five things you need before you trust a benchmark number: reproducibility, payload realism, measurement window length, p99 latency instead of p50, and a documented failure mode at 1.5x load. The urgentry benchmark satisfies the first three and partially satisfies the fourth. It does not yet document failure at 600 ev/s on the Tiny lane. This guide gives you the steps to reproduce the benchmark on your own hardware and the vocabulary to ask the same questions of any tool you evaluate.
If you need a number to make a purchasing or architecture decision, reproduce the benchmark yourself with a payload that looks like your production data. The result will be more useful than any number on any vendor’s website, including this one.
Why vendor benchmarks are always suspect
The publisher of a benchmark controls three things: the workload, the hardware, and the measurement window. Each choice creates space to look faster than a fair comparison would suggest, without any explicit dishonesty.
The workload is the most significant lever. A benchmark payload that is smaller than a typical SDK event will produce a higher events-per-second number. A payload that omits stack trace data, breadcrumbs, and user context is faster to parse and write than one that includes them. A benchmark that sends identical events benefits from in-memory deduplication paths that a production workload with diverse fingerprints does not hit.
The hardware choice is the second lever. A benchmark run on a local NVMe SSD will show better SQLite write throughput than one run on network-attached block storage, which is what most VPS providers actually provision. A benchmark on a dedicated bare-metal machine with no noisy neighbors will show lower p99 latency than one on a shared cloud instance. Neither choice is dishonest; both affect the number significantly.
The measurement window is the third lever. A one-minute benchmark misses the behavior that appears after the SQLite WAL grows past its checkpoint threshold, after the in-memory event cache evicts cold entries, or after the ingest queue depth starts affecting write latency. A thirty-minute benchmark at a sustained rate is harder to game, but it still does not cover week-long drift.
The incentive is to look fast. That incentive exists whether the publisher is urgentry, Sentry, GlitchTip, or anyone else. The only honest response is reproducibility plus explicit disclosure of what the benchmark does not measure. That is what this guide attempts.
The numbers urgentry publishes
The urgentry benchmark page publishes two primary numbers: 400 events per second sustained throughput and 52 MB resident set size at that load. Both come from the same test run, and both require the specific conditions listed below to be meaningful.
The hardware: a single-vCPU, 1 GB RAM VPS from a major cloud provider in a shared-tenant environment. The CPU is not pinned; the disk is network-attached SSD. This is the “Tiny lane” in the benchmark documentation. It is the cheapest reproducible hardware, chosen to produce a conservative lower bound on a class of machine that a small team might actually use.
The database: SQLite with WAL mode enabled and a synchronous setting of NORMAL. No Postgres. No external queue. The ingest path writes directly to SQLite through urgentry’s internal event pipeline.
The SDK: the urgentry load generator, which sends Sentry-envelope-format payloads over HTTP to the ingest endpoint. The payload shape is a Python-SDK-style exception event with a 15-frame stack trace, three breadcrumbs, and a user context block. The total payload size per event is approximately 3.5 KB compressed.
The measurement window: 30 minutes of sustained load at the target rate, preceded by a 2-minute warm-up ramp. The number published is the mean throughput over the sustained window. RSS is sampled at 15-second intervals; the 52 MB figure is the median of those samples.
Bigger hardware lanes (Postgres, multi-core VPS, larger RAM) run higher throughput. The 400 ev/s figure is the conservative published number from the smallest reproducible configuration.
What “400 events/sec” actually means
Sustained throughput is not the same as burst capacity. The 400 ev/s figure is the rate the system maintained without falling behind over 30 minutes. A burst to 600 ev/s for 10 seconds is a different measurement, and the benchmark does not cover it.
The ingest path at 400 ev/s has no backlog under the test conditions. Each batch of events arrives, processes through the pipeline, and commits to SQLite before the next batch arrives. If a production deployment accumulates an ingest backlog — because of a traffic spike, a slow disk, or a checkpoint stall — the behavior changes. The benchmark does not characterize that behavior.
The p99 latency at 400 ev/s on the Tiny lane is under 40 ms end-to-end for the HTTP ingest path. That is the time from the HTTP request arriving at urgentry to the event being committed to SQLite. It does not include the time to process a symbolication request, apply a source map, or evaluate alert rules. Those paths run asynchronously and their latency is not captured in the ingest benchmark.
What happens at 500 ev/s on the Tiny lane: the system continues to accept events, but SQLite write latency increases as the WAL checkpoint mechanism runs more frequently to keep WAL file size bounded. At the test hardware’s I/O budget, 500 ev/s is sustainable but produces noticeably higher p99 latency than 400 ev/s. The exact numbers from that test will appear in the benchmark documentation as the measurement set grows.
What the benchmark does not measure
This is the section that matters most.
The benchmark runs ingest in isolation. No UI queries happen during the test. A production urgentry deployment has engineers opening issues, running searches, and loading event detail pages while ingest is ongoing. Those read queries compete for SQLite’s WAL reader slots and the Go runtime’s goroutine scheduler. The benchmark does not characterize how concurrent read load affects ingest throughput or vice versa.
The benchmark does not cover the symbolicator path. When an event arrives with a minified JavaScript stack or a native binary crash, urgentry processes symbolication as a background step after the event is ingested. That step involves reading source map files from disk, performing position lookups, and rewriting the stack trace. Symbolicator peak memory and latency are not included in the 52 MB or 40 ms figures.
The benchmark does not cover projection rebuilds after a schema migration or a reindex operation. When urgentry adds a new index or modifies its event schema between versions, the upgrade path may require a table scan or an index rebuild on the events database. On a database with 90 days of events at 400 ev/s, that operation takes minutes and affects both ingest latency and query latency while it runs. The benchmark reflects a running system in steady state, not a system mid-upgrade.
The benchmark does not measure how replay payload size affects throughput. Session replay events are significantly larger than exception events — typically 30–200 KB per chunk rather than 3.5 KB. The 400 ev/s figure is for exception-shaped payloads. A workload that includes replay traffic at the same event rate will put more pressure on ingest I/O.
None of these gaps are hidden. They are listed here because the benchmark number is more useful when you know its boundaries.
How to reproduce the benchmark yourself
Reproducibility is the only property that makes a vendor benchmark worth anything. Here are the exact steps.
First, stand up urgentry on a fresh VPS. The benchmark uses a 1 vCPU / 1 GB instance, but any Linux host works. Install urgentry with the install script:
curl -fsSL https://urgentry.io/install.sh | sh
urgentry serve --role=all --db=/var/lib/urgentry/events.db
Second, create a project in the urgentry UI and copy the DSN. You need the project ID and the public key to configure the load generator.
Third, clone the urgentry load generator and configure it with your DSN:
git clone https://github.com/urgentry/bench
cd bench
go run ./cmd/loadgen \
--dsn="https://<key>@<host>/<project_id>" \
--rate=400 \
--duration=30m \
--payload=python-exception \
--warmup=2m
The --payload=python-exception flag selects the same payload shape used in the published benchmark. You can substitute --payload=node-exception or --payload=go-exception for other SDK shapes. You can also point --payload at a local JSON file containing a real event from your SDK if you want to benchmark against your actual production payload shape, which is the most useful test.
Fourth, capture metrics during the run. In a separate terminal:
# RSS every 15 seconds
while true; do
ps -o pid,rss,vsz -p $(pgrep urgentry) | tail -1
sleep 15
done
# Ingest rate from urgentry metrics endpoint
while true; do
curl -s http://localhost:8000/metrics | grep urgentry_events_ingested_total
sleep 15
done
Fifth, after the 30-minute run completes, collect the load generator’s output summary. It prints mean throughput, p50, p95, and p99 end-to-end latency for the ingest path, and the number of events that received a non-2xx response (which should be zero on a healthy run).
A reproduction counts as successful when the load generator sustains the target rate for the full 30 minutes with zero dropped events, and the RSS stays below 100 MB. If you get a different number on the same hardware class, that is a meaningful data point; open a GitHub issue with your findings.
Reading Sentry’s published numbers
Sentry does not publish a sustained-load throughput benchmark for the self-hosted distribution. What they publish is a hardware minimum: 16 GB RAM, as stated in the getsentry/self-hosted README.
The operator community on the getsentry/self-hosted issue tracker provides something closer to real performance data. The consistent signal across threads from 2022 through 2025: operators who run the stack under meaningful ingest load report needing 24–32 GB of RAM to avoid OOM instability. The 16 GB number is accurate for a lightly loaded idle deployment; it is not the floor for production use. The RAM guide in this cluster covers that gap in detail.
The gap between the documented minimum and the operator-reported practical floor exists because the Sentry self-hosted architecture runs roughly twenty containers. ClickHouse alone claims 8–12 GB under real ingest. Kafka and ZooKeeper add 2–4 GB at baseline. The per-service floors add up non-linearly as ingest rate increases.
What this means for comparison: urgentry publishes a throughput number with no hardware minimum stated for the documented-minimum case, because the minimum hardware is implicit in the benchmark (1 vCPU / 1 GB). Sentry publishes a hardware minimum with no sustained throughput number for the self-hosted case. The two numbers do not measure the same thing, and comparing them directly produces a misleading picture. The honest framing is that urgentry is a single-binary tool with a measured RAM footprint, and Sentry self-hosted is a multi-service architecture with a measured RAM floor. Both statements are true. They answer different questions.
Reading other tools’ benchmarks
GlitchTip, Bugsink, and SigNoz are the tools most often mentioned alongside urgentry in comparison searches. Each publishes something, and each leaves something out.
GlitchTip does not publish a throughput benchmark. Their documentation focuses on deployment simplicity and cost. Operator reports on their issue tracker describe performance that varies with the underlying Postgres configuration. There is no sustained-load number to evaluate.
Bugsink publishes informal throughput commentary in their documentation, noting that a single-process Django deployment handles a few hundred events per minute, not per second, before needing worker scaling. The framing is honest about the tool’s scope; it is not positioned as a high-throughput system.
SigNoz publishes benchmark data for their OTLP-native ingest pipeline, focused on traces-per-second rather than error events. The numbers are higher than urgentry’s on comparable hardware, but the measurement covers a different payload type and a different processing model. SigNoz’s Sentry SDK compatibility surface is narrower than urgentry’s; the tools answer different questions.
The general rule for reading any tool’s benchmark: ask what the payload was, how long the test ran, whether UI queries happened concurrently, and what the failure behavior was at 1.5x the headline rate. If the documentation does not answer those questions, the number is not useful for making a decision.
The honest disclosure list
This article is published on urgentry.io. The benchmarks it references are urgentry’s own. The load generator that produced the numbers is open source. Here is what would invalidate the 400 ev/s claim:
- A reproduction on the same hardware class produces throughput below 350 ev/s under the same conditions. Hardware variation across shared VPS instances is real; a 12% variance is within expected bounds. A variance larger than that is a signal worth investigating.
- A reproduction using a payload larger than 10 KB per event produces a materially lower number. Larger payloads stress the I/O path in ways the benchmark payload does not. This is expected behavior, not a flaw in the benchmark, but the headline number does not cover it.
- A reproduction with concurrent UI queries from five or more users produces a measurable drop in ingest throughput. That test has not been run at the volume needed to characterize the degradation curve.
- A reproduction on a host with slower disk I/O than network-attached SSD produces a lower number. The benchmark hardware uses a specific I/O tier; hosts with spinning disk or highly contended block storage will show lower throughput.
These are not hypothetical edge cases. They are the conditions most likely to produce a different result than the published number. The responsible thing is to name them.
What you need before you trust a number
Five questions. Apply them to urgentry’s numbers. Apply them to every other benchmark you read.
1. Can I reproduce it? If the methodology is not published, the number is a marketing claim, not a measurement. Reproducibility requires a public load generator, the exact command used, and the hardware specification. If any of those are absent, ask for them. If the vendor cannot or will not provide them, treat the number as a lower bound on what they would publish if they could.
2. Does the payload look like mine? A 3.5 KB exception event is not the same as a 200 KB replay chunk. A benchmark with uniform payloads is not the same as a benchmark with the mix of event types your production SDK sends. The payload shape affects I/O, parse time, and write amplification. Ask to see the payload schema the benchmark used.
3. How long did the test run? A one-minute benchmark does not capture WAL checkpoint behavior, cache eviction, or GC pressure accumulation. Thirty minutes is a reasonable minimum for a sustained-load claim. An hour is better. A week-long test would reveal drift that a thirty-minute test hides, but few vendors run one.
4. Is the number p99 or p50? A p50 latency number tells you what the median request experienced. A p99 number tells you what the worst 1 in 100 experienced. For an error tracker, p99 matters more: the event you care most about is the one that arrives during a production incident, when the system is under load. Ask for p99. If the benchmark does not report it, the number is incomplete.
5. What happened at 1.5x load? A system that handles 400 ev/s cleanly might handle 600 ev/s by dropping events, crashing, or degrading to 50 ev/s. The failure mode at excess load is as important as the headline throughput number. Ask what happens when the rate exceeds the benchmark target. If the answer is “we didn’t test that,” you have a data gap you need to fill yourself before relying on the number.
FAQ
Who ran the urgentry benchmark?
The urgentry team ran it, on hardware we control, with a load generator we wrote. That is the honest disclosure. The number reproduces on the listed hardware if you follow the reproduction steps in this guide; that reproducibility is the only claim we make.
Does 400 events/sec mean urgentry can handle my traffic?
It means urgentry sustained 400 events per second on a specific VPS with a specific SQLite WAL configuration for 30 minutes without falling behind. Whether that covers your traffic depends on your event size, your payload shape, and whether your load is bursty or flat. Run the reproduction against a payload that looks like yours.
Why does urgentry publish benchmarks if vendor benchmarks are always suspect?
Because no benchmark at all is less useful than a reproducible one with disclosed methodology. The benchmark is suspect in the way all self-published numbers are suspect. The reproduction steps exist so you can invalidate the claim. If you reproduce it and get a different number, we want to know.
What is the p99 latency at 400 events/sec?
Under 40 ms end-to-end on the Tiny lane benchmark. That covers the HTTP ingest path through SQLite write. The symbolicator path and query path are not included in that number; both can add latency that the benchmark does not capture.
How does the urgentry benchmark compare to Sentry’s published numbers?
Sentry does not publish a sustained-load benchmark for the self-hosted distribution. Their documentation lists hardware requirements; the community issue tracker provides operator-observed performance data. Direct comparison is not possible because the measurement methodology differs. Interpret both sets of numbers against their own disclosed conditions.
Sources
- urgentry/bench load generator — the open load generator used to produce the published numbers, with hardware specification documented in the README.
- urgentry/bench on GitHub — the open load generator used to produce the published numbers.
- getsentry/self-hosted — the official Sentry self-hosted distribution and its current README hardware minimum of 16 GB.
- getsentry/self-hosted#1521 — the long-running operator hardware thread, with field reports from 2022 through 2025 documenting the gap between the 16 GB minimum and the 24–32 GB practical floor.
- getsentry/self-hosted#3566 — operator-reported memory exhaustion under non-trivial ingest on documented-minimum hardware.
- urgentry FSL-1.1-Apache-2.0 license — the source-available license under which urgentry is published.
Run the benchmark yourself.
The load generator is open source. The hardware specification is published. Clone the repo, point it at your own urgentry instance, and see what number you get on your hardware with your payload.