Production urgentry on Kubernetes (Helm + HA).
urgentry runs as a single Go binary on a $5 VPS. That is the right shape for most teams. This guide is for the teams that already operate Kubernetes and want to add urgentry to the same cluster rather than maintain a separate host. It covers the full HA topology: multiple replicas, Postgres backing store, Helm chart structure, cert-manager TLS, OTLP endpoint exposure, Prometheus monitoring, and a safe rolling upgrade path.
20 seconds. The HA shape is: three urgentry replicas behind a Kubernetes Service, Postgres as the backing store (not SQLite), a PersistentVolumeClaim for event blobs, a ConfigMap for environment variables, and a Secret for the Postgres DSN. The Helm chart wraps all of this in a single helm install.
60 seconds. The primary decision point is Postgres. SQLite requires a single writer process; multiple urgentry replicas with separate SQLite files produce split issue indexes and broken deduplication. Switch to Postgres before you scale beyond one replica. A managed Postgres instance (RDS, Cloud SQL, Neon, Supabase) is the lowest-ops choice. Run PgBouncer as a sidecar or a dedicated Deployment to cap the connection count.
The rest of the setup follows Kubernetes conventions. urgentry’s container image is a scratch-based static binary. It starts in under two seconds, which makes rolling updates fast and liveness probe failures rare. The OTLP endpoints (port 4318 for HTTP, port 4317 for gRPC) live in the same binary as the Sentry ingest path. No additional sidecar is needed for trace and log ingestion.
When you need Kubernetes for an error tracker (you don’t, mostly)
The default deployment shape for urgentry is a single binary on a single host. That shape handles 400 events per second at 52 MB resident memory. For a side project, a small SaaS, an internal tool, or a bootstrapped startup, the binary-on-VPS path is the right answer. The $5 VPS guide covers it end to end.
Kubernetes becomes the right answer for urgentry in one specific scenario: you already run a Kubernetes cluster for your application workloads, and you want your error tracker in the same operational environment rather than managing a separate host with a separate patching schedule, a separate backup scheme, and a separate on-call context. In that case, the cost of running urgentry on Kubernetes is not the Kubernetes complexity itself — your team already carries that. The cost is the incremental Helm chart and the Postgres requirement for HA.
There is no throughput argument for Kubernetes. urgentry on a single binary scales vertically to a much higher event rate than most teams produce. If your reason for considering Kubernetes is purely throughput, size up the VPS first. It is cheaper and simpler.
The two reasons that actually justify the Kubernetes path:
- Operational consistency. You already use kubectl, Helm, cert-manager, and Prometheus. Adding urgentry as a Helm release means it fits into your existing runbooks, deployment pipelines, and alerting structure.
- Redundancy requirements. Some compliance contexts (SOC 2 availability controls, internal SLAs) require multi-instance redundancy for every production dependency. A single-host urgentry deployment does not satisfy those requirements; a three-replica Deployment behind a Service does.
If neither of those applies, close this guide and open the $5 VPS guide.
The HA shape
The high-availability topology for urgentry on Kubernetes has five components:
- Deployment. Three urgentry replicas. The Deployment controller replaces failed Pods automatically. The rolling update strategy keeps at least two replicas available during upgrades.
- Service. A ClusterIP Service load-balancing across all three replicas. The Ingress terminates TLS and forwards to this Service. OTLP ports (4317, 4318) are exposed on the same Service alongside the main HTTP port (8000).
- Postgres. A managed Postgres instance outside the cluster (RDS, Cloud SQL, Neon, Supabase) or a StatefulSet inside it. The connection string lives in a Kubernetes Secret. All urgentry replicas connect to the same Postgres database, which provides the shared event store that makes deduplication work across replicas.
- PersistentVolumeClaim. Event blob attachments (source maps, minidumps) are written to disk. In a multi-replica setup, this PVC must use a ReadWriteMany access mode (NFS, EFS, Filestore) so all replicas can write to the same volume. Alternatively, configure urgentry to store blobs in S3-compatible object storage, which eliminates the RWX requirement.
- ConfigMap and Secret. Non-sensitive environment variables (base URL, replica count, log level) live in a ConfigMap. The Postgres DSN, object storage credentials, and any signing keys live in a Secret.
The Helm chart manages all five components as a unit. A single values.yaml file controls the shape, and helm upgrade applies changes with a rolling restart.
The Helm chart structure
The urgentry Helm chart follows standard conventions. The key fields in values.yaml:
# values.yaml - top-level fields and their defaults
image:
repository: urgentry/urgentry
tag: "" # Defaults to the chart appVersion. Pin explicitly in production.
pullPolicy: IfNotPresent
replicaCount: 1 # Override to 3 for HA deployments.
postgres:
host: "" # Required. Hostname of your managed Postgres instance.
port: 5432
database: urgentry
user: urgentry
# Password comes from the existingSecret field, not here.
existingSecret: urgentry-postgres
existingSecretKey: postgres-dsn
ingress:
enabled: false
className: nginx # or traefik, or whatever your cluster uses
annotations: {}
host: errors.example.com
tls:
enabled: true
secretName: urgentry-tls
resources:
requests:
memory: 128Mi
cpu: 100m
limits:
memory: 512Mi
cpu: 500m
persistence:
enabled: true
storageClass: "" # Leave empty to use cluster default.
accessMode: ReadWriteOnce # Use ReadWriteMany for multi-replica blob storage.
size: 20Gi
serviceMonitor:
enabled: false # Set true if Prometheus Operator is installed.
interval: 30s
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
Each environment gets its own values file. The base values.yaml contains safe defaults. Environment-specific files (values-prod.yaml, values-staging.yaml) override only what differs. Install with:
helm install urgentry ./charts/urgentry \
-f values.yaml \
-f values-prod.yaml \
--namespace monitoring \
--create-namespace
Upgrades use the same pattern:
helm upgrade urgentry ./charts/urgentry \
-f values.yaml \
-f values-prod.yaml \
--namespace monitoring
A worked example: values-prod.yaml
The production values file below runs three replicas against a managed Postgres instance. Memory limits are set conservatively for a cluster with moderate headroom. The ingress uses cert-manager with Let’s Encrypt.
# values-prod.yaml
# Production overrides for urgentry Helm chart.
# Apply with: helm upgrade urgentry ./charts/urgentry -f values.yaml -f values-prod.yaml
image:
tag: "0.2.11" # Pin the image tag. Never use 'latest' in production.
replicaCount: 3
postgres:
host: urgentry.cluster-abc123.us-east-1.rds.amazonaws.com
port: 5432
database: urgentry
user: urgentry
existingSecret: urgentry-postgres
existingSecretKey: postgres-dsn
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/proxy-body-size: "20m"
nginx.ingress.kubernetes.io/proxy-read-timeout: "120"
nginx.ingress.kubernetes.io/proxy-send-timeout: "60"
host: errors.example.com
tls:
enabled: true
secretName: urgentry-tls
resources:
requests:
memory: 1Gi
cpu: 250m
limits:
memory: 2Gi
cpu: "1"
persistence:
enabled: true
storageClass: efs-sc # EFS StorageClass for ReadWriteMany support on AWS.
accessMode: ReadWriteMany
size: 50Gi
serviceMonitor:
enabled: true
interval: 30s
podDisruptionBudget:
enabled: true
minAvailable: 2 # Always keep at least 2 replicas during node drain.
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
app.kubernetes.io/name: urgentry
A few decisions in this file deserve explanation. The podAntiAffinity rule spreads replicas across different nodes. Without it, the scheduler can place all three replicas on one node; a node failure takes the whole deployment down, which defeats the purpose of three replicas. The preferredDuring form is a soft preference, not a hard constraint, which prevents stuck scheduling if the cluster is small.
The podDisruptionBudget with minAvailable: 2 tells Kubernetes that at most one replica can be voluntarily disrupted at a time. This keeps urgentry available during node drains, cluster upgrades, and Helm rolling updates.
The resource limits (1 Gi request, 2 Gi limit) are conservative for urgentry’s actual footprint at moderate event rates. The binary uses roughly 52 MB at 400 events per second. The headroom in the limits accommodates burst ingest and query workloads without OOM-killing the Pod. Adjust downward if your cluster is resource-constrained.
Postgres setup notes
Postgres is the only hard dependency that changes when you move from a single VPS to Kubernetes. The decisions that matter:
Managed vs in-cluster
A managed Postgres service (RDS, Cloud SQL, Neon, Supabase, CockroachDB Serverless) is the lowest-ops path. Backups, failover, storage scaling, and minor-version patching are handled by the provider. The tradeoff is egress latency between the cluster and the database; for most clusters in the same region as the managed instance, this is under 1ms and invisible.
An in-cluster Postgres StatefulSet (Zalando Postgres Operator, CloudNativePG) gives you lower latency and no egress cost, at the cost of running a database operator with its own upgrade and failure surface. For teams that already run CloudNativePG for other services, the incremental cost is low. For teams new to it, start with managed.
Minimum instance size
A 2-vCPU, 4 GB RAM managed Postgres instance handles urgentry comfortably for most teams. urgentry’s query patterns are straightforward: insert-heavy (event ingest), short range scans (issue list), and occasional full-table reads (bulk export). It does not run complex analytical queries inside Postgres.
At sustained ingest above 200 events per second with multiple concurrent users, move to 4 vCPU and 8 GB RAM. Monitor pg_stat_activity wait events to identify the bottleneck before resizing.
PgBouncer connection pooling
Each urgentry replica holds a connection pool to Postgres. With three replicas and a pool size of 10 per replica, you reach 30 connections. Postgres handles this comfortably. If you scale to more replicas or add other services hitting the same Postgres instance, add PgBouncer between urgentry and Postgres.
# pgbouncer-deployment.yaml (abbreviated)
# Run PgBouncer as a Deployment in the same namespace as urgentry.
# urgentry connects to the PgBouncer Service, not Postgres directly.
apiVersion: apps/v1
kind: Deployment
metadata:
name: pgbouncer
namespace: monitoring
spec:
replicas: 2
selector:
matchLabels:
app: pgbouncer
template:
metadata:
labels:
app: pgbouncer
spec:
containers:
- name: pgbouncer
image: bitnami/pgbouncer:1.23.1
env:
- name: POSTGRESQL_HOST
value: urgentry.cluster-abc123.us-east-1.rds.amazonaws.com
- name: POSTGRESQL_DATABASE
value: urgentry
- name: PGBOUNCER_POOL_MODE
value: transaction
- name: PGBOUNCER_MAX_CLIENT_CONN
value: "200"
- name: PGBOUNCER_DEFAULT_POOL_SIZE
value: "20"
ports:
- containerPort: 5432
Use transaction-mode pooling with urgentry. urgentry does not hold advisory locks or use session-level state that would break with transaction pooling.
Backup schedule and WAL retention
Configure daily base backups and WAL archiving with at least a 7-day retention window. This gives you point-in-time recovery to any second in that window. Most managed Postgres providers offer this out of the box; enable it explicitly rather than assuming it is on by default.
WAL retention and urgentry schema migrations interact in one specific way: a migration that rewrites a large table will produce a WAL spike. Schedule migrations during low-ingest windows and confirm that your managed Postgres instance has enough storage headroom to absorb the WAL volume during the migration.
Ingress and TLS
The Ingress resource handles TLS termination and routes external HTTPS traffic to the urgentry Service. cert-manager with a Let’s Encrypt ClusterIssuer automates certificate provisioning and renewal.
# urgentry-ingress.yaml
# Managed by the Helm chart when ingress.enabled=true.
# Shown here for reference and for clusters managing Ingress resources directly.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: urgentry
namespace: monitoring
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
# Body size for Sentry envelopes with source maps or minidumps.
nginx.ingress.kubernetes.io/proxy-body-size: "20m"
# Timeout for server-sent event connections from the urgentry UI.
nginx.ingress.kubernetes.io/proxy-read-timeout: "120"
nginx.ingress.kubernetes.io/proxy-send-timeout: "60"
# Do not buffer SSE responses; let them stream to the browser.
nginx.ingress.kubernetes.io/proxy-buffering: "off"
spec:
ingressClassName: nginx
tls:
- hosts:
- errors.example.com
secretName: urgentry-tls
rules:
- host: errors.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: urgentry
port:
number: 8000
Two annotations matter for urgentry specifically. The proxy-body-size: 20m annotation raises the nginx-ingress default body limit from 1 MB to 20 MB. Sentry envelopes that include source maps or session attachments can exceed the default; the ingress controller will return 413 silently and you will see missing events with no obvious error on the urgentry side.
The proxy-read-timeout: 120 annotation covers the server-sent event connection that urgentry uses for the live issues feed in its UI. SSE connections stay open; the default timeout of 60 seconds will close them before the browser can receive updates. Set it to at least 120 seconds, or higher if your UI users report disconnections.
cert-manager ClusterIssuer
# clusterissuer-letsencrypt-prod.yaml
# Apply once per cluster. Referenced by the cert-manager annotation on the Ingress.
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: ops@example.com
privateKeySecretRef:
name: letsencrypt-prod-account-key
solvers:
- http01:
ingress:
class: nginx
cert-manager provisions the TLS Secret (urgentry-tls) automatically after the Ingress is created and DNS resolves. Renewal happens before expiry without operator intervention.
The OTLP endpoints
urgentry receives OpenTelemetry traces and logs on the same binary that handles Sentry SDK ingest. No sidecar or additional Deployment is needed. Expose the OTLP ports alongside the main HTTP port in the Kubernetes Service:
# urgentry-service.yaml
# Managed by the Helm chart. Shown here for reference.
apiVersion: v1
kind: Service
metadata:
name: urgentry
namespace: monitoring
labels:
app.kubernetes.io/name: urgentry
spec:
type: ClusterIP
selector:
app.kubernetes.io/name: urgentry
ports:
- name: http
port: 8000
targetPort: 8000
protocol: TCP
- name: otlp-grpc
port: 4317
targetPort: 4317
protocol: TCP
- name: otlp-http
port: 4318
targetPort: 4318
protocol: TCP
OpenTelemetry collectors inside the cluster can send traces and logs to urgentry.monitoring.svc.cluster.local:4318 (HTTP) or urgentry.monitoring.svc.cluster.local:4317 (gRPC) without going through the Ingress. This keeps OTLP traffic on the cluster internal network and out of the TLS termination path.
If you want external OTLP ingestion (from services outside the cluster), add a second Ingress or a LoadBalancer Service for the OTLP ports. OTLP/HTTP on port 4318 is the easier path to expose through an nginx Ingress; OTLP/gRPC on port 4317 requires an Ingress controller that handles HTTP/2 and gRPC routing correctly, or a direct LoadBalancer Service.
A sample OpenTelemetry Collector configuration pointing at the urgentry Service:
# otel-collector-values.yaml snippet
# Configure the OTel Collector to export to urgentry inside the cluster.
exporters:
otlphttp:
endpoint: http://urgentry.monitoring.svc.cluster.local:4318
headers:
# urgentry uses the Sentry DSN auth header for OTLP ingest.
# Replace with the DSN for the project you want traces attributed to.
x-sentry-dsn: "https://your-key@errors.example.com/your-project-id"
service:
pipelines:
traces:
exporters: [otlphttp]
logs:
exporters: [otlphttp]
Monitoring the monitor
An error tracker with no monitoring of itself is a recurring blind spot. When urgentry is down, you lose the alerts that would tell you urgentry is down. The setup below closes that loop.
Prometheus ServiceMonitor
urgentry exposes a Prometheus metrics endpoint at /metrics. If you run the Prometheus Operator, enable the ServiceMonitor in the Helm values:
# In values-prod.yaml
serviceMonitor:
enabled: true
interval: 30s
path: /metrics
port: http # Matches the Service port named 'http' (8000)
The corresponding ServiceMonitor resource the chart generates:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: urgentry
namespace: monitoring
labels:
release: prometheus # Must match your Prometheus Operator labelSelector.
spec:
selector:
matchLabels:
app.kubernetes.io/name: urgentry
endpoints:
- port: http
path: /metrics
interval: 30s
Readiness and liveness probes
The probes in values.yaml above target /health (liveness) and /ready (readiness). The distinction matters for urgentry during startup and during Postgres connection failures. The liveness probe confirms the process is alive. The readiness probe confirms urgentry can reach Postgres and is ready to accept traffic. During a Postgres outage, the readiness probe fails and Kubernetes removes the Pods from the Service endpoints, preventing the Ingress from routing traffic to replicas that cannot serve requests.
Recommended alerts
Configure these alerts on the urgentry Pods in Prometheus or your alerting system of choice:
- Pod restart rate. Alert when any urgentry Pod restarts more than twice in 10 minutes. Repeated restarts indicate a crash loop, OOM kill, or liveness probe failure.
- Replica count below desired. Alert when the number of ready Pods drops below
replicaCount - 1. One replica down is a degraded state; two down approaches a service outage depending on your load. - Ingest error rate. Alert when urgentry’s
ingest_errors_totalmetric exceeds a threshold relative toingest_events_total. A spike in ingest errors is the first signal of a Postgres write problem or a disk space issue on the PVC. - PVC utilization. Alert when the PVC exceeds 80% capacity. Source maps and attachments grow monotonically without a retention policy. A full PVC causes urgentry to refuse blob writes silently.
Run these alerts in a separate Prometheus instance or alerting tool from urgentry itself. If urgentry goes down and your alerts depend on urgentry, you will not get the alert.
Upgrade procedure
urgentry upgrades on Kubernetes follow the Helm rolling update pattern. The Deployment strategy controls how Kubernetes replaces Pods during an upgrade.
# In the Helm chart Deployment template (or override in values)
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Kubernetes can create one extra Pod above replicaCount.
maxUnavailable: 0 # Never terminate a Pod before its replacement is Ready.
With maxUnavailable: 0, the rollout proceeds by creating one new Pod, waiting for it to pass the readiness probe, then terminating one old Pod. This keeps at least three Pods handling traffic throughout the update. The rollout completes when all Pods run the new image.
Postgres migration handling
Some urgentry releases include Postgres schema migrations. The migration runs as a Kubernetes Job before the Deployment update, using a Helm hook:
# migrate-job.yaml (generated by the Helm chart when migrations are present)
apiVersion: batch/v1
kind: Job
metadata:
name: urgentry-migrate-{{ .Release.Revision }}
annotations:
helm.sh/hook: pre-upgrade
helm.sh/hook-weight: "-5"
helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
spec:
template:
spec:
restartPolicy: Never
containers:
- name: migrate
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
args: ["migrate"]
env:
- name: URGENTRY_DATABASE_DSN
valueFrom:
secretKeyRef:
name: urgentry-postgres
key: postgres-dsn
The pre-upgrade hook ensures migrations complete before any new application Pods start. If the migration Job fails, the Helm upgrade fails and the existing Pods continue running the previous version.
Check the release notes for each urgentry version before upgrading. Releases that include a migration will say so. For migrations that add columns or indexes (additive), the old Pods and the new Pods can run simultaneously during the rollout without issues. For migrations that remove columns or change types (destructive), wait until all old Pods are terminated before starting ingest against the new schema.
Rolling back
If an upgrade produces errors, roll back with Helm:
helm rollback urgentry --namespace monitoring
This restores the previous Helm release values and triggers a rolling replacement back to the prior image. Check the release notes for whether the version you are rolling back from ran a destructive migration. If it did, the rollback will require a reverse migration SQL applied manually before the rollback, or a database restore from a backup taken before the upgrade.
The safest upgrade discipline: take a Postgres snapshot before every urgentry upgrade. Most managed Postgres providers offer a one-click pre-upgrade snapshot. The few seconds it takes is cheap insurance.
When Kubernetes is overkill
The honest conclusion to this guide: for most teams, this is too much.
A single urgentry binary on a $5 VPS handles 400 events per second at 52 MB resident memory with a single systemd unit and a SQLite file. That covers the ingest volume of a 50-person engineering team running a busy SaaS. The operational overhead is a Caddyfile, a Litestream config, and an occasional binary upgrade.
The Helm chart described in this guide adds: a Kubernetes Deployment, a Service, an Ingress, a ClusterIssuer, a ServiceMonitor, a PodDisruptionBudget, a PersistentVolumeClaim, a Job, a ConfigMap, a Secret, and a PgBouncer Deployment. Each of those is a Kubernetes object that can fail, drift, or require version-specific handling during cluster upgrades. The total configuration surface is three orders of magnitude larger than a Caddyfile and a systemd unit.
That surface is worth carrying if you already carry it for your application workloads and urgentry lives in the same cluster as a peer service. It is not worth introducing for urgentry alone. The binary on a VPS is the default for a reason.
The decision rule: if your cluster runs at least five other production services and your team includes at least one engineer comfortable with Helm and Kubernetes RBAC, add urgentry to the cluster. Otherwise, spin up a VPS and skip this guide.
Frequently asked questions
Does urgentry need Kubernetes to run in production?
No. A single urgentry binary on a $5 VPS handles 400 events per second at 52 MB resident memory. Kubernetes is the right choice only when you already operate a cluster and want urgentry in the same operational environment as your other production services.
Can urgentry use SQLite on Kubernetes?
Not for multi-replica deployments. SQLite requires a single writer process. Multiple replicas with separate SQLite files produce split issue indexes and broken deduplication. Use Postgres when you run more than one replica. For a single-replica Kubernetes deployment with a ReadWriteOnce PVC, SQLite works but offers no advantage over a VPS.
What is the minimum Postgres instance size for urgentry?
A 2-vCPU, 4 GB RAM managed instance handles most teams. At sustained ingest above 200 events per second with multiple concurrent users, move to 4 vCPU and 8 GB. Add PgBouncer when you run more than five urgentry replicas or share the Postgres instance with other services.
Does urgentry support OTLP on Kubernetes?
Yes. urgentry exposes OTLP/HTTP on port 4318 and OTLP/gRPC on port 4317 from the same binary. Add both ports to the Kubernetes Service. OpenTelemetry collectors inside the cluster can send directly to the Service DNS name without going through the Ingress.
What is the rollback path after a bad urgentry upgrade?
Run helm rollback urgentry --namespace monitoring. Kubernetes performs a rolling replacement back to the prior image. If the upgrade ran a destructive Postgres migration, apply the reverse migration SQL before rolling back, or restore from a Postgres snapshot taken before the upgrade.
Sources and further reading
- Kubernetes Deployment documentation — rolling update strategy,
maxSurge,maxUnavailable, and Pod lifecycle. - Kubernetes PodDisruptionBudget —
minAvailableandmaxUnavailablesemantics during voluntary disruptions. - Helm chart best practices — values structure, hook weights, and upgrade lifecycle annotations.
- Helm rollback command reference — revision targeting and namespace scoping.
- cert-manager ACME configuration — ClusterIssuer setup, HTTP-01 and DNS-01 challenge solvers, and certificate lifecycle.
- PgBouncer configuration reference — pool modes (
transaction,session,statement),max_client_conn, anddefault_pool_size. - Prometheus Operator ServiceMonitor API —
endpoints,interval, andselectorfields. - FSL-1.1-Apache-2.0 license text — the source-available license under which urgentry is distributed.
- urgentry compatibility matrix — the 218/218 Sentry API operation coverage referenced in this guide.
Ready to deploy urgentry on Kubernetes?
urgentry ships a container image built from a scratch base with the static Go binary. It starts in under two seconds, runs at 52 MB resident under load, and drops into any Kubernetes cluster that can pull from a registry. The Helm chart handles the full HA topology described in this guide.