Docs

How Umbrella thinks
about traffic.

The whole product fits in three nouns — pools, backends, routes — and four ideas: snapshot, strategy, health, analytics. This is everything you need to know.


Alpha
Documentation describes the current state. Umbrella is actively developed; some surfaces below may change before v1.0. Where a behavior is likely to evolve, the section says so explicitly.

Concepts

Three primitives:

  • Pool — a group of backends that share a balancing strategy and health-check policy.
  • Backend — a single upstream URL, member of exactly one pool. Has a weight and an enabled flag.
  • Route — a prioritized rule that maps an incoming request to a pool.

Configuration lives in SQL (SQLite or Postgres). Audit logs record every write. The proxy hot path never reads from SQL — see snapshots.

Atomic snapshots

Every request the proxy serves reads from an immutable in-memory RouterSnapshot. On any config write Umbrella:

  • Persists the change to SQL inside a transaction.
  • Bumps a monotonically-increasing config_version counter.
  • Rebuilds a fresh RouterSnapshot with pre-compiled regex, weight tables, and pool indices.
  • Reassigns the snapshot reference atomically — readers grab a local pointer so no locks are needed.

Effect: configuration changes apply to the very next request. No restart. No request loss. The version number you see in the dashboard’s Snapshot v6 indicator is exactly this counter.

Balancing strategies

Pick per pool. Switchable live; the next request uses the new strategy.

StrategyBehaviorWhen to use
round_robincycles backends equallydefault; works for stateless services
weightedrandom pick proportional to weightmixed-capacity backends; canary deploys
least_connfewest in-flight winslong-lived requests of varying duration
ip_hashsticky-by-client-IPsession affinity without a sticky cookie
randomuniform random pickchaos testing; small-N pools

Routing rules

Routes are evaluated in priority order — lower wins. The first route whose path / host / methods / headers / query all match captures the request.

Path matchers

  • exact — full-path equality (/login)
  • prefix — path prefix (/api/)
  • glob — fnmatch-style wildcards (/users/*/posts)
  • regex — Python re.search on the full path

Host matchers

Match by Host header. Supports wildcards: api.example.com, *.example.com, *-staging.example.com. Leave blank to match any host.

Header & query matchers

One name=value per line. Value supports literal match, */? globs, and regex:<pattern>.

header_matchers
X-Tenant=acme
Authorization=Bearer *
X-Version=regex:^v[2-9]$

Forwarding tweaks

  • Strip prefix — drop the matched path prefix before forwarding upstream.
  • Rewrite host — replace incoming Host with the upstream’s host (default).
  • Preserve host — keep the original Host header (use for vhost-routed upstreams).
  • Per-route timeout — override the global httpx timeout for slow endpoints.

Health checks

Two layers, working together.

Active probes

One asyncio task per pool sends configurable probes against every backend. Backends transition through a state machine:

UNKNOWN → HEALTHY ↔ UNHEALTHY → HALF_OPEN → HEALTHY

  • Interval / timeout — how often, how long to wait.
  • Path / method / expected status — e.g. GET /healthz expecting 200-299.
  • Healthy threshold — consecutive successes before flipping to HEALTHY.
  • Unhealthy threshold — consecutive failures before flipping to UNHEALTHY.

Passive circuit breaker

The forwarder tracks real-traffic outcomes. If a backend returns N consecutive 5xx, its circuit opens for a configurable cool-off, then half-opens for the next active probe to decide.

Result

Backends that flap on probes alone but fail on real traffic still get pulled out. Pools with no healthy backends return 503 Retry-After: 5 instead of queuing.

Analytics & ClickHouse

The Analytics page in the dashboard shows per-minute traffic, p50/p95 latency, top routes, and top error paths. By default this is backed by the SQL database; for high-volume production, opt into ClickHouse.

SQL (default)

Every request is logged into the request_logs table at a configurable sample rate. Aggregations run in Python over SQL rows. Fine up to a few million rows; sluggish past that.

ClickHouse (opt-in)

env
UMBRELLA_CLICKHOUSE_URL=http://default:password@clickhouse:8123/umbrella
UMBRELLA_CLICKHOUSE_TTL_DAYS=30
UMBRELLA_CLICKHOUSE_FLUSH_INTERVAL_S=2.0
UMBRELLA_CLICKHOUSE_FLUSH_MAX_ROWS=1000

Schema is auto-created. Inserts are buffered and batched over HTTP. Aggregate queries (KPIs, time-series, top-N) run directly on ClickHouse with quantile() functions, so the dashboard stays fast even over hundreds of millions of rows.

Schema

SQL · ClickHouse
CREATE TABLE umbrella.request_logs (
  ts          DateTime64(3, 'UTC'),
  route_id    Nullable(UInt32),
  pool_id     Nullable(UInt32),
  backend_id  Nullable(UInt32),
  method      LowCardinality(String),
  path        String,
  status_code UInt16,
  duration_ms Float32,
  client_ip   Nullable(String),
  error       Nullable(String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(ts)
ORDER BY (ts, status_code, route_id)
TTL toDateTime(ts) + INTERVAL 30 DAY;

Metrics & logs

Prometheus

Exposed at /metrics:

  • umbrella_proxy_requests_total
  • umbrella_proxy_responses_total{status, route}
  • umbrella_proxy_request_duration_seconds{route, pool} — histogram with p50/p95/p99 buckets
  • umbrella_backend_up{pool, backend} — gauge

Add Umbrella as a Prometheus scrape target and use Grafana for time-series dashboards.

Structured logs

JSON to stdout via structlog. Every request, every health-state transition, every audit event. Ship to Loki, Vector, Datadog, CloudWatch — whatever you have.

Audit log

Every config write (pools / backends / routes / users) records the user, IP, action, target, and full payload diff into audit_logs. Available via the dashboard’s admin pages.

JSON API

The dashboard is just a thin HTML view on top of a complete REST API. Everything you can click, you can curl.

  • POST /dashboard/api/v1/auth/login — get a JWT cookie
  • GET|POST|PATCH|DELETE /dashboard/api/v1/pools
  • GET|POST|PATCH|DELETE /dashboard/api/v1/backends
  • GET|POST|PATCH|DELETE /dashboard/api/v1/routes
  • GET /dashboard/api/v1/topology — full route → pool → backend graph
  • GET /dashboard/api/v1/analytics/{summary,top-routes,top-errors}

Interactive Swagger UI at /dashboard/api/docs.

Non-goals

Things Umbrella explicitly does not do, by design:

  • TLS termination — put Caddy / nginx / Traefik / Cloudflare in front.
  • Multi-tenancy — single-tenant only.
  • Service discovery — give it URLs, not Consul integration.
  • Distributed control plane — single-instance v1; HA via Postgres + LISTEN/NOTIFY is on the roadmap.

If you need any of the above, reach for Envoy / HAProxy / Traefik. If you want a small, sharp, self-hosted thing, Umbrella.