SGLang Prompt Caching

What It Is

SGLang keeps a copy of recently-used prompts in GPU memory (VRAM) so it doesn’t have to reprocess them from scratch. This is called the KV cache or radix cache.

When your prompt’s prefix matches something already in cache, the model skips re-computing those tokens — saving time and money. Synthetic applies an 80% discount on cache-hit tokens for subscribers. PAYG users do not receive cache discounts — you pay full price for all tokens regardless of cache hits.

How Long Does It Stay?

No fixed timer. Cached prompts stay in VRAM until the GPU needs room for new requests. If nobody else is using the GPU, your cache can sit there indefinitely. Under load, it gets evicted (removed) to make space.

There is no time-based expiry — only memory pressure causes eviction.

When Eviction Happens

Cache entries get pushed out under memory pressure — which happens when lots of users are sending requests at the same time. In other words: during peak hours.

More concurrent users = more KV cache competition = your cached prompts get evicted faster. During quiet hours, your cache persists much longer.

Why Capacity Is Tight

This section is based on public statements from Synthetic staff and community observations. Numbers may change as Synthetic adds hardware.

Synthetic runs on B200 and H200 GPUs. These are expensive — a single B200 node costs roughly $400k. GLM 5/5.1 requires 2× B200s per instance (NVFP4 quant). Kimi K2.5 runs on either B200s (NVFP4) or H200s (INT4), and is now load-balanced between them (see kimi-k25).

The economics are tight: breakeven on a B200 pair is roughly 400 subscribers at $30/mo (fewer if subscribers are on higher tiers). Synthetic can’t just throw more B200s at capacity problems — they have to be paid for by subscriber revenue. This is why peak hours cause real cache pressure: the user-to-GPU ratio is constrained by hardware costs, not by choice.

The NVFP4 factor: B200s were actually slower than H200s at FP16 in practice — the kernels aren’t fully optimized yet. NVFP4 is supposed to be significantly faster than FP8 on Blackwell hardware specifically, but real-world perf is still TBD since NVFP4 is new and the kernel optimization isn’t mature yet. If NVFP4 hits its theoretical lift, capacity per B200 improves and cache pressure eases. Until then, B200s are running somewhat below their potential.

The Peak-Hours Pricing Effect

Since Synthetic gives an 80% cache discount to subscribers (PAYG gets no discount), and cache hits drop during peak (because eviction is higher), subscribers’ effective price per token naturally rises during peak hours. It’s not a surcharge — the cache savings simply disappear when everyone’s fighting for the same VRAM. PAYG users are unaffected since they pay full price either way.

- Quiet hours → more cache → cheaper effective per-token cost - Peak hours → less cache → closer to full price

Tip: Batch heavy work during off-peak for best cache hit rates and lowest effective cost.

How Eviction Decides What Goes

Default is LRU (Least Recently Used) — the oldest-unused prompt gets evicted first.

Alternative: LFU (Least Frequently Used), configurable server-side.

Common prompt prefixes are automatically shared across requests via the radix tree structure. Entries are reference-counted — when a node’s ref count drops to zero and memory is needed, it’s evicted.

If the cache fills up mid-request, SGLang pauses that request (“retracts” it), frees space, then reschedules and resumes it.

Configurable Parameters

These are Synthetic-side settings — not user-facing, but documented here for transparency.

Parameter	Default	What It Does
`–radix-eviction-policy`	`lru`	Eviction policy: `lru` or `lfu`
`–mem-fraction-static`	~0.9	Fraction of GPU memory allocated to model weights + KV cache pool
`–schedule-policy`	`fcfs`	Scheduling policy. `lpm` (longest prefix match) encourages cache hits
`–schedule-conservativeness`	`1.0`	Higher = more conservative scheduling. Increase if frequent retraction
`–chunked-prefill-size`	auto	Max tokens per chunked prefill chunk. Reduce if OOM during prefill
`–kv-cache-dtype`	`auto`	KV cache data type. `fp8_e4m3` / `fp8_e5m2` for memory savings
`–disable-radix-cache`	`False`	Disable prefix caching entirely
`–max-total-tokens`	auto	Max tokens in memory pool. Overrides auto-calculation
`–max-running-requests`	auto	Max concurrent running requests

Hierarchical Cache (HiCache)

SGLang supports a 3-tier cache hierarchy, which can reduce the impact of GPU eviction:

L1: GPU VRAM — always active (RadixAttention)
L2: Host/CPU RAM — enabled with –enable-hierarchical-cache
L3: Remote storage — file, mooncake, hf3fs, nixl, aibrix backends

When L1 fills up, KV cache pages offload to L2 (host RAM). When L2 fills, pages go to L3. The eviction cascades: GPU → Host RAM → Storage.

This means evicted GPU entries don’t fully disappear if HiCache is enabled — they can be pulled back from RAM or disk instead of re-computed from scratch.

HiCache Parameter	Default	What It Does
`–enable-hierarchical-cache`	`False`	Enable HiCache L2+L3 tiers
`–hicache-ratio`	`2.0`	Host mem pool size as ratio of device pool
`–hicache-size`	`0`	Host mem pool size in GB (overrides ratio)
`–hicache-write-policy`	`write_through`	Write policy: `write_through`, `write_back`, `write_through_selective`
`–hicache-io-backend`	`kernel`	IO backend for CPU↔GPU transfer: `direct`, `kernel`
`–hicache-storage-backend`	`None`	L3 storage backend: `file`, `mooncake`, `hf3fs`, `nixl`, `aibrix`, `dynamic`, `eic`
`–hicache-storage-prefetch-policy`	`best_effort`	Prefetch policy: `best_effort`, `wait_complete`, `timeout`
`–enable-cache-report`	`False`	Return cached token count in API responses

Table of Contents