SGLang Prompt Caching

What It Is

SGLang keeps a copy of recently-used prompts in GPU memory (VRAM) so it doesn’t have to reprocess them from scratch. This is called the KV cache or radix cache.

When your prompt’s prefix matches something already in cache, the model skips re-computing those tokens — saving time and money. Synthetic applies an 80% discount on cache-hit tokens, so cache hits are a real cost saver.

How Long Does It Stay?

No fixed timer. Cached prompts stay in VRAM until the GPU needs room for new requests. If nobody else is using the GPU, your cache can sit there indefinitely. Under load, it gets evicted (removed) to make space.

There is no time-based expiry — only memory pressure causes eviction.

When Eviction Happens

Cache entries get pushed out under memory pressure — which happens when lots of users are sending requests at the same time. In other words: during peak hours.

More concurrent users = more KV cache competition = your cached prompts get evicted faster. During quiet hours, your cache persists much longer.

The Peak-Hours Pricing Effect

Since Synthetic gives an 80% discount on cache-hit tokens, and cache hits drop during peak (because eviction is higher), your effective price per token naturally rises during peak hours. It’s not a surcharge — the cache savings simply disappear when everyone’s fighting for the same VRAM.

- Quiet hours → more cache → cheaper effective per-token cost - Peak hours → less cache → closer to full price

Tip: Batch heavy work during off-peak for best cache hit rates and lowest effective cost.

How Eviction Decides What Goes

Default is LRU (Least Recently Used) — the oldest-unused prompt gets evicted first.

Alternative: LFU (Least Frequently Used), configurable server-side.

Common prompt prefixes are automatically shared across requests via the radix tree structure. Entries are reference-counted — when a node’s ref count drops to zero and memory is needed, it’s evicted.

If the cache fills up mid-request, SGLang pauses that request (“retracts” it), frees space, then reschedules and resumes it.

Configurable Parameters

These are Synthetic-side settings — not user-facing, but documented here for transparency.

Parameter	Default	What It Does
`–radix-eviction-policy`	`lru`	Eviction policy: `lru` or `lfu`
`–mem-fraction-static`	~0.9	Fraction of GPU memory allocated to model weights + KV cache pool
`–schedule-policy`	`fcfs`	Scheduling policy. `lpm` (longest prefix match) encourages cache hits
`–schedule-conservativeness`	`1.0`	Higher = more conservative scheduling. Increase if frequent retraction
`–chunked-prefill-size`	auto	Max tokens per chunked prefill chunk. Reduce if OOM during prefill
`–kv-cache-dtype`	`auto`	KV cache data type. `fp8_e4m3` / `fp8_e5m2` for memory savings
`–disable-radix-cache`	`False`	Disable prefix caching entirely
`–max-total-tokens`	auto	Max tokens in memory pool. Overrides auto-calculation
`–max-running-requests`	auto	Max concurrent running requests

Hierarchical Cache (HiCache)

SGLang supports a 3-tier cache hierarchy, which can reduce the impact of GPU eviction:

L1: GPU VRAM — always active (RadixAttention)
L2: Host/CPU RAM — enabled with –enable-hierarchical-cache
L3: Remote storage — file, mooncake, hf3fs, nixl, aibrix backends

When L1 fills up, KV cache pages offload to L2 (host RAM). When L2 fills, pages go to L3. The eviction cascades: GPU → Host RAM → Storage.

This means evicted GPU entries don’t fully disappear if HiCache is enabled — they can be pulled back from RAM or disk instead of re-computed from scratch.

HiCache Parameter	Default	What It Does
`–enable-hierarchical-cache`	`False`	Enable HiCache L2+L3 tiers
`–hicache-ratio`	`2.0`	Host mem pool size as ratio of device pool
`–hicache-size`	`0`	Host mem pool size in GB (overrides ratio)
`–hicache-write-policy`	`write_through`	Write policy: `write_through`, `write_back`, `write_through_selective`
`–hicache-io-backend`	`kernel`	IO backend for CPU↔GPU transfer: `direct`, `kernel`
`–hicache-storage-backend`	`None`	L3 storage backend: `file`, `mooncake`, `hf3fs`, `nixl`, `aibrix`, `dynamic`, `eic`
`–hicache-storage-prefetch-policy`	`best_effort`	Prefetch policy: `best_effort`, `wait_complete`, `timeout`
`–enable-cache-report`	`False`	Return cached token count in API responses

Table of Contents