Organic

This is an old revision of the document!


SGLang Prompt Caching

What It Is

SGLang keeps a copy of recently-used prompts in GPU memory (VRAM) so it doesn’t have to reprocess them from scratch. This is called the KV cache or radix cache.

When your prompt’s prefix matches something already in cache, the model skips re-computing those tokens — saving time and money. Synthetic applies an 80% discount on cache-hit tokens, so cache hits are a real cost saver.

How Long Does It Stay?

No fixed timer. Cached prompts stay in VRAM until the GPU needs room for new requests. If nobody else is using the GPU, your cache can sit there indefinitely. Under load, it gets evicted (removed) to make space.

There is no time-based expiry — only memory pressure causes eviction.

When Eviction Happens

Cache entries get pushed out under memory pressure — which happens when lots of users are sending requests at the same time. In other words: during peak hours.

More concurrent users = more KV cache competition = your cached prompts get evicted faster. During quiet hours, your cache persists much longer.

The Peak-Hours Pricing Effect

Since Synthetic gives an 80% discount on cache-hit tokens, and cache hits drop during peak (because eviction is higher), your effective price per token naturally rises during peak hours. It’s not a surcharge — the cache savings simply disappear when everyone’s fighting for the same VRAM.

- Quiet hours → more cache → cheaper effective per-token cost - Peak hours → less cache → closer to full price

Tip: Batch heavy work during off-peak for best cache hit rates and lowest effective cost.

How Eviction Decides What Goes

Default is LRU (Least Recently Used) — the oldest-unused prompt gets evicted first.

Alternative: LFU (Least Frequently Used), configurable server-side.

Common prompt prefixes are automatically shared across requests via the radix tree structure. Entries are reference-counted — when a node’s ref count drops to zero and memory is needed, it’s evicted.

If the cache fills up mid-request, SGLang pauses that request (“retracts” it), frees space, then reschedules and resumes it.

Configurable Parameters

These are Synthetic-side settings — not user-facing, but documented here for transparency.

Parameter Default What It Does
–radix-eviction-policy lru Eviction policy: lru or lfu
–mem-fraction-static ~0.9 Fraction of GPU memory allocated to model weights + KV cache pool
–schedule-policy fcfs Scheduling policy. lpm (longest prefix match) encourages cache hits
–schedule-conservativeness 1.0 Higher = more conservative scheduling. Increase if frequent retraction
–chunked-prefill-size auto Max tokens per chunked prefill chunk. Reduce if OOM during prefill
–kv-cache-dtype auto KV cache data type. fp8_e4m3 / fp8_e5m2 for memory savings
–disable-radix-cache False Disable prefix caching entirely
–max-total-tokens auto Max tokens in memory pool. Overrides auto-calculation
–max-running-requests auto Max concurrent running requests

Hierarchical Cache (HiCache)

SGLang supports a 3-tier cache hierarchy, which can reduce the impact of GPU eviction:

  1. L1: GPU VRAM — always active (RadixAttention)
  2. L2: Host/CPU RAM — enabled with –enable-hierarchical-cache
  3. L3: Remote storage — file, mooncake, hf3fs, nixl, aibrix backends

When L1 fills up, KV cache pages offload to L2 (host RAM). When L2 fills, pages go to L3. The eviction cascades: GPU → Host RAM → Storage.

This means evicted GPU entries don’t fully disappear if HiCache is enabled — they can be pulled back from RAM or disk instead of re-computed from scratch.

HiCache Parameter Default What It Does
–enable-hierarchical-cache False Enable HiCache L2+L3 tiers
–hicache-ratio 2.0 Host mem pool size as ratio of device pool
–hicache-size 0 Host mem pool size in GB (overrides ratio)
–hicache-write-policy write_through Write policy: write_through, write_back, write_through_selective
–hicache-io-backend kernel IO backend for CPU↔GPU transfer: direct, kernel
–hicache-storage-backend None L3 storage backend: file, mooncake, hf3fs, nixl, aibrix, dynamic, eic
–hicache-storage-prefetch-policy best_effort Prefetch policy: best_effort, wait_complete, timeout
–enable-cache-report False Return cached token count in API responses

See Also