====== SGLang Prompt Caching ======

===== What It Is =====

SGLang keeps a copy of recently-used prompts in GPU memory (VRAM) so it doesn't have to reprocess them from scratch. This is called the **KV cache** or **radix cache**.

When your prompt's prefix matches something already in cache, the model skips re-computing those tokens — saving time and money. Synthetic applies an **80% discount on cache-hit tokens for subscribers**. **PAYG users do not receive cache discounts** — you pay full price for all tokens regardless of cache hits.

===== How Long Does It Stay? =====

No fixed timer. Cached prompts stay in VRAM **until the GPU needs room for new requests**. If nobody else is using the GPU, your cache can sit there indefinitely. Under load, it gets evicted (removed) to make space.

There is **no time-based expiry** — only memory pressure causes eviction.

===== When Eviction Happens =====

Cache entries get pushed out under **memory pressure** — which happens when lots of users are sending requests at the same time. In other words: during peak hours.

More concurrent users = more KV cache competition = your cached prompts get evicted faster. During quiet hours, your cache persists much longer.

===== Why Capacity Is Tight =====

<WRAP center round info>
This section is based on public statements from Synthetic staff and community observations. Numbers may change as Synthetic adds hardware.
</WRAP>

Synthetic runs on B200 and H200 GPUs. These are expensive — a single B200 node costs roughly **$400k**. GLM 5/5.1 requires **2× B200s** per instance (NVFP4 quant). Kimi K2.5 runs on either B200s (NVFP4) or H200s (INT4), and is now load-balanced between them (see [[:models:kimi-k25]]).

The economics are tight: breakeven on a B200 pair is roughly **400 subscribers at $30/mo** (fewer if subscribers are on higher tiers). Synthetic can't just throw more B200s at capacity problems — they have to be paid for by subscriber revenue. This is why peak hours cause real cache pressure: the user-to-GPU ratio is constrained by hardware costs, not by choice.

<WRAP tip round>
**The NVFP4 factor:** B200s were actually slower than H200s at FP16 in practice — the kernels aren't fully optimized yet. NVFP4 is supposed to be significantly faster than FP8 on Blackwell hardware specifically, but real-world perf is still TBD since NVFP4 is new and the kernel optimization isn't mature yet. If NVFP4 hits its theoretical lift, capacity per B200 improves and cache pressure eases. Until then, B200s are running somewhat below their potential.
</WRAP>

===== The Peak-Hours Pricing Effect =====

<WRAP tip center round>
Since Synthetic gives an 80% cache discount to subscribers (PAYG gets no discount), and cache hits drop during peak (because eviction is higher), subscribers' **effective price per token naturally rises during peak hours**. It's not a surcharge — the cache savings simply disappear when everyone's fighting for the same VRAM. PAYG users are unaffected since they pay full price either way.
</WRAP>

- **Quiet hours** → more cache → cheaper effective per-token cost
- **Peak hours** → less cache → closer to full price

<WRAP info round>
**Tip:** Batch heavy work during off-peak for best cache hit rates and lowest effective cost.
</WRAP>

===== How Eviction Decides What Goes =====

Default is **LRU** (Least Recently Used) — the oldest-unused prompt gets evicted first.

Alternative: **LFU** (Least Frequently Used), configurable server-side.

Common prompt prefixes are **automatically shared** across requests via the radix tree structure. Entries are reference-counted — when a node's ref count drops to zero and memory is needed, it's evicted.

If the cache fills up mid-request, SGLang pauses that request ("retracts" it), frees space, then reschedules and resumes it.

===== Configurable Parameters =====

These are Synthetic-side settings — not user-facing, but documented here for transparency.

^ Parameter ^ Default ^ What It Does ^
| ''--radix-eviction-policy'' | ''lru'' | Eviction policy: ''lru'' or ''lfu'' |
| ''--mem-fraction-static'' | ~0.9 | Fraction of GPU memory allocated to model weights + KV cache pool |
| ''--schedule-policy'' | ''fcfs'' | Scheduling policy. ''lpm'' (longest prefix match) encourages cache hits |
| ''--schedule-conservativeness'' | ''1.0'' | Higher = more conservative scheduling. Increase if frequent retraction |
| ''--chunked-prefill-size'' | auto | Max tokens per chunked prefill chunk. Reduce if OOM during prefill |
| ''--kv-cache-dtype'' | ''auto'' | KV cache data type. ''fp8_e4m3'' / ''fp8_e5m2'' for memory savings |
| ''--disable-radix-cache'' | ''False'' | Disable prefix caching entirely |
| ''--max-total-tokens'' | auto | Max tokens in memory pool. Overrides auto-calculation |
| ''--max-running-requests'' | auto | Max concurrent running requests |

===== Hierarchical Cache (HiCache) =====

SGLang supports a 3-tier cache hierarchy, which can reduce the impact of GPU eviction:

  - **L1: GPU VRAM** — always active (RadixAttention)
  - **L2: Host/CPU RAM** — enabled with ''--enable-hierarchical-cache''
  - **L3: Remote storage** — file, mooncake, hf3fs, nixl, aibrix backends

When L1 fills up, KV cache pages offload to L2 (host RAM). When L2 fills, pages go to L3. The eviction cascades: GPU → Host RAM → Storage.

This means evicted GPU entries don't fully disappear if HiCache is enabled — they can be pulled back from RAM or disk instead of re-computed from scratch.

^ HiCache Parameter ^ Default ^ What It Does ^
| ''--enable-hierarchical-cache'' | ''False'' | Enable HiCache L2+L3 tiers |
| ''--hicache-ratio'' | ''2.0'' | Host mem pool size as ratio of device pool |
| ''--hicache-size'' | ''0'' | Host mem pool size in GB (overrides ratio) |
| ''--hicache-write-policy'' | ''write_through'' | Write policy: ''write_through'', ''write_back'', ''write_through_selective'' |
| ''--hicache-io-backend'' | ''kernel'' | IO backend for CPU↔GPU transfer: ''direct'', ''kernel'' |
| ''--hicache-storage-backend'' | ''None'' | L3 storage backend: ''file'', ''mooncake'', ''hf3fs'', ''nixl'', ''aibrix'', ''dynamic'', ''eic'' |
| ''--hicache-storage-prefetch-policy'' | ''best_effort'' | Prefetch policy: ''best_effort'', ''wait_complete'', ''timeout'' |
| ''--enable-cache-report'' | ''False'' | Return cached token count in API responses |

===== See Also =====

  * [[:limits]] — Subscription limits, token budgets, and recharge rates
  * [[:models:kimi-k25]] — Kimi K2.5 load-balanced routing between B200/H200
  * [[:models]] — Model catalog and pricing
  * [[https://docs.sglang.io|SGLang Official Docs]]
  * [[https://github.com/sgl-project/sglang|SGLang GitHub]]


===== Harness Cache Efficiency =====

<WRAP center round info 60%>
Cache hit rates vary drastically between harnesses. Based on public statements from Synthetic staff, some harnesses are **>5x worse** at hitting cache than others — same number of tokens sent, but quota usage >5x higher due to cache misses.
</WRAP>

This happens because different harnesses structure their API calls differently:

  - Harnesses that maintain consistent prefix messages (system prompts, conversation history) hit cache more often
  - Harnesses that restructure or reorder messages on each call miss cache more often
  - vLLM (used for Kimi) requires alignment to certain boundary sizes, so small requests may not be cached at all
  - SGLang (used for GLM) is smarter and hits cache even for small requests

The Anthropic API endpoint also uses prompt caching, but cache hit metrics are not yet exposed to end users.

<WRAP center round tip>
If you're seeing unexpectedly high token consumption, your harness's cache hit rate may be the culprit. Try a different harness or check if your tool supports prefix-consistent message ordering.
</WRAP>