Differences

This shows you the differences between two versions of the page.

--- prompt_caching [2026/04/19 03:39] – created by gwyntel <3 gwyntel
+++ prompt_caching [2026/04/19 22:16] (current) – Added harness cache efficiency section gwyntel
@@ Line 5: / Line 5: @@
 SGLang keeps a copy of recently-used prompts in GPU memory (VRAM) so it doesn't have to reprocess them from scratch. This is called the **KV cache** or **radix cache**.
-When your prompt's prefix matches something already in cache, the model skips re-computing those tokens — saving time and money. Synthetic applies an **80% discount** on cache-hit tokens, so cache hits are a real cost saver.
+When your prompt's prefix matches something already in cache, the model skips re-computing those tokens — saving time and money. Synthetic applies an **80% discount on cache-hit tokens for subscribers**. **PAYG users do not receive cache discounts** — you pay full price for all tokens regardless of cache hits.
 ===== How Long Does It Stay? =====
@@ Line 18: / Line 18: @@
 More concurrent users = more KV cache competition = your cached prompts get evicted faster. During quiet hours, your cache persists much longer.
+===== Why Capacity Is Tight =====
+<WRAP center round info>
+This section is based on public statements from Synthetic staff and community observations. Numbers may change as Synthetic adds hardware.
+</WRAP>
+Synthetic runs on B200 and H200 GPUs. These are expensive — a single B200 node costs roughly **$400k**. GLM 5/5.1 requires **2× B200s** per instance (NVFP4 quant). Kimi K2.5 runs on either B200s (NVFP4) or H200s (INT4), and is now load-balanced between them (see [[:models:kimi-k25]]).
+The economics are tight: breakeven on a B200 pair is roughly **400 subscribers at $30/mo** (fewer if subscribers are on higher tiers). Synthetic can't just throw more B200s at capacity problems — they have to be paid for by subscriber revenue. This is why peak hours cause real cache pressure: the user-to-GPU ratio is constrained by hardware costs, not by choice.
+<WRAP tip round>
+**The NVFP4 factor:** B200s were actually slower than H200s at FP16 in practice — the kernels aren't fully optimized yet. NVFP4 is supposed to be significantly faster than FP8 on Blackwell hardware specifically, but real-world perf is still TBD since NVFP4 is new and the kernel optimization isn't mature yet. If NVFP4 hits its theoretical lift, capacity per B200 improves and cache pressure eases. Until then, B200s are running somewhat below their potential.
+</WRAP>
 ===== The Peak-Hours Pricing Effect =====
 <WRAP tip center round>
-Since Synthetic gives an 80% discount on cache-hit tokens, and cache hits drop during peak (because eviction is higher), your **effective price per token naturally rises during peak hours**. It's not a surcharge — the cache savings simply disappear when everyone's fighting for the same VRAM.
+Since Synthetic gives an 80% cache discount to subscribers (PAYG gets no discount), and cache hits drop during peak (because eviction is higher), subscribers' **effective price per token naturally rises during peak hours**. It's not a surcharge — the cache savings simply disappear when everyone's fighting for the same VRAM. PAYG users are unaffected since they pay full price either way.
 </WRAP>
@@ Line 82: / Line 96: @@
   * [[:limits]] — Subscription limits, token budgets, and recharge rates
+  * [[:models:kimi-k25]] — Kimi K2.5 load-balanced routing between B200/H200
   * [[:models]] — Model catalog and pricing
   * [[https://docs.sglang.io|SGLang Official Docs]]
   * [[https://github.com/sgl-project/sglang|SGLang GitHub]]
+===== Harness Cache Efficiency =====
+<WRAP center round info 60%>
+Cache hit rates vary drastically between harnesses. Based on public statements from Synthetic staff, some harnesses are **>5x worse** at hitting cache than others — same number of tokens sent, but quota usage >5x higher due to cache misses.
+</WRAP>
+This happens because different harnesses structure their API calls differently:
+  - Harnesses that maintain consistent prefix messages (system prompts, conversation history) hit cache more often
+  - Harnesses that restructure or reorder messages on each call miss cache more often
+  - vLLM (used for Kimi) requires alignment to certain boundary sizes, so small requests may not be cached at all
+  - SGLang (used for GLM) is smarter and hits cache even for small requests
+The Anthropic API endpoint also uses prompt caching, but cache hit metrics are not yet exposed to end users.
+<WRAP center round tip>
+If you're seeing unexpectedly high token consumption, your harness's cache hit rate may be the culprit. Try a different harness or check if your tool supports prefix-consistent message ordering.
+</WRAP>