Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| prompt_caching [2026/04/19 03:39] – created by gwyntel <3 gwyntel | prompt_caching [2026/04/19 22:16] (current) – Added harness cache efficiency section gwyntel | ||
|---|---|---|---|
| Line 5: | Line 5: | ||
| SGLang keeps a copy of recently-used prompts in GPU memory (VRAM) so it doesn' | SGLang keeps a copy of recently-used prompts in GPU memory (VRAM) so it doesn' | ||
| - | When your prompt' | + | When your prompt' |
| ===== How Long Does It Stay? ===== | ===== How Long Does It Stay? ===== | ||
| Line 18: | Line 18: | ||
| More concurrent users = more KV cache competition = your cached prompts get evicted faster. During quiet hours, your cache persists much longer. | More concurrent users = more KV cache competition = your cached prompts get evicted faster. During quiet hours, your cache persists much longer. | ||
| + | |||
| + | ===== Why Capacity Is Tight ===== | ||
| + | |||
| + | <WRAP center round info> | ||
| + | This section is based on public statements from Synthetic staff and community observations. Numbers may change as Synthetic adds hardware. | ||
| + | </ | ||
| + | |||
| + | Synthetic runs on B200 and H200 GPUs. These are expensive — a single B200 node costs roughly **$400k**. GLM 5/5.1 requires **2× B200s** per instance (NVFP4 quant). Kimi K2.5 runs on either B200s (NVFP4) or H200s (INT4), and is now load-balanced between them (see [[: | ||
| + | |||
| + | The economics are tight: breakeven on a B200 pair is roughly **400 subscribers at $30/mo** (fewer if subscribers are on higher tiers). Synthetic can't just throw more B200s at capacity problems — they have to be paid for by subscriber revenue. This is why peak hours cause real cache pressure: the user-to-GPU ratio is constrained by hardware costs, not by choice. | ||
| + | |||
| + | <WRAP tip round> | ||
| + | **The NVFP4 factor:** B200s were actually slower than H200s at FP16 in practice — the kernels aren't fully optimized yet. NVFP4 is supposed to be significantly faster than FP8 on Blackwell hardware specifically, | ||
| + | </ | ||
| ===== The Peak-Hours Pricing Effect ===== | ===== The Peak-Hours Pricing Effect ===== | ||
| <WRAP tip center round> | <WRAP tip center round> | ||
| - | Since Synthetic gives an 80% discount on cache-hit tokens, and cache hits drop during peak (because eviction is higher), | + | Since Synthetic gives an 80% cache discount to subscribers (PAYG gets no discount), and cache hits drop during peak (because eviction is higher), |
| </ | </ | ||
| Line 82: | Line 96: | ||
| * [[:limits]] — Subscription limits, token budgets, and recharge rates | * [[:limits]] — Subscription limits, token budgets, and recharge rates | ||
| + | * [[: | ||
| * [[:models]] — Model catalog and pricing | * [[:models]] — Model catalog and pricing | ||
| * [[https:// | * [[https:// | ||
| * [[https:// | * [[https:// | ||
| + | |||
| + | |||
| + | ===== Harness Cache Efficiency ===== | ||
| + | |||
| + | <WRAP center round info 60%> | ||
| + | Cache hit rates vary drastically between harnesses. Based on public statements from Synthetic staff, some harnesses are **>5x worse** at hitting cache than others — same number of tokens sent, but quota usage >5x higher due to cache misses. | ||
| + | </ | ||
| + | |||
| + | This happens because different harnesses structure their API calls differently: | ||
| + | |||
| + | - Harnesses that maintain consistent prefix messages (system prompts, conversation history) hit cache more often | ||
| + | - Harnesses that restructure or reorder messages on each call miss cache more often | ||
| + | - vLLM (used for Kimi) requires alignment to certain boundary sizes, so small requests may not be cached at all | ||
| + | - SGLang (used for GLM) is smarter and hits cache even for small requests | ||
| + | |||
| + | The Anthropic API endpoint also uses prompt caching, but cache hit metrics are not yet exposed to end users. | ||
| + | |||
| + | <WRAP center round tip> | ||
| + | If you're seeing unexpectedly high token consumption, | ||
| + | </ | ||