Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revision | |||
| prompt_caching [2026/04/19 05:11] – gwyntel | prompt_caching [2026/04/19 22:16] (current) – Added harness cache efficiency section gwyntel | ||
|---|---|---|---|
| Line 101: | Line 101: | ||
| * [[https:// | * [[https:// | ||
| + | |||
| + | ===== Harness Cache Efficiency ===== | ||
| + | |||
| + | <WRAP center round info 60%> | ||
| + | Cache hit rates vary drastically between harnesses. Based on public statements from Synthetic staff, some harnesses are **>5x worse** at hitting cache than others — same number of tokens sent, but quota usage >5x higher due to cache misses. | ||
| + | </ | ||
| + | |||
| + | This happens because different harnesses structure their API calls differently: | ||
| + | |||
| + | - Harnesses that maintain consistent prefix messages (system prompts, conversation history) hit cache more often | ||
| + | - Harnesses that restructure or reorder messages on each call miss cache more often | ||
| + | - vLLM (used for Kimi) requires alignment to certain boundary sizes, so small requests may not be cached at all | ||
| + | - SGLang (used for GLM) is smarter and hits cache even for small requests | ||
| + | |||
| + | The Anthropic API endpoint also uses prompt caching, but cache hit metrics are not yet exposed to end users. | ||
| + | |||
| + | <WRAP center round tip> | ||
| + | If you're seeing unexpectedly high token consumption, | ||
| + | </ | ||