Organic

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
prompt_caching [2026/04/19 05:11] gwyntelprompt_caching [2026/04/19 22:16] (current) – Added harness cache efficiency section gwyntel
Line 101: Line 101:
   * [[https://github.com/sgl-project/sglang|SGLang GitHub]]   * [[https://github.com/sgl-project/sglang|SGLang GitHub]]
  
 +
 +===== Harness Cache Efficiency =====
 +
 +<WRAP center round info 60%>
 +Cache hit rates vary drastically between harnesses. Based on public statements from Synthetic staff, some harnesses are **>5x worse** at hitting cache than others — same number of tokens sent, but quota usage >5x higher due to cache misses.
 +</WRAP>
 +
 +This happens because different harnesses structure their API calls differently:
 +
 +  - Harnesses that maintain consistent prefix messages (system prompts, conversation history) hit cache more often
 +  - Harnesses that restructure or reorder messages on each call miss cache more often
 +  - vLLM (used for Kimi) requires alignment to certain boundary sizes, so small requests may not be cached at all
 +  - SGLang (used for GLM) is smarter and hits cache even for small requests
 +
 +The Anthropic API endpoint also uses prompt caching, but cache hit metrics are not yet exposed to end users.
 +
 +<WRAP center round tip>
 +If you're seeing unexpectedly high token consumption, your harness's cache hit rate may be the culprit. Try a different harness or check if your tool supports prefix-consistent message ordering.
 +</WRAP>