Differences

This shows you the differences between two versions of the page.

--- prompt_caching [2026/04/19 05:11] – gwyntel
+++ prompt_caching [2026/04/19 22:16] (current) – Added harness cache efficiency section gwyntel
@@ Line 101: / Line 101: @@
   * [[https://github.com/sgl-project/sglang|SGLang GitHub]]
+===== Harness Cache Efficiency =====
+<WRAP center round info 60%>
+Cache hit rates vary drastically between harnesses. Based on public statements from Synthetic staff, some harnesses are **>5x worse** at hitting cache than others — same number of tokens sent, but quota usage >5x higher due to cache misses.
+</WRAP>
+This happens because different harnesses structure their API calls differently:
+  - Harnesses that maintain consistent prefix messages (system prompts, conversation history) hit cache more often
+  - Harnesses that restructure or reorder messages on each call miss cache more often
+  - vLLM (used for Kimi) requires alignment to certain boundary sizes, so small requests may not be cached at all
+  - SGLang (used for GLM) is smarter and hits cache even for small requests
+The Anthropic API endpoint also uses prompt caching, but cache hit metrics are not yet exposed to end users.
+<WRAP center round tip>
+If you're seeing unexpectedly high token consumption, your harness's cache hit rate may be the culprit. Try a different harness or check if your tool supports prefix-consistent message ordering.
+</WRAP>