Differences

This shows you the differences between two versions of the page.

--- prompt_caching [2026/04/19 04:23] – llm edit gwyntel
+++ prompt_caching [2026/04/19 22:16] (current) – Added harness cache efficiency section gwyntel
@@ Line 5: / Line 5: @@
 SGLang keeps a copy of recently-used prompts in GPU memory (VRAM) so it doesn't have to reprocess them from scratch. This is called the **KV cache** or **radix cache**.
-When your prompt's prefix matches something already in cache, the model skips re-computing those tokens — saving time and money. Synthetic applies an **80% discount** on cache-hit tokens, so cache hits are a real cost saver.
+When your prompt's prefix matches something already in cache, the model skips re-computing those tokens — saving time and money. Synthetic applies an **80% discount on cache-hit tokens for subscribers**. **PAYG users do not receive cache discounts** — you pay full price for all tokens regardless of cache hits.
 ===== How Long Does It Stay? =====
@@ Line 36: / Line 36: @@
 <WRAP tip center round>
-Since Synthetic gives an 80% discount on cache-hit tokens, and cache hits drop during peak (because eviction is higher), your **effective price per token naturally rises during peak hours**. It's not a surcharge — the cache savings simply disappear when everyone's fighting for the same VRAM.
+Since Synthetic gives an 80% cache discount to subscribers (PAYG gets no discount), and cache hits drop during peak (because eviction is higher), subscribers' **effective price per token naturally rises during peak hours**. It's not a surcharge — the cache savings simply disappear when everyone's fighting for the same VRAM. PAYG users are unaffected since they pay full price either way.
 </WRAP>
@@ Line 101: / Line 101: @@
   * [[https://github.com/sgl-project/sglang|SGLang GitHub]]
+===== Harness Cache Efficiency =====
+<WRAP center round info 60%>
+Cache hit rates vary drastically between harnesses. Based on public statements from Synthetic staff, some harnesses are **>5x worse** at hitting cache than others — same number of tokens sent, but quota usage >5x higher due to cache misses.
+</WRAP>
+This happens because different harnesses structure their API calls differently:
+  - Harnesses that maintain consistent prefix messages (system prompts, conversation history) hit cache more often
+  - Harnesses that restructure or reorder messages on each call miss cache more often
+  - vLLM (used for Kimi) requires alignment to certain boundary sizes, so small requests may not be cached at all
+  - SGLang (used for GLM) is smarter and hits cache even for small requests
+The Anthropic API endpoint also uses prompt caching, but cache hit metrics are not yet exposed to end users.
+<WRAP center round tip>
+If you're seeing unexpectedly high token consumption, your harness's cache hit rate may be the culprit. Try a different harness or check if your tool supports prefix-consistent message ordering.
+</WRAP>