Organic

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
prompt_caching [2026/04/19 04:23] – llm edit gwyntelprompt_caching [2026/04/19 22:16] (current) – Added harness cache efficiency section gwyntel
Line 5: Line 5:
 SGLang keeps a copy of recently-used prompts in GPU memory (VRAM) so it doesn't have to reprocess them from scratch. This is called the **KV cache** or **radix cache**. SGLang keeps a copy of recently-used prompts in GPU memory (VRAM) so it doesn't have to reprocess them from scratch. This is called the **KV cache** or **radix cache**.
  
-When your prompt's prefix matches something already in cache, the model skips re-computing those tokens — saving time and money. Synthetic applies an **80% discount** on cache-hit tokens, so cache hits are a real cost saver.+When your prompt's prefix matches something already in cache, the model skips re-computing those tokens — saving time and money. Synthetic applies an **80% discount on cache-hit tokens for subscribers**. **PAYG users do not receive cache discounts** — you pay full price for all tokens regardless of cache hits.
  
 ===== How Long Does It Stay? ===== ===== How Long Does It Stay? =====
Line 36: Line 36:
  
 <WRAP tip center round> <WRAP tip center round>
-Since Synthetic gives an 80% discount on cache-hit tokens, and cache hits drop during peak (because eviction is higher), your **effective price per token naturally rises during peak hours**. It's not a surcharge — the cache savings simply disappear when everyone's fighting for the same VRAM.+Since Synthetic gives an 80% cache discount to subscribers (PAYG gets no discount), and cache hits drop during peak (because eviction is higher), subscribers' **effective price per token naturally rises during peak hours**. It's not a surcharge — the cache savings simply disappear when everyone's fighting for the same VRAM. PAYG users are unaffected since they pay full price either way.
 </WRAP> </WRAP>
  
Line 101: Line 101:
   * [[https://github.com/sgl-project/sglang|SGLang GitHub]]   * [[https://github.com/sgl-project/sglang|SGLang GitHub]]
  
 +
 +===== Harness Cache Efficiency =====
 +
 +<WRAP center round info 60%>
 +Cache hit rates vary drastically between harnesses. Based on public statements from Synthetic staff, some harnesses are **>5x worse** at hitting cache than others — same number of tokens sent, but quota usage >5x higher due to cache misses.
 +</WRAP>
 +
 +This happens because different harnesses structure their API calls differently:
 +
 +  - Harnesses that maintain consistent prefix messages (system prompts, conversation history) hit cache more often
 +  - Harnesses that restructure or reorder messages on each call miss cache more often
 +  - vLLM (used for Kimi) requires alignment to certain boundary sizes, so small requests may not be cached at all
 +  - SGLang (used for GLM) is smarter and hits cache even for small requests
 +
 +The Anthropic API endpoint also uses prompt caching, but cache hit metrics are not yet exposed to end users.
 +
 +<WRAP center round tip>
 +If you're seeing unexpectedly high token consumption, your harness's cache hit rate may be the culprit. Try a different harness or check if your tool supports prefix-consistent message ordering.
 +</WRAP>