====== Infrastructure ======

<WRAP center round info 60%>
This section is based on public statements from Synthetic staff and community observations. Details may change as Synthetic adds hardware or changes providers.
</WRAP>

===== Hardware =====

Synthetic runs on a mix of **B200** (Blackwell) and **H200** (Hopper) GPUs from multiple providers. As of April 2026, they have onboarded a 4th GPU provider after a series of outages.

  - **B200s** — Run NVFP4 quantized models. More efficient per dollar, but Blackwell kernel optimization is still maturing. NVFP4 is supposed to be significantly faster than FP8 on Blackwell, but real-world performance is still catching up.
  - **H200s** — Run INT4 quantized models (original lab releases). More mature software stack, but harder to optimize and less cost-efficient. H200s at FP16 are actually faster than B200s at FP16 in practice.

A single B200 node costs roughly **$400k**. GLM-5/5.1 requires **2×B200s per replica** (4 GPUs at tp4). Breakeven on a B200 pair is roughly **400 subscribers at $30/mo**.

===== Inference Engines =====

Synthetic uses both **SGLang** and **vLLM** depending on the model:

  - **GLM series** → SGLang (historically faster for GLM)
  - **Kimi K2.5** → vLLM (slightly better throughput under load vs SGLang)
  - **Kimi K2-Thinking** → SGLang (had custom patches already from early capacity crunch)

Both engines have bugs that Synthetic regularly patches. The main differentiators for Synthetic are:

  - Custom **tool calling parser** patches for both vLLM and SGLang
  - Custom **reasoning parsing** patches
  - **FlashInfer** bug fixes (e.g., SGLang's default FlashInfer version had bugs with NVFP4)

<WRAP center round tip>
Synthetic's approach is to run standard NVFP4/FP8 quants on standard inference stacks with targeted patches, rather than trying fancy cost-cutting tricks that other inference providers use. "Other companies try too-fancy stuff to cut costs." — matt
</WRAP>

=== SGLang vs vLLM Cache Behavior ===

  - SGLang is smarter about caching: it hits cache even for small requests
  - vLLM tries to align caches to certain boundary sizes, so small requests may not get cached
  - This means GLM-5/5.1 (SGLang) may have better cache hit rates than Kimi K2.5 (vLLM) for short conversations

===== Speculative Decoding =====

Synthetic uses **Eagle3** speculative decoding for Kimi K2.5:

  - **H200s**: Speculative decoding deployed and working well, averaging well over **50 TPS**
  - **B200s/Blackwell**: Speculative decoding still being worked on. Has issues — encountered infinite "!!!!!!" bugs in reasoning during initial deployment

===== Hosting Model =====

Synthetic uses a mix of self-hosted and proxied models:

  - **Self-hosted models**: Run on Synthetic's own reserved GPUs. Synthetic can patch inference engines, fix tool calling bugs, and control reliability. Currently: GLM-5/5.1, Kimi K2.5, Qwen 3.5, MiniMax M2.7, Nemotron 3 Super, GLM-4.7-Flash.
  - **Proxied models**: Forwarded to third-party providers (Fireworks, TogetherAI, DeepInfra). Synthetic cannot fix reliability issues. Currently: DeepSeek V3.2, and temporarily GLM-4.7 during GPU outages.

<WRAP center round tip>
**General rule**: Synthetic self-hosts newer/frontier models and proxies older models. When a newer model replaces an older one, the older model is typically proxied. Proxy duration depends on load — people usually switch quickly, so load is low and proxies can stay around for a while.
</WRAP>

Proxied models forward the price Synthetic pays the underlying inference provider, which may differ from self-hosted pricing.

=== GPU Providers ===

As of April 2026, Synthetic uses 4 GPU providers. Previously, all models were on reserved GPUs from TogetherAI. The provider landscape has been unstable — in mid-April 2026, all 3 original providers experienced simultaneous outages.

===== Sharding and Concurrency =====

Synthetic has experimented with sharding models across multiple nodes:

  - **Standard**: Shard model weights across 8 GPUs of a single node (replicate across nodes)
  - **Experimental**: Shard a single model across 16 GPUs (2 nodes). Increases KV cache available per node, allowing higher request concurrency for MoE models

This is particularly relevant for fine-grained MoE models where KV cache is the bottleneck.

===== Synbad =====

Synbad is Synthetic's internal testing tool that runs payloads against inference engines to detect bugs. It's used as a repository of test payloads to validate against new SGLang/vLLM releases, since many bugs are payload-specific. Synbad Proxy can also be used by users to intercept and capture failing payloads for debugging.