====== Infrastructure ======
This section is based on public statements from Synthetic staff and community observations. Details may change as Synthetic adds hardware or changes providers.
===== Hardware =====
Synthetic runs on a mix of **B200** (Blackwell) and **H200** (Hopper) GPUs from multiple providers. As of April 2026, they have onboarded a 4th GPU provider after a series of outages.
- **B200s** — Run NVFP4 quantized models. More efficient per dollar, but Blackwell kernel optimization is still maturing. NVFP4 is supposed to be significantly faster than FP8 on Blackwell, but real-world performance is still catching up.
- **H200s** — Run INT4 quantized models (original lab releases). More mature software stack, but harder to optimize and less cost-efficient. H200s at FP16 are actually faster than B200s at FP16 in practice.
A single B200 node costs roughly **$400k**. GLM-5/5.1 requires **2×B200s per replica** (4 GPUs at tp4). Breakeven on a B200 pair is roughly **400 subscribers at $30/mo**.
===== Inference Engines =====
Synthetic uses both **SGLang** and **vLLM** depending on the model:
- **GLM series** → SGLang (historically faster for GLM)
- **Kimi K2.5** → vLLM (slightly better throughput under load vs SGLang)
- **Kimi K2-Thinking** → SGLang (had custom patches already from early capacity crunch)
Both engines have bugs that Synthetic regularly patches. The main differentiators for Synthetic are:
- Custom **tool calling parser** patches for both vLLM and SGLang
- Custom **reasoning parsing** patches
- **FlashInfer** bug fixes (e.g., SGLang's default FlashInfer version had bugs with NVFP4)
Synthetic's approach is to run standard NVFP4/FP8 quants on standard inference stacks with targeted patches, rather than trying fancy cost-cutting tricks that other inference providers use. "Other companies try too-fancy stuff to cut costs." — matt
=== SGLang vs vLLM Cache Behavior ===
- SGLang is smarter about caching: it hits cache even for small requests
- vLLM tries to align caches to certain boundary sizes, so small requests may not get cached
- This means GLM-5/5.1 (SGLang) may have better cache hit rates than Kimi K2.5 (vLLM) for short conversations
===== Speculative Decoding =====
Synthetic uses **Eagle3** speculative decoding for Kimi K2.5:
- **H200s**: Speculative decoding deployed and working well, averaging well over **50 TPS**
- **B200s/Blackwell**: Speculative decoding still being worked on. Has issues — encountered infinite "!!!!!!" bugs in reasoning during initial deployment
===== Hosting Model =====
Synthetic uses a mix of self-hosted and proxied models:
- **Self-hosted models**: Run on Synthetic's own reserved GPUs. Synthetic can patch inference engines, fix tool calling bugs, and control reliability. Currently: GLM-5/5.1, Kimi K2.5, Qwen 3.5, MiniMax M2.7, Nemotron 3 Super, GLM-4.7-Flash.
- **Proxied models**: Forwarded to third-party providers (Fireworks, TogetherAI, DeepInfra). Synthetic cannot fix reliability issues. Currently: DeepSeek V3.2, and temporarily GLM-4.7 during GPU outages.
**General rule**: Synthetic self-hosts newer/frontier models and proxies older models. When a newer model replaces an older one, the older model is typically proxied. Proxy duration depends on load — people usually switch quickly, so load is low and proxies can stay around for a while.
Proxied models forward the price Synthetic pays the underlying inference provider, which may differ from self-hosted pricing.
=== GPU Providers ===
As of April 2026, Synthetic uses 4 GPU providers. Previously, all models were on reserved GPUs from TogetherAI. The provider landscape has been unstable — in mid-April 2026, all 3 original providers experienced simultaneous outages.
===== Sharding and Concurrency =====
Synthetic has experimented with sharding models across multiple nodes:
- **Standard**: Shard model weights across 8 GPUs of a single node (replicate across nodes)
- **Experimental**: Shard a single model across 16 GPUs (2 nodes). Increases KV cache available per node, allowing higher request concurrency for MoE models
This is particularly relevant for fine-grained MoE models where KV cache is the bottleneck.
===== Synbad =====
Synbad is Synthetic's internal testing tool that runs payloads against inference engines to detect bugs. It's used as a repository of test payloads to validate against new SGLang/vLLM releases, since many bugs are payload-specific. Synbad Proxy can also be used by users to intercept and capture failing payloads for debugging.