====== Infrastructure ====== This section is based on public statements from Synthetic staff and community observations. Details may change as Synthetic adds hardware or changes providers. ===== Hardware ===== Synthetic runs on a mix of **B200** (Blackwell) and **H200** (Hopper) GPUs from multiple providers. As of April 2026, they have onboarded a 4th GPU provider after a series of outages. - **B200s** — Run NVFP4 quantized models. More efficient per dollar, but Blackwell kernel optimization is still maturing. NVFP4 is supposed to be significantly faster than FP8 on Blackwell, but real-world performance is still catching up. - **H200s** — Run INT4 quantized models (original lab releases). More mature software stack, but harder to optimize and less cost-efficient. H200s at FP16 are actually faster than B200s at FP16 in practice. A single B200 node costs roughly **$400k**. GLM-5/5.1 requires **2×B200s per replica** (4 GPUs at tp4). Breakeven on a B200 pair is roughly **400 subscribers at $30/mo**. ===== Inference Engines ===== Synthetic uses both **SGLang** and **vLLM** depending on the model: - **GLM series** → SGLang (historically faster for GLM) - **Kimi K2.5** → vLLM (slightly better throughput under load vs SGLang) - **Kimi K2-Thinking** → SGLang (had custom patches already from early capacity crunch) Both engines have bugs that Synthetic regularly patches. The main differentiators for Synthetic are: - Custom **tool calling parser** patches for both vLLM and SGLang - Custom **reasoning parsing** patches - **FlashInfer** bug fixes (e.g., SGLang's default FlashInfer version had bugs with NVFP4) Synthetic's approach is to run standard NVFP4/FP8 quants on standard inference stacks with targeted patches, rather than trying fancy cost-cutting tricks that other inference providers use. "Other companies try too-fancy stuff to cut costs." — matt === SGLang vs vLLM Cache Behavior === - SGLang is smarter about caching: it hits cache even for small requests - vLLM tries to align caches to certain boundary sizes, so small requests may not get cached - This means GLM-5/5.1 (SGLang) may have better cache hit rates than Kimi K2.5 (vLLM) for short conversations ===== Speculative Decoding ===== Synthetic uses **Eagle3** speculative decoding for Kimi K2.5: - **H200s**: Speculative decoding deployed and working well, averaging well over **50 TPS** - **B200s/Blackwell**: Speculative decoding still being worked on. Has issues — encountered infinite "!!!!!!" bugs in reasoning during initial deployment ===== Hosting Model ===== Synthetic uses a mix of self-hosted and proxied models: - **Self-hosted models**: Run on Synthetic's own reserved GPUs. Synthetic can patch inference engines, fix tool calling bugs, and control reliability. Currently: GLM-5/5.1, Kimi K2.5, Qwen 3.5, MiniMax M2.7, Nemotron 3 Super, GLM-4.7-Flash. - **Proxied models**: Forwarded to third-party providers (Fireworks, TogetherAI, DeepInfra). Synthetic cannot fix reliability issues. Currently: DeepSeek V3.2, and temporarily GLM-4.7 during GPU outages. **General rule**: Synthetic self-hosts newer/frontier models and proxies older models. When a newer model replaces an older one, the older model is typically proxied. Proxy duration depends on load — people usually switch quickly, so load is low and proxies can stay around for a while. Proxied models forward the price Synthetic pays the underlying inference provider, which may differ from self-hosted pricing. === GPU Providers === As of April 2026, Synthetic uses 4 GPU providers. Previously, all models were on reserved GPUs from TogetherAI. The provider landscape has been unstable — in mid-April 2026, all 3 original providers experienced simultaneous outages. ===== Sharding and Concurrency ===== Synthetic has experimented with sharding models across multiple nodes: - **Standard**: Shard model weights across 8 GPUs of a single node (replicate across nodes) - **Experimental**: Shard a single model across 16 GPUs (2 nodes). Increases KV cache available per node, allowing higher request concurrency for MoE models This is particularly relevant for fine-grained MoE models where KV cache is the bottleneck. ===== Synbad ===== Synbad is Synthetic's internal testing tool that runs payloads against inference engines to detect bugs. It's used as a repository of test payloads to validate against new SGLang/vLLM releases, since many bugs are payload-specific. Synbad Proxy can also be used by users to intercept and capture failing payloads for debugging.