Table of Contents

Infrastructure

This section is based on public statements from Synthetic staff and community observations. Details may change as Synthetic adds hardware or changes providers.

Hardware

Synthetic runs on a mix of B200 (Blackwell) and H200 (Hopper) GPUs from multiple providers. As of April 2026, they have onboarded a 4th GPU provider after a series of outages.

  1. B200s — Run NVFP4 quantized models. More efficient per dollar, but Blackwell kernel optimization is still maturing. NVFP4 is supposed to be significantly faster than FP8 on Blackwell, but real-world performance is still catching up.
  2. H200s — Run INT4 quantized models (original lab releases). More mature software stack, but harder to optimize and less cost-efficient. H200s at FP16 are actually faster than B200s at FP16 in practice.

A single B200 node costs roughly $400k. GLM-5/5.1 requires 2×B200s per replica (4 GPUs at tp4). Breakeven on a B200 pair is roughly 400 subscribers at $30/mo.

Inference Engines

Synthetic uses both SGLang and vLLM depending on the model:

  1. GLM series → SGLang (historically faster for GLM)
  2. Kimi K2.5 → vLLM (slightly better throughput under load vs SGLang)
  3. Kimi K2-Thinking → SGLang (had custom patches already from early capacity crunch)

Both engines have bugs that Synthetic regularly patches. The main differentiators for Synthetic are:

  1. Custom tool calling parser patches for both vLLM and SGLang
  2. Custom reasoning parsing patches
  3. FlashInfer bug fixes (e.g., SGLang’s default FlashInfer version had bugs with NVFP4)

Synthetic’s approach is to run standard NVFP4/FP8 quants on standard inference stacks with targeted patches, rather than trying fancy cost-cutting tricks that other inference providers use. “Other companies try too-fancy stuff to cut costs.” — matt

SGLang vs vLLM Cache Behavior

  1. SGLang is smarter about caching: it hits cache even for small requests
  2. vLLM tries to align caches to certain boundary sizes, so small requests may not get cached
  3. This means GLM-5/5.1 (SGLang) may have better cache hit rates than Kimi K2.5 (vLLM) for short conversations

Speculative Decoding

Synthetic uses Eagle3 speculative decoding for Kimi K2.5:

  1. H200s: Speculative decoding deployed and working well, averaging well over 50 TPS
  2. B200s/Blackwell: Speculative decoding still being worked on. Has issues — encountered infinite “!!!!!!” bugs in reasoning during initial deployment

Hosting Model

Synthetic uses a mix of self-hosted and proxied models:

  1. Self-hosted models: Run on Synthetic’s own reserved GPUs. Synthetic can patch inference engines, fix tool calling bugs, and control reliability. Currently: GLM-5/5.1, Kimi K2.5, Qwen 3.5, MiniMax M2.7, Nemotron 3 Super, GLM-4.7-Flash.
  2. Proxied models: Forwarded to third-party providers (Fireworks, TogetherAI, DeepInfra). Synthetic cannot fix reliability issues. Currently: DeepSeek V3.2, and temporarily GLM-4.7 during GPU outages.

General rule: Synthetic self-hosts newer/frontier models and proxies older models. When a newer model replaces an older one, the older model is typically proxied. Proxy duration depends on load — people usually switch quickly, so load is low and proxies can stay around for a while.

Proxied models forward the price Synthetic pays the underlying inference provider, which may differ from self-hosted pricing.

GPU Providers

As of April 2026, Synthetic uses 4 GPU providers. Previously, all models were on reserved GPUs from TogetherAI. The provider landscape has been unstable — in mid-April 2026, all 3 original providers experienced simultaneous outages.

Sharding and Concurrency

Synthetic has experimented with sharding models across multiple nodes:

  1. Standard: Shard model weights across 8 GPUs of a single node (replicate across nodes)
  2. Experimental: Shard a single model across 16 GPUs (2 nodes). Increases KV cache available per node, allowing higher request concurrency for MoE models

This is particularly relevant for fine-grained MoE models where KV cache is the bottleneck.

Synbad

Synbad is Synthetic’s internal testing tool that runs payloads against inference engines to detect bugs. It’s used as a repository of test payloads to validate against new SGLang/vLLM releases, since many bugs are payload-specific. Synbad Proxy can also be used by users to intercept and capture failing payloads for debugging.