| Next revision | Previous revision |
| models:glm-5.1 [2026/04/19 22:16] – Created GLM-5.1 model page gwyntel | models:glm-5.1 [2026/04/21 13:07] (current) – xenolandscapes |
|---|
| === Compute Requirements === | === Compute Requirements === |
| |
| GLM-5.1 requires 4 B200 GPUs per replica at NVFP4 (same as GLM-5). It is more compute-intensive than Kimi K2.5 at current pricing, and Synthetic has acknowledged the price may need to increase to better reflect actual compute costs and improve average performance during peak times. | GLM-5.1 runs on 4 B200 GPUs per replica at NVFP4 quantization (same as GLM-5), and uses SGLang instead of vLLM for better cache hit rates. This is compared to 8 B200s for Kimi K2.5, making it theoretically both faster (due to less NVLink overhead) and cheaper (due to less energy required, and needing to rent fewer expensive GPUs). There is no official NVFP4 quant from Nvidia yet, but llmcompressor supports the GLM-5 architecture, which has allowed Synthetic to produce their own quant. |
| |
| The model uses SGLang for inference (which provides better cache hit behavior than vLLM for small requests) and runs on B200 hardware with NVFP4 quantization. There is no official NVFP4 quant from Nvidia yet, but llmcompressor supports the GLM-5 architecture, so Synthetic can produce their own quant. | Despite the above numbers, Synthetic claims that it is actually *more* compute-intensive than Kimi K2.5, and they say that the price may need to increase to better reflect this, and has floated the idea that this might resolve GLM-5.1's noted instability (see below). The price-point they have pointed to as the goal (the "market rates") are per-token API prices on OpenRouter, where GLM-5.1 is slightly more expensive than GLM-5. |
| | |
| | However, users have noted that the subscription-based pricing model used by Synthetic differs significantly from the per-token based pricing model seen on OpenRouter: on OpenRouter, providers are trying to make a per-token profit, so the raise in prices could be just due to greater demand for GLM-5.1 due to its increased capabilities, whereas Synthetic's subscription model means that token pricing is only meant to be a bellweather for compute costs. |
| | |
| | These price hikes have been floated as a solution to the recent instability of the GLM-5.1 replicas, covered below. |
| |
| === Known Issues === | === Known Issues === |
| |
| - **Prefill stalling:** More active parameters means more expensive prefill. Under load, prefills can block decode, causing perceived stalling. This affects GLM-5/5.1 more than GLM-4.7. | - **Prefill stalling:** More active parameters means more expensive prefill. Under load, prefills can block decode, causing perceived stalling, and eventually request timeouts in some cases. This affects GLM-5/5.1 more than GLM-4.7. |
| - **Capacity constraints:** GLM-5.1 has been running close to redline during peak hours. Synthetic is working on bringing up more compute. | - **Capacity constraints:** GLM-5.1 has been running close to redline during peak hours. Synthetic is working on bringing up more compute. |
| | - **Node instability:** There have been several instances of entire GPU nodes crashing in the last two or three weeks, which has led to service gaps where official status updates often lag behind community reports in the Discord. |
| | |
| | It is unclear if an increase in pricing would resolve any of these issues. |
| |
| See also: [[:models:glm-5|GLM-5]] (predecessor, being retired), [[:models:kimi-k25|Kimi K2.5]] (complementary frontier model with vision) | See also: [[:models:glm-5|GLM-5]] (predecessor, being retired), [[:models:kimi-k25|Kimi K2.5]] (complementary frontier model with vision) |