Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| models:glm-5.1 [2026/04/21 13:02] – xenolandscapes | models:glm-5.1 [2026/04/21 13:07] (current) – xenolandscapes | ||
|---|---|---|---|
| Line 20: | Line 20: | ||
| GLM-5.1 runs on 4 B200 GPUs per replica at NVFP4 quantization (same as GLM-5), and uses SGLang instead of vLLM for better cache hit rates. This is compared to 8 B200s for Kimi K2.5, making it theoretically both faster (due to less NVLink overhead) and cheaper (due to less energy required, and needing to rent fewer expensive GPUs). There is no official NVFP4 quant from Nvidia yet, but llmcompressor supports the GLM-5 architecture, | GLM-5.1 runs on 4 B200 GPUs per replica at NVFP4 quantization (same as GLM-5), and uses SGLang instead of vLLM for better cache hit rates. This is compared to 8 B200s for Kimi K2.5, making it theoretically both faster (due to less NVLink overhead) and cheaper (due to less energy required, and needing to rent fewer expensive GPUs). There is no official NVFP4 quant from Nvidia yet, but llmcompressor supports the GLM-5 architecture, | ||
| - | Despite the above numbers, Synthetic claims that it is actually *more* compute-intensive than Kimi K2.5, and they say that the price may need to increase to better reflect this. The price-point they have pointed to as the goal (the " | + | Despite the above numbers, Synthetic claims that it is actually *more* compute-intensive than Kimi K2.5, and they say that the price may need to increase to better reflect this, and has floated the idea that this might resolve GLM-5.1' |
| + | |||
| + | However, | ||
| These price hikes have been floated as a solution to the recent instability of the GLM-5.1 replicas, covered below. | These price hikes have been floated as a solution to the recent instability of the GLM-5.1 replicas, covered below. | ||
| Line 28: | Line 30: | ||
| - **Prefill stalling:** More active parameters means more expensive prefill. Under load, prefills can block decode, causing perceived stalling, and eventually request timeouts in some cases. This affects GLM-5/5.1 more than GLM-4.7. | - **Prefill stalling:** More active parameters means more expensive prefill. Under load, prefills can block decode, causing perceived stalling, and eventually request timeouts in some cases. This affects GLM-5/5.1 more than GLM-4.7. | ||
| - **Capacity constraints: | - **Capacity constraints: | ||
| - | - **Node instability: | + | - **Node instability: |
| It is unclear if an increase in pricing would resolve any of these issues. | It is unclear if an increase in pricing would resolve any of these issues. | ||
| See also: [[: | See also: [[: | ||