Organic

Models

Selection Criteria

GLM-5.1

hf:zai-org/GLM-5.1

Price: $1.00/mtok in, $3.00/mtok out

GLM-5.1 is currently in beta. It may have higher rate limit impact than stable models. Price may increase to better reflect compute costs (~$1.40/mtok in, ~$4.40/mtok out has been discussed).

GLM-5.1 is currently the smartest/most capable coding and agentic model hosted directly by Synthetic. Also the most capable open weight model period, trading blows with SOTA proprietary models like Opus 4.6 and GPT-5.4.

Consequently, it is also currently the most expensive model; while its output cost is actually lower than Kimi K2.5’s, GLM-5.1’s input costs are 2x Kimi K2.5’s, and input tokens usually dominate output tokens in coding and agentic use cases, meaning that input prices dominate when comparing model cost.

  • Pros: Excels at almost all coding and long-horizon agentic work. Widely considered to be exceptionally good at code review. Even better than GLM-5 which was already excellent.
  • Cons: Quite a bit worse at user interface work than Kimi K2.5. Overkill for basic assistant work, such as for OpenClaw. Worse at lateral thinking than other frontier models (needs more express guidance). More compute-intensive than Kimi K2.5 — closer to redline during peak hours.

Compute Requirements

GLM-5.1 runs on 4 B200 GPUs per replica at NVFP4 quantization (same as GLM-5), and uses SGLang instead of vLLM for better cache hit rates. This is compared to 8 B200s for Kimi K2.5, making it theoretically both faster (due to less NVLink overhead) and cheaper (due to less energy required, and needing to rent fewer expensive GPUs). There is no official NVFP4 quant from Nvidia yet, but llmcompressor supports the GLM-5 architecture, which has allowed Synthetic to produce their own quant.

Despite the above numbers, Synthetic claims that it is actually *more* compute-intensive than Kimi K2.5, and they say that the price may need to increase to better reflect this, and has floated the idea that this might resolve GLM-5.1’s noted instability (see below). The price-point they have pointed to as the goal (the “market rates”) are per-token API prices on OpenRouter, where GLM-5.1 is slightly more expensive than GLM-5.

However, users have noted that the subscription-based pricing model used by Synthetic differs significantly from the per-token based pricing model seen on OpenRouter: on OpenRouter, providers are trying to make a per-token profit, so the raise in prices could be just due to greater demand for GLM-5.1 due to its increased capabilities, whereas Synthetic’s subscription model means that token pricing is only meant to be a bellweather for compute costs.

These price hikes have been floated as a solution to the recent instability of the GLM-5.1 replicas, covered below.

Known Issues

  1. Prefill stalling: More active parameters means more expensive prefill. Under load, prefills can block decode, causing perceived stalling, and eventually request timeouts in some cases. This affects GLM-5/5.1 more than GLM-4.7.
  2. Capacity constraints: GLM-5.1 has been running close to redline during peak hours. Synthetic is working on bringing up more compute.
  3. Node instability: There have been several instances of entire GPU nodes crashing in the last two or three weeks, which has led to service gaps where official status updates often lag behind community reports in the Discord.

It is unclear if an increase in pricing would resolve any of these issues.

See also: GLM-5 (predecessor, being retired), Kimi K2.5 (complementary frontier model with vision)

GLM-5

hf:zai-org/GLM-5

Price: $1.00/mtok in, $3.00/mtok out (beta pricing)

GLM-5 is in beta and is slated to be retired/proxied once GLM-5.1 exits beta. For new projects, prefer GLM-5.1.

GLM-5 was launched in beta on March 30, 2026, after SGLang made progress stabilizing GLM-5’s new architecture. It trades blows with proprietary models like Opus 4.6 and GPT-5.4 for coding and agentic tasks.

  • Pros: Excellent at backend coding and long-horizon agentic work. Fewer total parameters than Kimi K2.5, so it can run at tp4 on B200 NVFP4 (4 GPUs vs K2.5’s required tp8), which is usually faster due to reduced NVLink overhead.
  • Cons: Not as strong at UI/frontend work. Beta pricing is set high (impacts rate limits). Will be replaced by GLM-5.1.

Architecture Notes

GLM-5 has fewer total parameters than Kimi K2.5, making it more efficient to serve on B200 hardware at NVFP4 quantization. However, vLLM and SGLang support for GLM-5’s architecture was initially poor — Synthetic had to wait for upstream fixes before stable hosting was possible.

The model runs on SGLang (which is faster for the GLM series) and uses NVFP4 quantization on B200 GPUs. Each replica requires 4 B200 GPUs (tp4).

Deprecation Path

Based on public statements from Synthetic staff, the plan is:

1. Take GLM-5 out of beta, stop self-hosting GLM-4.7
2. Put [[:models:glm-5.1|GLM-5.1]] in beta
3. Once GLM-5.1 is out of beta, retire/proxy GLM-5

Old models are typically proxied to Fireworks or TogetherAI, although proxy duration depends on load since proxies are expensive.

See also: GLM-5.1 (the replacement), GLM-4.7 (the predecessor)

Kimi K2.5

hf:moonshotai/Kimi-K2.5

Price: $0.45/mtok in, $3.40/mtok out

hf:moonshotai/Kimi-K2.5 and hf:nvidia/Kimi-K2.5-NVFP4 are aliased internally and can be used interchangeably.

A powerful agentic model with above-average lateral thinking/debugging and great design skills.

  • Pros: Solid code. Amazing at orchestrating other agents due to special “agent swarm” reinforcement learning (source). Only frontier class model on Synthetic with vision. Best model Synthetic has for UI work (probably because it was trained extensively with vision and to translate between visual input and code).
  • Cons: Prone to outright laziness (keeping code for “backward compatibility”, marking things as “to implement later”) and thinking a bit too laterally. Should have an eye kept on her for longer tasks. Not quite as good as GLM-5 for backend work.

Although Kimi was trained with Agent Swarms, to get the same results as MoonshotAI you would have to use their proprietary swarms endpoint which is not available on Synthetic. However, similar results may be had using massively parallel sub-agents, or utilizing an SDK such as Swarms with various roles.

Load-Balanced Routing (NVFP4 + INT4)

Since February 15, 2026, Synthetic routes Kimi K2.5 requests between two hardware backends based on current load. Both model strings (hf:moonshotai/Kimi-K2.5 and hf:nvidia/Kimi-K2.5-NVFP4) may silently hit either backend. You don’t need to do anything differently.

  1. INT4 variant — runs on H200 GPUs (original Moonshot quant)
  2. NVFP4 variant — runs on B200 GPUs (Nvidia’s new near-lossless quant format, now U.S.-based)

Why: B200 capacity was sometimes overloaded while H200s sat with excess capacity. Routing between them based on load smooths this out and lets Synthetic scale Kimi via B200s going forward. Some INT4 capacity remains on reserved H200s (NVFP4 provides no perf advantage on H200 hardware).

Benchmark comparison (from Synthetic’s internal testing):

Benchmark INT4 NVFP4 Delta
AIME 91.0 93.3 NVFP4 +3.3
Aider Polyglot 74.4 71.1 INT4 +3.3
LiveCodeBench (subset) NVFP4 +4.0

All within margin of error. NVFP4 mildly better on 2 of 3 benchmarks. Synthetic considers them equivalent for routing purposes.

Precision Mix Detail

Since April 7, 2026, Synthetic serves Kimi K2.5 from a mix of NVFP4 (B200s) and original INT4 (H200s). The /models API endpoint reports NVFP4 as the quant format, but your request may silently hit either backend based on current load.

When serving a model in multiple precisions, Synthetic reports the “least legitimate” (i.e., lab-released original) format in the /models endpoint. Since Moonshot’s original release was INT4 and Nvidia’s NVFP4 is a derived quant, NVFP4 is reported. However, INT4 capacity remains on reserved H200s where NVFP4 provides no performance advantage.

This means: even if you explicitly use hf:nvidia/Kimi-K2.5-NVFP4, your request may still be served by the INT4 variant on H200s during times of B200 load.

MiniMax M2.5

hf:MiniMaxAI/MiniMax-M2.5

Price: $0.40/mtok in, $2.00/mtok out

Currently the most capable middle-tier model on Synthetic for general agentic and coding tasks. Best used as a fast subagent orchestrated by a more powerful model like GLM 5 or Kimi K2.5.

  • Pros: Very fast due to a very low active parameter count (10b). Pretty good at straightforward agentic tool use, agentic terminal use, and writing working, adequate code, as well as thoroughly exploring and writing reports on codebases or document collections.
  • Cons: Will very easily get stuck in loops if it isn’t able to quickly debug an issue with its code — or its tools — in 1-2 turns. Requires detailed and thorough instructions to correctly execute the desired task (otherwise it will misinterpret what you mean, leave crucial things out, or just not understand the assignment).

Kimi K2-Thinking

hf:moonshotai/Kimi-K2-Thinking

Price: $0.60/mtok, $2.50/mtok

The previous most capable model on Synthetic before GLM 5 and Kimi K2.5 came around. Still by far the best writing model, though.

  • Pros: Mostly just very good at writing, especially in a way that doesn’t have noticeable Claude-like LLM writing tells, and picking up on emotions and nuances.
  • Cons: Writing isn’t always great at conveying coherent physical spaces or motions; can have continuity issues sometimes. Shouldn’t really be used for anything but the writing style at this point.

Nemotron 3 Super

hf:nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

Price: $0.30/mtok, $1.00/mtok

The most powerful budget model on Synthetic. Definitely worth using for agentic web search and report gathering, basic agentic terminal automation, as well as thread summary and title generation, and other basic housekeeping tasks you don’t need a frontier model for. Should not be allowed to touch code with a ten-foot pole.

  • Pros: Very long context for such a cheap/small model (double the context of GPT-OSS 120b, which is the same size). Extremely, almost unnervingly fast. Does not really slow down over long contexts at all. Which is all thanks to the hybrid state space model architecture. Most powerful and capable fully open source model (source)
  • Cons: Not really very flexible at problem solving. Can lose the plot pretty hard if set loose on a difficult problem for a very long time without feedback, although it doesn’t really context rot and is very tenacious, so it has that going for it. Probably shouldn’t be allowed to write code. Not that smart.

GLM 4.7 Flash

hf:zai-org/GLM-4.7-Flash

Price: $0.10/mtok, $0.50/mtok

By far the cheapest model on Synthetic. Capable at basic tasks like summarization, classification, simple translation of natural language commands into tool calls, or terminal commands.

  • Pros: Cheapest. Very fast.
  • Cons: Only for basic usage.

Qwen 3.5 397B

hf:Qwen/Qwen3.5-397B-A17B

Price: See Synthetic pricing page

Qwen 3.5 is a large MoE model (397B total / 17B active) self-hosted by Synthetic. It was launched in beta on February 20, 2026.

  • Pros: Fast — fast enough that Synthetic didn’t need beta rate limits for it. Decent at code, decent at agentic work. Self-hosted, so more reliable than proxied models.
  • Cons: Not quite at the level of GLM-5 or Kimi K2.5 for frontier coding tasks. Performance “isn’t great tbh” (per Synthetic staff) compared to the flagship models.

Qwen 3.5 is still in beta. Performance may improve as Synthetic optimizes serving.

DeepSeek V3.2

hf:deepseek-ai/DeepSeek-V3.2

Price: Uses Fireworks pricing (proxied model)

DeepSeek V3.2 is proxied to Fireworks — it is not self-hosted by Synthetic. This means Synthetic cannot control reliability or fix tool-calling issues. Uptime is approximately 99.5% (per status.synthetic.new).

DeepSeek V3.2 is a powerful model that was one of the first available on Synthetic, but its experience has been inconsistent due to proxying.

  • Pros: Strong model capabilities. One of the fastest models when working properly (fastest proxied model over 24hrs in early 2026).
  • Cons: Tool calling reliability is poor due to proxying. Occasional timeout issues. Synthetic cannot patch the inference engine for this model. When DeepSeek 4 comes out, Synthetic has indicated they will try to self-host it.

Proxy Implications

Because DeepSeek V3.2 is proxied to Fireworks:

  1. Synthetic forwards the price they pay the underlying inference provider
  2. Tool calling bugs cannot be fixed on Synthetic’s end
  3. TTFT and TPS depend on Fireworks’ infrastructure
  4. If Fireworks stops hosting it, Synthetic will have to stop supporting it too

For better reliability, prefer self-hosted models like GLM-5, Kimi K2.5, or GLM-4.7-Flash.

Embedding Models

Nomic Embed Text 1.5

hf:nomic-ai/nomic-embed-text-v1.5

If you have a Synthetic subscription, using an embedding model is free. Embedding models are commonly used in vector databases to provide effective semantic search over your data. Think of it as an easy way to browse through a hundred documents without giving an agent the entire context.

Nomic only works with text and is no longer state-of-the-art, so expect average performance.