Differences

This shows you the differences between two versions of the page.

--- faq [2026/04/11 11:21] – [Step 2: Optimize Your Workflow] xenolandscapes
+++ faq [2026/04/11 16:19] (current) – beeg brane here amolith
@@ Line 1: / Line 1: @@
 ====== Frequently Asked Questions ======
-===== When will Synthetic get model X? =====
+{{namespace>faq}}
-Before you ask this, make sure it:
-  - Is [[https://opensource.org/ai/open-weights|open weight]] or [[https://opensource.org/ai/open-source-ai-definition|open source]].
-  - Has weights available on [[https://huggingface.co/models|HuggingFace]] (these are not always the same thing: sometimes a model is //planned// to be open weight, but has not been released yet).
-  - Has a compatible license that allows Synthetic to actually make money hosting a model (sometimes model weights are published "openly," but only under modified OSS licenses that require royalties over a certain profitability limit, for instance).
-  - Has an [[https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference|NVFP4]] [[https://ngrok.com/blog/quantization|quantization]] available so that Synthetic can run it on their GPUs at optimal speed (there are some exceptions to this — if a model is sufficiently desired, they may make their own quant).
-  - Has solid support for that model or its general architecture in [[https://github.com/sgl-project/sglang/issues|sglang]], the inference engine Synthetic uses to actually //run// the models.
-Factors that can delay Synthetic getting a model:
-  * If it is unusually large, it may take time for them to acquire or free up GPU space to host it.
-  * If it has a novel or unusual architecture (such as DeepSeek Sparse Attention for GLM 5), it will take time for inference engines like sglang to get reliable support for the model.
-  * If the model has not yet been quantized to NVFP4, Synthetic will have to wait for NVIDIA to do that, or make one themselves, both of which can take some time.
-Additionally, it is worth keeping in mind that many models from labs known for creating open-weight models may either be closed source—such as Qwen 3.6-Plus—or available only through the lab's API for user testing and feedback (and to give the lab a profitable head start) but not yet released as open weights. This was the case with GLM 5.1 for a few weeks and remains true as of April 9th, 2026, for MiniMax M2.7.
-===== Why is model X from Synthetic not in $preferred_harness's list? =====
-Many of these lists are updated by hand by a human, so you might be the first ''$preferred_harness'' user who's noticed one's missing! You can either
-  * Wait until someone else opens a PR to ''$preferred_harness'''s list
-  * Find where ''$preferred_harness'' sources its data from and open that PR yourself
-    * OpenCode has [[https://github.com/anomalyco/models.dev/|Models.dev]]
-    * Crush has [[https://github.com/charmbracelet/catwalk|Catwalk]]
-  * Use some kind of provider-specific plugin for ''$preferred_harness'' with a list that updates more frequently
-    * Pi has [[https://github.com/ben-vargas/pi-packages/tree/main/packages/pi-synthetic-provider#readme|@benvargas/pi-synthetic-provider]]
-  * Maintain your own provider/model list for ''$preferred_harness''
-    * [[https://opencode.ai/docs/providers#custom-provider|OpenCode]]
-    * [[https://github.com/badlogic/pi-mono/blob/main/packages/coding-agent/docs/models.md|Pi]]
-    * [[https://github.com/charmbracelet/crush?tab=readme-ov-file#openai-compatible-apis|Crush]]
-===== Why am I burning through my credits or requests so quickly? =====
-==== Step 1: Check Your Tools ====
-The software you use with Synthetic has a massive impact on your token "burn rate."
-=== 1. Are you using Claude Code? ===
-**Recommendation: Use any other agent.** We strongly recommend //against// using Claude Code as a coding harness with Synthetic. Its underlying infrastructure is notoriously inefficient, causing excessive token bloat. Using Claude Code will drain your credits much faster than almost any other option.
-=== 2. Are you using OpenCode (with oh-my-opencode/oh-my-openagent)? ===
-While OpenCode is significantly better than Claude Code, it is still not optimized for efficiency. If you are using the oh-my-opencode/openagent (OMO) extension, the problem is significantly worse: OMO launches unnecessary, poorly designed subagent workflows that bloat every single prompt with redundant context, leading to a "death by a thousand tokens" scenario.
-=== 3. Are you using Zed? ===
-Zed is a powerful editor, but its real-time "live-edit" feature comes at a cost. Zed utilizes a two-step process for edits:
-  - **Intent to Edit:** The main chat (the one you're talking to) sends a tool call declaring that it wants to edit a file and defining the goal of the edit.
-  - **Execution:** Zed recieves that tool call and runs a //separate// request with the same chat history to generate the code in a streaming manner (for the live-diff to work).
-Because of this "intent to edit" system, you are essentially using **2x the input tokens** and **2x the requests** for every single edit. If this workflow is necessary for you, you may need more packs to sustain it.
-==== Step 2: Optimize Your Workflow ====
-If your tools aren't known for being wasteful but your usage remains high, follow these steps (roughly in order of recommendation) to reduce token bloat:
-=== 1. Reduce Frequency ===
-Reduce the frequency of automated workflow runs. For example, if you use OpenClaw, review your current tasks in OpenClaw's "heartbeat" function and increase the interval between checks.
-=== 2. Prompt Efficiency ===
-Refine your input tokens. Use concise system prompts and AGENTS.md files.
-=== 3. Model Efficiency ===
-Switch to a cheaper model for simpler tasks; Kimi and GLM don't need to be running for every single prompt.
-=== 4. Serial Model Orchestration ===
-Let your top level chat be with a more expensive model like Kimi or GLM, but have it orchestrate subagents in series (not in parallel) using cheaper models like MiniMax or even Nemotron to execute scoped, specified-in-detail tasks such as editing code or learning about the codebase.
-This allows you to get the superior planning, problem solving, prompt and project understanding, and code review capabilities of the better model, but avoid burning tokens having the bigger model grep around your codebase or iterate on a piece of code to satisfy the compiler/linter/test suite, while still having access to the bigger model for problems the smaller ones can't handle.
-=== 5. Limit "Thinking" ===
-For less complex tasks, reduce the model's budget for thinking. This forces the model to be more direct and prevents it from using tokens on unnecessary internal reasoning.
-=== 6. Increase Packs ===
-If your workflow is already lean but you still hit limits, it may be time to upgrade your number of packs to match your professional output.