plexus [Organic]

This page is read only. You can view the source, but not change it. Ask your administrator if you think this is wrong.
====== Plexus ======

[[https://github.com/mcowger/plexus|Plexus]]

A unified LLM API gateway handling protocol translation, failover, and usage tracking. Maintained by Matt Cowger (@mcowger on the Synthetic Discord). The Synthetic Discord has a [[https://discord.com/channels/1315627714056687706/1446759141921132584|#plexus]] channel thread dedicated to Plexus support.

Currently the most practical way to use multiple AI providers through a single endpoint without rewriting client code per API format. Routes OpenAI, Anthropic, Gemini, and any OpenAI-compatible provider through one interface.

Also supports OAuth-backed providers (GitHub Copilot, Claude, Codex, Gemini CLI) without API key management. Currently the only gateway with built-in vision fallthrough for non-vision models.

Pros: 

  * Routes between providers automatically with configurable selectors. 
  * Converts API formats bidirectionally — send Anthropic-style requests to OpenAI backends and vice versa. 
  * Vision fallthrough lets cheap models handle images via automatic text description. 
  * Per-request cost tracking with quota enforcement by tokens, requests, or dollars. Web UI for configuration.
  * Best available conversion accuracy - exceeds the translation validity of LiteLLM, AxonHub and other similar tools.

Cons: 

  * Adds ~20-50ms latency vs direct provider calls. 
  * Self-hosted means you manage uptime. 
  * SQLite default won't scale past single-node; PostgreSQL required for production. 


**Vision Fallthrough**

  * Unique capability allowing vision aliases to work with non-vision backends. 
  * Intercepts images, sends to a descriptor model (Gemini Flash, GPT-5.3-Codex) for text conversion, passes descriptions to target.
  * Enables specialized models to handle image inputs transparently without native vision support.

**Routing & Selection**

Model aliases backed by multiple providers:

| Selector | Behavior |
|----------|----------|
| ''random'' | Distribute across healthy targets (default) |
| ''in_order'' | Failover sequence |
| ''cost'' | Cheapest provider wins |
| ''performance'' | Highest tokens/sec |
| ''latency'' | Lowest time-to-first-token |

''priority: api_match'' enables pass-through for same-format providers, skipping transformation overhead.

**Protocol Translation**

Bidirectional conversion between:
  - OpenAI chat completions (''/v1/chat/completions'')
  - OpenAI responses (''/v1/responses'')
	- Full OpenAI ''/v1/responses'' support: stateful multi-turn via ''previous_response_id'', 7-day TTL storage, function calling.
  - Anthropic messages (''/v1/messages'')
  - Google Gemini native format (''/v1beta'')

Handles streaming SSE and tool use normalization across all formats.

**Deep Inspection**

  * Per-request capture of full request/response bodies, transformation steps, routing decisions, and timing breakdowns. 
  * Every request gets a UUID; trace the full lifecycle from ingress to provider response.
  * Stores 30 days of debug logs by default (configurable). 
  * View timing waterfall: parse → route → transform → provider roundtrip → transform → serialize. 
    * See exactly which provider handled the request and why.

Pros: 
  * Complete visibility into routing decisions and transformation errors. 
  * Request/response bodies aid debugging client issues without reproducing. 
  * Performance metrics (TTFT, TPS, total latency) per provider/model help identify regressions.
  * Full error context including stack traces when transformations fail.

Cons: 
  * Capturing full bodies increases storage significantly — ~10-50KB per request depending on payload size. 
  * Default SQLite deployment will grow quickly under high load; PostgreSQL recommended for inspection-heavy use. 

**Quota Enforcement**

  * Per-API-key limits: `tokens`, `requests`, or `cost`. 
  * Windows: rolling, daily, weekly. 
  * Hard stops when exceeded.

**Provider Cooldowns**

  * Automatic exponential backoff on failure: 2 min → 4 min → 8 min → ... → 5 hr cap.
  * Automatic parsing of quota and Retry-After headers to set cooldowns.
  * Successful requests reset counter.
  * Optionally disable per-provider.

**MCP Proxy**

Model Context Protocol server proxy with per-request session isolation. 
  * HTTP transport only; prevents tool sprawl across clients.

**Encryption at Rest**

  * Optional AES-256-GCM for API keys, OAuth tokens, MCP headers. 
  * Generate once: `openssl rand -hex 32`. Automatic migration from plaintext on first boot with key set.