Organic

This is an old revision of the document!


Plexus

Plexus

A unified LLM API gateway handling protocol translation, failover, and usage tracking.

Currently the most practical way to use multiple AI providers through a single endpoint without rewriting client code per API format. Routes OpenAI, Anthropic, Gemini, and any OpenAI-compatible provider through one interface.

Also supports OAuth-backed providers (GitHub Copilot, Claude, Codex, Gemini CLI) without API key management. Currently the only gateway with built-in vision fallthrough for non-vision models.

Pros:

  • Routes between providers automatically with configurable selectors.
  • Converts API formats bidirectionally — send Anthropic-style requests to OpenAI backends and vice versa.
  • Vision fallthrough lets cheap models handle images via automatic text description.
  • Per-request cost tracking with quota enforcement by tokens, requests, or dollars. Web UI for configuration.
  • Best available conversion accuracy - exceeds the translation validity of LiteLLM, AxonHub and other similar tools.

Cons:

  • Adds ~20-50ms latency vs direct provider calls.
  • Self-hosted means you manage uptime.
  • SQLite default won’t scale past single-node; PostgreSQL required for production.

Vision Fallthrough

  • Unique capability allowing vision aliases to work with non-vision backends.
  • Intercepts images, sends to a descriptor model (Gemini Flash, GPT-5.3-Codex) for text conversion, passes descriptions to target.
  • Enables specialized models to handle image inputs transparently without native vision support.

Routing & Selection

Model aliases backed by multiple providers:

Selector Behavior
———-———-
random Distribute across healthy targets (default)
in_order Failover sequence
cost Cheapest provider wins
performance Highest tokens/sec
latency Lowest time-to-first-token

priority: api_match enables pass-through for same-format providers, skipping transformation overhead.

Protocol Translation

Bidirectional conversion between:

  1. OpenAI chat completions (/v1/chat/completions)
  2. OpenAI responses (/v1/responses)
  3. Full OpenAI /v1/responses support: stateful multi-turn via previous_response_id, 7-day TTL storage, function calling.
  4. Anthropic messages (/v1/messages)
  5. Google Gemini native format (/v1beta)

Handles streaming SSE and tool use normalization across all formats.

Deep Inspection

  • Per-request capture of full request/response bodies, transformation steps, routing decisions, and timing breakdowns.
  • Every request gets a UUID; trace the full lifecycle from ingress to provider response.
  • Stores 30 days of debug logs by default (configurable).
  • View timing waterfall: parse → route → transform → provider roundtrip → transform → serialize.
    • See exactly which provider handled the request and why.

Pros:

  • Complete visibility into routing decisions and transformation errors.
  • Request/response bodies aid debugging client issues without reproducing.
  • Performance metrics (TTFT, TPS, total latency) per provider/model help identify regressions.
  • Full error context including stack traces when transformations fail.

Cons:

  • Capturing full bodies increases storage significantly — ~10-50KB per request depending on payload size.
  • Default SQLite deployment will grow quickly under high load; PostgreSQL recommended for inspection-heavy use.

Quota Enforcement

  • Per-API-key limits: `tokens`, `requests`, or `cost`.
  • Windows: rolling, daily, weekly.
  • Hard stops when exceeded.

Provider Cooldowns

  • Automatic exponential backoff on failure: 2 min → 4 min → 8 min → … → 5 hr cap.
  • Automatic parsing of quota and Retry-After headers to set cooldowns.
  • Successful requests reset counter.
  • Optionally disable per-provider.

MCP Proxy

Model Context Protocol server proxy with per-request session isolation.

  • HTTP transport only; prevents tool sprawl across clients.

Encryption at Rest

  • Optional AES-256-GCM for API keys, OAuth tokens, MCP headers.
  • Generate once: `openssl rand -hex 32`. Automatic migration from plaintext on first boot with key set.