VoiceUse / Docs / Architecture

Architecture

VoiceUse is built around a modular pipeline architecture with clear separation between voice input, LLM reasoning, system control, and output.

System Overview

flowchart LR subgraph Input[Input Layer] Hotkey[Hotkey / Wake Word] VAD[Voice Activity Detection] STT[Groq Whisper STT] end subgraph Brain[Brain Layer] LLM[LLM Orchestrator] Tools[Tool Registry] Safety[Safety Guard] end subgraph Output[Output Layer] TTS[edge-tts / pyttsx3] Speaker[Speaker] end subgraph System[System Layer] OS[OS Controller] Vision[Vision Bridge] Desktop[Desktop] end Hotkey --> VAD VAD --> STT STT --> LLM LLM --> Tools Tools --> Safety Safety --> OS Safety --> Vision OS --> Desktop Vision --> Desktop LLM --> TTS TTS --> Speaker

Pipeline Flow

  1. Input — User activates via hotkey or wake word. Audio is captured until release/silence.
  2. STT — Groq Whisper transcribes speech to text asynchronously.
  3. Reasoning — LLM plans actions using tool schemas, desktop context, and conversation history.
  4. Safety — Destructive actions trigger spoken confirmation.
  5. Execution — Tools dispatch to OS Controller or Vision Bridge.
  6. Response — TTS speaks the result summary.

Core Components

InputManager

Handles all user input activation:

Runs audio work off the main async loop to prevent blocking.

Brain

The LLM orchestrator that:

Tool Registry

Shared tool schemas and dispatch used by both Brain and plugins:

ToolPurpose
open_appLaunch or focus applications
focus_windowBring window to front
type_textSimulate keyboard input
press_keyPress specific keys
click_elementVision-based UI clicking
take_screenshotCapture screen/window
run_system_commandExecute shell commands
open_urlOpen URLs in browser
search_webWeb search

OSController

Cross-platform desktop control facade:

VisionBridge

Closed-loop computer vision for UI interaction:

  1. Capture screenshot of target window/monitor
  2. Send to vision provider (Codex CLI or Anthropic)
  3. Receive action JSON (click coordinates, key presses)
  4. Execute action via OSController
  5. Re-capture screenshot and repeat up to 5 steps

This handles popups, loading delays, and misclicks through observation.

TTSManager

Multi-backend text-to-speech with playback queue:

MCP Integration

VoiceUse exposes desktop control tools via MCP (Model Context Protocol):

# Register with Codex CLI
codex mcp add voiceuse-computer-control -- voiceuse-computer-control-mcp

This enables Codex and other MCP-capable agents to control the desktop through VoiceUse's OS layer.

Plugin Architecture

Plugins replace the default pipeline while sharing core services:

flowchart TB subgraph Shared[Shared Core] ToolR[Tool Registry] SafetyG[Safety Guard] Audit[Action Audit] OSC[OS Controller] VisionB[Vision Bridge] AudioD[Audio Device] end subgraph Default[Default Pipeline] InputM[InputManager] BrainL[Brain] TTSM[TTSManager] end subgraph GrokP[Grok Voice Plugin] GrokVP[GrokVoicePlugin] XAIClient[XAIRealtimeClient] Streamer[GrokAudioStreamer] end Default --> Shared GrokP --> Shared

The Grok Voice plugin demonstrates this: it replaces STT→Brain→TTS with a single xAI Realtime WebSocket while using the same tools, safety, and OS layers.