I will build and deploy a stateful multimodal AI workspace backend (an orchestrator) that powers document ingestion and retrieval (RAG) and routes requests into specialized services for heavy multimodal work. The system stores conversation, job, feedback, and media state in PostgreSQL, uses pgvector for similarity search, and persists uploaded files and generated assets in MinIO. It supports long-running tasks through a job-oriented architecture and integrates over HTTP with image generation (SDXL), video generation (WAN 2.2 with ComfyUI workflows and a structured VideoPlan), speech-to-text (Whisper), text-to-speech and voice cloning (OpenVoice V2), plus an MCP-compatible tool server for externalized tool execution.