Most WhatsApp automation only handles text. But real customers send voice notes, images, and mixed messages. A standard chatbot breaks down the moment someone sends a 30-second voice note instead of typing. The goal: build a WhatsApp AI system that handles both text and voice seamlessly.
What I Built
A WhatsApp AI automation workflow in n8n that processes both text and voice messages, with speech-to-text conversion and intelligent AI responses.
Key features:
Voice message processing: automatically converts incoming voice notes to text using speech-to-text
Text message handling with context-aware AI responses
Unified conversation flow regardless of whether the customer types or speaks
Automatic AI reply generation using OpenAI
n8n workflow that ties together WhatsApp, speech processing, and AI response generation
The Tech Stack
n8n for end-to-end workflow automation
OpenAI for language understanding, response generation, and speech-to-text
WhatsApp Business API for receiving and sending messages
The Result
Customers can now communicate however they prefer: typing or talking. Voice notes get transcribed and processed just like text messages, and the AI responds intelligently to both. No more "please type your question instead" limitations.
Like this project
Posted Jun 20, 2026
Built a WhatsApp AI automation in n8n that handles text and voice messages, converts speech to text, generates AI replies, and responds automatically.