A real-time voice-to-voice conversation system that enables natural, fluid interactions with AI. This system processes human speech, understands context, and responds with natural-sounding voice, all in real-time.
voice_agent_demo.mp4
- Event-Driven: The architecture revolves around the events of receiving audio chunks, processing them, and responding in real-time.
- Real-Time: It is designed to minimize latency, ensuring conversational interactions.
- Audio Processing: Handles audio streams, transcription, LLM responses, and TTS generation efficiently.
- Redis Queue Integration: Utilizes Redis queues for user-specific message handling, ensuring organized processing of transcriptions and responses. Each user gets their dedicated queue, preventing message mixing across different conversations.
- FIFO Processing: Maintains strict First-In-First-Out order for each user's responses, ensuring conversational coherence and natural dialogue flow.
- Stateful Processing: Redis queues maintain conversation state and message order per user, allowing for context-aware responses and proper sentence sequencing during TTS generation.
- Lightweight: Redis's in-memory nature provides extremely low latency for queue operations while maintaining message persistence.
- User Isolation: Dedicated queues per user ensure that concurrent conversations remain isolated and don't interfere with each other's processing flow.
- Sentence-Level Processing: Smart sentence boundary detection for natural speech synthesis
- Python 3.9+
- API keys for:
- OpenAI
- Groq
- Deepgram
- Clone the repository:
git clone https://github.com/spandan114/AI-realtime-voice-agent.git
cd AI-realtime-voice-agent
- Create and activate virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Create
.env
file:
OPENAI_API_KEY=your_openai_key
GROQ_API_KEY=your_groq_key
DEEPGRAM_API_KEY=your_deepgram_key
REDIS_HOST="localhost"
REDIS_PORT="6379"
- Start the server:
uvicorn main:app --reload
- The API will be available at:
- WebSocket:
ws://localhost:8000/ws
- REST API:
http://localhost:8000/
- Start frontend:
cd frontend
npm install
npm run dev
- The UI will be available at:
http://127.0.0.1:5173/
OR Run using docker:
docker compose up
sequenceDiagram
participant Client as Frontend Client
participant WS as WebSocket Server
participant DG as Deepgram API
participant LLM as LLM Service
participant SC as Sentence Collector
participant RQ as Redis Queue
participant Worker as Queue Worker
participant TTS as TTS Generator
note over Client,TTS: Voice Processing Flow
loop Audio Streaming
Client->>+WS: Stream microphone chunks
WS->>+DG: Forward audio chunks
DG->>-WS: Return real-time transcript
alt 1 second pause detected
WS->>+LLM: Send transcript
LLM-->>-SC: Stream response chunks
loop Sentence Formation
SC->>SC: Collect and check for<br/>complete sentence
alt Complete sentence detected
SC->>RQ: Push sentence to queue
end
end
end
end
loop Queue Processing
Worker->>RQ: Check for new sentences
alt Queue not empty
RQ-->>Worker: Return next sentence
alt No ongoing TTS processing
Worker->>+TTS: Generate audio
TTS-->>-Client: Stream audio chunks
else TTS in progress
Worker->>Worker: Wait for current<br/>TTS to complete
end
end
end
note over Client,TTS: FIFO order maintained for sentence processing
- Handles microphone input capture
- Streams audio chunks to WebSocket server
- Plays received audio responses
- Manages WebSocket connections
- Routes audio chunks to Deepgram API
- Handles real-time communication
- Integrates with Deepgram for real-time transcription
- Implements 1-second pause detection
- Processes transcripts through LLM
- Collects and validates complete sentences
- Redis-based FIFO queue
- Ensures ordered processing of responses
- Manages TTS processing states
- Generates audio from text responses
- Streams audio chunks back to frontend
- Maintains sequential processing
- Nested websocket (Deepgram) can cause scalability issue
- Socket connection rate limit can cause issue in sclale
- Fork the repository
- Create a feature branch:
git checkout -b feature-name
- Commit changes:
git commit -am 'Add feature'
- Push to branch:
git push origin feature-name
- Submit a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Credit to libraries and services used
- Community contributions
Linkedin - @Spandan Joshi Project Link: https://github.com/spandan114/AI-realtime-voice-agent