Voice RAG

  • Using NoSQL API? Please refer to our  our related guide on creating an AI Assistant powered by Azure OpenAI services integrated with Azure Cosmos DB Core (NoSQL) API. Learn how to leverage Azure Cosmos DB's flexible schema, powerful querying capabilities, and global distribution while building intelligent conversational experiences.
  • Voice RAG: Real-time intelligent communication between human and AI, utilizing human-like voices for seamless interactions.
  • Real-time AI Assistant: Delivers intelligent responses on your own data instantly by using Azure OpenAI services, seamlessly integrated with Azure Cosmos DB for MongoDB.
  • Microsoft and Cazton: We work closely with OpenAI, Azure OpenAI and many other Microsoft teams. Thanks to Microsoft for providing us very early access to critical technologies. We are fortunate to have been working on GPT-3 since 2020, a couple years before ChatGPT was launched.
  • Top clients: At Cazton, we help Fortune 500, large, mid-size and startup companies with Big Data, AI and ML, GenAI, OpenAI and open source AI models, custom software development, deployment (MLOps), consulting, recruiting services and hands-on training services. Our clients include Microsoft, Broadcom, Thomson Reuters, Bank of America, Macquarie, Dell and more.
 

Introduction:

We are excited to announce the open-sourcing of our Real-Time AI Assistant, leveraging Azure OpenAI GPT-4o Realtime API and Cosmos DB for rapid data queries and intelligent, real-time interactions. By building on Microsoft’s original framework, we’ve adapted the solution to use Azure Cosmos DB for MongoDB instead of Azure AI Search, ensuring optimized data management and retrieval.

Our solution integrates human-like AI voice responses, allowing users to interact naturally using voice commands. This release includes both the source code and a video demo, showcasing how the assistant provides real-time answers to spoken queries with AI-generated speech.

 

Video Demo: AI-Driven Voice Interaction

This embedded video showcases the AI assistant in action, featuring:

  • Voice-based conversations between a user and the AI assistant.
  • AI-generated responses that sound remarkably human.
  • Real-time query execution with results retrieved from Cosmos DB.
 

How It Works

Our solution features multiple key components, all working together to provide efficient, real-time query capabilities. Let’s explore the major files that power the system.

The Foundation of the AI System (app.py)

    • Loads environment variables and connects to Azure services like Cosmos DB and OpenAI.
    • Configures the AI assistant’s behavior for processing incoming queries.
    • Serves both static files and dynamic API endpoints.

The AI assistant uses retrieved knowledge to answer questions in real-time, making it ideal for dynamic business applications.

Advanced Document Retrieval and Processing (ragtools.py)

    • Chunking and Overlap: Large documents are broken into manageable 1,000-character chunks with 150-character overlaps to preserve context across sections.
    • HNSW Vector Search: Uses HNSW (Hierarchical Navigable Small World) indexing for nearest-neighbor searches. This ensures quick and accurate search performance across large datasets.
    • Azure OpenAI Embeddings: Vector-based similarity search is conducted by comparing a query to the document vectors, finding the most relevant documents. This allows for highly accurate retrieval based on semantic meaning rather than keywords alone. Documents are retrieved with metadata to help contextualize the results, which are formatted for AI-driven responses.

Tip: If you prefer not to use HNSW, update the search type in ragtools.py to CosmosDBVectorSearchType.VECTOR_IVF for compatibility with CosmosDB.

The Backbone of Real-Time Communication (rtmt.py)

    • WebSocket Management: Uses aiohttp to maintain persistent connections between the client and server, supporting live queries.
    • Tool-Based Interaction: Defines tools (e.g., search, reporting) that can be dynamically invoked during conversations to provide specific insights.
    • Authentication & Token Management: Integrates Azure token management, ensuring secure communication and continuous session management.
    • Real-Time Responses: Messages are processed dynamically, forwarding results to the client or triggering tool-based interactions as needed.
 

Tutorial: Running the Application

Pre-requisites: Before starting, ensure you have access to the following:

  • Azure OpenAI Access: Deployed GPT-4o and embedding models on Azure.
  • Azure Cosmos DB for MongoDB: Used to store document embeddings.

Step 1: Set up the Environment:

Create a .env file in the backend folder and configure the following variables:

           AZURE_OPENAI_ENDPOINT=wss://<your-instance-name>.openai.azure.com

           AZURE_OPENAI_DEPLOYMENT=gpt-4o-realtime-preview

           AZURE_OPENAI_API_KEY=<your-api-key>

           MONGO_CONNECTION_STRING=<your-mongo-connection-string>

           MONGO_DB_NAME=<your-database-name>

           MONGO_COLLECTION_NAME=<your-collection-name>

Step 2: Install Dependencies:

Make sure Node.js and npm are installed:

  • You can download Node.js from https://nodejs.org/en/download/package-manager
  • Windows users: Ensure PowerShell is available for running scripts.

Step 3: Run the Application:

Use the provided startup script to launch the app:

           Windows: .\scripts\start.ps1

           Linux/Mac: ./scripts/start.sh

Place any relevant documents in the ./data/ directory for processing. On the first run, the system will index documents and store the embeddings in Cosmos DB.

Step 4: Access and Interact with the App:

After launching, the app will be available at http://localhost:8765. From here:

  • Speak directly to the AI using the voice interface.
  • Query documents and receive real-time responses.
 

Conclusion:

Our open-source AI assistant offers a powerful blend of Azure OpenAI services and Azure Cosmos DB, providing fast, intelligent responses through real-time voice interactions. With features like document chunking, HNSW vector search, and tool-based interactions, this solution is highly scalable and adaptable to various business needs.

We look forward to your feedback and contributions! Check out the repository, try the demo, and let us know how you’ve customized the solution for your use case.

Visit our GitHub: https://github.com/cazton/CaztonVoiceRag