Skip to main contentRetrieval Augmented Generation (RAG) for voice is more demanding than for text. The retrieved context must be high-quality and concise to keep latency low and responses natural.
1. Chunking Strategy
For voice agents, we recommend smaller chunk sizes (200-400 tokens). Large chunks increase the LLM’s “thinking” time (latency) and may include irrelevant information that makes the agent ramble.
Best Practices:
- Overlapping: Use a 10-15% overlap between chunks to ensure context isn’t lost at the boundaries.
- Semantic Splitting: If possible, split by paragraph or section rather than a strict character count.
2. Context Injection
In Butter AI, when a document is attached to an agent, relevant chunks are injected into the system prompt at runtime.
Prompting for RAG:
To prevent the agent from sounding like a robot reading a manual, add these instructions to your agent’s system prompt:
“Use the provided context to answer questions, but speak naturally. Do not say ‘According to the document…’ or ‘The context says…’. Just provide the answer directly and concisely.”
3. Optimizing for Speed
Every token of retrieved context adds to the LLM’s input processing time.
- Top-K Selection: By default, Butter AI retrieves the top 3 most relevant chunks. For very fast agents, you might reduce this to the top 1 or 2 chunks.
- Filtering: Ensure your documents are clean. Remove headers, footers, and page numbers from PDFs before uploading, as these create “noise” in the retrieval results.
4. Hybrid Search (Upcoming)
We are working on hybrid search (combining keyword and vector search) to improve the retrieval of specific proper nouns, SKU numbers, or technical codes that might be missed by semantic-only search.