HOW-TO ยท VOICE AI

How to Build a Voice AI App with the OpenAI Realtime API (Step-by-Step)

A practical, no-fluff guide to building a real voice AI assistant with OpenAI's Realtime API in Python. WebSocket, mic streaming, function calling over voice, interruption handling, and how to hit sub-500ms latency. Written for working engineers who want to ship something โ€” not just read about it.

By the ThinkPythonAI TeamUpdated 2026Live cohorts on Zoom

Voice AI is the breakout UX of 2026. Apps that used to be chat-only โ€” customer support, language tutors, interview coaches, internal copilots โ€” are quickly becoming voice-first. The good news: with OpenAI's Realtime API, you can build a working voice assistant in a single afternoon. The better news: you don't need to know WebRTC, and you don't need a team.

This guide walks through the actual code patterns we teach in our 5-week VoxCoach โ€” Voice AI Sprint cohort. If you want to build this with live code reviews and a private WhatsApp community, VoxCoach is open โ€” but this article covers the full path on its own.

Step 1 โ€” Get Realtime API access and set up your environment

The Realtime API is enabled by default on most paid OpenAI accounts. Confirm yours has access by checking the model list for gpt-4o-realtime-preview in your OpenAI dashboard.

Spin up a Python environment:

python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install websockets sounddevice numpy python-dotenv openai

Store your API key in a .env file: OPENAI_API_KEY=sk-.... Never commit it.

Step 2 โ€” Open a WebSocket connection

The Realtime API speaks WebSockets, not REST. Open one to wss://api.openai.com/v1/realtime with your model in the query string and a bearer-token header:

import asyncio, websockets, os
from dotenv import load_dotenv

load_dotenv()

URL = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
HEADERS = {
    "Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",
    "OpenAI-Beta": "realtime=v1",
}

async def main():
    async with websockets.connect(URL, extra_headers=HEADERS) as ws:
        # session.created arrives automatically
        async for msg in ws:
            print(msg)
            break

asyncio.run(main())

If this prints a session.created JSON message, congratulations โ€” your voice AI is alive (just not talking yet).

Step 3 โ€” Configure the session: voice, instructions, tools

Right after session.created, send a session.update event to choose your voice, system prompt, audio format, and function-calling tools:

config = {
  "type": "session.update",
  "session": {
    "voice": "alloy",
    "instructions": "You are a friendly interview coach. Be concise.",
    "input_audio_format": "pcm16",
    "output_audio_format": "pcm16",
    "turn_detection": {"type": "server_vad"},
    "tools": [
      {
        "type": "function",
        "name": "lookup_company",
        "description": "Look up info about a company.",
        "parameters": {
          "type": "object",
          "properties": {"name": {"type": "string"}},
          "required": ["name"],
        },
      }
    ],
  }
}
await ws.send(json.dumps(config))

Step 4 โ€” Stream microphone audio to the API

The API expects 24kHz PCM16 (16-bit mono) audio. Capture from the microphone with sounddevice or pyaudio, base64-encode each chunk, and send as input_audio_buffer.append:

import base64, json, sounddevice as sd

def mic_callback(indata, frames, time_, status):
    pcm = (indata * 32767).astype("int16").tobytes()
    b64 = base64.b64encode(pcm).decode()
    asyncio.create_task(ws.send(json.dumps({
        "type": "input_audio_buffer.append",
        "audio": b64,
    })))

stream = sd.InputStream(
    samplerate=24000, channels=1, dtype="float32",
    callback=mic_callback, blocksize=2400,  # 100ms chunks
)
stream.start()

With server_vad turn detection (set in Step 3), the server automatically commits the audio buffer when it detects you've stopped speaking. No manual commit needed.

Step 5 โ€” Receive audio responses and play them back

As the AI talks, the server emits response.audio.delta events with base64-encoded PCM16 chunks. Decode and queue them to your speaker:

out_stream = sd.OutputStream(samplerate=24000, channels=1, dtype="int16")
out_stream.start()

async for raw in ws:
    msg = json.loads(raw)
    if msg["type"] == "response.audio.delta":
        pcm = base64.b64decode(msg["delta"])
        out_stream.write(np.frombuffer(pcm, dtype=np.int16))
    elif msg["type"] == "response.done":
        # turn finished
        pass

Time-to-first-audio (TTFA) is the most important latency metric. Most users perceive sub-500ms as instant; over 1s starts to feel like a robot pausing.

Step 6 โ€” Function calling over voice

Voice AI without tools is just a fancy chatbot. The Realtime API supports the same function-calling pattern as the text API โ€” the AI decides when to call your function, you execute it, you return the result, the AI continues talking.

# Receive
{
  "type": "response.function_call_arguments.done",
  "name": "lookup_company",
  "arguments": '{"name": "Stripe"}',
  "call_id": "fc_abc123",
}

# Execute it
result = lookup_company(name="Stripe")

# Send result back
await ws.send(json.dumps({
  "type": "conversation.item.create",
  "item": {
    "type": "function_call_output",
    "call_id": "fc_abc123",
    "output": json.dumps(result),
  },
}))
await ws.send(json.dumps({"type": "response.create"}))

Step 7 โ€” Handle interruptions gracefully

The thing that makes voice AI feel alive โ€” not robotic โ€” is graceful interruption. When the user starts speaking while the AI is talking, you receive input_audio_buffer.speech_started. You should:

  1. Send response.cancel to stop the AI mid-sentence.
  2. Clear any queued PCM16 audio from your speaker buffer.
  3. Let the new user turn flow through normally.

Done right, the user can interrupt at any moment and the AI yields immediately โ€” exactly like a polite human.

Step 8 โ€” Measure and tune latency

Production voice agents target sub-500ms time-to-first-audio and under 1.2s end-to-end turn latency. Common bottlenecks:

  • Audio chunks larger than 100-200ms (increase frequency, decrease size)
  • Blocking I/O on the speaker path (use a dedicated thread/loop)
  • Slow tool functions blocking the response (parallelize, cache, or pre-fetch)
  • Network jitter (use a wired connection or a regional edge)

From this article to a real production system

You now have the spine of a working voice AI. The real engineering starts when you add RAG over your private data, LLM-as-a-Judge evaluation, LangGraph orchestration, and structured observability โ€” all of which are what separates a demo from a production assistant.

That's exactly what we teach in our 5-week VoxCoach โ€” Voice AI Sprint. You build the same architecture this article sketches โ€” but live, with code reviews, a private WhatsApp community, four bonus voice-AI projects, and a verifiable LinkedIn certificate.

First two classes are completely free โ€” try before paying. WhatsApp our coordinator Sachin at +1-603-417-0825 to reserve your seat.

Related reading

Want to build this with live guidance?

ThinkPythonAI runs small live cohorts where you build real Python + AI projects with direct feedback. Most professionals go directly into the 8-Week Python + AI Systems Lab. Kids (Grades 5-12) have their own track.