Whisper Integration: Voice-to-Text for Robots

Introduction

Natural language control starts with hearing. Before your humanoid robot can understand commands, it needs to convert speech into text. In this section, you'll integrate OpenAI Whisper, a state-of-the-art speech recognition model, into a ROS 2 node. By the end, your robot will transcribe voice commands and publish them to a ROS topic for downstream planning.

What is Whisper?

Whisper is OpenAI's automatic speech recognition (ASR) model trained on 680,000 hours of multilingual audio. It's designed for:

Robustness: Works with noisy audio, accents, and background sounds
Multilingual: Supports 99 languages (English, Spanish, Chinese, etc.)
Zero-shot: No fine-tuning required for new domains

Key Features:

Word Error Rate (WER): ~5-10% on clean audio (human-level)
Real-time capable: ~500ms-2s latency depending on audio length
Free tier: OpenAI API provides generous free usage

Whisper Models

Whisper comes in several sizes:

Model	Parameters	Speed	Accuracy	Use Case
tiny	39M	~10x real-time	80%	Quick prototyping
base	74M	~7x real-time	85%	Embedded devices
small	244M	~4x real-time	90%	General use
medium	769M	~2x real-time	95%	High accuracy
large-v3	1.5B	~1x real-time	97%+	Production

For this chapter: We'll use the OpenAI API (uses large-v3 automatically), which is faster and more accurate than running models locally.

Why Whisper for Robotics?

Advantage 1: Noise Robustness

Robots operate in noisy environments:

Motors humming
People talking in background
Wind (outdoors)
Echoes in large rooms

Whisper handles this because it's trained on diverse, real-world audio (not just clean studio recordings).

Advantage 2: Multilingual Support

Example: International research team

Engineer (English): "Go to the lab"
Colleague (Spanish): "Ve al laboratorio"
Both transcribed correctly without language switching

Advantage 3: No Fine-Tuning

Traditional ASR (e.g., Google Speech API) often struggles with:

Technical jargon ("Nav2", "cuVSLAM", "ROS 2")
Domain-specific terms ("gripper", "manipulator")

Whisper handles robotics terminology out-of-the-box due to its massive training corpus.

OpenAI Whisper API vs. Local Models

Option 1: OpenAI API (Recommended)

Advantages:

Fastest (cloud GPUs)
Most accurate (large-v3 model)
No local setup required
Always up-to-date

Disadvantages:

Requires internet
Costs $0.006/minute of audio (~$0.10 for 15min exercise)
Privacy: Audio sent to OpenAI servers

Option 2: Local Whisper (whisper.cpp or Faster Whisper)

Advantages:

No internet required
Free (after initial setup)
Privacy: Audio stays local

Disadvantages:

Slower (unless you have a strong GPU)
Requires installation and model downloads
Slightly lower accuracy

For learning: Use OpenAI API. For production/privacy-sensitive deployments: Use local models.

Setting Up OpenAI API

Step 1: Create OpenAI Account

Go to https://platform.openai.com/signup
Sign up with email
Verify email address

Step 2: Get API Key

Navigate to API Keys section
Click Create new secret key
Copy key (starts with sk-...)
Important: Never commit API keys to GitHub!

Step 3: Install OpenAI Python SDK

# Activate ROS 2 workspace
cd ~/ros2_ws
source /opt/ros/humble/setup.bash

# Install OpenAI SDK
pip3 install openai

Step 4: Test API Access

#!/usr/bin/env python3
import openai

# Set API key (replace with yours)
openai.api_key = "sk-YOUR_API_KEY_HERE"

# Test connection
models = openai.Model.list()
print("✓ API connection successful!")
print(f"Available models: {len(models['data'])}")

Expected output:

✓ API connection successful!
Available models: 50+

Step 5: Secure API Key

Never hardcode API keys! Use environment variables:

# Add to ~/.bashrc
export OPENAI_API_KEY="sk-YOUR_API_KEY_HERE"
source ~/.bashrc

In Python:

import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")

Creating a Whisper ROS 2 Node

Now let's build a ROS 2 node that captures audio and publishes transcriptions.

Package Structure

cd ~/ros2_ws/src
ros2 pkg create --build-type ament_python whisper_voice_node --dependencies rclpy std_msgs
cd whisper_voice_node

Directory structure:

whisper_voice_node/
├── whisper_voice_node/
│   ├── __init__.py
│   └── whisper_node.py       # Main node
├── package.xml
├── setup.py
└── config/
    └── whisper_config.yaml   # Configuration

Node Implementation

File: whisper_voice_node/whisper_voice_node/whisper_node.py

#!/usr/bin/env python3
"""
Whisper Voice Node - Captures audio and transcribes to text using OpenAI Whisper API
"""

import os
import rclpy
from rclpy.node import Node
from std_msgs.msg import String
import sounddevice as sd
import scipy.io.wavfile as wav
import openai
import tempfile

class WhisperVoiceNode(Node):
    """ROS 2 node for voice-to-text using Whisper API"""

    def __init__(self):
        super().__init__('whisper_voice_node')

        # Declare parameters
        self.declare_parameter('sample_rate', 16000)
        self.declare_parameter('duration', 5.0)  # Record 5 seconds at a time
        self.declare_parameter('device_index', None)  # Auto-select microphone

        # Get parameters
        self.sample_rate = self.get_parameter('sample_rate').value
        self.duration = self.get_parameter('duration').value
        self.device_index = self.get_parameter('device_index').value

        # Set up OpenAI API
        openai.api_key = os.getenv("OPENAI_API_KEY")
        if not openai.api_key:
            self.get_logger().error("OPENAI_API_KEY environment variable not set!")
            raise ValueError("Missing API key")

        # Create publisher for transcribed text
        self.publisher = self.create_publisher(String, '/voice_commands', 10)

        # Create timer to record audio periodically
        self.timer = self.create_timer(self.duration + 0.5, self.record_and_transcribe)

        self.get_logger().info(f'Whisper Voice Node started')
        self.get_logger().info(f'Recording {self.duration}s audio every {self.duration + 0.5}s')
        self.get_logger().info('Publishing transcriptions to /voice_commands')

    def record_audio(self):
        """Record audio from microphone"""
        self.get_logger().info('🎤 Recording...')

        # Record audio
        audio_data = sd.rec(
            int(self.duration * self.sample_rate),
            samplerate=self.sample_rate,
            channels=1,
            dtype='int16',
            device=self.device_index
        )
        sd.wait()  # Wait until recording is finished

        return audio_data

    def transcribe_audio(self, audio_data):
        """Transcribe audio using Whisper API"""
        # Save audio to temporary WAV file (API requires file input)
        with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as temp_audio:
            wav.write(temp_audio.name, self.sample_rate, audio_data)
            temp_path = temp_audio.name

        try:
            # Call Whisper API
            with open(temp_path, 'rb') as audio_file:
                transcript = openai.Audio.transcribe(
                    model="whisper-1",
                    file=audio_file,
                    language="en"  # Force English (remove for auto-detect)
                )

            transcribed_text = transcript['text'].strip()
            return transcribed_text

        except Exception as e:
            self.get_logger().error(f'Transcription failed: {e}')
            return None

        finally:
            # Clean up temporary file
            os.remove(temp_path)

    def record_and_transcribe(self):
        """Main callback: record audio and transcribe"""
        # Record audio
        audio_data = self.record_audio()

        # Transcribe
        text = self.transcribe_audio(audio_data)

        if text:
            # Publish to ROS topic
            msg = String()
            msg.data = text
            self.publisher.publish(msg)
            self.get_logger().info(f'📝 Transcribed: "{text}"')
        else:
            self.get_logger().warn('No transcription (silence or error)')


def main(args=None):
    rclpy.init(args=args)
    node = WhisperVoiceNode()

    try:
        rclpy.spin(node)
    except KeyboardInterrupt:
        pass
    finally:
        node.destroy_node()
        rclpy.shutdown()


if __name__ == '__main__':
    main()

Dependencies

Add to package.xml:

<exec_depend>std_msgs</exec_depend>
<exec_depend>python3-sounddevice</exec_depend>
<exec_depend>python3-scipy</exec_depend>
<exec_depend>python3-openai</exec_depend>

Install Python packages:

pip3 install sounddevice scipy openai

Configuration File

File: config/whisper_config.yaml

whisper_voice_node:
  ros__parameters:
    sample_rate: 16000      # 16 kHz (Whisper's native rate)
    duration: 5.0           # Record 5 seconds at a time
    device_index: null      # Auto-select (or specify device number)

Setup.py Entry Point

File: setup.py

from setuptools import setup
import os
from glob import glob

package_name = 'whisper_voice_node'

setup(
    name=package_name,
    version='0.0.1',
    packages=[package_name],
    data_files=[
        ('share/ament_index/resource_index/packages',
            ['resource/' + package_name]),
        ('share/' + package_name, ['package.xml']),
        (os.path.join('share', package_name, 'config'),
            glob('config/*.yaml')),
    ],
    install_requires=['setuptools'],
    zip_safe=True,
    maintainer='Your Name',
    maintainer_email='you@example.com',
    description='Whisper voice-to-text ROS 2 node',
    license='MIT',
    entry_points={
        'console_scripts': [
            'whisper_node = whisper_voice_node.whisper_node:main',
        ],
    },
)

Building and Running

Build the Package

cd ~/ros2_ws
colcon build --packages-select whisper_voice_node --symlink-install
source install/setup.bash

Run the Node

ros2 run whisper_voice_node whisper_node

Expected output:

[INFO] [whisper_voice_node]: Whisper Voice Node started
[INFO] [whisper_voice_node]: Recording 5.0s audio every 5.5s
[INFO] [whisper_voice_node]: Publishing transcriptions to /voice_commands
[INFO] [whisper_voice_node]: 🎤 Recording...
[INFO] [whisper_voice_node]: 📝 Transcribed: "go to the kitchen"

Speak into your microphone during the 5-second recording window.

Listen to Transcriptions

Terminal 2:

ros2 topic echo /voice_commands

Output:

data: 'go to the kitchen'
---
data: 'pick up the red cup'
---

Improving Audio Quality

Issue 1: Background Noise

Problem: Robot motors, HVAC, people talking Solution: Use noise suppression

Install noisereduce:

pip3 install noisereduce

Add to node:

import noisereduce as nr

def record_audio(self):
    audio_data = sd.rec(...)
    sd.wait()

    # Apply noise reduction
    audio_data_clean = nr.reduce_noise(
        y=audio_data.flatten(),
        sr=self.sample_rate,
        stationary=True  # For constant background noise (motors)
    )

    return audio_data_clean.reshape(-1, 1).astype('int16')

Issue 2: Voice Activity Detection (VAD)

Problem: Transcribing silence wastes API calls Solution: Only transcribe when voice detected

Install webrtcvad:

pip3 install webrtcvad

Add VAD check:

import webrtcvad

vad = webrtcvad.Vad(3)  # Aggressiveness 0-3 (3 = most aggressive)

def has_voice(self, audio_data):
    # Convert to bytes
    audio_bytes = audio_data.tobytes()

    # Check if voice present
    is_speech = vad.is_speech(audio_bytes, self.sample_rate)
    return is_speech

def record_and_transcribe(self):
    audio_data = self.record_audio()

    if not self.has_voice(audio_data):
        self.get_logger().info('No voice detected (silence)')
        return

    text = self.transcribe_audio(audio_data)
    # ... rest of code

Privacy Considerations

Data Handling

What OpenAI receives:

Audio file (WAV format)
API key (for billing)

What OpenAI does NOT receive:

User identity (unless in audio content)
Location data
Camera feeds

OpenAI's Policy (as of 2025):

Audio data used to improve models (opt-out available)
Not used for advertising
Deleted after 30 days

Best Practices

Inform users: Display message "Voice commands are processed by OpenAI Whisper"
Minimize data: Only send audio when voice detected (VAD)
Local alternative: Use whisper.cpp for sensitive environments
Anonymize: Filter out personal information before sending

Cost Optimization

Whisper API Pricing

Current pricing: $0.006 / minute of audio

Example usage (15-minute exercise):

15 minutes audio × $0.006 = $0.09 total
For 30 students: $2.70

Optimization strategies:

1. Use VAD (Voice Activity Detection)

Only transcribe when voice detected → Save 50-70% (no silence transcribed)

2. Adjust Recording Duration

duration: 3.0  # Shorter clips = less audio sent

Trade-off: Users must speak within 3-second windows

3. Batch Processing

Record 30 seconds, split into chunks, transcribe once → Reduce API calls

4. Local Whisper for Testing

Use local whisper.cpp during development, switch to API for demos.

Summary

You've now integrated OpenAI Whisper into a ROS 2 node for voice-to-text:

Whisper API: State-of-the-art speech recognition (97%+ accuracy)
ROS 2 Node: whisper_voice_node publishes transcriptions to /voice_commands
Audio Capture: Uses sounddevice to record from microphone
Improvements: Noise reduction, Voice Activity Detection (VAD)
Privacy: Understanding OpenAI's data handling, local alternatives available
Cost: ~$0.09 for 15min exercise, optimizations can reduce by 50-70%

In the next section, you'll take transcribed commands and use GPT-4 to decompose them into robot action sequences.

Review Questions

What is the main advantage of Whisper over traditional speech recognition models?

Details
Answer
Whisper is highly robust to noise, accents, and background sounds because it's trained on 680,000 hours of diverse, real-world audio (not just clean studio recordings). It also supports 99 languages zero-shot.
How much does the OpenAI Whisper API cost per minute of audio?

Details
Answer
$0.006 per minute of audio (as of 2025). For a 15-minute exercise, this costs approximately $0.09 total.
What is Voice Activity Detection (VAD) and why is it useful?

Details
Answer
VAD detects whether audio contains speech or just silence. It's useful for saving API costs (don't transcribe silence) and reducing unnecessary processing.
What ROS 2 topic does the Whisper node publish transcriptions to?

Details
Answer
/voice_commands (type: std_msgs/String)
What are two privacy concerns with using the OpenAI Whisper API?
Answer
1. Audio data is sent to OpenAI servers (external processing), 2) OpenAI may use the data to improve models (though users can opt-out, and data is deleted after 30 days).

Introduction​

What is Whisper?​

Whisper Models​

Why Whisper for Robotics?​

Advantage 1: Noise Robustness​

Advantage 2: Multilingual Support​

Advantage 3: No Fine-Tuning​

OpenAI Whisper API vs. Local Models​

Option 1: OpenAI API (Recommended)​

Option 2: Local Whisper (whisper.cpp or Faster Whisper)​

Setting Up OpenAI API​

Step 1: Create OpenAI Account​

Step 2: Get API Key​

Step 3: Install OpenAI Python SDK​

Step 4: Test API Access​

Step 5: Secure API Key​

Creating a Whisper ROS 2 Node​

Package Structure​

Node Implementation​

Dependencies​

Configuration File​

Setup.py Entry Point​

Building and Running​

Build the Package​

Run the Node​

Listen to Transcriptions​

Improving Audio Quality​

Issue 1: Background Noise​

Issue 2: Voice Activity Detection (VAD)​

Privacy Considerations​

Data Handling​

Best Practices​

Cost Optimization​

Whisper API Pricing​

1. Use VAD (Voice Activity Detection)​

2. Adjust Recording Duration​

3. Batch Processing​

4. Local Whisper for Testing​

Summary​

Review Questions​