survival8: Building an Offline Speech-to-Text App with Whisper and Gradio

Wednesday, March 12, 2025

Building an Offline Speech-to-Text App with Whisper and Gradio

To See All Articles About Technology: Index of Lessons in Technology

Speech-to-text (STT) technology has become a game-changer for accessibility and productivity. However, many STT solutions require an internet connection, raising concerns about privacy and latency. In this blog post, we’ll build an offline voice transcription app using OpenAI’s Whisper model and Gradio for a user-friendly interface.

Why Whisper?

OpenAI’s Whisper is a powerful ASR (Automatic Speech Recognition) model trained on diverse datasets. However, its large versions require significant compute resources. To ensure lightweight performance, we’ll use Whisper-Tiny, the smallest variant (~155MB).

Setting Up the Project

1. Install Dependencies

First, install the required libraries:

bash
pip install gradio transformers soundfile librosa numpy

2. Load the Model Locally

Instead of downloading Whisper every time, we’ll load it from a local directory:

python
from transformers import pipeline
    
    asr = pipeline("automatic-speech-recognition", model="/home/ashish/Desktop/ASR/models--openai--whisper-tiny")

👉 Tip: If you haven’t downloaded the model yet, use:

python
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
    
    model_id = "openai/whisper-tiny"
    save_path = "/home/ashish/Desktop/ASR/models--openai--whisper-tiny"
    
    model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id)
    model.save_pretrained(save_path)
    
    processor = AutoProcessor.from_pretrained(model_id)
    processor.save_pretrained(save_path)

3. Process Audio for Transcription

Audio recordings may have multiple channels and varying sample rates. We'll normalize the audio and resample it to 16kHz for better performance:

python
import soundfile as sf
    import librosa
    import numpy as np
    
    def transcribe_speech(audio):
        if audio is None:
            return "No audio provided. Please try again."
        try:
            audio_data, sampling_rate = sf.read(audio)
            if len(audio_data.shape) > 1:
                audio_data = librosa.to_mono(np.transpose(audio_data))
            audio_16KHz = librosa.resample(audio_data, orig_sr=sampling_rate, target_sr=16000)
            output = asr(audio_16KHz)
            return output["text"]
        except Exception as e:
            return f"Error processing audio: {str(e)}"

4. Create a User Interface with Gradio

Gradio allows us to quickly build an interactive UI for testing the model. The user can record audio or upload a file, and the app will display the transcribed text.

python
import gradio as gr
    
    demo = gr.Interface(
        fn=transcribe_speech,
        inputs=gr.Audio(sources=["microphone", "upload"], type="filepath"),
        outputs=gr.Textbox(label="Transcription", lines=4),
        title="🎙️ Voice-to-Text Transcription",
        description="Record or upload an audio file to generate text transcription."
    )
    
    demo.launch()

Running the App

Simply run the script:

bash
python app.py

Gradio will generate a local web interface where you can record and transcribe speech in real time.

Conclusion

With OpenAI Whisper-Tiny and Gradio, we’ve built an offline speech-to-text app that is:
✅ Fast & lightweight (~72MB model)
✅ Works without the internet
✅ Easy to use with a web UI

If you need more accurate transcription, you can explore larger Whisper models or try distil-whisper for efficiency.

🚀 Ready to build your own speech recognition app? Give it a try!

survival8

Pages