Wednesday, March 12, 2025

Building an Offline Speech-to-Text App with Whisper and Gradio

To See All Articles About Technology: Index of Lessons in Technology

Speech-to-text (STT) technology has become a game-changer for accessibility and productivity. However, many STT solutions require an internet connection, raising concerns about privacy and latency. In this blog post, we’ll build an offline voice transcription app using OpenAI’s Whisper model and Gradio for a user-friendly interface.


Why Whisper?

OpenAI’s Whisper is a powerful ASR (Automatic Speech Recognition) model trained on diverse datasets. However, its large versions require significant compute resources. To ensure lightweight performance, we’ll use Whisper-Tiny, the smallest variant (~155MB).


Setting Up the Project

1. Install Dependencies

First, install the required libraries:

bash
pip install gradio transformers soundfile librosa numpy

2. Load the Model Locally

Instead of downloading Whisper every time, we’ll load it from a local directory:

python
from transformers import pipeline asr = pipeline("automatic-speech-recognition", model="/home/ashish/Desktop/ASR/models--openai--whisper-tiny")

👉 Tip: If you haven’t downloaded the model yet, use:

python
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor model_id = "openai/whisper-tiny" save_path = "/home/ashish/Desktop/ASR/models--openai--whisper-tiny" model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id) model.save_pretrained(save_path) processor = AutoProcessor.from_pretrained(model_id) processor.save_pretrained(save_path)

3. Process Audio for Transcription

Audio recordings may have multiple channels and varying sample rates. We'll normalize the audio and resample it to 16kHz for better performance:

python
import soundfile as sf import librosa import numpy as np def transcribe_speech(audio): if audio is None: return "No audio provided. Please try again." try: audio_data, sampling_rate = sf.read(audio) if len(audio_data.shape) > 1: audio_data = librosa.to_mono(np.transpose(audio_data)) audio_16KHz = librosa.resample(audio_data, orig_sr=sampling_rate, target_sr=16000) output = asr(audio_16KHz) return output["text"] except Exception as e: return f"Error processing audio: {str(e)}"

4. Create a User Interface with Gradio

Gradio allows us to quickly build an interactive UI for testing the model. The user can record audio or upload a file, and the app will display the transcribed text.

python
import gradio as gr demo = gr.Interface( fn=transcribe_speech, inputs=gr.Audio(sources=["microphone", "upload"], type="filepath"), outputs=gr.Textbox(label="Transcription", lines=4), title="🎙️ Voice-to-Text Transcription", description="Record or upload an audio file to generate text transcription." ) demo.launch()

Running the App

Simply run the script:

bash
python app.py

Gradio will generate a local web interface where you can record and transcribe speech in real time.


Conclusion

With OpenAI Whisper-Tiny and Gradio, we’ve built an offline speech-to-text app that is:
Fast & lightweight (~72MB model)
Works without the internet
Easy to use with a web UI

If you need more accurate transcription, you can explore larger Whisper models or try distil-whisper for efficiency.

🚀 Ready to build your own speech recognition app? Give it a try!


Tags: Technology,Generative AI,Large Language Models,

No comments:

Post a Comment