I want to run a Speech Recognition app as a flask endpoint?
Below is an example of how to expose your Whisper-based speech recognition functionality as a Flask endpoint. In this example, the server accepts an audio file via a POST request, processes it with the Whisper model, and returns the transcription as JSON.
Example: app.py
from flask import Flask, request, jsonify
import whisper
import os
import uuid
import soundfile as sf
app = Flask(__name__)
# Load the Whisper model (choose "base", "small", "medium", or "large")
model = whisper.load_model("base")
@app.route("/transcribe", methods=["POST"])deftranscribe():
# Check if an audio file is included in the requestif"audio"notin request.files:
return jsonify({"error": "No audio file provided."}), 400
audio_file = request.files["audio"]
# Generate a unique filename to temporarily store the audio file
temp_filename = f"temp_{uuid.uuid4().hex}.wav"
audio_file.save(temp_filename)
try:
# Optional: Ensure the audio is 16kHz. # This reads the file and, if needed, rewrites it with the correct sample rate.
audio_data, samplerate = sf.read(temp_filename)
if samplerate != 16000:
sf.write(temp_filename, audio_data, 16000)
except Exception as e:
os.remove(temp_filename)
return jsonify({"error": "Failed to process audio file", "details": str(e)}), 500try:
# Run transcription using Whisper
result = model.transcribe(temp_filename)
transcription = result.get("text", "")
except Exception as e:
os.remove(temp_filename)
return jsonify({"error": "Transcription failed", "details": str(e)}), 500# Clean up the temporary file
os.remove(temp_filename)
return jsonify({"transcription": transcription})
if __name__ == "__main__":
# Run the Flask server on 0.0.0.0 so it's accessible on your network
app.run(debug=True, host="0.0.0.0", port=5000)
How to Run the Flask Endpoint on Linux/Ubuntu
Install Dependencies
Make sure you have the necessary packages installed. Open your terminal and run:
Note: The ffmpeg package is useful for handling various audio formats.
Save the Code
Save the code above into a file named app.py.
Run the Flask Server
In your terminal, navigate to the folder where app.py is saved and execute:
sh
python app.py
The server will start on port 5000 and be accessible at http://0.0.0.0:5000/transcribe.
Test the Endpoint
You can test the endpoint using tools like curl or Postman. For example, using curl:
sh
curl -X POST -F "audio=@path/to/your/audio.wav" http://localhost:5000/transcribe
You should receive a JSON response with the transcription.
Additional Considerations
Audio Preprocessing:
In a production system, you might want to include more robust audio processing (e.g., noise reduction, resampling using a dedicated library) before passing the file to the model.
Error Handling:
The code includes basic error handling for missing files and processing issues. You can extend these handlers to meet your application's needs.
Deployment:
When deploying in production, consider using a production-grade server like Gunicorn behind a reverse proxy (e.g., Nginx).
This setup transforms your offline Whisper-based speech recognition functionality into a RESTful Flask endpoint that you can call from any client.
Can you also provide a webpage to run as per this flask endpoint?
But note: I want to record user voice and send that
Below is an example HTML page that uses the browser’s MediaRecorder API to record the user’s voice, then sends the recorded audio as a file to your Flask endpoint. In this example, the endpoint is assumed to be hosted on the same origin at /transcribe. (If it’s hosted elsewhere, simply update the URL.)
Note:
Most browsers will record audio as WebM/Opus. On the Flask side, if needed, you can use ffmpeg (or similar) to convert the file to WAV or another format that works best with Whisper. Alternatively, you can modify the Flask code to accept WebM input.
The page requests microphone access via navigator.mediaDevices.getUserMedia({ audio: true }).
Recording:
A MediaRecorder instance records the audio stream.
Recorded chunks are stored in an array.
Stopping & Sending Audio:
When the user clicks the button again (now labeled Stop Recording), the recorder stops.
The recorded chunks are assembled into a Blob (with MIME type audio/webm).
The Blob is appended to a FormData object and sent via a fetch() POST request to the /transcribe endpoint.
Display Results:
The Flask endpoint processes the audio and returns a JSON object with the transcription.
The transcription is then displayed on the webpage.
This simple webpage allows you to record audio in the browser and send it to your Flask endpoint where Whisper can process it. Adjust the MIME type or perform client-side conversion if your Flask endpoint requires a specific audio format.
Part 2
Dated: Dec 2025
I have this code for Automatic speech recognition. Is this still valid? Or there are any updates or changes needed in it?
Short answer: yes, it still works, as long as you're using the original OpenAI Whisper (openai/whisper Python library).
But — depending on your goals — there are some important updates, improvements, and optimizations you should consider. I’ll break it into sections so you can decide what applies to you.
✅ Your Code is Still Valid
If:
You have pip install git+https://github.com/openai/whisper installed
You are running Python 3.8+
You have FFmpeg installed (apt install ffmpeg or brew install ffmpeg)
…then your code will run and transcribe audio correctly.
Your API structure is fine, the 16kHz conversion is fine, and model.transcribe() is still the correct method.
⚠️ But: Whisper Python Is No Longer Updated Actively
The classic Whisper Python library is stable but not evolving much.
Latest improvements in speech recognition are happening in:
⭐ OpenAI Whisper v3 / v3-turbo (via API, not Python)
Faster inference
Better multilingual accuracy
Whisper supports VAD and segmentation better
Streaming transcription (not in Python)
If you want latest quality + speed improvements, you should switch to:
python
from openai import OpenAI
client = OpenAI()
audio = client.audio.transcriptions.create(
file=open("audio.wav","rb"),
model="gpt-4o-mini-tts"# Whisper successor
)
print(audio.text)
This is much faster and cleaner, but requires an API key.
from faster_whisper import WhisperModel
model = WhisperModel("base", device="cpu")
segments, info = model.transcribe("audio.wav")
for seg in segments:
print(seg.text)
Why faster-whisper is recommended:
Most production Whisper deployments have already moved to CTranslate2
Lower RAM usage
Streaming support
Higher real-time factor (RTF)
🧹 Your Current Script Improvements
1️⃣ Avoid saving WAV if you just want transcription
Whisper works with direct file paths or numpy arrays. You can avoid convert-rewrite step.
2️⃣ Use streaming file handling
Right now you're loading entire audio into memory. For long audio, consider chunking.
3️⃣ Consider GPU device selection:
python
model = whisper.load_model("base", device="cuda")
4️⃣ Consider language hints
python
result = model.transcribe(temp_filename, language="hi")
⭐ Major Architecture Consideration for Production
If the transcription API is for mobile apps (Android), then:
Avoid Python Whisper in server if load is high
Consider faster-whisper + GPU OR
Just use OpenAI API transcription
If you're building Android local ASR, then Python Whisper + Kivy/Chaquopy is heavy; better approach:
Whisper.cpp on-device
Vosk on-device
or Mozilla DeepSpeech on-device
🆕 If You Want Real-Time Streaming (ASR as live speech)
Classic Whisper Python cannot do real streaming.
But you can get streaming via:
faster-whisper streaming mode
whisper.cpp WebRTC streaming
OpenAI Whisper API streaming mode (BEST)
⭐ Recommendation Summary
Use Case
Best Choice
You want highest accuracy with simplest code
OpenAI Whisper API
You want local ASR without API
faster-whisper
You want streaming ASR
Whisper API streaming
You want Android/on-device
whisper.cpp or vosk
🔁 Minimal Update to Your Existing Code (Drop-in)
Just replace:
python
import whisper
model = whisper.load_model("base")
result = model.transcribe(temp_filename)
with:
python
from faster_whisper import WhisperModel
model = WhisperModel("base", device="cpu", compute_type="int8")
segments, info = model.transcribe(temp_filename)
transcription = " ".join([seg.text for seg in segments])
Everything else stays the same.
This makes your Flask API:
4× faster
lower memory
same accuracy
📌 TL;DR
✔ Your existing Flask + Whisper code still works
⚠ Whisper Python is older and slower now
⭐ Better to switch to:
OpenAI Whisper API (best accuracy, fastest)
or faster-whisper (best local performance)
If you want, I can rewrite your Flask ASR server using:
No comments:
Post a Comment