Saturday, March 8, 2025

Whisper-based speech recognition functionality as a Flask endpoint (Research)

To See All Articles About Technology: Index of Lessons in Technology

I want to run a Speech Recognition app as a flask endpoint?

Below is an example of how to expose your Whisper-based speech recognition functionality as a Flask endpoint. In this example, the server accepts an audio file via a POST request, processes it with the Whisper model, and returns the transcription as JSON.


Example: app.py

python
from flask import Flask, request, jsonify import whisper import os import uuid import soundfile as sf app = Flask(__name__) # Load the Whisper model (choose "base", "small", "medium", or "large") model = whisper.load_model("base") @app.route("/transcribe", methods=["POST"]) def transcribe(): # Check if an audio file is included in the request if "audio" not in request.files: return jsonify({"error": "No audio file provided."}), 400 audio_file = request.files["audio"] # Generate a unique filename to temporarily store the audio file temp_filename = f"temp_{uuid.uuid4().hex}.wav" audio_file.save(temp_filename) try: # Optional: Ensure the audio is 16kHz. # This reads the file and, if needed, rewrites it with the correct sample rate. audio_data, samplerate = sf.read(temp_filename) if samplerate != 16000: sf.write(temp_filename, audio_data, 16000) except Exception as e: os.remove(temp_filename) return jsonify({"error": "Failed to process audio file", "details": str(e)}), 500 try: # Run transcription using Whisper result = model.transcribe(temp_filename) transcription = result.get("text", "") except Exception as e: os.remove(temp_filename) return jsonify({"error": "Transcription failed", "details": str(e)}), 500 # Clean up the temporary file os.remove(temp_filename) return jsonify({"transcription": transcription}) if __name__ == "__main__": # Run the Flask server on 0.0.0.0 so it's accessible on your network app.run(debug=True, host="0.0.0.0", port=5000)

How to Run the Flask Endpoint on Linux/Ubuntu

  1. Install Dependencies
    Make sure you have the necessary packages installed. Open your terminal and run:

    sh
    sudo apt update && sudo apt install -y ffmpeg pip install flask whisper openai-whisper soundfile

    Note: The ffmpeg package is useful for handling various audio formats.

  2. Save the Code
    Save the code above into a file named app.py.

  3. Run the Flask Server
    In your terminal, navigate to the folder where app.py is saved and execute:

    sh
    python app.py

    The server will start on port 5000 and be accessible at http://0.0.0.0:5000/transcribe.

  4. Test the Endpoint
    You can test the endpoint using tools like curl or Postman. For example, using curl:

    sh
    curl -X POST -F "audio=@path/to/your/audio.wav" http://localhost:5000/transcribe

    You should receive a JSON response with the transcription.


Additional Considerations

  • Audio Preprocessing:
    In a production system, you might want to include more robust audio processing (e.g., noise reduction, resampling using a dedicated library) before passing the file to the model.

  • Error Handling:
    The code includes basic error handling for missing files and processing issues. You can extend these handlers to meet your application's needs.

  • Deployment:
    When deploying in production, consider using a production-grade server like Gunicorn behind a reverse proxy (e.g., Nginx).

This setup transforms your offline Whisper-based speech recognition functionality into a RESTful Flask endpoint that you can call from any client.

Can you also provide a webpage to run as per this flask endpoint?
But note: I want to record user voice and send that

Below is an example HTML page that uses the browser’s MediaRecorder API to record the user’s voice, then sends the recorded audio as a file to your Flask endpoint. In this example, the endpoint is assumed to be hosted on the same origin at /transcribe. (If it’s hosted elsewhere, simply update the URL.)

Note:
Most browsers will record audio as WebM/Opus. On the Flask side, if needed, you can use ffmpeg (or similar) to convert the file to WAV or another format that works best with Whisper. Alternatively, you can modify the Flask code to accept WebM input.


index.html

html
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Speech Recognition Demo</title> <style> body { font-family: Arial, sans-serif; margin: 2em; } #recordButton { padding: 1em; font-size: 1.2em; } #status { margin-top: 1em; } #transcription { margin-top: 1em; font-size: 1.1em; color: #333; } </style> </head> <body> <h1>Speech Recognition Demo</h1> <button id="recordButton">Start Recording</button> <p id="status"></p> <h2>Transcription:</h2> <p id="transcription"></p> <script> let mediaRecorder; let audioChunks = []; const recordButton = document.getElementById('recordButton'); const statusDisplay = document.getElementById('status'); const transcriptionDisplay = document.getElementById('transcription'); let isRecording = false; recordButton.addEventListener('click', async () => { if (!isRecording) { // Request microphone access and start recording. try { const stream = await navigator.mediaDevices.getUserMedia({ audio: true }); mediaRecorder = new MediaRecorder(stream); audioChunks = []; mediaRecorder.ondataavailable = event => { if (event.data && event.data.size > 0) { audioChunks.push(event.data); } }; mediaRecorder.onstop = async () => { const audioBlob = new Blob(audioChunks, { type: 'audio/webm' }); // Prepare the audio file for upload. const formData = new FormData(); formData.append('audio', audioBlob, 'recording.webm'); statusDisplay.textContent = 'Sending audio for transcription...'; try { const response = await fetch('/transcribe', { method: 'POST', body: formData }); const data = await response.json(); transcriptionDisplay.textContent = data.transcription || 'No transcription available.'; } catch (error) { transcriptionDisplay.textContent = 'Error: ' + error; } statusDisplay.textContent = ''; }; mediaRecorder.start(); recordButton.textContent = 'Stop Recording'; isRecording = true; statusDisplay.textContent = 'Recording...'; } catch (error) { statusDisplay.textContent = 'Error accessing microphone: ' + error; } } else { // Stop the recording. mediaRecorder.stop(); recordButton.textContent = 'Start Recording'; isRecording = false; statusDisplay.textContent = 'Processing audio...'; } }); </script> </body> </html>

How It Works

  1. User Interaction:

    • The user clicks the Start Recording button.
    • The page requests microphone access via navigator.mediaDevices.getUserMedia({ audio: true }).
  2. Recording:

    • A MediaRecorder instance records the audio stream.
    • Recorded chunks are stored in an array.
  3. Stopping & Sending Audio:

    • When the user clicks the button again (now labeled Stop Recording), the recorder stops.
    • The recorded chunks are assembled into a Blob (with MIME type audio/webm).
    • The Blob is appended to a FormData object and sent via a fetch() POST request to the /transcribe endpoint.
  4. Display Results:

    • The Flask endpoint processes the audio and returns a JSON object with the transcription.
    • The transcription is then displayed on the webpage.

This simple webpage allows you to record audio in the browser and send it to your Flask endpoint where Whisper can process it. Adjust the MIME type or perform client-side conversion if your Flask endpoint requires a specific audio format.

Tags: Technology,Generative AI,Large Language Models,

No comments:

Post a Comment