survival8: Using OpenAI-Whisper-Tiny via HuggingFace for Automatic Speech Recognition app (Research)

To See All Articles About Technology: Index of Lessons in Technology

Can you show to create a flask based endpoint that does transcription of audio recorded from browser using the above model. Also give the associated frontend code.

Below is a complete example that shows how to build a Flask endpoint using the Hugging Face model "openai/whisper-tiny" to transcribe audio files uploaded from a browser. The Flask app will accept an audio file (recorded via the browser), process it with the Hugging Face Transformers pipeline, and return the transcription as JSON. A simple HTML/JavaScript front end is provided that records audio using the MediaRecorder API and sends it to the endpoint.

Flask Endpoint (app.py)

python
from flask import Flask, request, jsonify
    import os
    import uuid
    from transformers import WhisperProcessor, WhisperForConditionalGeneration
    import torch
    import librosa
    
    app = Flask(__name__)
    
    # Load the processor and model from Hugging Face
    model_name = "openai/whisper-tiny"
    processor = WhisperProcessor.from_pretrained(model_name)
    model = WhisperForConditionalGeneration.from_pretrained(model_name)
    
    @app.route("/transcribe", methods=["POST"])
    def transcribe():
        if "audio" not in request.files:
            return jsonify({"error": "No audio file provided."}), 400
    
        audio_file = request.files["audio"]
        # Save the incoming audio file temporarily
        temp_filename = f"temp_{uuid.uuid4().hex}.wav"
        audio_file.save(temp_filename)
    
        try:
            # Load the audio using librosa and resample to 16kHz
            audio, sr = librosa.load(temp_filename, sr=16000)
            # Process audio into input features expected by the model
            input_features = processor(audio, sampling_rate=sr, return_tensors="pt").input_features
            # Generate transcription (default generation parameters; adjust if needed)
            predicted_ids = model.generate(input_features)
            transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
        except Exception as e:
            transcription = f"Error processing audio: {str(e)}"
        finally:
            os.remove(temp_filename)
    
        return jsonify({"transcription": transcription})
    
    if __name__ == "__main__":
        # Run the server on port 5000 and listen on all interfaces.
        app.run(host="0.0.0.0", port=5000, debug=True)

Notes:
Make sure you have installed the required packages:
bash
pip install flask transformers torch librosa
    
The audio file is saved temporarily then removed after transcription.

Frontend (index.html)

This HTML page uses the MediaRecorder API to capture audio from the user's microphone, creates a Blob from the recorded data, and sends it via a POST request to the /transcribe endpoint.

html
<!DOCTYPE html>
    <html lang="en">
    <head>
      <meta charset="UTF-8">
      <title>Speech Transcription Demo</title>
      <style>
        body { font-family: Arial, sans-serif; margin: 2em; }
        #recordButton { padding: 1em; font-size: 1.2em; }
        #status { margin-top: 1em; }
        #transcription { margin-top: 1em; font-size: 1.1em; color: #333; }
      </style>
    </head>
    <body>
      <h1>Speech Transcription Demo</h1>
      <button id="recordButton">Start Recording</button>
      <p id="status"></p>
      <h2>Transcription:</h2>
      <p id="transcription"></p>
    
      <script>
        let mediaRecorder;
        let audioChunks = [];
        const recordButton = document.getElementById('recordButton');
        const statusDisplay = document.getElementById('status');
        const transcriptionDisplay = document.getElementById('transcription');
        let isRecording = false;
    
        recordButton.addEventListener('click', async () => {
          if (!isRecording) {
            try {
              // Request access to the microphone
              const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
              mediaRecorder = new MediaRecorder(stream);
              audioChunks = [];
    
              mediaRecorder.ondataavailable = event => {
                if (event.data && event.data.size > 0) {
                  audioChunks.push(event.data);
                }
              };
    
              mediaRecorder.onstop = async () => {
                // Create a Blob from the recorded audio chunks
                const audioBlob = new Blob(audioChunks, { type: 'audio/wav' });
                const formData = new FormData();
                formData.append('audio', audioBlob, 'recording.wav');
    
                statusDisplay.textContent = 'Sending audio for transcription...';
    
                try {
                  const response = await fetch('/transcribe', {
                    method: 'POST',
                    body: formData
                  });
                  const data = await response.json();
                  transcriptionDisplay.textContent = data.transcription || 'No transcription available.';
                } catch (error) {
                  transcriptionDisplay.textContent = 'Error: ' + error;
                }
                statusDisplay.textContent = '';
              };
    
              mediaRecorder.start();
              recordButton.textContent = 'Stop Recording';
              isRecording = true;
              statusDisplay.textContent = 'Recording...';
            } catch (error) {
              statusDisplay.textContent = 'Error accessing microphone: ' + error;
            }
          } else {
            // Stop recording and trigger transcription
            mediaRecorder.stop();
            recordButton.textContent = 'Start Recording';
            isRecording = false;
            statusDisplay.textContent = 'Processing audio...';
          }
        });
      </script>
    </body>
    </html>

Usage Instructions:

Place this index.html file in a directory served by your Flask app (or use a static file server).

When you load the page, click Start Recording to record your voice. Click the button again to stop recording and send the audio to the server.

The transcription returned from the server will be displayed on the page.

This setup gives you an end-to-end solution where your Flask endpoint leverages the Hugging Face "openai/whisper-tiny" model to transcribe audio recorded from the browser. Feel free to adjust generation parameters, error handling, and UI details as needed for your application.

Can rewrite the python code to use tensorflow as backend for transformers?

Below is an example of a Flask endpoint that uses the Hugging Face Transformers library with TensorFlow as the backend to load the "openai/whisper-tiny" model for transcription. In this example, we use the TensorFlow‑compatible model loader (via TFAutoModelForConditionalGeneration) and tell the processor to return TensorFlow tensors. The endpoint accepts an uploaded audio file, processes it with the model, and returns the transcription as JSON. The accompanying HTML page uses the MediaRecorder API to record audio in the browser and send it to the endpoint.

Flask Endpoint (app_tf.py)

python
from flask import Flask, request, jsonify
    import os
    import uuid
    import librosa
    from transformers import WhisperProcessor, TFAutoModelForConditionalGeneration
    
    app = Flask(__name__)
    
    # Use the TensorFlow version of the model
    model_name = "openai/whisper-tiny"
    processor = WhisperProcessor.from_pretrained(model_name)
    model = TFAutoModelForConditionalGeneration.from_pretrained(model_name)
    
    @app.route("/transcribe", methods=["POST"])
    def transcribe():
        if "audio" not in request.files:
            return jsonify({"error": "No audio file provided."}), 400
    
        audio_file = request.files["audio"]
        # Save the audio file temporarily
        temp_filename = f"temp_{uuid.uuid4().hex}.wav"
        audio_file.save(temp_filename)
    
        try:
            # Load the audio with librosa and ensure a 16kHz sampling rate
            audio, sr = librosa.load(temp_filename, sr=16000)
            # Process the audio and get input features as a TensorFlow tensor
            inputs = processor(audio, sampling_rate=sr, return_tensors="tf")
            input_features = inputs.input_features
    
            # Generate transcription using the TF model (generate returns a TF tensor)
            predicted_ids = model.generate(input_features)
            # Convert predicted_ids tensor to numpy before decoding
            transcription = processor.batch_decode(predicted_ids.numpy(), skip_special_tokens=True)[0]
        except Exception as e:
            transcription = f"Error processing audio: {str(e)}"
        finally:
            os.remove(temp_filename)
    
        return jsonify({"transcription": transcription})
    
    if __name__ == "__main__":
        app.run(host="0.0.0.0", port=5000, debug=True)

Dependencies:
Make sure you install the necessary packages:
bash
pip install flask transformers tensorflow librosa
    

Frontend (index.html)

This HTML page uses the browser’s MediaRecorder API to record the user’s voice and send the resulting audio file to the /transcribe endpoint.

html
<!DOCTYPE html>
    <html lang="en">
    <head>
      <meta charset="UTF-8">
      <title>Speech Transcription Demo</title>
      <style>
        body { font-family: Arial, sans-serif; margin: 2em; }
        #recordButton { padding: 1em; font-size: 1.2em; }
        #status { margin-top: 1em; }
        #transcription { margin-top: 1em; font-size: 1.1em; color: #333; }
      </style>
    </head>
    <body>
      <h1>Speech Transcription Demo</h1>
      <button id="recordButton">Start Recording</button>
      <p id="status"></p>
      <h2>Transcription:</h2>
      <p id="transcription"></p>
    
      <script>
        let mediaRecorder;
        let audioChunks = [];
        const recordButton = document.getElementById('recordButton');
        const statusDisplay = document.getElementById('status');
        const transcriptionDisplay = document.getElementById('transcription');
        let isRecording = false;
    
        recordButton.addEventListener('click', async () => {
          if (!isRecording) {
            try {
              // Request microphone access
              const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
              mediaRecorder = new MediaRecorder(stream);
              audioChunks = [];
    
              mediaRecorder.ondataavailable = event => {
                if (event.data && event.data.size > 0) {
                  audioChunks.push(event.data);
                }
              };
    
              mediaRecorder.onstop = async () => {
                // Combine audio chunks into a Blob
                const audioBlob = new Blob(audioChunks, { type: 'audio/wav' });
                const formData = new FormData();
                formData.append('audio', audioBlob, 'recording.wav');
    
                statusDisplay.textContent = 'Sending audio for transcription...';
    
                try {
                  const response = await fetch('/transcribe', {
                    method: 'POST',
                    body: formData
                  });
                  const data = await response.json();
                  transcriptionDisplay.textContent = data.transcription || 'No transcription available.';
                } catch (error) {
                  transcriptionDisplay.textContent = 'Error: ' + error;
                }
                statusDisplay.textContent = '';
              };
    
              mediaRecorder.start();
              recordButton.textContent = 'Stop Recording';
              isRecording = true;
              statusDisplay.textContent = 'Recording...';
            } catch (error) {
              statusDisplay.textContent = 'Error accessing microphone: ' + error;
            }
          } else {
            // Stop recording and trigger transcription
            mediaRecorder.stop();
            recordButton.textContent = 'Start Recording';
            isRecording = false;
            statusDisplay.textContent = 'Processing audio...';
          }
        });
      </script>
    </body>
    </html>

Summary

The Flask endpoint (app_tf.py) uses TensorFlow as the backend by loading the model with TFAutoModelForConditionalGeneration and returning TensorFlow tensors (with return_tensors="tf").
The Frontend (index.html) records audio using the browser’s MediaRecorder API, sends the recorded audio as a file to the Flask endpoint, and displays the transcription result.

This setup provides an end-to-end solution where a browser-based client records audio and sends it to a TensorFlow‑backed transcription endpoint using the Hugging Face "openai/whisper-tiny" model.

from transformers import WhisperProcessor,
TFAutoModelForConditionalGeneration
ImportError: cannot import name 'TFAutoModelForConditionalGeneration' from 'transformers' (/home/ashish/anaconda3/envs/hf_202412/lib/python3.12/site-packages/transformers/init.py)

This error indicates that the class you're trying to import—TFAutoModelForConditionalGeneration—is not available in your current Transformers installation. Here are a few points to consider and possible solutions:

1. Version or Availability Issue

Transformers Version:
Make sure you’re using a recent version of the Transformers library. You can upgrade with:
```
bash
pip install --upgrade transformers
    
```
However, even with the latest version, not all model architectures have TensorFlow implementations.
TensorFlow Support for Whisper:
Currently, the Whisper models (like "openai/whisper-tiny") are primarily implemented in PyTorch. The TensorFlow equivalents (for conditional generation) may not be available yet. The class TFAutoModelForConditionalGeneration might not support Whisper because a TensorFlow version of Whisper isn’t implemented.

2. Alternatives and Workarounds

Use the PyTorch Version:
Since the Whisper model is officially available as a PyTorch model, you might consider using the PyTorch backend with:
```
python
from transformers import WhisperProcessor, AutoModelForConditionalGeneration
    model = AutoModelForConditionalGeneration.from_pretrained("openai/whisper-tiny")
    
```
This approach is currently the recommended path for using Whisper.
Check for Experimental TF Support:
If you specifically need a TensorFlow implementation, you could check the Transformers GitHub repository or its release notes for any experimental support for Whisper on TF. At the time of writing, Whisper’s TensorFlow support is not yet mainstream.
Use TFAutoModelForSeq2SeqLM (If Applicable):
Some sequence-to-sequence models use the class TFAutoModelForSeq2SeqLM for TensorFlow. However, unless the Whisper model is explicitly supported in TensorFlow, this may not work either.

Summary

Error Cause: The error is raised because TFAutoModelForConditionalGeneration isn’t implemented (or isn’t available in your Transformers version) for the Whisper model.
Recommendation: For now, use the PyTorch implementation (via AutoModelForConditionalGeneration) when working with the "openai/whisper-tiny" model.
Future Updates: Keep an eye on the Transformers release notes; TensorFlow support for Whisper might be added in a future release.

If you need further guidance or have additional questions, feel free to ask!

survival8

Pages

Saturday, March 8, 2025

Using OpenAI-Whisper-Tiny via HuggingFace for Automatic Speech Recognition app (Research)

Can you show to create a flask based endpoint that does transcription of audio recorded from browser using the above model. Also give the associated frontend code.

Flask Endpoint (app.py)

Frontend (index.html)

Can rewrite the python code to use tensorflow as backend for transformers?

Flask Endpoint (app_tf.py)

Frontend (index.html)

Summary

from transformers import WhisperProcessor,
TFAutoModelForConditionalGeneration
ImportError: cannot import name 'TFAutoModelForConditionalGeneration' from 'transformers' (/home/ashish/anaconda3/envs/hf_202412/lib/python3.12/site-packages/transformers/init.py)

1. Version or Availability Issue

2. Alternatives and Workarounds

Summary

No comments:

Post a Comment

Pages

Saturday, March 8, 2025

Using OpenAI-Whisper-Tiny via HuggingFace for Automatic Speech Recognition app (Research)

Can you show to create a flask based endpoint that does transcription of audio recorded from browser using the above model. Also give the associated frontend code.

Flask Endpoint (app.py)

Frontend (index.html)

Can rewrite the python code to use tensorflow as backend for transformers?

Flask Endpoint (app_tf.py)

Frontend (index.html)

Summary

from transformers import WhisperProcessor, TFAutoModelForConditionalGeneration ImportError: cannot import name 'TFAutoModelForConditionalGeneration' from 'transformers' (/home/ashish/anaconda3/envs/hf_202412/lib/python3.12/site-packages/transformers/__init__.py)

1. Version or Availability Issue

2. Alternatives and Workarounds

Summary

No comments:

Post a Comment

from transformers import WhisperProcessor,
TFAutoModelForConditionalGeneration
ImportError: cannot import name 'TFAutoModelForConditionalGeneration' from 'transformers' (/home/ashish/anaconda3/envs/hf_202412/lib/python3.12/site-packages/transformers/init.py)