Saturday, March 8, 2025

Hour 2 - First steps with development for ASR

To See All Articles About Technology: Index of Lessons in Technology
Note: We are using this model: 
https://huggingface.co/openai/whisper-tiny/tree/main

$ pwd
/home/ashish/Desktop/Using OpenAI-Whisper-Tiny via HuggingFace for Automatic Speech Recognition app (Research)/models--openai--whisper-tiny

$ ls -lR
...
...
...

./snapshots/169d4a4341b33bc18d8881c4b69c2e104e1cc0af:
total 4
lrwxrwxrwx 1 ashish ashish 52 Mar  8 22:46 added_tokens.json -> ../../blobs/e3d256c988462aa153dcabe2aa38b8e9b436c06f
lrwxrwxrwx 1 ashish ashish 52 Mar  8 22:46 config.json -> ../../blobs/417aa9de49a132dd3eb6a56d3be2718b15f08917
lrwxrwxrwx 1 ashish ashish 52 Mar  8 22:46 generation_config.json -> ../../blobs/4b26dd66b8f7bca37d851d259fdc118315cacc62
lrwxrwxrwx 1 ashish ashish 52 Mar  8 22:46 merges.txt -> ../../blobs/6038932a2a1f09a66991b1c2adae0d14066fa29e
lrwxrwxrwx 1 ashish ashish 76 Mar  8 22:46 model.safetensors -> ../../blobs/7ebd0e69e78190ffe1438491fa05cc1f5c1aa3a4c4db3bc1723adbb551ea2395
lrwxrwxrwx 1 ashish ashish 52 Mar  8 22:46 normalizer.json -> ../../blobs/dd6ae819ad738ac1a546e9f9282ef325c33b9ea0
lrwxrwxrwx 1 ashish ashish 52 Mar  8 22:46 preprocessor_config.json -> ../../blobs/c2048dfa9fd94a052e62e908d2c4dfb18534b4d2
lrwxrwxrwx 1 ashish ashish 52 Mar  8 22:46 special_tokens_map.json -> ../../blobs/bf69932dca4b3719b59fdd8f6cc1978109509f6c
lrwxrwxrwx 1 ashish ashish 52 Mar  8 22:46 tokenizer_config.json -> ../../blobs/d13b786c04765fb1a06492b53587752cd67665ea
lrwxrwxrwx 1 ashish ashish 52 Mar  8 22:46 tokenizer.json -> ../../blobs/1e95340ff836fad1b5932e800fb7b8c5e6d78a74
lrwxrwxrwx 1 ashish ashish 52 Mar  8 22:46 vocab.json -> ../../blobs/d7016e21da8776c8a9d577d0f559600f09a240eb


I am getting this error with my local setup:
OSError: ./models--openai--whisper-tiny does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co/./models--openai-

But this file is present in my model_dir as I can check using "ls -lR"
lrwxrwxrwx 1 ashish ashish 52 Mar 8 22:46 preprocessor_config.json -> ../../blobs/c2048dfa9fd94a052e62e908d2c4dfb18534b4d2

This error usually happens when the model directory contains symbolic links (symlinks) that point to files (like preprocessor_config.json) outside the current folder, and the Transformers library can’t resolve them. Even though your ls output shows the symlink, if the target file (in this case, the blob file) isn’t actually available in the expected relative location, you’ll get an error.

Here are some steps to resolve it:

  1. Check the Symlink Target:
    Run:

    bash
    readlink preprocessor_config.json

    Verify that the target file (e.g. ../../blobs/c2048dfa9fd94a052e62e908d2c4dfb18534b4d2) exists relative to your model directory. If it doesn’t, the symlink is broken.

  2. Properly Download the Repository:
    The recommended way to obtain the model files is to clone the repository using Git LFS. For example:

    bash
    git lfs install git clone https://huggingface.co/openai/whisper-tiny

    This ensures that all the blob files are downloaded and the symlinks point to real files.

  3. Manually Fix the Symlink:
    If you already downloaded the files manually and have the blob file somewhere else, copy that blob file into the expected relative path so the symlink resolves. Alternatively, you can replace the symlink with the actual file:

    bash
    cp ../../blobs/c2048dfa9fd94a052e62e908d2c4dfb18534b4d2 preprocessor_config.json

    (Make sure the source path is correct.)

  4. Use the Hugging Face Hub API:
    Instead of manually handling files, you can let the Transformers library handle the download and caching by specifying the model repository name. For example:

    python
    from transformers import WhisperProcessor, AutoModelForConditionalGeneration model_dir = "openai/whisper-tiny" # This will download the files and cache them properly processor = WhisperProcessor.from_pretrained(model_dir) model = AutoModelForConditionalGeneration.from_pretrained(model_dir)

    This avoids the need to manage symlinks manually.

By ensuring that the symlinks are valid (either by using Git LFS to clone the repository or by copying the actual files in place), the Transformers library should be able to load the preprocessor configuration without error.

Let me know if you need more help!

If you don't fix the symlinks manually, you will get the below error:

04:04 ~ $ python app_pt.py 
2025-03-09 04:06:00.012059: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.10/site-packages/cv2/../../lib64:
2025-03-09 04:06:00.012096: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "/home/keshavpawar137/app_pt.py", line 13, in [module]
    processor = WhisperProcessor.from_pretrained(model_dir)
  File "/home/keshavpawar137/.local/lib/python3.10/site-packages/transformers/processing_utils.py", line 1070, in from_pretrained
    args = cls._get_arguments_from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "/home/keshavpawar137/.local/lib/python3.10/site-packages/transformers/processing_utils.py", line 1116, in _get_arguments_from_pretrained
    args.append(attribute_class.from_pretrained(pretrained_model_name_or_path, **kwargs))
  File "/home/keshavpawar137/.local/lib/python3.10/site-packages/transformers/feature_extraction_utils.py", line 385, in from_pretrained
    feature_extractor_dict, kwargs = cls.get_feature_extractor_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/keshavpawar137/.local/lib/python3.10/site-packages/transformers/feature_extraction_utils.py", line 511, in get_feature_extractor_dict
    resolved_feature_extractor_file = cached_file(
  File "/home/keshavpawar137/.local/lib/python3.10/site-packages/transformers/utils/hub.py", line 313, in cached_file
    raise EnvironmentError(
OSError: ./models--openai--whisper-tiny does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co/./models--openai--whisper-tiny/tree/main' for available files.

Manually Fixing the Symlink

$ pwd /home/ashish/Desktop/Using OpenAI-Whisper-Tiny via HuggingFace for Automatic Speech Recognition app (Research)/models--openai--whisper-tiny cp ./blobs/e3d256c988462aa153dcabe2aa38b8e9b436c06f added_tokens.json cp ./blobs/417aa9de49a132dd3eb6a56d3be2718b15f08917 config.json cp ./blobs/4b26dd66b8f7bca37d851d259fdc118315cacc62 generation_config.json cp ./blobs/6038932a2a1f09a66991b1c2adae0d14066fa29e merges.txt cp ./blobs/7ebd0e69e78190ffe1438491fa05cc1f5c1aa3a4c4db3bc1723adbb551ea2395 model.safetensors cp ./blobs/dd6ae819ad738ac1a546e9f9282ef325c33b9ea0 normalizer.json cp ./blobs/c2048dfa9fd94a052e62e908d2c4dfb18534b4d2 preprocessor_config.json cp ./blobs/bf69932dca4b3719b59fdd8f6cc1978109509f6c special_tokens_map.json cp ./blobs/d13b786c04765fb1a06492b53587752cd67665ea tokenizer_config.json cp ./blobs/1e95340ff836fad1b5932e800fb7b8c5e6d78a74 tokenizer.json cp ./blobs/d7016e21da8776c8a9d577d0f559600f09a240eb vocab.json ~~~ 05:34 ~/mysite $ ls __pycache__ flask_app.py models--openai--whisper-tiny https://www.pythonanywhere.com/user/keshavpawar137/files/var/log/keshavpawar137.pythonanywhere.com.error.log 2025-03-09 05:31:23,560: OSError: Incorrect path_or_model_id: './models--openai--whisper-tiny'. Please provide either the path to a local folder or the repo_id of a model on the Hub. 2025-03-09 05:31:23,560: File "/var/www/keshavpawar137_pythonanywhere_com_wsgi.py", line 16, in [module> 2025-03-09 05:31:23,560: from flask_app import app as application # noqa 2025-03-09 05:31:23,560: 2025-03-09 05:31:23,560: File "/home/keshavpawar137/mysite/flask_app.py", line 23, in [module> 2025-03-09 05:31:23,561: processor = WhisperProcessor.from_pretrained(model_dir)
Tags: Large Language Models,Technology,

No comments:

Post a Comment