In the rapidly evolving landscape of artificial intelligence, speed and accuracy are paramount. A new implementation called "insanely-fast-whisper" is taking the AI community by storm, promising to transcribe 2.5 hours of audio in just 98 seconds, all while running locally on your Mac or Nvidia GPU. This groundbreaking tool leverages OpenAI's Whisper model and the Pyannote library to deliver unprecedented transcription speeds and speaker segmentation capabilities. Let's delve into how you can harness this powerful tool.
What is Insanely Fast Whisper?
Insanely Fast Whisper is an opinionated command-line interface (CLI) designed to transcribe audio files quickly and efficiently using Whisper's large v3 model from OpenAI, enhanced with the Pyannote library for speaker diarization. The tool is optimized to run on both Mac (using Apple Silicon's mps backend) and Nvidia GPUs, ensuring broad accessibility and performance.
Key Features
- Blazing Fast Transcriptions: Transcribe 150 minutes (2.5 hours) of audio in less than 98 seconds.
- Local Processing: Runs entirely on-device, eliminating the need for internet connectivity and enhancing privacy.
- Advanced Optimizations: Utilizes Flash Attention 2 and other optimization techniques for rapid processing.
Installation and Setup
To get started with Insanely Fast Whisper, you'll need to install the tool via pip
or pipx
. Here’s a step-by-step guide:
- Install Insanely Fast Whisper:
pip install insanely-fast-whisper
Alternatively, you can use pipx
for a more isolated environment:
pipx install insanely-fast-whisper
- Run the CLI:
insanely-fast-whisper --file-name <FILE NAME or URL> --batch-size 2 --device-id mps --hf_token <HF TOKEN>
Using Insanely Fast Whisper
Once installed, you can transcribe audio files using a simple command. Here’s how:
- Basic Transcription:
insanely-fast-whisper --file-name <filename or URL>
- Using Flash Attention 2 for even faster processing:
insanely-fast-whisper --file-name <filename or URL> --flash True
- Running on Mac:
insanely-fast-whisper --file-name <filename or URL> --device-id mps
- Using Distil-Whisper Model:
insanely-fast-whisper --model-name distil-whisper/large-v2 --file-name <filename or URL>
Benchmark Results
The speed and efficiency of Insanely Fast Whisper have been benchmarked extensively. Here are some results using an Nvidia A100 - 80GB GPU:
Optimization Type | Time to Transcribe (150 mins of Audio) |
---|---|
large-v3 (Transformers) (fp32 ) |
~31 minutes |
large-v3 (Transformers) (fp16 + batching [24] + bettertransformer) |
~5 minutes |
large-v3 (Transformers) (fp16 + batching [24] + Flash Attention 2) |
~2 minutes |
distil-large-v2 (Transformers) (fp16 + batching [24] + bettertransformer) |
~3 minutes |
distil-large-v2 (Transformers) (fp16 + batching [24] + Flash Attention 2) |
~1 minute |
large-v2 (Faster Whisper) (fp16 + beam_size [1]) |
~9 minutes |
large-v2 (Faster Whisper) (8-bit + beam_size [1]) |
~8 minutes |
Advanced CLI Options
Insanely Fast Whisper provides a range of options to customize your transcription experience:
-h, --help Show help message and exit
--file-name FILE_NAME Path or URL to the audio file to be transcribed
--device-id DEVICE_ID Device ID for your GPU. Use "mps" for Macs with Apple Silicon
--transcript-path TRANSCRIPT_PATH Path to save the transcription output (default: output.json)
--model-name MODEL_NAME Name of the pretrained model/checkpoint (default: openai/whisper-large-v3)
--task {transcribe, translate} Task to perform (default: transcribe)
--language LANGUAGE Language of the input audio (default: auto-detect)
--batch-size BATCH_SIZE Number of parallel batches to compute (default: 24)
--flash FLASH Use Flash Attention 2 (default: False)
--timestamp {chunk, word} Timestamps granularity (default: chunk)
--hf-token HF_TOKEN Hugging Face token for Pyannote audio diarization
--diarization_model DIARIZATION_MODEL Model for speaker diarization (default: pyannote/speaker-diarization)
--num-speakers NUM_SPEAKERS Exact number of speakers (default: None)
--min-speakers MIN_SPEAKERS Minimum number of speakers (default: None)
--max-speakers MAX_SPEAKERS Maximum number of speakers (default: None)
Conclusion
Insanely Fast Whisper is a game-changer in the field of audio transcription, offering unparalleled speed and accuracy. Its ability to run locally on Mac and Nvidia GPUs makes it a versatile tool for developers and researchers alike. Whether you're transcribing lengthy interviews or need quick speaker segmentation, Insanely Fast Whisper delivers performance that sets a new standard.
For more information and to get started, visit the Insanely Fast Whisper GitHub repository.
Top comments (0)