Open-source subtitle generation for seamless content translation.
Key Features:
I made this project for fun, but I think it could also be useful for other people.
First, you need to install FFmpeg. Here's how you can do it:
# On Linux
sudo apt install ffmpeg
You can run the script from the command line using the following command:
python subtitle.py <filepath | video_url> [--model <modelname>]
Replace <filepath | video_url>
with the path to your video file. The --model
argument is optional. If not provided, it will use 'base' as the default model.
For example:
python subtitle.py /path/to/your/video.mp4 --model base
This will run the script on the video at /path/to/your/video.mp4
using the base
model. Please replace /path/to/your/video.mp4
with the actual path to your video file.
Here are the models you can use: Note: Use the .en
model only when the video is in English.
You can modify the behaviour by using these parameters whisper
binary as follows:
./whisper [options] file0.wav file1.wav ...
Here are the options you can use with the whisper
binary:
Option | Default | Description |
---|---|---|
-h, --help | Show help message and exit | |
-t N, --threads N | 4 | Number of threads to use during computation |
-p N, --processors N | 1 | Number of processors to use during computation |
-ot N, --offset-t N | 0 | Time offset in milliseconds |
-on N, --offset-n N | 0 | Segment index offset |
-d N, --duration N | 0 | Duration of audio to process in milliseconds |
-mc N, --max-context N | -1 | Maximum number of text context tokens to store |
-ml N, --max-len N | 0 | Maximum segment length in characters |
-sow, --split-on-word | false | Split on word rather than on token |
-bo N, --best-of N | 2 | Number of best candidates to keep |
-bs N, --beam-size N | -1 | Beam size for beam search |
-wt N, --word-thold N | 0.01 | Word timestamp probability threshold |
-et N, --entropy-thold N | 2.40 | Entropy threshold for decoder fail |
-lpt N, --logprob-thold N | -1.00 | Log probability threshold for decoder fail |
-debug, --debug-mode | false | Enable debug mode (eg. dump log_mel) |
-tr, --translate | false | Translate from source language to English |
-di, --diarize | false | Stereo audio diarization |
-tdrz, --tinydiarize | false | Enable tinydiarize (requires a tdrz model) |
-nf, --no-fallback | false | Do not use temperature fallback while decoding |
-otxt, --output-txt | true | Output result in a text file |
-ovtt, --output-vtt | false | Output result in a vtt file |
-osrt, --output-srt | false | Output result in a srt file |
-olrc, --output-lrc | false | Output result in a lrc file |
-owts, --output-words | false | Output script for generating karaoke video |
-fp, --font-path | /System/Library/Fonts/Supplemental/Courier New Bold.ttf | Path to a monospace font for karaoke video |
-ocsv, --output-csv | false | Output result in a CSV file |
-oj, --output-json | false | Output result in a JSON file |
-ojf, --output-json-full | false | Include more information in the JSON file |
-of FNAME, --output-file FNAME | Output file path (without file extension) | |
-ps, --print-special | false | Print special tokens |
-pc, --print-colors | false | Print colors |
-pp, --print-progress | false | Print progress |
-nt, --no-timestamps | false | Do not print timestamps |
-l LANG, --language LANG | en | Spoken language ('auto' for auto-detect) |
-dl, --detect-language | false | Exit after automatically detecting language |
--prompt PROMPT | Initial prompt | |
-m FNAME, --model FNAME | models/ggml-base.en.bin | Model path |
-f FNAME, --file FNAME | Input WAV file path | |
-oved D, --ov-e-device DNAME | CPU | The OpenVINO device used for encode inference |
-ls, --log-score | false | Log best decoder scores of tokens |
-ng, --no-gpu | false | Disable GPU |
Here's an example of how to use the whisper binary:
./whisper -m models/ggml-tiny.en.bin -f Rev.mp3 out.wav -nt --output-vtt
Just try to being a Developer!
For support, email vedgupta@protonmail.com.
Author: innovatorved
Source Code: https://github.com/innovatorved/subtitle
License: MIT license