This Python script is designed to process .vtt subtitle files obtained using yt-dlp from YouTube or similar platforms. It merges subtitles with overlapping segments and cleans the text by removing excess whitespace. The script outputs the processed subtitles into a new text file with a timestamped filename.
- Subtitle Merging: Combines multiple subtitle entries into a single entry, considering overlaps.
- Text Cleaning: Cleans subtitle text by replacing newline characters and reducing multiple spaces to a single space.
- Output: Generates a cleaned and merged text file for each
.vttfile in the specified directory.
- Python 3
webvtt-pylibrary for parsing.vttfiles
- Obtain Subtitles: First, download the subtitles using
yt-dlpwith the following commands:- For automatic subtitles (machine-generated):
yt-dlp --write-auto-sub --sub-lang ru --skip-download YOUR_URL - For manual subtitles (provided by the uploader):
yt-dlp --write-subs --sub-lang ru --skip-download YOUR_URL
- For automatic subtitles (machine-generated):
- Script Execution: Place the script in a directory one level above the directory containing the
.vttfiles (or modify thepathvariable as needed). Run the script.
The script outputs processed subtitles in a new file named with the pattern content_YYYY-MM-DD-HH-MM-SS.txt, where the timestamp corresponds to the script execution time.
Errors during the processing of .vtt files (such as malformed files) are logged to the console.