Without further ado, let’s get to it.

Target:

Make a summary of the youtube video

Installation

Install python 3.10. Strongly recommend using pyenv.

Install ffmpeg and yt-dlp. Scripts for mac.

# ffmpeg
brew install ffmpeg
# yt-dlp
wget https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp_macos
chmod +x yt-dlp_macos
sudo mv yt-dlp_macos /usr/local/bin/yt-dlp

Install whisper. For greater good use virtualenv:

pip install -U openai-whisper

Usage

Download video as an audio. ba == bestaudio.

yt-dlp -f "ba" -o "to_transcribe.%(ext)s" "https://youtube.com/watch?v=6VhD5QyDEh4"

In case webm was downloaded, convert it to aac:

ffmpeg -y -i "to_transcribe.webm"  "to_transcribe.aac"

Run whisper:

python -m whisper --model medium to_transcribe.aac

If you have more than 10G of RAM or GPU memory, you can change model model from medium to large. But you’ll have to wait two times more.

Enjoy the result. Btw, I’d like to suggest using bat:

cat to_transcribe.txt

With 4 cores and 8 threads (i3-10105, 3.7 ГГц) and 8G RAM, it took 16 minutes to transcribe the video.

Do what you want with text. Summary or whatever.