YouTube Transcript Analyzer

Scan transcripts from a YouTube channel, load them into MDDB, then ask questions about the channel's content using Claude CLI.

YouTube โ†’ yt-dlp (transcripts) โ†’ MDDB โ†’ Claude CLI (MCP) โ†’ answers

Requirements

  • yt-dlp โ€” download transcripts
  • MDDB โ€” document database
  • Claude CLI โ€” question interface
  • Python 3 (optional, for format conversion)

Step 1: Install yt-dlp

brew install yt-dlp pip install yt-dlp winget install yt-dlp

Verify the installation:

yt-dlp --version

Step 2: Start MDDB

The simplest way โ€” Docker:

docker run -d \ --name mddb \ -p 11023:11023 \ -v mddb-data:/data \ tradik/mddb:latest

Check that it's running:

curl http://localhost:11023/v1/health

Expected response: {"status":"ok"}


Step 3: Download transcripts from a YouTube channel

VPN recommended. YouTube may throttle or block your IP when downloading many transcripts at once. Use a VPN to avoid rate limiting and potential IP bans.

3a. Get the list of videos

Replace CHANNEL_URL with the channel address, e.g. https://www.youtube.com/@lexfridman.

yt-dlp --flat-playlist --print "%(id)s %(title)s" "CHANNEL_URL" > video_list.txt

Check how many videos were found:

wc -l video_list.txt

3b. Download transcripts

Create a folder for transcripts:

mkdir -p transcripts

Download subtitles from each video:

while IFS=' ' read -r video_id title; do echo "Downloading: $title" yt-dlp \ --write-subs \ --write-auto-subs \ --sub-lang "en" \ --sub-format "vtt" \ --skip-download \ --output "transcripts/%(id)s" \ "https://www.youtube.com/watch?v=$video_id" 2>/dev/null
done < video_list.txt

Options:

  • --sub-lang "en" โ€” download English subtitles (change to "en,pl" for multiple languages)
  • --write-auto-subs โ€” use auto-generated subs when manual ones are unavailable
  • --skip-download โ€” don't download the video itself, only subtitles

3c. Convert VTT to plain text

VTT transcripts contain timestamps. Convert them to clean Markdown:

mkdir -p transcripts_md for vtt_file in transcripts/*.vtt; do base=$(basename "$vtt_file" .vtt) # Remove VTT headers, timestamps, and HTML tags sed '/^WEBVTT/d; /^$/d; /^[0-9][0-9]:[0-9][0-9]/d; /-->/d; s/<[^>]*>//g' \ "$vtt_file" | \ awk '!seen[$0]++' > "transcripts_md/${base}.md" echo "Converted: $base"
done

Verify the result:

ls transcripts_md/ | head -5
cat transcripts_md/$(ls transcripts_md/ | head -1) | head -20

Step 4: Load transcripts into MDDB

Option A: Use the load-md-folder.sh script

curl -O https://raw.githubusercontent.com/tradik/mddb/main/scripts/load-md-folder.sh
chmod +x load-md-folder.sh ./load-md-folder.sh transcripts_md/ youtube --lang en_US --verbose

Option B: Use mddb-cli

go install github.com/tradik/mddb/services/mddb-cli@latest

Load files one by one:

for md_file in transcripts_md/*.md; do key=$(basename "$md_file" .md) mddb-cli add youtube "$key" en -f "$md_file" \ -m "source=youtube,type=transcript" echo "Loaded: $key"
done

Option C: Use the HTTP API directly

for md_file in transcripts_md/*.md; do curl -s -X POST http://localhost:11023/v1/upload \ -F "file=@$md_file" \ -F "collection=youtube" \ -F "lang=en" echo " -> $(basename $md_file)"
done

Check how many documents were loaded:

curl -s http://localhost:11023/v1/stats | python3 -m json.tool

Step 5: Configure Claude CLI (MCP)

Claude CLI connects to MDDB via the MCP protocol. Create the config file:

mkdir -p ~/Library/Application\ Support/Claude
cat > ~/Library/Application\ Support/Claude/claude_desktop_config.json << 'EOF'
{ "mcpServers": { "mddb": { "command": "docker", "args": [ "run", "-i", "--rm", "--network", "host", "-e", "MDDB_MCP_STDIO=true", "-e", "MDDB_SERVER=http://localhost:11023", "tradik/mddb:latest" ] } }
}
EOF

If you have the mddbd binary locally:

cat > ~/Library/Application\ Support/Claude/claude_desktop_config.json << 'EOF'
{ "mcpServers": { "mddb": { "command": "/usr/local/bin/mddbd", "env": { "MDDB_MCP_STDIO": "true", "MDDB_PATH": "/path/to/mddb.db" } } }
}
EOF

Step 6: Ask questions

Launch Claude CLI:

claude

Now you can ask about the channel's content:

> What topics were most frequently discussed on this channel? > Summarize the 5 most interesting episodes. > Was there an episode about artificial intelligence? What was said? > Find episodes where the guest talked about quantum physics. > Compare the opinions of different guests on the future of AI.

Claude will automatically use MCP tools (semantic_search, full_text_search, hybrid_search) to search through transcripts and provide answers based on the actual content.


Step 7 (optional): Enable vector search

Semantic search requires embeddings. Run Ollama locally:

curl -fsSL https://ollama.com/install.sh | sh ollama pull nomic-embed-text

Configure MDDB to use Ollama (if running via Docker):

docker stop mddb
docker rm mddb
docker run -d \ --name mddb \ -p 11023:11023 \ -v mddb-data:/data \ -e MDDB_EMBEDDING_PROVIDER=ollama \ -e MDDB_EMBEDDING_API_URL=http://host.docker.internal:11434 \ -e MDDB_EMBEDDING_MODEL=nomic-embed-text \ -e MDDB_EMBEDDING_DIMENSIONS=768 \ --add-host=host.docker.internal:host-gateway \ tradik/mddb:latest

Reindex the collection:

curl -X POST "http://localhost:11023/v1/vector-reindex?collection=youtube"

Now semantic_search in Claude CLI will return semantically similar transcript fragments.


Summary

StepCommandWhat it does
1brew install yt-dlpInstall download tool
2docker run tradik/mddbStart the database
3yt-dlp --write-subsDownload transcripts
4load-md-folder.shLoad into MDDB
5claude_desktop_config.jsonConfigure MCP
6claudeAsk questions
7ollama pull nomic-embed-textAdd semantic search