YouTube Transcript Analyzer
Scan transcripts from a YouTube channel, load them into MDDB, then ask questions about the channel's content using Claude CLI.
YouTube โ yt-dlp (transcripts) โ MDDB โ Claude CLI (MCP) โ answers
Requirements
- yt-dlp โ download transcripts
- MDDB โ document database
- Claude CLI โ question interface
- Python 3 (optional, for format conversion)
Step 1: Install yt-dlp
brew install yt-dlp pip install yt-dlp winget install yt-dlp
Verify the installation:
yt-dlp --version
Step 2: Start MDDB
The simplest way โ Docker:
docker run -d \ --name mddb \ -p 11023:11023 \ -v mddb-data:/data \ tradik/mddb:latest
Check that it's running:
curl http://localhost:11023/v1/health
Expected response: {"status":"ok"}
Step 3: Download transcripts from a YouTube channel
VPN recommended. YouTube may throttle or block your IP when downloading many transcripts at once. Use a VPN to avoid rate limiting and potential IP bans.
3a. Get the list of videos
Replace CHANNEL_URL with the channel address, e.g. https://www.youtube.com/@lexfridman.
yt-dlp --flat-playlist --print "%(id)s %(title)s" "CHANNEL_URL" > video_list.txt
Check how many videos were found:
wc -l video_list.txt
3b. Download transcripts
Create a folder for transcripts:
mkdir -p transcripts
Download subtitles from each video:
while IFS=' ' read -r video_id title; do echo "Downloading: $title" yt-dlp \ --write-subs \ --write-auto-subs \ --sub-lang "en" \ --sub-format "vtt" \ --skip-download \ --output "transcripts/%(id)s" \ "https://www.youtube.com/watch?v=$video_id" 2>/dev/null
done < video_list.txt
Options:
--sub-lang "en"โ download English subtitles (change to"en,pl"for multiple languages)--write-auto-subsโ use auto-generated subs when manual ones are unavailable--skip-downloadโ don't download the video itself, only subtitles
3c. Convert VTT to plain text
VTT transcripts contain timestamps. Convert them to clean Markdown:
mkdir -p transcripts_md for vtt_file in transcripts/*.vtt; do base=$(basename "$vtt_file" .vtt) # Remove VTT headers, timestamps, and HTML tags sed '/^WEBVTT/d; /^$/d; /^[0-9][0-9]:[0-9][0-9]/d; /-->/d; s/<[^>]*>//g' \ "$vtt_file" | \ awk '!seen[$0]++' > "transcripts_md/${base}.md" echo "Converted: $base"
done
Verify the result:
ls transcripts_md/ | head -5
cat transcripts_md/$(ls transcripts_md/ | head -1) | head -20
Step 4: Load transcripts into MDDB
Option A: Use the load-md-folder.sh script
curl -O https://raw.githubusercontent.com/tradik/mddb/main/scripts/load-md-folder.sh
chmod +x load-md-folder.sh ./load-md-folder.sh transcripts_md/ youtube --lang en_US --verbose
Option B: Use mddb-cli
go install github.com/tradik/mddb/services/mddb-cli@latest
Load files one by one:
for md_file in transcripts_md/*.md; do key=$(basename "$md_file" .md) mddb-cli add youtube "$key" en -f "$md_file" \ -m "source=youtube,type=transcript" echo "Loaded: $key"
done
Option C: Use the HTTP API directly
for md_file in transcripts_md/*.md; do curl -s -X POST http://localhost:11023/v1/upload \ -F "file=@$md_file" \ -F "collection=youtube" \ -F "lang=en" echo " -> $(basename $md_file)"
done
Check how many documents were loaded:
curl -s http://localhost:11023/v1/stats | python3 -m json.tool
Step 5: Configure Claude CLI (MCP)
Claude CLI connects to MDDB via the MCP protocol. Create the config file:
mkdir -p ~/Library/Application\ Support/Claude
cat > ~/Library/Application\ Support/Claude/claude_desktop_config.json << 'EOF'
{ "mcpServers": { "mddb": { "command": "docker", "args": [ "run", "-i", "--rm", "--network", "host", "-e", "MDDB_MCP_STDIO=true", "-e", "MDDB_SERVER=http://localhost:11023", "tradik/mddb:latest" ] } }
}
EOF
If you have the mddbd binary locally:
cat > ~/Library/Application\ Support/Claude/claude_desktop_config.json << 'EOF'
{ "mcpServers": { "mddb": { "command": "/usr/local/bin/mddbd", "env": { "MDDB_MCP_STDIO": "true", "MDDB_PATH": "/path/to/mddb.db" } } }
}
EOF
Step 6: Ask questions
Launch Claude CLI:
claude
Now you can ask about the channel's content:
> What topics were most frequently discussed on this channel? > Summarize the 5 most interesting episodes. > Was there an episode about artificial intelligence? What was said? > Find episodes where the guest talked about quantum physics. > Compare the opinions of different guests on the future of AI.
Claude will automatically use MCP tools (semantic_search, full_text_search, hybrid_search) to search through transcripts and provide answers based on the actual content.
Step 7 (optional): Enable vector search
Semantic search requires embeddings. Run Ollama locally:
curl -fsSL https://ollama.com/install.sh | sh ollama pull nomic-embed-text
Configure MDDB to use Ollama (if running via Docker):
docker stop mddb
docker rm mddb
docker run -d \ --name mddb \ -p 11023:11023 \ -v mddb-data:/data \ -e MDDB_EMBEDDING_PROVIDER=ollama \ -e MDDB_EMBEDDING_API_URL=http://host.docker.internal:11434 \ -e MDDB_EMBEDDING_MODEL=nomic-embed-text \ -e MDDB_EMBEDDING_DIMENSIONS=768 \ --add-host=host.docker.internal:host-gateway \ tradik/mddb:latest
Reindex the collection:
curl -X POST "http://localhost:11023/v1/vector-reindex?collection=youtube"
Now semantic_search in Claude CLI will return semantically similar transcript fragments.
Summary
| Step | Command | What it does |
|---|---|---|
| 1 | brew install yt-dlp | Install download tool |
| 2 | docker run tradik/mddb | Start the database |
| 3 | yt-dlp --write-subs | Download transcripts |
| 4 | load-md-folder.sh | Load into MDDB |
| 5 | claude_desktop_config.json | Configure MCP |
| 6 | claude | Ask questions |
| 7 | ollama pull nomic-embed-text | Add semantic search |