YouTube Transcript Analyzer

Scan transcripts from a YouTube channel, load them into MDDB, then ask questions about the channel's content using Claude CLI.

YouTube → yt-dlp (transcripts) → MDDB → Claude CLI (MCP) → answers

Requirements

yt-dlp — download transcripts
MDDB — document database
Claude CLI — question interface
Python 3 (optional, for format conversion)

Step 1: Install yt-dlp

brew install yt-dlp

pip install yt-dlp

winget install yt-dlp

Verify the installation:

yt-dlp --version

Step 2: Start MDDB

The simplest way — Docker:

docker run -d \
  --name mddb \
  -p 11023:11023 \
  -v mddb-data:/data \
  tradik/mddb:latest

Check that it's running:

curl http://localhost:11023/v1/health

Expected response: {"status":"ok"}

Step 3: Download transcripts from a YouTube channel

VPN recommended. YouTube may throttle or block your IP when downloading many transcripts at once. Use a VPN to avoid rate limiting and potential IP bans.

3a. Get the list of videos

Replace CHANNEL_URL with the channel address, e.g. https://www.youtube.com/@lexfridman.

yt-dlp --flat-playlist --print "%(id)s %(title)s" "CHANNEL_URL" > video_list.txt

Check how many videos were found:

wc -l video_list.txt

3b. Download transcripts

Create a folder for transcripts:

mkdir -p transcripts

Download subtitles from each video:

while IFS=' ' read -r video_id title; do
  echo "Downloading: $title"
  yt-dlp \
    --write-subs \
    --write-auto-subs \
    --sub-lang "en" \
    --sub-format "vtt" \
    --skip-download \
    --output "transcripts/%(id)s" \
    "https://www.youtube.com/watch?v=$video_id" 2>/dev/null
done < video_list.txt

Options:

--sub-lang "en" — download English subtitles (change to "en,pl" for multiple languages)
--write-auto-subs — use auto-generated subs when manual ones are unavailable
--skip-download — don't download the video itself, only subtitles

3c. Convert VTT to plain text

VTT transcripts contain timestamps. Convert them to clean Markdown:

mkdir -p transcripts_md

for vtt_file in transcripts/*.vtt; do
  base=$(basename "$vtt_file" .vtt)
  # Remove VTT headers, timestamps, and HTML tags
  sed '/^WEBVTT/d; /^$/d; /^[0-9][0-9]:[0-9][0-9]/d; /-->/d; s/<[^>]*>//g' \
    "$vtt_file" | \
    awk '!seen[$0]++' > "transcripts_md/${base}.md"
  echo "Converted: $base"
done

Verify the result:

ls transcripts_md/ | head -5
cat transcripts_md/$(ls transcripts_md/ | head -1) | head -20

Step 4: Load transcripts into MDDB

Option A: Use the load-md-folder.sh script

curl -O https://raw.githubusercontent.com/tradik/mddb/main/scripts/load-md-folder.sh
chmod +x load-md-folder.sh

./load-md-folder.sh transcripts_md/ youtube --lang en_US --verbose

Option B: Use mddb-cli

go install github.com/tradik/mddb/services/mddb-cli@latest

Load files one by one:

for md_file in transcripts_md/*.md; do
  key=$(basename "$md_file" .md)
  mddb-cli add youtube "$key" en -f "$md_file" \
    -m "source=youtube,type=transcript"
  echo "Loaded: $key"
done

Option C: Use the HTTP API directly

for md_file in transcripts_md/*.md; do
  curl -s -X POST http://localhost:11023/v1/upload \
    -F "file=@$md_file" \
    -F "collection=youtube" \
    -F "lang=en"
  echo " -> $(basename $md_file)"
done

Check how many documents were loaded:

curl -s http://localhost:11023/v1/stats | python3 -m json.tool

Step 5: Configure Claude CLI (MCP)

Claude CLI connects to MDDB via the MCP protocol. Create the config file:

mkdir -p ~/Library/Application\ Support/Claude
cat > ~/Library/Application\ Support/Claude/claude_desktop_config.json << 'EOF'
{
  "mcpServers": {
    "mddb": {
      "command": "docker",
      "args": [
        "run", "-i", "--rm", "--network", "host",
        "-e", "MDDB_MCP_STDIO=true",
        "-e", "MDDB_SERVER=http://localhost:11023",
        "tradik/mddb:latest"
      ]
    }
  }
}
EOF

If you have the mddbd binary locally:

cat > ~/Library/Application\ Support/Claude/claude_desktop_config.json << 'EOF'
{
  "mcpServers": {
    "mddb": {
      "command": "/usr/local/bin/mddbd",
      "env": {
        "MDDB_MCP_STDIO": "true",
        "MDDB_PATH": "/path/to/mddb.db"
      }
    }
  }
}
EOF

Step 6: Ask questions

Launch Claude CLI:

claude

Now you can ask about the channel's content:

> What topics were most frequently discussed on this channel?

> Summarize the 5 most interesting episodes.

> Was there an episode about artificial intelligence? What was said?

> Find episodes where the guest talked about quantum physics.

> Compare the opinions of different guests on the future of AI.

Claude will automatically use MCP tools (semantic_search, full_text_search, hybrid_search) to search through transcripts and provide answers based on the actual content.

Step 7 (optional): Enable vector search

Semantic search requires embeddings. Run Ollama locally:

curl -fsSL https://ollama.com/install.sh | sh

ollama pull nomic-embed-text

Configure MDDB to use Ollama (if running via Docker):

docker stop mddb
docker rm mddb
docker run -d \
  --name mddb \
  -p 11023:11023 \
  -v mddb-data:/data \
  -e MDDB_EMBEDDING_PROVIDER=ollama \
  -e MDDB_EMBEDDING_API_URL=http://host.docker.internal:11434 \
  -e MDDB_EMBEDDING_MODEL=nomic-embed-text \
  -e MDDB_EMBEDDING_DIMENSIONS=768 \
  --add-host=host.docker.internal:host-gateway \
  tradik/mddb:latest

Reindex the collection:

curl -X POST "http://localhost:11023/v1/vector-reindex?collection=youtube"

Now semantic_search in Claude CLI will return semantically similar transcript fragments.

Summary

Step	Command	What it does
1	`brew install yt-dlp`	Install download tool
2	`docker run tradik/mddb`	Start the database
3	`yt-dlp --write-subs`	Download transcripts
4	`load-md-folder.sh`	Load into MDDB
5	`claude_desktop_config.json`	Configure MCP
6	`claude`	Ask questions
7	`ollama pull nomic-embed-text`	Add semantic search