malabar/README_TRANSCRIBE.md

# Episode Transcription Script

This script transcribes video episodes with speaker diarization and infers speaker names using AI.

## Features

- ✅ Transcribes all `.mp4`, `.mkv`, `.avi`, `.mov`, `.webm` files in `episodes/` folder
- ✅ Speaker diarization (identifies who spoke when)
- ✅ AI-powered speaker naming based on context
- ✅ Smart merging of non-word utterances (sounds, modal particles)
- ✅ Progress tracking - resume from where you left off
- ✅ Output format: `[mm:ss](SpeakerName) line content`

## Setup

### 1. Install uv (if not already installed)

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```

### 2. Set API Keys

```bash
# Required: AssemblyAI API key (free tier: 100 hours/month)
export ASSEMBLYAI_API_KEY="your-assemblyai-key"

# Required: OpenAI/Kimi API key
export OPENAI_API_KEY="your-kimi-key"

# Optional: If using Kimi (already set as default in script)
export OPENAI_BASE_URL="https://api.moonshot.cn/v1"
```

Get your API keys:
- AssemblyAI: https://www.assemblyai.com/ (free tier available)
- Kimi: https://platform.moonshot.cn/

## Usage

### Run with uv (recommended)

```bash
# This will automatically install dependencies and run the script
uv run transcribe_episodes.py
```

### Or sync dependencies first, then run

```bash
# Install dependencies (creates .venv automatically)
uv sync

# Run the script
uv run python transcribe_episodes.py
```

### Check progress

```bash
uv run transcribe_episodes.py status
```

### Reset and re-process

```bash
# Reset all (will re-process everything)
uv run transcribe_episodes.py reset

# Reset specific file only
uv run transcribe_episodes.py reset S02E02.mp4
```

## Output

Transcripts are saved to `transcripts/` folder as `.txt` files:

```
transcripts/
├── S02E01.txt
└── S02E02.txt
```

Example content:
```
[00:12](Malabar) Hello everyone, welcome back!
[00:15](Sun) Nice to see you all again.
[00:18](Jupiter) Yeah, let's get started.
```

## Progress Tracking

The script creates `.transcription_progress.json` to track which files are:
- `completed` - Successfully processed
- `error` - Failed (check error message)
- `transcribing` - In progress (transcription)
- `naming` - In progress (speaker naming)

If interrupted, simply re-run the script - it will skip completed files.

## How Speaker Naming Works

1. Transcribe with AssemblyAI to get speaker labels (A, B, C...)
2. Sample utterances from each speaker
3. Send to LLM (Kimi) with context about characters: Malabar, Sun, Jupiter, Kangarro, Mole
4. LLM infers which speaker is which character based on speaking style and content
5. Apply inferred names to output

## Troubleshooting

**AssemblyAI upload fails:**
- Check your API key
- Check internet connection
- Video files might be too large for free tier

**Speaker naming is wrong:**
- The LLM makes educated guesses based on context
- You can manually edit the output files if needed
- Consider providing more context about each character's personality

**Progress lost:**
- Don't delete `.transcription_progress.json`
- It tracks which files are done to avoid re-processing

## Development

```bash
# Add new dependencies
uv add <package-name>

# Add dev dependencies
uv add --dev <package-name>

# Update lock file
uv lock
```