I've found the YT transcripts to be severely lacking sometimes, in accuracy and ...

paulirish · 2025-07-22T15:54:54 1753199694

An approach I've been using recently is to rely on pyannote/tinydiarize only for the speaker_turn timestamps, but prefer the larger model (or in this case YT's autotranscript) for the actual text.

ldenoue · 2025-07-22T20:13:44 1753215224

Check out https://ldenoue.github.io/readabletranscripts/ and the website https://www.appblit.com/scribe that use Gemini to post correct the raw transcripts

meerab · 2025-07-23T23:21:14 1753312874

YT transcripts definitely lack speaker ID. LLMs can infer speakers from context but miss nuance without proper speaker recognition.

I have been tackling this while building VideoToBe.com. My current pipeline is Download Video -> Whisper Transcription with diarization -> Replace speaker tags with AI generated speaker ID + human fallback.

Reliable ML speaker identification is still surprisingly hard. For podcast summarization, speaker ID is a game-changer vs basic YT transcripts.

stanleykm · 2025-07-22T15:39:34 1753198774

I’ve had some success with running them through another LLM to have it clean up the transcription errors based on the context. But this obviously does nothing for speaker identitication.