I've found the YT transcripts to be severely lacking sometimes, in accuracy and features. Especially speaker identification is really useful if you want to e.g. summarize podcasts or interviews, so if this project here delivers on that then it's definitely better than the YT transcripts.
An approach I've been using recently is to rely on pyannote/tinydiarize only for the speaker_turn timestamps, but prefer the larger model (or in this case YT's autotranscript) for the actual text.
YT transcripts definitely lack speaker ID. LLMs can infer speakers from context but miss nuance without proper speaker recognition.
I have been tackling this while building VideoToBe.com.
My current pipeline is Download Video -> Whisper Transcription with diarization -> Replace speaker tags with AI generated speaker ID + human fallback.
Reliable ML speaker identification is still surprisingly hard.
For podcast summarization, speaker ID is a game-changer vs basic YT transcripts.
I’ve had some success with running them through another LLM to have it clean up the transcription errors based on the context. But this obviously does nothing for speaker identitication.