Papers·1개월 전

ByteDance SwanVoice — 1~4인 대화 TTS, monologue 품질 유지하며 dialogue coherence 개선

ByteDance가 1~4화자 대화를 단일 모델로 합성하는 zero-shot TTS SwanVoice를 공개했습니다. 25Hz VAE와 flow-matching DiT에 speaker-turn conditioning을 더하고, monologue→dialogue 순차 학습 후 DiffusionNFT로 정렬한 결과, SwanBench-Speech에서 monologue와 dialogue 모두 오픈소스 대비 richness·hierarchy 점수가 높았습니다. 다만 content accuracy는 여전히 한계로 남았습니다.

ByteDance가 1~4화자 대화를 단일 모델로 합성하는 zero-shot TTS SwanVoice를 공개했습니다.

핵심 결론

태스크 — 1~4화자 zero-shot dialogue TTS — monologue와 dialogue 모두 단일 모델로 처리.
벤치 — SwanBench-Speech에서 richness·hierarchy 점수 오픈소스 대비 우위, content accuracy는 한계.

방법

데이터 — SwanData-Speech: in-the-wild 오디오에서 monologue·dialogue 코퍼스 구축, Swan Forced Aligner로 pause-aware 정렬, RobustMegaTTS3로 어려운 발음 처리.
모델 — 25Hz VAE + raw-text conditioning (pause 심볼, pinyin 대체) + flow-matching DiT, speaker-turn conditioning 추가.
학습은 monologue → mixed → real dialogue 순차 진행 후 DiffusionNFT로 phone-level·speaker-similarity reward 정렬.

한계·조건

정확도 — Content accuracy가 주요 병목 — 발음 오류나 단어 누락이 발생할 수 있습니다.
공개 — 데모는 공개되었으나 코드·데이터셋 공개 여부는 미정.

편집자 한 줄

대화 TTS에서 monologue 품질을 유지하려는 접근은 실용성이 높지만, content accuracy 개선이 후속 과제로 보입니다.

#tts
#dialogue
#zero-shot
#bytedance
#flow-matching

ByteDance

원문 보기 →

ByteDance SwanVoice — 1~4인 대화 TTS, monologue 품질 유지하며 dialogue coherence 개선

핵심 결론

방법

한계·조건

Comments