Few-Shot Accent Synthesis for ASR
with LLM-Guided Phoneme Editing

Yurii Halychanskyi1,2, Nimet Beyza Bozdag1, Mark Hasegawa-Johnson1, Dilek Hakkani-Tür1, Volodymyr Kindratenko1,2

1 University of Illinois Urbana-Champaign, Urbana, IL, USA 2 National Center for Supercomputing Applications, Urbana, IL, USA

yuriih2@illinois.edu

Abstract

Accented automatic speech recognition (ASR) often degrades due to the limited availability of accented training data. Prior work has explored accent modeling in low-resource settings, but existing approaches typically require minutes to hours of labeled speech, which may still be impractical for truly scarce accent scenarios. We propose a pipeline that adapts a text-to-speech (TTS) decoder to a target-accent speaker using fewer than ten reference utterances, and employs large language model (LLM)-based phoneme editing to generate accent-conditioned pronunciations. The resulting synthetic speech is used to fine-tune a self-supervised ASR model. Experiments demonstrate consistent word error rate (WER) reductions on real accented speech, including cross-speaker evaluation and ultra-low data regimes. A matched-rate random phoneme baseline shows that phoneme-space perturbation itself is a strong form of augmentation, while LLM-guided edits provide additional gains through accent-conditioned structure.

System Conditions

The audio samples below correspond to the conditions reported in Table 1 of the paper.

American TTS

Backbone TTS with a Standard American speaker — no adaptation, no phoneme editing.

Adapt-only

TTS decoder adapted to the target-accent speaker from fewer than ten reference utterances; phoneme sequence unchanged.

Adapt + LLM

Adapted decoder with phonemes edited by an LLM toward the target accent — the proposed full system.

Adapt + Random

Adapted decoder with random, matched-rate phoneme perturbations — baseline isolating the augmentation effect.

Adapt + GT (oracle)

Adapted decoder with phonemes taken from the L2-ARCTIC ground-truth perceived canonical labels.

Real accent

Original recordings from the target-accent speaker (TNI for Indian English, HKK for Korean English).

Audio Samples

All conditions use the same six L2-ARCTIC prompts. Synthetic audio is produced by the same pipeline used to generate the training corpus reported in the paper.

Indian English

Transcript American TTS Adapt-only Adapt + LLM Adapt + Random Adapt + GT Real accent

Korean English

Transcript American TTS Adapt-only Adapt + LLM Adapt + Random Adapt + GT Real accent

LLM Prompt

Below is the full prompt used for the Indian English condition, including the in-context American → Indian English example pairs that are rendered into the prompt at inference time. The Korean condition uses the same template with the accent name swapped and a Korean-specific set of demonstrations.

Loading prompt…