1 University of Illinois Urbana-Champaign, Urbana, IL, USA 2 National Center for Supercomputing Applications, Urbana, IL, USA
Accented automatic speech recognition (ASR) often degrades due to the limited availability of accented training data. Prior work has explored accent modeling in low-resource settings, but existing approaches typically require minutes to hours of labeled speech, which may still be impractical for truly scarce accent scenarios. We propose a pipeline that adapts a text-to-speech (TTS) decoder to a target-accent speaker using fewer than ten reference utterances, and employs large language model (LLM)-based phoneme editing to generate accent-conditioned pronunciations. The resulting synthetic speech is used to fine-tune a self-supervised ASR model. Experiments demonstrate consistent word error rate (WER) reductions on real accented speech, including cross-speaker evaluation and ultra-low data regimes. A matched-rate random phoneme baseline shows that phoneme-space perturbation itself is a strong form of augmentation, while LLM-guided edits provide additional gains through accent-conditioned structure.
The audio samples below correspond to the conditions reported in Table 1 of the paper.
Backbone TTS with a Standard American speaker — no adaptation, no phoneme editing.
TTS decoder adapted to the target-accent speaker from fewer than ten reference utterances; phoneme sequence unchanged.
Adapted decoder with phonemes edited by an LLM toward the target accent — the proposed full system.
Adapted decoder with random, matched-rate phoneme perturbations — baseline isolating the augmentation effect.
Adapted decoder with phonemes taken from the L2-ARCTIC ground-truth perceived canonical labels.
Original recordings from the target-accent speaker (TNI for Indian English, HKK for Korean English).
All conditions use the same six L2-ARCTIC prompts. Synthetic audio is produced by the same pipeline used to generate the training corpus reported in the paper.
| Transcript | American TTS | Adapt-only | Adapt + LLM | Adapt + Random | Adapt + GT | Real accent |
|---|
| Transcript | American TTS | Adapt-only | Adapt + LLM | Adapt + Random | Adapt + GT | Real accent |
|---|
Below is the full prompt used for the Indian English condition, including the in-context American → Indian English example pairs that are rendered into the prompt at inference time. The Korean condition uses the same template with the accent name swapped and a Korean-specific set of demonstrations.
Loading prompt…