interspeech25_speechtherapy

Towards Temporally Explainable Speech Clarity Assessment for Dysarthric Speech

Authors

Seohyun Park*, Chitralekha Gupta*, Michelle Kah Yian Kwan, Xinhui Fung, Alexander Wenjun Yip, Suranga Nanayakkara
*Equal contributors

Accepted in Interspeech 2025

GitHub Repo PDF

Abstract

Dysarthria, a motor speech disorder, affects intelligibility and requires targeted interventions for effective communication.
In this work, we investigate automated mispronunciation feedback by collecting a dysarthric speech dataset from six speakers reading two passages, annotated by a speech therapist with temporal markers and mispronunciation descriptions. We design a three-stage framework for explainable mispronunciation evaluation: (1) overall clarity scoring, (2) mispronunciation localization, and (3) mispronunciation type classification. We systematically analyze pretrained Automatic Speech Recognition (ASR) models in each stage, assessing their effectiveness in dysarthric speech evaluation. Our findings offer clinically relevant insights for automating actionable feedback for pronunciation assessment, which could enable independent practice for patients and help therapists deliver more effective interventions.

Framework for Speech Clarity Assessment

Supplementary Material for the paper

Table of Contents

  1. Examples from our dataset
  2. Mispronunciation Classification
  3. Examples for Word/Phoneme Substitution, Deletion, Insertion from ASR
  4. Prompting Strategy to Convert Therapist’s Descriptive Error Labels to Mispronunciation Classes

1. Examples from our dataset

Each example consists of the therapist’s annotation file screenshot for a particular mispronunciation label in an audio file, and the corresponding audio snippet. The error localized and classified by an ASR is also shown.

Example 1 Example 2

2. (Section 4.5) Mispronunciation Classification

2.1 ASR-wise Class Confusion Matrices

Word-level Confusion Matrix

Phone-level Confusion Matrix

2.2 Exact Error Matches

Error Match 1 Error Match 2

3. (Section 3) Examples for Word/Phoneme Substitution, Deletion, Insertion from ASR

Rules

Examples

1. Substitution-Word (Absolute Distance > 3)

2. Substitution-Word (Relative Distance > 0.6)

3. Substitution-Phoneme

4. Insertion-Phoneme

5. Deletion-Phoneme


4. (Section 2.1) Prompting Strategy to Convert Therapist’s Descriptive Error Labels to Mispronunciation Classes

prompt = f"""
You are a world-class Speech-Language Pathologist specializing in speech disorder classification and ASR (Automatic Speech Recognition) error analysis. Your task is to categorize the given speech error label into one of the predefined categories with high accuracy.

### **Strict Categorization Guidelines**
- **Follow the categorization rules precisely. No assumptions or reinterpretations.**
- **If a label contains multiple error types, select the most specific category based on hierarchy.**
- **Do not include explanations, reasoning, or any additional text. Return only the category name.**

---

### **Categorization Rules (Hierarchy-Based)**
1. **Substitution Errors**  
   - A phoneme or word is replaced with another phoneme or word.  
   - MUST be used if "phonemic sub", "word sub", or "transposition" appears in the label.  
   - **Examples:**  
     - "look, took, phonemic sub 'l' to 'k' or word sub" → "Substitution Errors"  
     - "horizon, phonemic sub/distortion of vowel" → "Substitution Errors"  

2. **Deletion Errors**  
   - A phoneme or word is missing.  
   - MUST be used if "phonemic del" or "deletion" appears in the label.  
   - **Examples:**  
     - "raindrops, raindops, phonemic del" → "Deletion Errors"  
     - "act, ac, phonemic del (SCE), strained voice" → "Deletion Errors"  

3. **Insertion Errors**  
   - An extra phoneme or word is added.  
   - MUST be used if "addition" appears in the label.  
   - **Examples:**  
     - "word addition 'big'" → "Insertion Errors"  
     - "phonemic addition its -> biz" → "Insertion Errors"  

4. **Repetition Errors**  
   - A phoneme or word is repeated.  
   - MUST be used if "repetition" appears in the label.  
   - **Examples:**  
     - "his, his his, repetition" → "Repetition Errors"  
     - "prism, prism.prisms, repetition+ phonemic addition" → "Repetition Errors"  
     - "word repetition 'look'" → "Repetition Errors"  

5. **Prosodic Errors**  
   - Irregular pauses, unnatural intonation, or breaks between words.  
   - MUST be used if the label includes: "pause", "intonation", "break", "timing issue".  
   - **Examples:**  
     - "gold at one end, gold...at one end, irregular break between words" → "Prosodic Errors"  
     - "strained voice" → "Prosodic Errors"  

---

### **Input**
"{label}"

### **Output Format**
- **Return only one category name:**  
  `Substitution Errors`, `Deletion Errors`, `Insertion Errors`, `Repetition Errors`, or `Prosodic Errors`.  
- **Do NOT include explanations, reasoning, or additional text.**
"""