Abstract: Racial bias in automatic speech recognition is an emerging area of concern in fields associated with human-computer interaction. Research to date suggests that sociolinguistic variation, namely systematic sources of sociophonetic variation, has yet to be extensively exploited in Acoustic Model architectures. This talk reports on an ongoing project that evaluates the performance of one ASR system for a multi-ethnic sample of speakers from the American Pacific Northwest (including four groups: Yakama (Native American), African American, European American and ChicanX). Using a sociophonetic approach to characterizing vocalic and consonantal variation allows us to ask which phonetic dialect variants appear to be most challenging for our ASR system. We also ask whether certain phonetic error types are most frequently observed in any of the four ethnic dialects sampled, signifying higher error rates for particular groups in the sample. Recordings of both conversational and read speech were coded for a common set of 17 sociophonetic variables with distinct acoustic profiles. Automatic transcription was achieved using CLOx, a custom-built ASR system created in the University of Washington Sociolinguistics Laboratory. Normalized error frequency rates (Nf) are compared across ethnic group samples to evaluate CLOx performance. Nf error rates demonstrate clear differential performance in the ASR system, pointing to racial bias in system output. Specific predictions are made regarding steps that might be taken to leverage sociophonetic knowledge to improve sociolect-recognition accuracy in ASR systems.
Zoom Link: https://yale.zoom.us/j/93677552383