Automatic speech recognition technologies have improved greatly over the last decade. But these advancements often leave people with speech impairments behind.
The Speech Accessibility Project, an initiative run by researchers at the University of Illinois Urbana-Champaign, aims to transform accessibility in speech recognition software—known generally as automatic speech recognition (ASR) systems—for individuals with speech disorders.
“Success for me is that a person with Down syndrome [or other conditions that affect speech] can start using a smartphone or smart speaker, and it will just work,” said Mark Hasegawa-Johnson, a U of I professor of electrical and computer engineering and the principal investigator of the project. “They’ll be able to use it exactly the way their peers would.”
The project aims to develop a dataset of dysarthric speech, which is speech that is difficult to understand due to a motor speech disorder that affects the muscles used for speaking.
People with atypical speech are not well-represented in the datasets tech companies use to train ASR systems, Hasegawa-Johnson said.
“The biggest obstacle we’ve had until now is that the speech that’s available to train the ASR comes from people who read audiobooks,” he said. “As a result, when you get people who have speech that’s atypical in some way, they don’t sound exactly like any of the speakers in the training set, and therefore the speech recognizer doesn’t know how to deal with it.”
Artificial intelligence tools enable people to use speech recognition through voice commands, like voice assistants, voice-to-text, or translation tools. These systems rely on machine learning; without diverse and representative data, they cannot accurately understand different types of speech.
The project targets five main diagnoses: Parkinson’s disease, cerebral palsy, Down syndrome, amyotrophic lateral sclerosis (ALS) and speech disabilities due to stroke.
As of 2023, public data collections for dysarthric speech were more than 32 times smaller than those for typical speech, with only about 1,000 hours of data publicly available, according to Xiuwen Zheng, a third-year U of I graduate student who works on the project. The explosion of ASR technology in the 2010s led to rapid improvements in ASR accuracy for typical speech, reducing word error rates to as low as 1.4% in 2023, she said. But these advancements did not extend to dysarthric speech, where word error rates remained around 18%.
As of the end of June 2024, the project has shared 235,000 speech samples with the companies that fund them—Amazon, Apple, Google, Meta and Microsoft—organizations with longstanding accessibility commitments to providing products, services and experiences for people from diverse backgrounds, according to the project’s website.
The Speech Accessibility Project recently kicked off a 90-day challenge aimed at uncovering newideas for improving ASR systems, Zheng said.
The challenge invites developers and researchers beyond the U of I to help tackle ASR limitations for dysarthric speech by using the Speech Accessibility Project’s dataset to create new ASR models.
“The goal of this challenge is to try to advance the state-of-the-art ASR speech recognition,” Zheng said. “We’re actually hoping there will be some novel ideas or some advanced speech recognizers that some teams will come up with.”
The Speech Accessibility Project has worked with more than 1,200 individuals with the five target diagnoses, said Clairon Mendes, a speech language pathologist and investigator on the project.
“I think something that surprised me was how much people have been enjoying the process
and have given us feedback about how they’ve found it a really rewarding experience to share their voice with the project,” Mendes said.
Individuals who participate in the project can record in the comfort of their own home and at their own leisure, Mendes said.
Participants are given core prompts, like, “Turn up the volume to maximum,” a common command for technology. They are also given more individualized prompts, with proper nouns, such as “Play a song by Taylor Swift.” Another section includes a series of phonetically diverse sentences from novels that capture different sounds of the English language, as well as open-ended prompts like, “Tell me about a pet, or a pet you wish you had.”
The project invites anyone in the five target conditions to contribute their voice samples. Participants are compensated for their time and all data is private and de-identified, according to the project’s website.
Looking ahead, the project aims to expand to include individuals with additional diagnoses, as well as non-English speakers.
“I’m just excited for the opportunity for the technology to work for these participants, so that the world becomes a lot more open,” Mendes said.