Building a Custom Speech Recognition Engine for Kids' Voices

Building a Custom Speech Recognition Engine for Kids’ Voices

As an app developer, I’m on a mission to build a more accurate speech recognition engine that can better understand kids’ voices. I’m currently using a speech framework in my iOS app, but I want to create a custom solution that can adapt to the unique characteristics of children’s speech.

To achieve this, I’ve started collecting sample audio data from my app, keeping privacy concerns in mind. I’m transcribing these audio files using Whisper Large V2 and then using them as pseudo labeling to train Whisper Tiny. But I have some questions about the effectiveness of this approach.

Firstly, is this a valid strategy, or will the limited parameters of Whisper Tiny make it a futile exercise no matter how much I train it? Secondly, most of my data is not clean, with background noise and other disturbances interspersed with kids’ speech. But it’s crucial for my app to be accurate in these environments.

I’m also wondering how many hours of audio data I need to train the model to achieve reasonable accuracy, given the quality of my audio samples. And finally, are there better solutions or approaches that I could explore?

If you’ve worked on similar projects or have insights to share, I’d love to hear from you. Let’s explore the possibilities of building a custom speech recognition engine that can make a real difference for kids.

Leave a Comment

Your email address will not be published. Required fields are marked *