Speech in Noise part 1: Why is Latency Such a Problem for AI?

Posted by Dr Andrew Simpson on Nov 14, 2020 6:46:56 PM

AI is all about information. Information in sound accumulates over time.

In the case of AI for speech enhancement, the input is a mixture of speech and noise and the output is a clean voice signal. The more accumulated information you can give the AI at the input, the better you can expect the output to be.

However, if we’re talking about real-time AI for hearing aid type devices, the time you spend waiting for information to accumulate results proportionate latency. This is bad for listeners.

So you end up with something of a catch-22: - the more information you accumulate, the better your AI will work, but the longer you have to wait. If you want it fast, you end up sacrificing the information and the AI doesn’t work so well. So, you can have it good, or you can have it fast, but you can’t have both. This is the uncertainty principle in action.

This means that a very impressive AI for speech enhancement that runs on a large server and uses large accumulation buffers to process telecoms signals with high latency does not necessarily translate into something that will perform similarly without the latency.

The acceptability threshold for hearing device latency is 6ms [REF]. At a sample rate of 44.1kHz that’s only 264 samples - not long enough for a person to say ‘boo’.

At Chatable, we are developing the world’s first real-time zero-latency AI for speech in noise.

Keep an eye out for updates.