I think speech detection is not that big of a challenge. Speech has very unique characteristics that can easily be differentiated. Modern HAs from the big 6 can detect what is speech and what is non-speech very easily. But speech detection is not the challenge. Speech and noise separation IS the challenge.
In the bold part I highlighted above, Dr. Schum is right that separating the noise from the speech has been very hard (almost the holy grail) to crack. That’s why beam forming to the front is so prevalent, because if you can’t separate it, the only next best thing you can do is to find its (front) direction and zoom in on it to eliminate the surrounding sounds. Even if that doesn’t separate the noise that’s already been diffused with the speech in front, it’s still much better because the surrounding noise is eliminated. But we all know the trade-off here, beam forming causes the blinder effect.
Oticon with the OPN came up with a new trick borrowed from the noise cancelling principle used in headphones. Because of much faster processing power made possible with the Velox platform, they can create a noise model of sounds on the sides and back to subtract this out from the noise-diffused speech in front in real time.
But I think AI takes a different approach to speech and noise separation. There’s no more noise blocking in beam forming, or noise cancelling (a la OPN style) per se. That’s all signal processing trick, so to speak. Well, beam forming is still always there and you can still choose to have it, I guess, but then that’s not done by AI.
The AI approach adopted by Whisper as described in their whitepaper, is to build a library of all kinds of thousands of hours worth of unique sounds that can be looked up and compared, and classification of those sound characteristics are already built-in by their importance to the listener, and the sounds can be broken down further to identify their characteristics that would reveal more info about them in terms of their sources and environments. The AI probably does 4 things to the incoming signal, identify (via lookup and compared to the thousand hours of stored sounds), classify, separate, and process, all on-the-fly. So it needs a lot of brain processing power to support this.
So I think the training approaches that the Whisper AI (The Sound Separation Engine) and the training approach that Oticon do are vastly different. That’s why Whisper never mentioned how many sound scenes they’ve captured and trained their DNN on. Instead, they only mentioned capturing "thousands of hours of unique sound sources in environments and learn to distinguish … portions… that are of high importance… from those that are less relevant. Whisper seems to train by building a library of sounds and their classification and other rules and characteristics and store it in the brain, then do on-the-fly access and lookup and comparison to identify, then classify, separate and process.
Oticon seems to build their DNN differently. They also focus on the different characteristics of speech and noise, but they designed a neural network that specializes in dealing with dynamic signals, meaning not just sounds at a single moment, but sounds as they vary over time, because they recognize that a key feature of a sound is that there is also a degree of continuity to it. Based on this requirement, Oticon came up with a DNN called a Gated Recurrent Unit (GRU) which is a variant of the well established LSTM neural network used for many other things. This allows them to have “an algorithm that not only recognizes different features that sounds have in a single moment, but also how those sound features vary over time. The ability to incorporate information over time is precisely the unique attribute that they need to analyze a dynamic signal such as sounds.” The GRU/LSTM links the same neural network over time and have the network pass information to itself over and over to recreate this degree of continuity.
Hence Oticon captured entire sound scene in a 360 degree globe of mics (not just unique sounds like Whisper did, but the whole scene of multiple sounds) that last over a certain amount of time. Then they feed this sound scene into their GRU neural network to do their “things”, then adjust for best possible output, then repeat and rinse 12 million times. After it’s all trained up, they have something that “just knows what to do according to the trained “formula” and just does it by quickly cranking through the formula”, so to speak.
I guess maybe an analogy, not sure if a good one or not, is that Whisper brings the whole kitchen to the show and does all the cooking on the fly for a predetermined meal given a set of ingredients. Oticon creates a cooking machine offline, presumably a small enough machine to just put on the kitchen table, and just feeds the same ingredients in through the machine and cranks out the same meal. Not necessarily any faster, just that the machine just fits on the table and is not the whole kitchen. Of course the quality of the 2 meals is for the users to judge.