I’m currently taking Microsoft’s DEV287x on Speech Recognition Systems on edX.org. We are supposedly going to build a speech recognition system from scratch in Python over the next several months. In the course, there are frequent references as to how human speech and hearing work and discussion of the need to process the speech signal by computer akin to the way humans hear what we have to say to each other. For instance, applying a mel filterbank that favors lower frequencies and then taking the log of that because our hearing processes sound for important frequencies in a similar way. But working on module 02 of the course, audio signal processing, I came across some fascinating references on Columbia University web pages and beyond.
One site is a free online version of an old book on Music and Computers that apparently didn’t sell well enough to continue in a print edition. The purpose of the book is to both teach the basics about sound and music and at the same time teach you to analyze music via a computer.
The book was written as a collaboration between the Columbia University Music Department and the Mathematics and Music Departments of Dartmouth College. In a foreward by a Dartmouth College math professor, she says a point of the book was to teach, using sound, that math can not only be very powerful but also a lot of fun-instead of just teaching math on very dry topics. In sidebars, the book has downloadable samples of music, applets to illustrate and play with music that require Java and an ActiveX control(not to secure for Windows unless you run a virtual machine you can trash afterwards), and also links to further information within the book and on the Internet (although because of the age of the book and the aging of the Internet, some of the links generated <404> Page Not Found! errors).
The second Columbia University website is an active graduate-level course on Topics in Spoken Language Processing.
It covers a wide-range of topics from how we make speech, how computers can process the sound in speech, to analyzing human emotions/behaviors such as deception and trust, humor and sarcasm, likability and charisma. A list of downloaded course materials is here (downloading anything over the Internet, especially via http://, not https://, can be insecure):
Of particular interest to me since we have a grandchild with a speech impediment problem is the book chapter on phonetics and a link to a manual for a free, very powerful open-source piece of software that provides graphic images to analyze phoneme pronunciations (haven’t tried the software itself yet to see if it still works-it appears to be an old program).
Edit_Update: Direct link does not seem to work from hearingtracker.com but the reference is the following link in the index of course materials:
2nd Update: From the Praat manual source at UCSD http://wstyler.ucsd.edu/praat/UsingPraatforLinguisticResearchLatest.pdf
Apart from these two Columbia University sites, there is a highly cited book on Speech and Language Processing by Daniel Jurafsky and James H. Martin, 3rd edition available for free in draft form from Stanford University.
and also highly cited on the Internet as reference material on speech recognition, there is the University of Edinburgh (Harvard of Scotland) course on Automatic Speech Recognition, which cites Jurafsky’s book as the essential text for its course:
So although these references may seem “off-topic” for a hearing website, when you consider that a number of folks here extoll the benefits of Google’s Live Transcribe, the above science is the foundation upon which Live Transcribe is built - and a really good transcription system for the hard-of-hearing would also want to capture emotion: sarcasm and humor, joy and sadness, deception and trust, etc., and textually inform the user of shades of meaning the user has trouble hearing naturally…
BTW, if anyone is familiar with Python and wants to give audio processing a try in a very simple way, the following is an excellent short tutorial in Python code on how to generate an artificial sound signal by combining two sine waves, window the signal, and then do a Fast Fourier Transformation(FFT) to reveal the frequencies that compose your artificially made-up signal. The nice thing about the example is that if you want to convince yourself it really works, you can add an additional sine wave or two and see that it appears in the graph of frequency analysis exactly where expected. For example, adding “+ np.sin(100.0 * 2.0*np.pi*x)” to the definition of the variable “y” creates a third frequency of 100 equal in magnitude to the first frequency at 50 and twice the frequency at 80. BTW, the example works perfectly in Anaconda Python 3.7 using an interactive window in VS Code even though Python 2.7 is cited as the language used.
(comment out the line “y = np.array(Bx)” and uncomment the preceding line beginning “y = np.sin(…”)