How Speech Recognition works

Understanding Speech Recognition

How it works

Hidden Markov models form the basis of modern speech recognition engines, in fact have done so for decades (in fact the Hidden Markov approach dates back to the late 1960s). The reason for the longevity of this approach is that so far no really better methodology has been found. There is plenty of literature on Markov modelling and a detailed explanation falls outside the scope of this article (search for classic papers by Baum on the topic), but I will briefly highlight the core properties that make Hidden Markov models so persistent in the field of speech recognition.

Speech audio is a signal type with some particular properties: continuous, non-stationary and often not very pure (i.e. affected by noise as defined in information theory). Speech is essentially stochastic and therefore needs to be analysed using statistical probability methods. Hidden Markov models provide such statistical description of the speech signal. This analysis renders speech into a series of phonemes, which are combined using n-gram models. N-grams are essentially probabilistic models defining the most likely sequences of phonemes and words.

It should be clear from this description that, as already stated, a speech recognition engine does not really understand speech. The informational content, the meaning, of the speech is unknown to the computer. The probability models cannot resolve questions that rely on comprehension, or, even worse, require contextual knowledge to allow correct interpretation. For example, making the distinction between "I scream" and "ice cream" will require true understanding and appreciation of context, something computers can simply not do as they do not possess true intelligence in the human sense.

As such, speech recognition will remain severely constrained and no match for the human brain in the ability to understand spoken language. We have seen improved recognition rates over the last decades, not in the least because modern computers have more processing power and memory at their disposal (a result of Moore's Law), thus allowing a more in-depth, more finely granulated analysis and more cleverly designed text building algorithms. Extra capacity and performance means more opportunity for secondary inspection of the (partially) recognised speech. However, fundamentally speech recognition remains constrained by the unmovable barriers that lack of true understanding entail. The best hope for a real breakthrough is thus closely related to developments in Artificial Intelligence, and that is another field where science and technology have made very little true progress. We are still a long way from building computer systems that really understand information in the proper sense. In the mean time, it is vital that users of speech recognition systems, and designers and engineers implementing products and services based on speech recognition, are fully aware of the constraints and limitations of the technology.

Next: Command and Control versus Large Vocabulary Systems

Understanding Speech Recognition

Science and Technology
News

Download

Printable version
(PDF Document)

Size: 66KB

Understanding Speech Recognition

How it works

Understanding Speech Recognition

Science and TechnologyNews

Download

Science and Technology
News