Home > Writings > Science & Technology > Understanding Speech Recognition > Getting good results

Understanding Speech Recognition

Getting good results

To achieve high recognition accuracy, especially for large vocabulary systems, a number of conditions must be fulfilled. These requirements relate to those factors that will influence the statistical analysis most, of which the nature and quality of the input signal is key. It is essential to use a clean, wideband signal. Consequently, using a good quality microphone that is positioned well so that no breathing and other noises interfere with the speech, is very important. Especially on laptops (where a lot of electronic components are situated very close to one another), the analogue audio circuitry can sometimes be prone to interference degrading the signal. Using a USB microphone (particularly one designed for use with speech recognition) is often better than using the analogue mic-in line on a soundcard (notably low-end ones) as it produces a cleaner signal.

A wideband signal is also very important, as it means that more features of the original signal are retained, allowing for better accuracy and higher precision during the analysis. That is why recognising speech from narrow band audio sources (such as a traditional telephone signal) produces far worse recognition accuracy compared to direct, wideband, high quality microphone input (and if such a signal is used, it is important to use the statistical models, corpora, that are derived from a representative narrowband signal instead of using wideband acoustic corpora).

Even if the input is of high quality, too much background noise (music, other people chatting, traffic noises, etc) will also heavily impact on recognition accuracy. So, the acoustic environment should be as free of background noise as possible. Many systems can be trained on specific user voices, and if the system is to be used in an environment that is by nature noisy, system training can also help in dealing with background noise, provided it is of a fairly stable nature, i.e. the background noise during training is representative of the usage conditions).

If the application is a simple command and control interface that needs to recognise a few dozen fairly distinct utterances (as in the case of the Linguatronic system I mentioned earlier), even a limited bandwidth input signal of only moderate quality might still be sufficient, but for larger vocabulary systems the audio input must be as clean and wideband as possible.

As already stated, speaker training of the speech recognition engine can also significantly enhance accuracy. This is because training produces a statistical model of the user's voice and background audio, which can be taken into account when analysing the speech. Other factors that will have notable impact on accuracy are pace, intonation, articulation (accents) and making sure that phrases have a correct structure.

Next: Real-world applications

 

Understanding Speech Recognition

Science and Technology
News

Download

Printable version
(PDF Document)

Size: 66KB