tag:blogger.com,1999:blog-167596382009-04-26T08:36:41.102-07:00what I know about Speech Recognitionajclnoreply@blogger.comBlogger7125tag:blogger.com,1999:blog-16759638.post-1133710840288067242005-12-04T07:35:00.000-08:002005-12-04T07:40:40.340-08:00Formants Analysis<strong>Three Formants Analysis<br /></strong><br />In order to view the changes of more than 2 formants over time, the formants of all frames can be plotted in a 2-D graph. The results are shown in following figures for two different speakers.<br /><br /><br /><p><a href="http://photos1.blogger.com/blogger/7013/1599/1600/fig2.0.png"><img style="DISPLAY: block; MARGIN: 0px auto 10px; CURSOR: hand; TEXT-ALIGN: center" alt="" src="http://photos1.blogger.com/blogger/7013/1599/320/fig2.0.jpg" border="0" /></a>Formants of “five” spoken by a male speaker </p><p><br /><img style="DISPLAY: block; MARGIN: 0px auto 10px; CURSOR: hand; TEXT-ALIGN: center" alt="" src="http://photos1.blogger.com/blogger/7013/1599/320/fig1.0.jpg" border="0" />Formants of “five” spoken by a female speaker<br /><br />The circles on each graph show the location of the voiced region for /ai/ in digit “five”. Even that two speech signals are from different speakers of different gender, a few common properties can be seen from the graphs. Firstly, the voiced regions have respectively high amplitude compare to unvoiced region. The normalized amplitudes are shown with the intensity of the star marker (*). The darker markers indicate higher amplitudes. Secondly, there is a trend for the changes of diphthong in the voiced region, and the formants can be classified into certain ranges.<br /></p><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16759638-113371084028806724?l=basic-programming.blogspot.com'/></div>ajclnoreply@blogger.com0tag:blogger.com,1999:blog-16759638.post-1132229449735945672005-11-17T04:02:00.000-08:002005-11-17T04:14:02.336-08:00Speech Analysis using LPCWritting a speech analysis tool is not difficult if you choose a good software you are familiar to. Some useful software could be found at the sidebar of this blog. I have written some tools for speech signal analysis using MATLAB for the sake of speech signal studies. It could be found at the matlabcentral:<br /><br /><a href="http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=8779&objectType=FILE">http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=8779&amp;objectType=FILE</a><br /><br />This GUI is designed to extract and visualize the spectrum of FFT and LPC of a specific frame, or a window. Following figure shows the LPC spectrum of a frame with 256 samples. The indicator on the upper subplot shows the location of the specific frame.<br /><br /><br /><img style="DISPLAY: block; MARGIN: 0px auto 10px; CURSOR: hand; TEXT-ALIGN: center" alt="" src="http://photos1.blogger.com/blogger/7013/1599/320/sampleimage.jpg" border="0" /><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16759638-113222944973594567?l=basic-programming.blogspot.com'/></div>ajclnoreply@blogger.com9tag:blogger.com,1999:blog-16759638.post-1131208146259744632005-11-05T08:20:00.000-08:002005-11-05T08:29:06.260-08:00Frames/Blocks Processing<p>The LPC spectral analysis which has been discussed in the previous post section can be used to analyse the spectrum of each frame. Figure below illustrates the LPC spectra of a speech signal from frame 48 to frame 63. </p><p></p><p><a href="http://photos1.blogger.com/blogger/7013/1599/1600/fig1.0.png"><img style="DISPLAY: block; MARGIN: 0px auto 10px; CURSOR: hand; TEXT-ALIGN: center" alt="" src="http://photos1.blogger.com/blogger/7013/1599/320/fig1.0.png" border="0" /></a><br />This segment of speech signal by a male speaker is the voiced region, which corresponds to the diphthong “ay” in digit “five”. A few characteristics can be found from the graph. Firstly, the locations of three formants for all frames are almost the same. Secondly, the amplitudes of the first formant of all frames are respectively high (Compare to unvoiced frames shown below).<br /><img style="DISPLAY: block; MARGIN: 0px auto 10px; CURSOR: hand; TEXT-ALIGN: center" alt="" src="http://photos1.blogger.com/blogger/7013/1599/320/fig2.png" border="0" /></p><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16759638-113120814625974463?l=basic-programming.blogspot.com'/></div>ajclnoreply@blogger.com0tag:blogger.com,1999:blog-16759638.post-1131207581139298652005-11-05T08:11:00.000-08:002005-11-05T08:19:41.153-08:00Frames Representation of Speech SignalFrame-based data is a common format in digital computers. Data acquisition hardware often operates by accumulating a large number of signal samples at a high rate, and propagating these samples to the digital computer as a block of data.<br /><br />There are some reasons of doing frame processing. Firstly, some time-properties of the signal are easier to be seen in frames. For example, the energy level of a speech signal for a period of time is analyzed in frames for a few milliseconds. Secondly, most of the analyses in frequency domain, for example, short-time Fourier transform, needs the data to be in blocks, or windows. Another advantage of frame analysis is the application in real-time system. The frame processing maximizes the efficiency of the system by distributing the fixed process overhead across many samples; the fast data acquisition is suspended by slow interrupt processes after each frame is acquired, rather than after each individual sample. Typical values of parameters are applied for the frame processing in which frame size is 256 samples with the overlapping of 64 samples for the 8 kHz signal. The figure below illustrates a segment of speech signal is split into frames with 256 samples per frame and 192 overlapping.<br /><br /><img style="DISPLAY: block; MARGIN: 0px auto 10px; CURSOR: hand; TEXT-ALIGN: center" alt="" src="http://photos1.blogger.com/blogger/7013/1599/320/fig1.png" border="0" /><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16759638-113120758113929865?l=basic-programming.blogspot.com'/></div>ajclnoreply@blogger.com1tag:blogger.com,1999:blog-16759638.post-1129876890453885902005-10-08T23:37:00.000-07:002005-10-20T23:45:48.610-07:00Speech Recognition: End-Point Detection<p></p><p></p><p></p><p>The end point detection technique is applied to extract the region of interest from the raw speech signal. In other words, it removes the silent region from speech signals. The basic technique of end point detection is to find the energy level of a signal. Signal energy level is calculated in frames, where each frame consists of N samples. The frames are usually overlapped with the adjacent frames to produce a smooth energy line. Fig 1 shows the energy plot of “One”.</p><p><img style="DISPLAY: block; MARGIN: 0px auto 10px; CURSOR: hand; TEXT-ALIGN: center" alt="" src="http://photos1.blogger.com/blogger/7013/1599/320/fig14.jpg" border="0" /><br />Fig 1: (a) Amplitude vs time plot of “One” (b) energy level of the signal<br /><br />Accurate end point detection is important to reduce processing load and increase the accuracy of a speech recognition system. Basically there are two famous endpoint detection algorithms. First algorithm uses signal features based on energy levels and second algorithm uses signal features based on the rate of zero crossings. The combination of both gives good result, but nevertheless increases the complexity of the program and also the processing time.<br /><br />Fig 2 shows the signal of “one” sampled at 8000Hz for 10650 samples or 1.33 seconds. Before the speech begins, the waveform started as silence for about 5000 samples. After the utterance, the signal remains in silence state again for about 2000 samples. Throwing the unwanted silence region, the processing time can be improved to 3650/10650 * 100 = 34.3% by assuming all the frames in the region of interest have been processed. The energy level of the signal is inspected and a threshold value is determined from the energy plot. Fig 3 shows the cropped signal, where the silence region has been eliminated, and the remaining region of interest are used for further processing.<br /><img style="DISPLAY: block; MARGIN: 0px auto 10px; CURSOR: hand; TEXT-ALIGN: center" alt="" src="http://photos1.blogger.com/blogger/7013/1599/320/fig24.jpg" border="0" /><br />Fig 2: (a) Original signal, (b) End-point detection by using the energy level of the speech signal</p><p><img style="DISPLAY: block; MARGIN: 0px auto 10px; CURSOR: hand; TEXT-ALIGN: center" alt="" src="http://photos1.blogger.com/blogger/7013/1599/320/fig33.jpg" border="0" /><br />Fig 3: (a) Detected end point, (b) Cropped signal/region of interest<br /></p><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16759638-112987689045388590?l=basic-programming.blogspot.com'/></div>ajclnoreply@blogger.com3tag:blogger.com,1999:blog-16759638.post-1129870219461664962005-10-01T21:45:00.000-07:002005-10-20T23:23:49.816-07:00Speech Recognition: Formants for Vowels Classification (II)In the previous post, the fundamental of formants has been explained. In this article, we will look into the vowel classification based on the formants which has been extracted from a speech signal.<br /><br />Fig 1 shows the formants location for six vowels and three diphthongs that contains in English digits. The data are extracted from four male speakers, and the first formants are plotted over second formants in a two dimensional graph.<br /><div align="justify"></div><div align="justify"><img style="DISPLAY: block; MARGIN: 0px auto 10px; CURSOR: hand; TEXT-ALIGN: center" alt="" src="http://photos1.blogger.com/blogger/7013/1599/320/fig1.jpg" border="0" /><br />Fig 1</div><div align="justify"><br />There are overlaps between formant frequencies for different vowels/diphthongs by different speakers. The circles illustrate the location of each vowels/diphthongs which centred at the mean values of the same groups of data.<br /><br />From the graph, it can be easily observed that vowels and diphthongs which have similar pronunciations grouped closely together. For instants, vowels /i/, /I/ and diphthongs /ie/ , /ei/ circles overlap more than others.<br /><br />It is also possible to visualize first three formants of vowels and diphthongs in three dimensional graphs. Fig 2 illustrates the same formants for English digits vowels and diphthongs with the addition of the third formant. </div><div align="justify"><br /><img style="DISPLAY: block; MARGIN: 0px auto 10px; CURSOR: hand; TEXT-ALIGN: center" alt="" src="http://photos1.blogger.com/blogger/7013/1599/320/fig2.jpg" border="0" /></div><div align="justify"></div><div align="justify">Fig 2</div><div align="justify"><br /><br />The spheres indicated the location of each vowels/diphthongs which centred at the mean values of the same groups of data. With the addition of third formant, the overlapping of the vowels/diphthongs is reduced. An obvious example is vowel /u/ with vowel /ie/.<br /><br />However, due to the overlapping problem of the formants feature even in the three dimension space, and the difficulty of defining the formants for unvoiced regions, the formats for speech analysis are more for visualization rather than implementation for speech recognition. </div><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16759638-112987021946166496?l=basic-programming.blogspot.com'/></div>ajclnoreply@blogger.com2tag:blogger.com,1999:blog-16759638.post-1127886201191105682005-09-27T22:43:00.000-07:002005-10-20T23:26:16.203-07:00Speech Recognition: Formants for Vowels Classification (I)<p><strong>1. How to </strong>classify <strong>vowels </strong>for speech recognition?</p>Different ways are used for this purpose; however, the most basic approach might be the use for “Formants” for classification.<br /><br />You might use the search engine or the google search bar at the side bar of this page to search for “<strong>speech classification</strong>”<br /><br /><br /><p><strong>2. What is </strong>“<strong>formant”</strong>?</p>Formant is the natural frequencies or resonances produce by the vocal track when someone speaks. The following figure shows the typical FFT spectral and the spectral for LPC autocorrelation method for a segment of speech spoken by a male speaker. From the LPC spectral, three resonances of significance can be noticed, and named as F1, F2 and F3 respectively<br /><br /><img style="DISPLAY: block; MARGIN: 0px auto 10px; CURSOR: hand; TEXT-ALIGN: center" alt="" src="http://photos1.blogger.com/blogger/7013/1599/320/img13.JPG" border="0" /><br /><br /><p><strong>3. How to </strong>find the formant values from the LPC coefficients?</p><strong>Visually </strong>it can be obtained easily, but not accurate. To obtain the formants numerically for mathematically, 1st you can perform the <strong>search on the data </strong>itself. 2nd, by taking the <strong>angle of roots of LPC coefficients, </strong>you actually obtain the formants.<br /><br /><p><strong>4. How to </strong>perform classification for the vowels using formants?</p>Theoretically after you’ve obtained the formants, the classification task can be easily performed using simple classification methods, as well as NN methods.<br /><br />You might use the search engine or the google search bar at the side bar of this page to search for “<strong>speech recognition</strong>”<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16759638-112788620119110568?l=basic-programming.blogspot.com'/></div>ajclnoreply@blogger.com1