\rhead{Chris Taylor and Varun Madhok} \lhead{Lab 4: Speech Compression via Linear Predictive Coding -- Sample lab report} \lfoot{December 12, 1996} \cfoot{EE-649 -- Speech Processing} \ctsec{Introduction} For this project we were required to design a method for representing 16KHz speech waveforms at a rate of 1800 parameters per second. A number of possible methods were considered. An obvious simple solution would be to lowpass filter the speech signal to meet the 1800 parameters per second requirement. This would reduce the high frequency content in the speech but would still retain frequencies below 900Hz which would still provide intelligible speech. While this would provide a solution, it seems to be a cheap way out. As a result, we also consider a number of other possibilities. These included adaptive predictive coding, adaptive transform coding, sub-band coding using adaptive bit allocation, sub-band adaptive predictive coding, and vector quantization. It was at this point that we realized that we needed to set some design objective in conjunction with picking a compression approach. Motivated by the generally warm, fuzzy feeling from \emphc{Linear Predictive Coding} (LPC) in the third project, we set the following design goal: \begin{verse} {\sc Develop a speech compression technique that produces reasonably intelligible male speech with as few parameters per second as possible.}\footnote{We limited ourselves to male speech since all of our training/testing speech was spoken by male speakers.} \end{verse} \ctsec{Design Process} Throughout this section we use the ``sun'' sound bite from the first project to help illustrate our motivation for various design decisions. We resampled the speech signal at 16KHz in order to ensure an optimal match with the LPC codebook that we assume was trained on 16KHz speech data. Figure 1 shows the original ``sun'' signal. \begin{center} \includegraphics[width=.6\textwidth]{CCTorg} Figure 1: Original speech waveform for ``sun'' \end{center} \ctssec{Vocal Tract} Our first design decision (other than choosing our design goal) found early and unanimous agreement. We settled on using LPC to model the vocal tract. Furthermore, we restricted our LPC model to a twenty pole filter characterizing 30 msec speech frames. This restriction allowed us to take advantage of the previously trained \emphc{Vector Quantization} (VQ) codebooks that we used in the third project. At this point the vocal tract model was fixed as VQ on LPC coefficients of non-overlapping, Hamming windowed, 30 msec speech frames. As in the third project, we used the Euclidean distance metric on the cepstral coefficients to select the apropriate codeword from the ``all\_males'' VQ codebook. The remainder of the design process involved modeling the error signal. \ctssec{Excitation} We model the error signal generated by the LPC vocal tract analysis as the excitation component of the speech waveform. We will use ``excitation signal'' and ``error signal'' interchangeably. A wide variety of excitation models exist in the literature. In this section we will describe a number of approaches that we considered. We will also describe some of the results for the ones we actually implemented. On the extreme ends lie two options. One option is to ignore the excitation and just use the vocal tract information to reconstruct the signal. We call this approach \emphc{complete ignorance}. This approach is appealing in that allows our compression scheme to achieve a parameter rate of just over 33 parameters per second. While the compression rate is extremely good, the quality of the speech (as perceived by a human) is rather low. In fact, the output signal is identically zero. This occurs because the LPC coefficients are weighted by the zeros in the error signal. At the other extreme is a method to model the excitation with all 1800 per second of the available model parameters. This could be done in a way similar to was described above where the compression operation only involved lowpass filtering. Here we model the excitation signal by lowpass filtering the error signal from the LPC modeling to a rate that requires $1800 - 34 = 1766$ parameters. This results in a sampling rate for the excitation signal that is just under 900 Hz. While much of the a frequency content is lost, the key component (the pitch frequency) is retained. Although this approach holds promise for producing high quality speech, we did not implement it because it would not meet our design goal. Since the \emphc{complete ignorance} approach aligned more closely with our design goal, we return to it to try to salvage it by introducing some modifications. With this return come a number of methods. Methods that we call \emphc{serious ignorance}, \emphc{moderate ignorance}, and a family of methods labeled \emphc{mild ignorance}. \emphc{Serious ignorance} involves one slight modification to the \emphc{complete ignorance} method. Instead of completely ignoring the excitation signal, in this approach we calculate the standard deviation of the excitation signal over the entire speech segment. This increases the parameter rate only slightly. Assuming a speech segment of two seconds results in a parameter rate under 34 parameters per second. When reconstructing the signal, we generate white noise with the calculated standard deviation and use it at the excitation signal. The \emphc{moderate ignorance} approach is very similar to this except that we now calculate the standard deviation over each frame. This results in a parameter rate of 67 parameters per second. Both of these approaches are founded on the premise that the LPC modeling is a whitening process and the resultant error signal (which we assume to be our excitation signal) is white noise. While this works well for unvoiced speech, it does not perform well for voiced speech. Even so, it is interesting to note that the resultant speech is significantly intelligible. This makes sense because we all know that whispered speech is significantly intelligible yet contains no voiced speech. In fact, the reconstructed speech using the \emphc{serious ignorance} method (see Figure 2) and the \emphc{moderate ignorance} method (see Figure 3) do sound much like whispered speech. \begin{center} \includegraphics[width=.6\textwidth]{CCTserious} Figure 2: Output for ``sun'' using \emphc{serious ignorance} \end{center} \begin{center} \includegraphics[width=.6\textwidth]{CCTmoderate} Figure 3: Output for ``sun'' using \emphc{moderate ignorance} \end{center} In both the \emphc{serious ignorance} and \emphc{moderate ignorance} approaches we assume that the entire speech segment is unvoiced. In nearly every case of speech, this assumption is invalid. In order to improve on the quality of the reconstructed speech we describe a family of speech compression techniques that do not assume that the entire speech segment to be unvoiced. In order to remove this assumption we need to perform two tasks -- classify each frame as voiced or unvoiced and estimate the pitch period for voiced frames. A plethora of techniques have been developed for performing these tasks, and many variations can be had on each technique. We initially drew our ideas from Rabiner et al. (Rabiner et al. 1976). Among our pitch detection alternatives were cepstral analysis, autocorrelation methods (center clipping prior to autocorrelation calculation (CLIP) and autocorrelation performed on the LPC error signal (SIFT)), a slightly modified autocorrelation method called Average Magnitude Differences Function (AMDF) which subtracts instead of multiplying in the autocorrelation summation, and a parallel processing method based on an elaborate voting scheme. We immediately dismissed the parallel processing method due to its complexity and little promise of significantly superior performance. Based on our design objective we proposed to use the pitch detection algorithm that produced the most perceptually pleasing results. McGonegal (McGonegal 1977) reported that of these methods, AMDF offered the best results. At this point it is necessary for us to write a ``weaselly'' sentence or two to explain why we didn't actually do this. The bottom line is that a different group did this and we listened to their results and found that they weren't much different from ours using the cepstral analysis method. While it is true that a number of methods exist for performing pitch detection, we chose to limit our implementational exploration to cepstral techniques. We did so because of the ease of implementation and intuitive attractiveness. We implemented the cepstral analysis as outlined in our second project. The cepstral coefficients are then used to determine whether the frame contains voiced or unvoiced speech. If the speech is determined to be voiced, an estimate of the pitch period is also obtained. By default our algorithm focuses on the cepstral coefficients representing the frequency range from 100 to 270 Hz.\footnote{Due to the speaker dependent nature of the cepstral approach to pitch detection, we have included an input parameter to adjust this as needed.} Our algorithm calculates the mean value of nonnegative coefficients in this range. If the peak value is greater than 1.5 times that of the mean value, the speech segment is classified as voiced speech and the pitch period is set based on maximum valued coefficient and is stored as the first excitation modeling parameter. If the peak value is less than 1.5 times that of the mean value, the speech segment is classified as unvoiced speech, and the first excitation modeling parameter is set to zero. In either case, the standard deviation of the excitation signal is calculated and stored as the second excitation modeling parameter. This processing results in two model parameters for each frame. While it would be possible to arbitrarily chose the frame size for the excitation modeling, for simplicity we chose to remain consistent with the frame length used in the vocal tract modeling, i.e., 30 msec. As a result, we have three parameters for every 30 msec frame or just under 100 parameters per second. We reconstruct the excitation signal as follows. For an unvoiced frame the excitation signal is white noise with standard deviation equal to the second excitation parameter. For a voiced frame we generate a periodic signal using the function \[ e_{n} = r_{n} + {\frac{\alpha n}{1 + \alpha n^2}} \bmod \gamma \] where $r_{n}$ is white noise sequence with the same standard deviation as the excitation signal, $\alpha$ determines the steepness of the slope, and $\gamma$ is the pitch period. This function provides a periodic excitation signal that retains a white noise component approximating that of the excitation signal. The vocal tract and excitation information are combined via: \[s_{n} = e_{n} - \sum_{k=1}^{20} b_{k}s_{n-k} \] where $e_{n}$ is the excitation signal and $b_{k}$ are the LPC codebook coefficients. We performed cepstral analysis on the original signal (henceforth referred to as \emphc{{\sc scep} mild ignorance}) and on the excitation signal (henceforth referred to as \emphc{{\sc ecep} mild ignorance}). The \emphc{{\sc scep} mild ignorance} method provided useful results; however, the \emphc{{\sc ecep} mild ignorance} method is unable to detect voiced frames. Unfortunately, we did not have time to fully explore why this is happening. In any case, the analysis is the same for both methods. The only difference is the signal analyzed. Figure 4 presents the sound bite ``sun'' after processing by the cepstral analysis on the original signal. \begin{center} \includegraphics[width=.6\textwidth]{CCTmild} Figure 4: Output for ``sun'' using \emphc{{\sc scep} mild ignorance} \end{center} While the plots thus far are instructive, plots of the excitation signal only provide a clearer view of the excitation signal modeling. These plots are included in Figures 5 -- 7 for the original excitation signal, the excitation modeled by \emphc{moderate ignorance}, and \emphc{{\sc scep} mild ignorance} respectively. It should be obvious that the \emphc{{\sc scep} mild ignorance} approach provides a much better model for the excitation. \begin{center} \includegraphics[width=.6\textwidth]{CCTe_org} Figure 5: Original excitation for ``sun'' \end{center} \begin{center} \includegraphics[width=.6\textwidth]{CCTe_moderate} Figure 6: Excitation for ``sun'' using \emphc{moderate ignorance} \end{center} \begin{center} \includegraphics[width=.6\textwidth]{CCTe_mild} Figure 7: Excitation for ``sun'' using \emphc{{\sc scep} mild ignorance} \end{center} \ctsec{Discussion} There exist a large number of reasonable approaches for reaching our design goal. We have considered a number of them and have actually implemented a subset of that number. Since our design goal was founded on intelligibility, we concluded that a quantitative evaluation to be of little use in assessing our ability to achieve our objective. Instead we relied on subjective assessments. Our assessments are rather imprecise and are aimed to provide a feel for our experiences as opposed to a definitive argument for a particular approach. Table 1 contains our estimates on the percentage of intelligible speech present for each speech signal for the two methods included in our final program. There are five approaches that we evaluated --- \emphc{complete ignorance}, \emphc{serious ignorance}, \emphc{moderate ignorance}, \emphc{{\sc ecep} mild ignorance}, and \emphc{{\sc scep} mild ignorance}. As its name suggests, \emphc{complete ignorance} did not perform very well. The resulting speech waveform was often unintelligible. Although the standard deviation varied significantly from frame to frame, the difference between the \emphc{serious ignorance} and \emphc{moderate ignorance} intelligibility was not as pronounced as we had expected. Both approaches resulted in reasonably intelligible speech. One implication of these approaches is the lack of any voiced speech. This resulted in the impression that processed speech sounded as if it were being whispered. While this was a significant deviation from the original speech, it did not reduce the intelligibility significantly. It would seem that at this point we had met our design criteria. These approaches allow us to achieve compression rates of 34 and 67 parameters per second respectively while still maintaining reasonably intelligible speech. The two \emphc{mild ignorance} methods attempted to reduce the ``whisper effect'' by including voiced speech frames. These methods increased our parameter burden to 100 parameters per second (still well below the 1800 parameters per second that we were given to work with). The \emphc{{\sc ecep} mild ignorance} method failed to identify voiced speech. As a result, the output was the same as that of the \emphc{moderate ignorance} approach. While the \emphc{{\sc scep} mild ignorance} approach was moderately successful in reducing the whisper quality of the speech, there were a few shortcomings. One significant disadvantage was that the threshold was somewhat speaker dependent. This shortcoming is most likely due to our choice of pitch detector. The cepstral pitch detection method is known for it's thresholding ambiguity, and it may be that we could elevate this problem by selecting a different pitch detection method like the AMDF. This could be done with a simple modification and the general compression framework would remain the same. Another disadvantage is that the transitions between voiced and unvoiced occasionally produces an audible artifact. It may be possible to incorporate some sort of transition smoothing to eliminate this; however, we did not explore this option. \begin{center} \begin{tabular}{|c|r|r|r|r|r|r|} \cline{2-7} \multicolumn{1}{c|}{} & \multicolumn{3}{|c|}{\emphc{{\sc scep} mild ignorance}} & \multicolumn{3}{|c|}{\emphc{Moderate ignorance}} \\ \hline \multicolumn{1}{|c|}{Sentence} & \multicolumn{3}{|c|}{Speaker number} & \multicolumn{3}{|c|}{Speaker number} \\ \multicolumn{1}{|c|}{number} & \multicolumn{1}{|c}{1} & \multicolumn{1}{c}{2} & \multicolumn{1}{c|}{3} & \multicolumn{1}{|c}{1} & \multicolumn{1}{c}{2} & \multicolumn{1}{c|}{3} \\ \hline 1 & 80\% & 60\% & 50\% & 70\% & 20\% & 20\% \\ \hline 2 & 60\% & 70\% & 70\% & 30\% & 50\% & 30\% \\ \hline 3 & 70\% & 40\% & 100\% & 20\% & 20\% & 30\% \\ \hline 4 & 70\% & 60\% & 90\% & 40\% & 20\% & 20\% \\ \hline 5 & 80\% & 80\% & 90\% & 40\% & 10\% & 20\% \\ \hline \end{tabular} Table 1: Percentage of intelligible speech \end{center} Our project guidelines made it clear that we were to not concern ourselves with the number of bits required to represent the speech; however, it may be of interest to note that our approach can be easily modified to squeeze the most information out of each bit as possible. We chose to use a 10 bit codebook for the LPC coefficients, but we certainly could have reduced this without much loss of intelligibility. A 6 bit codebook should suffice. As we saw in the comparison between the \emphc{serious ignorance} and \emphc{moderate ignorance} approaches, the standard deviation estimate is not very sensitive. For the sake of discussion we will assume that we can quantize this estimate to 4 bits. The remaining parameter contains information on the pitch period. We also use this parameter to indicate whether the speech frame contains voiced or unvoiced data. This is done by setting the pitch period equal to zero if the frame contains an unvoiced speech segment. This approach allows us to reserve one quantization level of the pitch period parameter as a flag for unvoiced speech. Because of the narrow range of possible pitch periods, we hypothesize that we can quantize this parameter to 4 bits. Table 2 indicates the parameter and bit rates using these quantization levels for the various approaches that we implemented. \begin{center} \begin{tabular}{|l|r|r|} \hline \multicolumn{1}{|c|}{Compression technique} & \multicolumn{1}{|c|}{Parameters per second} & \multicolumn{1}{|c|}{bit per second} \\ \hline \emphc{complete ignorance} & 33.3 & 200 \\ \hline \emphc{serious ignorance} & $33.3 + 1$ & $200 + 4$ \\ \hline \emphc{moderate ignorance} & 66.6 & 667 \\ \hline \emphc{{\sc ecep} mild ignorance} & 99.9 & 1400 \\ \hline \emphc{{\sc scep} mild ignorance} & 99.9 & 1400 \\ \hline \end{tabular} Table 2: Compression rates \end{center} All of these bit rates could be reduced further by additional coding techniques. For example, the \emphc{mild ignorance} techniques could make good use of Huffman coding. It should be evident from Figure 7 that the voiced/unvoiced decision remains consistent for a few frames at a time. As a result, all neighboring unvoiced frames will share the same value for their pitch period parameter. If we store the LPC codebook parameter for all the frames first, then the pitch period parameter for all of the frames next, and then the standard deviation parameter last, the sequence of pitch period parameters should compress significantly whenever a sequence of unvoiced frames appear consecutively. \ctsec{Additional Notes} The entire project was programmed in `C' and the source code is attached at the end of this report. Also, the last page of the report (after the source code) is the ``Project 4S Information Sheet.'' Our executable code allows two modes of operation. The default mode processes using the \emphc{ {\sc scep} mild ignorance} method. Using the \textttc{+N} flag will cause the program to process the speech data using the \emphc{moderate ignorance} method instead. Please refer to the manpage included just prior to the source code, refer to the README file, or run the program with the \textttc{-help} option for more information on the command syntax. All of the files for our project can be found in \textttc{/home/offset/a/taylor/SpeechStuff}. Some files exist in each directory and the others are symbolically linked. Our program generates ascii speech files. In order to listen to the output converted it to binary speech files and then used a package called ``sox'' to convert the file to a Sun AU file, and used ``audioplay'' on the Suns and ``send\_sound'' on the HPs to listen to the output. \newpage \ctsec{Bibliography} \begin{blist} \item L.R.\ Rabiner, M.J.\ Cheng, A.E.\ Rosenberg, and C.A.\ McGonegal, ``A Comparative Performance Study of Several Pitch Detection Algorithms,'' \emphc{IEEE Transactions on Acoustics, Speech, and Signal Processing}, vol.\ ASSP-24, no.\ 5, pp. 399--418, 1976. \item C.A.\ McGonegal, ``A Subjective Evaluation of Pitch Detection Methods Using LPC Synthesized Speech,'' \emphc{IEEE Transactions on Acoustics, Speech, and Signal Processing}, vol.\ ASSP-25, no.\ 6, 1977. \end{blist} \ctsec{Source Files} \ctssec{hw4.h} \begin{lstlisting}{} /********************************************************************* Authors: Varun Madhok and Chris Taylor Date: December 6, 1996 File: hw4.h Purpose: This header file contains the function prototypes for the speech compression application that was part of our fourth homework assignment for EE649 -- Speech Processing Notes: The following subroutines have been copied (mostly) from the text 'Numerical Recipes in C' by Press, Teukolsky, Flannery and Vetterling. The source code however has not been submitted. (float *)vector : allocates memory for a floating point array; (double *)dvector : allocates memory for an array with double elements; (double *)c_dvector : allocates memory for an array with double elements with initialization to zero; (int *) ivector : allocates memory for an array with integer elements; void free_vector : frees memory allocated for a floating point array; void free_ivector : frees memory allocated for an integer point array; void free_dvector : frees memory allocated for a double point array; void dfour1 : carries out FFT on input array. Original array is replaced by the FFT thereof. To work with complex data, the convention used is to assign all real values to the even indices and the imaginary components to the odd indices of the array (assuming first index is zero); void normal : white noise generation subroutine with mean 0 and variance 1. ***********************************************************************/ /* Definitions for constants in our simple program. If this were more than an experimental application, these constants should be parameters whose values could be selected at runtime. */ #define DEF_DAT 7680 #define SEGMENT_LENGTH 480 #define IN_DEF_FILE "sun.ascii.Z" #define OUT_DEF_FILE "out.temp" #define CODE_DEF_DIR "male" #define DEF_CODEBK_SIZE 2 #if defined(__STDC__) || defined(ANSI) || defined(NRANSI) /* fftmag: Calculates the magnitude of an n sample signal s and stores the result in mag */ /* fftmag: Calculates the n point FFT of s and stores the magnitude of the result in mag. Notes: n must be a power of two with n <= 1024 mag stores the magnitude, not the log magnitude */ int fftmag(double s[], double mag[], int n); /* hamm: Calculates the Hamming windowed version of an n sample signal s and stores the result in hs (uses float precision) */ void hamm(float s[], float hs[], int n); /* dhamm: Calculates the Hamming windowed version of an n sample signal s and stores the result in hs (uses double precision) */ void dhamm(double s[], double hs[], int n); /* lpc: Calculates p Linear Predictive Coding coefficients b[1], ..., b[p]; (b[0] = 1.0) The LPC coefficients approximate the signal x[]. Convention used: signs of the b[k]'s are such that the denominator of the transfer function is of the form 1+(sum from k=1 to p of b[k]*z**(-k)) This is the normal convention for the inverse filtering formulation errn = normalized minimum error rmse = root mean square energy of the x[i]'s n = number of data points in frame p = number of coefficients = degree of inverse filter polynomial, p <= 40 */ int lpc(float x[], int n, int p, float b[], float *rmse, float *errn); /* voiced_error_gen: Generates a seg_len length voiced error signal, segment, which is a sequence of pulses (with a period of pitch_period/2) corresponding to the excitation signal for voiced speech is generated using the function f(x) = ax/(1+a*x*x). A constant multiplicative factor based on the standard deviation measured over the actual error signal is used to modulate the signal to the appropriate amplitude. White gaussian noise with a standard deviation of err_stdev is added */ void voiced_error_gen(float *segment, int seg_len, float err_stdev, int pitch_period); /* unvoiced_error_gen: Generates a seg_len length unvoiced error signal, segment, which is just white noise with a standard deviation of err_stdev */ void unvoiced_error_gen(float *segment, int seg_len, float err_stdev) /* code_select: Selects the appropriate codebook. **real_cep: This is the array of cepstral coefficients generated by the frame over the entire speech signal. **code_cep: This contains the codebook for the cepstral coefficients. **code_lpc: This contains the codebook for the LPC coefficients. **codeword: Once the best match between the input word and that from the codebook (cepstral) is found, the corresponding word from the LPC codebook is transferred to 'codebook' as the output to be used in speech generation. */ void code_select(float **real_cep, float **code_cep, float **code_lpc, float **codeword, int seg_num, int num_codes, int filter_order); /* wr_error: If n is zero it prints and error and exists otherwise, it prints an okay message and continues */ void wr_error(int n); /* print_directions: Displays usage instructions */ void print_directions(); #else void hamm(); void dhamm(); int fftmag(); int lpc(); void voiced_error_gen(); void unvoiced_error_gen(); void code_select(); void wr_error(int n); void print_directions(); #endif \end{lstlisting} \ctssec{hw4.c} \begin{lstlisting}{} /****************************************************************************** Authors: Varun Madhok and Chris Taylor Date: December 6, 1996 File: hw4.c Purpose: This file contains the main application for the speech compression application that was part of our fourth homework assignment for EE649 -- Speech Processing ******************************************************************************/ #include #include #include "/home/offset/a/taylor/Src/Recipes/recipes/nrutil.h" #include "/home/offset/a/taylor/Src/Recipes/recipes/nr.h" #include "/home/offset/a/taylor/Src/Recipes/Vrecipes/randlib.h" #include "hw4.h" #define MOD_FACTOR 1.5 #define OTHER 0 #define MALE 1 #define FEMALE 2 #define CHILD 3 int main(int argc, char *argv[]) { int i; int j; int k; int N_flag; int pole; int itemp; int num; int seg_len; int seg_num; int filter_order; int* data; int pad_location; int ID; int sampling_rate; int lifter_from_this_sample; int lifter_till_this_sample; float ftemp; float rmse; float errn; float* filter_coeffs; float* ceps_coeffs; float e; float* gen_e; float err_stdev; float err_mean; float* segment; float* windowed_segment; int non_zero_count; int max_index; int pitch_period; int num_codes; int category_is; /* long_segment is of length 1024 samples. It comprises the windowed segment in the centre padded left and right by an appropriate number*/ double* long_segment; double* fft_segment; double non_zero_sum; double max_samp; FILE* infile; FILE* errfile; FILE* gen_errfile; FILE* cepsfile; FILE* lpcfile; float* gen_err; float** real_cep; float** code_cep; float** code_lpc; float** codeword; float* error_signal; float* output_signal; char fname[55]; char out_fname[55]; char temp_str[90]; char num_codes_string[8]; char code_fname[15]; char group_name[5]; char CODEBOOKS_EXIST; if (( argc > 1 ) && ( !strcmp (argv [1], "-help" ))) { print_directions(); } /*the default values are assigned here*/ strcpy(fname, IN_DEF_FILE); strcpy(out_fname, OUT_DEF_FILE); strcpy(code_fname, CODE_DEF_DIR); N_flag=1; pole=0; num_codes=DEF_CODEBK_SIZE; strcpy(num_codes_string, "2"); num=DEF_DAT; filter_order= 20; seg_len=SEGMENT_LENGTH; category_is=OTHER; ID=0; CODEBOOKS_EXIST=1; sampling_rate=16000; /*The for loop below works in the command line arguments into the program */ for(i=1;i=0; j--) { if(((k-1)*seg_len+j-pad_location)>=0) { long_segment[2*j]=0.0/*(double) data[(k-1)*seg_len+j-pad_location]*/; } else { long_segment[2*j]=0.0; } long_segment[2*j+1]=0.0; } /* Right pad*/ for(j=(pad_location+seg_len+1); j<1024; j++) { if(((k-1)*seg_len+j)lifter_till_this_sample)||(jmax_samp) { max_samp=long_segment[2*j]; max_index=j; } if((long_segment[2*j]>=0.0)&&(j<=lifter_till_this_sample)&& (j>=lifter_from_this_sample)) { non_zero_count++; non_zero_sum+=fabs(long_segment[2*j]); } } non_zero_sum/=non_zero_count; /* Pitch detection is done here : If the max value is greater than the average non-negative signal over the liftered signal, we claim a pitch to have been detected*/ if((max_samp>(MOD_FACTOR*non_zero_sum))&&(N_flag!=0)) { pitch_period=max_index; } else { pitch_period=-1; } lpc(windowed_segment, seg_len, filter_order, filter_coeffs, &rmse, &errn); /* Calculate error--->Initialization*/ for(j=0;j=0) { e+=filter_coeffs[i]*segment[j-i]; } } else { e+=filter_coeffs[i]*(float)data[(k-1)*seg_len+j-i]; } } if(!CODEBOOKS_EXIST) { fprintf(errfile, "%f\n", e); } err_mean+=e; err_stdev+=e*e; } err_mean/=(float)(seg_len); err_stdev/=(float)(seg_len); err_stdev-=(err_mean*err_mean); if(err_stdev>0.0) { err_stdev=sqrt(err_stdev); } else { err_stdev=0.0; } /* At this stage... use the voiced unvoiced decision plus standard deviation of the error signal to generate an 'error' signal. To recap - Parameters used are : a. (optional) Voiced/unvoiced flag : 0 if unvoiced, 1 if otherwise; b. standard deviation of the error for the frame; c. pitch period : -1 if unvoiced, something +ve if voiced; */ /* An excitation signal is generated as and how we have classified the frame */ if(pitch_period>0) { voiced_error_gen(gen_e, seg_len, err_stdev, pitch_period); } else { unvoiced_error_gen(gen_e, seg_len, err_stdev); } for(j=0; j new segment begins */ if(CODEBOOKS_EXIST) { codeword=(float **)matrix(1, seg_num, 1, filter_order); code_lpc=(float **)matrix(1, num_codes, 1, filter_order); /*read codebook LPC*/ code_cep=(float **)matrix(1, num_codes, 1, filter_order); /*read codebook CEPS*/ } /* Freeing memory */ free_ivector(data, 0, num-1); free_vector(gen_e, 0, seg_len-1); free_vector(windowed_segment, 0, seg_len-1); free_dvector(long_segment, 0, (2*1024)-1); free_dvector(fft_segment, 0, 1024-1); free_vector(segment, 0, seg_len-1); free_vector(filter_coeffs, 0, filter_order); free_vector(ceps_coeffs, 1, filter_order); if(CODEBOOKS_EXIST) { for(i=1; i<=num_codes; i++) { for(j=1; j<=filter_order; j++) { fscanf(cepsfile,"%f", &ftemp); code_cep[i][j]=ftemp; fscanf(lpcfile,"%f", &ftemp); code_lpc[i][j]=ftemp; } } /* At this stage... have frame by frame data on cepstral coefficients have codebooks on lpc and cepstral coeffs. Proceed with the association Output is stored in codeword */ code_select(real_cep, code_cep, code_lpc, codeword, seg_num, num_codes, filter_order); free_matrix(code_cep, 1, num_codes, 1, filter_order); free_matrix(code_lpc, 1, num_codes, 1, filter_order); /* Incorporate inverse filtering process */ output_signal=(float *)vector(1, num); for(k=1;k<=seg_num;k++) { for(i=1;i<=seg_len;i++) { output_signal[(k-1)*seg_len+i] = error_signal[(k-1)*seg_len+i]; for(j=1;j<=filter_order;j++) { /* Generating output using excitation signal and LPC coefficients from the codebook */ if(((k-1)*seg_len+i-j)>=1) { output_signal[(k-1)*seg_len+i] -= codeword[k][j]*output_signal[(k-1)*seg_len+i-j]; } } printf("%d\n", (int)output_signal[(k-1)*seg_len+i]); } } free_vector(output_signal, 1, num); free_matrix(codeword, 1, seg_num, 1, filter_order); fclose(lpcfile); fclose(cepsfile); } free_matrix(real_cep, 1, seg_num, 1, filter_order); free_vector(error_signal, 1, num); if(CODEBOOKS_EXIST==0) { fclose(errfile); } if(CODEBOOKS_EXIST==0) { fclose(gen_errfile); } writeseed(); return 0; } \end{lstlisting} \ctssec{code\_select.c} \begin{lstlisting}{} /***************************************************************************** Authors: Varun Madhok and Chris Taylor Date: December 6, 1996 File: code_select.c Purpose: This file contains the code_select function which selects the appropriate codebook for the speech being processed by the speech compression application that was part of our fourth homework assignment for EE649 -- Speech Processing *****************************************************************************/ #include void code_select(float **real_cep, float **code_cep, float **code_lpc, float **codeword, int seg_num, int num_codes, int filter_order) { int i; int k; int j; float err; float emin; for(k=1;k<=seg_num;k++) { emin = 9999999.9; for(i=1;i<=num_codes;i++) { err = 0.0; /* Measuring difference between the generated codeword and one from the cepstral codebook*/ for(j=1;j<=filter_order;j++) { err += (double)fabs((float)real_cep[k][j] - (float)code_cep[i][j]); } if(err male (default)\n"); printf(" female\n"); printf(" all_males\n"); printf(" all_females\n"); printf(" -segl n segment length\n"); printf(" -group *char group name to decide cepstrum liftering.\n"); printf(" Valid options are -> O or o (default);\n"); printf(" M or m male;\n"); printf(" F or f female;\n"); printf(" J or j child.\n"); printf(" +P use popen\n"); printf(" +N dont classify voiced/unvoiced\n"); printf("\nDESCRIPTION\n"); printf("Default input file : %s\n", IN_DEF_FILE); printf("Default codebook dir : %s\n", CODE_DEF_DIR); printf("Default codebook size : %d\n", DEF_CODEBK_SIZE); printf("Default number of records : %d\n", DEF_DAT); printf("Default segment length : %d\n", SEGMENT_LENGTH); printf("Default sampling rate : 16000 Hz\n"); printf("Default filter order : 20\n"); exit(0); } \end{lstlisting} \ctssec{unvoiced\_error\_gen.c} \begin{lstlisting}{} /***************************************************************************** Authors: Varun Madhok and Chris Taylor Date: December 6, 1996 File: unvoiced_error_gen.c Purpose: This file contains the unvoiced_error_gen function which generates the voiced error signal for the speech compression application that was part of our fourth homework assignment for EE649 -- Speech Processing *****************************************************************************/ #include #include #include "hw4.h" #include "/home/offset/a/taylor/Src/Recipes/Vrecipes/randlib.h" void unvoiced_error_gen(float *segment, int seg_len, float err_stdev) { int i; /* The unvoiced excitation signal is just white noise with the desired variance */ for (i=0; i #include #include "hw4.h" #include "/home/offset/a/taylor/Src/Recipes/Vrecipes/randlib.h" void voiced_error_gen(float *segment, int seg_len, float err_stdev, int pitch_period) { float var; float mult_factor; float ftemp; float const_factor; int i; int j; int num_peaks; var=err_stdev*err_stdev*(float)seg_len; num_peaks=(int)((float)seg_len/(float)pitch_period); mult_factor = 0.95*sqrt(var/(float) num_peaks); const_factor=10.0; j=0; for(i=0; i #define PI 3.14159265 void hamm(float s[], float hs[], int n) { double omega; double w; int k; omega=2*PI/(n-1); for(k=0; k #define PI 3.14159265 void dhamm(double s[], double hs[], int n) { double omega; double w; int k; omega=2*PI/(n-1); for(k=0; k #include #define PI 3.14159265 #define c_mag(c1) sqrt((c1.r)*(c1.r) + (c1.i)*(c1.i)) /* A structure to hold a complex number */ typedef struct { double r; double i; } COMPLEX; /* Authors: Varun Madhok and Chris Taylor Date: December 6, 1996 Purpose: Returns the product of two complex numbers c1 and c2 */ COMPLEX c_mult(COMPLEX c1, COMPLEX c2) { COMPLEX c3; c3.r=c1.r*c2.r - c1.i*c2.i; c3.i=c1.i*c2.r + c1.r*c2.i; return c3; } /* Authors: Varun Madhok and Chris Taylor Date: December 6, 1996 Purpose: Returns the sum of two complex numbers c1 and c2 */ COMPLEX c_add(COMPLEX c1, COMPLEX c2) { COMPLEX c3; c3.r=c1.r + c2.r; c3.i=c1.i + c2.i; return c3; } /* Authors: Varun Madhok and Chris Taylor Date: December 6, 1996 Purpose: Returns the difference of two complex numbers c1 and c2 */ COMPLEX c_sub(COMPLEX c1, COMPLEX c2) { COMPLEX c3; c3.r=c1.r - c2.r; c3.i=c1.i - c2.i; return c3; } /* Authors: Varun Madhok and Chris Taylor Date: December 6, 1996 Reference: Steiglitz, Introduction to Discrete Systems */ int fftmag(double s[], double mag[], int n) { int i; int j; int m; int l; int length; int loc1; int loc2; double arg; double w; COMPLEX c; COMPLEX z; COMPLEX f[1024]; for(i=0; i= m) j += n/(m+m); } f[i].r=s[j]; f[i].i=0; } for(length=2; length <= n; length += length) { w = -2.0*PI/(double)length; for(j=0; j #include #define MAX_LPC_ORDER 40 #define EVEN(x) !(x%2) int lpc(float x[], int n, int p, float b[], float* rmse, float* errn) { int i; int k; float reflect_coef[MAX_LPC_ORDER+1]; float auto_coef[MAX_LPC_ORDER+1]; float sum; float temp1,temp2; float current_reflect_coef; float pred_error; for(i=0; i<=p; i++) { sum = 0.0; for(k=0; k< n-i; k++) { sum += (x[k] * x[k+i]); } auto_coef[i] = sum; } *rmse = auto_coef[0]; if(*rmse == 0.0) { return 1; /* Zero power. */ } pred_error = auto_coef[0]; b[0] = 1.0; for (k=1; k<=p; k++) { sum = 0.0; for(i=0; i