Tài liệu 45 Speech Coding docx

Thông tin tài liệu

Richard V.Cox. “Speech Coding.” 2000 CRC Press LLC. <http://www.engnetbase.com>. SpeechCoding RichardV.Cox AT&TLabs—Research 45.1Introduction ExamplesofApplications • SpeechCoderAttributes 45.2UsefulModelsforSpeechandHearing TheLPCSpeechProductionModel • ModelsofHumanPer- ceptionforSpeechCoding 45.3TypesofSpeechCoders Model-BasedSpeechCoders • TimeDomainWaveform- FollowingSpeechCoders • FrequencyDomainWaveform- FollowingSpeechCoders 45.4CurrentStandards CurrentITUWaveformSignalCoders • ITULinearPrediction Analysis-by-SynthesisSpeechCoders • DigitalCellularSpeech CodingStandards • SecureVoiceStandards • Performance References 45.1 Introduction Digitalspeechcodingisusedinawidevarietyofeverydayapplicationsthattheordinaryperson takesforgranted,suchasnetworktelephonyortelephoneansweringmachines.Byspeechcoding wemeanamethodforreducingtheamountofinformationneededtorepresentaspeechsignalfor transmissionorstorageapplications.Formostapplicationsthismeansusingalossycompression algorithmbecauseasmallamountofperceptibledegradationisacceptable.Thissectionreviews someoftheapplications,thebasicattributesofspeechcoders,methodscurrentlyusedforcoding, andsomeofthemostimportantspeechcodingstandards. 45.1.1 ExamplesofApplications Digitalspeechtransmissionisusedinnetworktelephony.Thespeechcodingusedisjustsample-by- samplequantization.Thetransmissionrateformostcallsisfixedat64kilobitspersecond(kb/s). Thespeechissampledat8000Hz(8kHz)andalogarithmic8-bitquantizerisusedtorepresenteach sampleasoneof256possibleoutputvalues.Internationalcallsovertransoceaniccablesorsatellites areoftenreducedinbitrateto32kb/sinordertoboostthecapacityofthisrelativelyexpensive equipment.Digitalwirelesstransmissionhasalreadybegun.InNorthAmerica,Europe,andJapan therearedigitalcellularphonesystemsalreadyinoperationwithbitratesrangingfrom6.7to13kb/s forthespeechcoders.SecuretelephonyhasexistedsinceWorldWarII,basedonthefirstvocoder. (Vocoderisacontractionofthewordsvoicecoder.)Securetelephonyinvolvesfirstconvertingthe speechtoadigitalform,thendigitallyencryptingitandthentransmittingit.Atthereceiver,it isdecrypted,decoded,andreconvertedbacktoanalog.Currentvideotelephonyisaccomplished c  1999byCRCPressLLC through digital transmission of both the speech and the video signals. An emerging use of speech coders is for simultaneous voice and data. In these applications, users exchange data (text, images, FAX, or any other form of digitalinformation) while carrying on a conversation. All of the above examples involve real-time conversations. Today we use speech coders for many storage applications that make our lives easier. For example, voice mail systems and telephone answering machines allow us to leave messages for others. T he calledparty canretrieve the message when they wish, even from halfway around the world. The same storage technology can be used to broadcast announcements to many different individuals. Another emerging use of speech coding is multimedia. Most forms ofmultimedia involve only one-way communications, so we include them with storage applications. Multimedia documents on computers can have snippets of speech as an integral part. Capabilities currently exist to allow users to make voice annotations onto documents stored on a personal computer (PC) or workstation. 45.1.2 Speech Coder Attributes Speech coders have attributes that can be placed in four groups: bit rate, quality, complexity, and delay. For a given application, some of these attributes are pre-determined while tradeoffs can be made among the others. For example, the communications channel may set a limit on bit rate, or cost considerations may limit complexity. Quality can usually be improved by increasing bit rate or complexity,andsometimesbyincreasingdelay. Inthefollowing sections,wediscusstheseattributes. Primarily we will be discussing telephone bandwidth speech. This is a slightly nebulous term. In the telephone network, speech is first bandpass filtered from roughly 200 to 3200Hz. This is often referredtoas3kHzspeech. Speechissampledat8kHzinthetelephonenetwork. Theusualtelephone bandwidth filter rolls off to about 35 dB by 4 kHz in order to eliminate the aliasing artifacts caused by sampling. There is a second bandwidth of interest. It is referred to aswideband speech. The sampling rate is doubled to 16 kHz. The lowpass filter is assumed to begin rolling off at 7 kHz. At the low end, the speechisassumed tobeuncontamined byline noise andonlythe DC componentneedstobefiltered out. Thus,thehighpass filtercutofffrequencyis50Hz. Whenwerefertowidebandspeech,wemean speech with a bandwidth of 50 to 7000 Hz and a sampling rate of 16 kHz. This is also referred to as 7 kHz speech. Bit Rate Bitratetellsusthedegreeofcompressionthatthecoderachieves. Telephonebandwidthspeech issampledat8kHzanddigitizedwithan8-bitlogarithmicquantizer,resultinginabitrateof64kb/s. Fortelephone bandwidth speechcoders,wemeasurethedeg ree ofcompressionby howmuchthe bit rate is lowered from 64 kb/s. International telephone network standards currently exist for coders operating from 64kb/s down to 5.3 kb/s. Thespeech coders for regional cellular standards spanthe range from 13 to 3.45 kb/s and those for secure telephony span the range from 16 kb/s to 800 b/s. Finally,there are proprietaryspeech coders that are in common use which span the entire range. Speech coders need not have a constant bit rate. Considerable compression can be gained by not transmitting speech during the silence intervals of aconversation. Nor is it necessary to keepthe bit rate fixed during the talkspurts of a conversation. Delay The communication delay of the coder is more important for transmission than for storage applications. In real-time conversations, a large communication delay can impose an awkward protocol on talkers. Large communication delays of 300 ms or greater are particularly objectionable to users even if there are no echoes. c  1999 by CRCPress LLC Most low bit rate speech coders are block coders. They encode a block of speech, also known as a frame, at a time. Speech coding delay can be allocated as follows. First, there is algorithmic delay. Some coders have anamount of look-ahead or other inherent delays in addition to their frame size. The sum of frame size and other inherent delays constitutes algorithmic delay. The coder requires computation. Theamountoftimerequired for this is calledprocessingdelay. It is dependent onthe speed of the processor used. Other delays in a complete system are the multiplexing delay and the transmission delay. Complexity The degree ofcomplexity is adetermining factor inboththecostandpowerconsumptionofa speechcoder. Cost is almost alwaysafactorintheselectionof a speechcoder foragivenapplication. With the advent of wireless and portable communications, power consumption has also become an important factor. Simple scalar quantizers, such as linear or logarithmic PCM, are necessary in any coding system and have the lowest possible complexity. More complex speech coders are first simulated on host processors, then implemented on DSP chips and may later be implemented on special purpose VLSI devices. Speed and random access memory(RAM)are the two most importantcontributing factors of complexity. The faster the chip or the greater the chip size, the greater the cost. In fact, complexity is a determining factor for both cost and power consumption. Generally 1 word of RAM takes up as much on-chip area as 4 to 6 words of readonly memory (ROM). Most speechcoders areimplemented on fixed point DSPchips, soonewayto comparethecomplexity ofcodersistomeasuretheirspeedand memor y requirements when efficiently implemented on commercially available fixed point DSP chips. DSP chips are available in both 16-bit fixed point and 32-bit floating point. 16-bit DSP chips are generally preferred for dedicated speech coder implementations because the chips are usually less expensive and consume less power than implementations based on floating point DSPs. A disadvantage of fixed-point DSP chips is that the speech coding algorithm must be implemented using 16-bit arithmetic. As part of the implementation process, a representation must be selected for eachandevery variable. Some can berepresentedina fixedformat, someinblockfloatingpoint, and still others may require double precision. As VLSI technology has advanced, fixed point DSP chips contain a richer set of instructions to handle the data manipulations required to implement representations such as block floating point. The advantage of floating point DSP chips is that implementing speech coders is much quicker. Their arithmetic precision is about the same as that of a highlevel language simulation, so the steps of determining the representation of each and e very variable andhow these representations affect performance can beomitted. Quality The attribute of quality has many dimensions. Ultimately quality is determined by how the speech sounds to a listener. Some of the factors that affect the performance of a coder are whether the input speech is clean or noisy, whether the bit stream has been corrupted by errors, and whether multiple encodings have taken place. Speech coder quality ratings are determined by means of subjective listening tests. The listening is done in a quiet booth and may use specified telephone handsets, headphones, or loudspeakers. The speech material is presented to the listeners at specified levels and isoriginally prepared to have particular frequency characteristics. The most often used test is the absolute category rating (ACR) test. Subjects hear pairs of sentences and are asked to give one of the follow ing ratings: excellent, good, fair, poor,orbad. A typicaltest contains a variety of differenttalkers and a numberofdifferent coders or reference conditions. The data resulting from this test can be analyzed in many ways. The simplest way is to assign anumerical rankingto each response, givinga5tothebestpossible rating, 4 to the next best, down to a 1 for the worst rating, then computing the mean rating for each of the c  1999 by CRC Press LLC conditions under test. This is a referred to as a mean opinionscore (MOS) and the ACR test is often referred to as a MOS test. Therearemanyotherdimensionstoqualitybesidesthosepertainingtonoiselesschannels. Biterror sensitivity is another aspect of quality. For some low bit rate applications such as secure telephones over 2.4 or 4.8 kb/s modems, it might be reasonable to expect the distribution of bit errors to be random and coders should be made robust for low random bit error rates up to 1 to 2%. For radio channels, such as in digital cellular telephony, provision is made for additional bits to be used for channel codingtoprotecttheinformationbearing bits. Errors aremore likely tooccur inburstsand the speech coder requires a mechanism to recover from an entire lost frame. This is referred to as frame erasure concealment, another aspect of quality for cellular speech coders. Forthepurposesofconservingbandwidth,voiceactivitydetectorsaresometimesusedwithspeech coders. During non-speech intervals, the speech coder bit stream is discontinued. At the receiver “comfort noise” is injected to simulate the background acoustic noise at the encoder. This method is used for some cellular systems and also in digital speech interpolation (DSI) systems to increase the effective number of channels or circuits. Most international phone calls car ried on undersea cables orsatellitesuseDSI systems. Thereissomeimpacton quality when thesetechniquesareused. Subjective testing can determine the degree of degradation. 45.2 Useful Models for Speech and Hearing 45.2.1 The LPC Speech Production Model Human speech is produced in the vocal tract by a combination of the vocal cords in the glottis interacting with the articulators of the vocal tract. The vocal tract can be approximated as a tube of varying diameter. The shape of the tube gives r ise to resonant frequencies called formants. Over the years, the most successfulspeechcoding techniques havebeen based on linear predictioncoding (LPC).TheLPCmodelisderivedfromamathematicalapproximationtothevocaltractrepresentation as a variable diameter tube. The essential element of LPC is the linear prediction filter. This is an all polefilter which predicts the value of the next sample based on a linear combination of previous samples. Let x n be the speech sample value at sampling instant n. The object is to find a set of prediction coefficients {a i } such thatthe prediction error fora frame ofsize M is minimized: ε = M−1  m=0  I  i=1 a i x n+m−i + x n+m  2 (45.1) where I is theorder of the linear prediction model. The prediction value for x n is given by ˜x n =− I  i=1 a i x n−i (45.2) The prediction error signal {e n } is also referred to as the residual signal. In z-transform notation we can write A(z) = 1 + I  i=1 a i z −i (45.3) 1/A(z) isreferredtoastheLPC synthesis filter and (ironically) A(z) isreferredtoastheLPC inverse filter. c  1999 by CRC Press LLC LPCanalysisiscarriedoutasablockprocessona frameofspeech. Themostoftenusedtechniques arereferredtoastheautocorrelationandtheautocovariancemethods[1]–[3]. Bothmethodsinvolve inverting mat rices containing correlation statistics of thespeech signal. If the polesof the LPC filter are close to the unit circle, then these matrices become more ill-conditioned, which means that the techniques used for inversion are more sensitive to errors caused by finite numerical precision. Various techniques for dealing with this aspect of LPC analysis include windows for the data [1, 2], windows for the correlation statistics [4], and bandwidth expansion of the LPC coefficients. For forward adaptive coders, the LPC information must also be quantized and transmitted or stored. Direct quantization of LPC coefficients is not efficient. A small quantization error in a single coefficient can render the entire LPC filter unstable. Even if the filter is stable, sufficient precision is required and too many bits will be needed. Instead, it is better to transform the LPC coefficientstoanotherdomaininwhichstabilityismoreeasilydeterminedandfewerbitsarerequired for representing the quantizationlevels. The first such domain to be considered is the reflection coefficient [5]. Reflection coefficients are computed as a byproduct of LPC analysis. One of their properties is that all reflection coefficients must have magnitudes less than 1, making stability easily verified. Direct quantization of reflection coefficients is still not efficient because the sensitivity of the LPC filter to errors is much greater when reflection coefficients are nearly 1 or −1. More efficient quantizers have been designed by transformingtheindividualreflectioncoefficientswithanonlinearitythatmakestheerrorsensitivity more uniform. Two such nonlinear functions are the inverse sine function, arcsin(k i ), and the logarithm ofthe area ratio, log 1+k i 1−k i . Aseconddomainthathasattractedevengreaterinterestrecentlyisthelinespectralfrequency(LSF) domain [6]. The transformation is given as follows. We first use A(z) to define two polynomials: P(z) = A(z) + z −(I +1) A  z −1  (45.4a) Q(z) = A(z) − z −(I +1) A  z −1  (45.4b) These polynomials can be shown to have two useful properties: all zeroes of P(z)and Q(z) lie on the unit circle and they are interlaced with each other. Thus, stability is easily checked by assuring both the interlaced property and that no two zeroes are too close together. A second property is that the frequencies tend to be clustered near the formant frequencies; the closer together two LSFs are, the sharper the formant. LSFs have attracted more interest recently because they typically resultinquantizers having eitherbetterrepresentationsor using fewerbits thanreflectioncoefficient quantizers. The simplest quantizers are scalar quantizers [8]. Each of the values (in whatever domain is being used to represent the LPC coefficients) is represented by one of the possible quantizer levels. The individual values are quantized independently of each other. There may also be additional redundancy between successive frames, especially during stationary speech. In such cases, values may be quantized differentially between frames. Amoreefficient,butalso morecomplex, methodofquantizationiscalled vectorquantization[9]. Inthistechnique,thecompletesetofvaluesisquantized jointly. The actualsetofvaluesiscompared againstallsets inthecodebook usingadistancemetric. Thesetthat isnearestisselected. Inpractice, an exhaustive codebook search is too complex. For example, a 10-bit codebook has 1024 entries. This seems like a practical limit for most codebooks, but does not give sufficient performance for typical 10th order LPC. A 20-bit codebook would give increased performance, but would contain over 1 million vectors. This is both too much storage and too much computational complexity to be practical. Instead of using large codebooks, product codes are used. In one technique, an initial codebook is used, then the remaining error vector is quantized by a second stage codebook. In the c  1999 by CRC Press LLC secondtechnique,thevectorissub-dividedandeachsub-vectorisquantizedusingitsowncodebook. Both of these techniques lose efficiency compared to a full-search vector quantizer, but represent a good means for reducing computational complexity and codebook size for bit rate or quality. 45.2.2 Models of Human Perception for Speech Coding Our ears have a limited dynamic range that depends on both the level and the frequency content of the input signal. The typical bandpass telephone filter has a stopband of onlyabout35 dB. Also, the logarithmicquantizercharacteristicsspecifiedbyCCITTRec. G.711resultinasignal-to-quantization noiseratioofabout35dB.Isthisacoincidence? Ofcoursenot! Ifasignal maintainsan SNRof about 35 dB or greater for telephone bandwidth,then most humans will perceive little or no noise. Conceptually, the masking property tells us that we can permit greater amounts of noise in and near the formant regions andthatnoisewillbemostaudiblein the spectral valleys. Ifwe use acoder that produces a white noise characteristic, then the noise spectrum is flat. The white noise would probably be audible in all but the formant regions. In modern speech coders, an additional linear filter is added to weight the difference between the original speech signal and the synthesized signal. The object is to minimize the error in a space whose metric is like that of the human auditory system. If the LPC filter information is available, it constitutes the best available estimate of the speech spectrum. It can be used to form the basis for this “perceptual weighting filter” [10]. The perceptual weighting filter isgiven by W(z) = 1 − A(z/γ 1 ) 1 − A(z/γ 2 ) 0 <γ 2 <γ 1 < 1 (45.5) The perceptual weighting filter de-emphasizes the importance of noise in the formant region and emphasizes its importance in spectral valleys. The quantization noise will have a spectral shapethat is similar to that of the LPC spectral estimate, making iteasier to mask. The adaptive postfilter is an additional linear filter that is combined with the synthesis filter to reducenoiseinthespectralvalleys[11]. OnceagaintheLPCsynthesisfilterisavailableastheestimate of the speech spectrum. As in the perceptual weighting filter, the synthesis filter is modified. This idea was later further extended to include a long-term (pitch) filter. A tilt-compensation filter was added to correct for thelow pass characteristic that causes a muffled sound. A gain control st rategy helped prevent any segments from being either too loud or too soft. Adaptive postfilters are now included as a part of many standards. 45.3 Types of Speech Coders Thispartofthesectiondescribesavarietyofspeechcodersthatarewidelyused. Theyaredividedinto two categories: waveform-following coders and model-based coders. Waveform-following coders havethepropertythatiftherewerenoquantizationerror,theoriginalspeechsignalwouldbe exactly reproduced. Model-based coders are based on parametric models of speech production. Only the values of the parameters are quantized. If there were no quantization error, the reproduced signal would not be the original speech. 45.3.1 Model-Based Speech Coders LPC Vocoders AblockdiagramoftheLPCvocoderisshowninFig.45.1. LPCanalysisisperformedonaframe of speech and the LPCinformation is quantized and transmitted. A voiced/unvoiced determination is made. The decision may be based on either the original speech or the LPC residual signal, but it c  1999 by CRC Press LLC will always be based on the degree of periodicit y of the signal. If the frame is classified as unvoiced, the excitation signal is white noise. If the frame is voiced, the pitch period is transmitted and the excitationsignalisa periodic pulsetrain. Ineithercase, theamplitudeof theoutputsignalisselected such that its power matches that of the original speech. For more informationon the LPC vocoder, the reader isreferred to [12]. FIGURE 45.1: Block diagram of LPCvocoder. Multiband Excitation (MBE) Coders Figure45.2is ablockdiagramofamultibandsinusoidalexcitationcoder. Thebasicpremiseof these coders is that the speech waveform can be modeled as a combination of harmonically related sinusoidal waveforms and narrowband noise. Within a given bandwidth, the speech is classified as periodic or aperiodic. Harmonically relatedsinusoidsare used togenerate the periodic components and white noise is used to generate the aperiodic components. Rather than transmitting a single voiced/unvoiced decision, a frame consists of a number of voiced/unvoiced decisions corresponding to the different bands. In addition, the spectral shape and gain must be transmitted to the receiver. LPC may or may not be used to quantize the spectral shape. Most often the analysis of the encoder is performed via fast Fourier t ransform (FFT). Synthesis at the decoder is usually performed by a number of parallel sinusoid and white noise generators. MBE coders are model-based because they do not t ransmit the phase of the sinusoids, nor do they attempt to capture anything more than the energy of the aperiodic components. For more information the reader is referred to [13]–[16]. FIGURE 45.2: Block diagram of multiband excitation coder. c  1999 by CRC Press LLC Waveform Interpolation Coders Figure 45.3 is a block diagram of a waveform interpolation coder. In this coder, the speech is assumed to be composed of a slowly evolving periodic waveform (SEW) and a rapidly evolving noise-like waveform (REW). A frame is analyzed first to extract a “characteristic waveform”. The evolution of these waveforms is filtered to separate the REW from the SEW. REW updates are made several times more often than SEW updates. The LPC, the pitch, the spectra of the SEW and REW, and the overall energy are all transmitted independently. Atthereceiver a parametric representation of the SEW and REW information is constructed, summed, and passed through the LPC synthesis filter to produce output speech. For more information the reader is referred to [17, 18]. FIGURE 45.3: Block diagram of waveform interpolation coder. 45.3.2 Time Domain Waveform-Following Speech Coders Allofthetimedomainwaveformcodersdescribedinthissectionincludeapredictionfilter. Webegin with the simplest. Adaptive Differential Pulse Code Modulation (ADPCM) Adaptive differential pulse code modulation (ADPCM) [19] is based on sample-by-sample quantization of the prediction error. A simple blockdiagram is shown in Fig.45.4. Two partsofthe coder may be adaptive: the quantizer step-size and/or the prediction filter. ITU Recommendations G.726 and G.727 adapt both. The adaptation may be either forward or backward adaptive. In a backward adaptive system, the adaptation is based only on the previously quantized sample values and the quantizer codewords. At the receiver, the backward adaptive par ameter values must be recomputed. An important feature of such adaptation schemes isthat they must use predictors that include a leakage factor thatallows the effectsoferroneousvaluescausedbychannel errorstodieout over time. In a forward a daptive system, the adapted values are quantized and transmitted. This additional“sideinformation” uses bitrate, butcanimprovequality. Additionally, itdoesnot require recomputation at the decoder. Delta Modulation Coders Indeltamodulationcoders[20],thequantizer isjustthesignbit. Thequantizationstep sizeis adaptive. Not all the adaptation schemes used for ADPCM will work for delta modulation because the quantization is so coarse. The quality of delta modulation coders tends to be proportional to their sampling clock: the greater the sampling clock, the greater the correlation between successive samples, and the finer the quantization step size that can be used. The block diagram for delta modulation is the same as that of ADPCM. c  1999 by CRC Press LLC FIGURE 45.4: ADPCM encoder and decoder block diagrams. Adaptive Predictive Coding The better the performance of the prediction filter, the lower the bit rate needed to encode a speech signal. This is the basis of the adaptive predictive coder [21] shown in Fig. 45.5. A forward adaptive higher order linear prediction filter is used. The speech is quantized on a frame-by-frame basis. In this way the bit rate for the excitation can be reduced compared to an equivalent quality ADPCM coder. FIGURE 45.5: Adaptive predictive coding encoder and decoder. Linear Prediction Analysis-by-Synthesis Speech Coders Figure45.6showsatypicallinearpredictionanalysis-by-synthesisspeechcoder[22]. LikeAPC, theseareframe-by-framecoders. TheybeginwithanLPCanalysis. Typically theLPCinformationis forwardadaptive,butthereareexceptions. LPAS codersborrowtheconceptfromADPCMof having alocallyavailabledecoder. Thedifferencebetweenthequantizedoutputsignalandtheoriginalsignal ispassedthrough aperceptualweightingfilter. Possibleexcitationsignalsareconsideredandthebest (minimum mean squareerrorintheperceptual domain) is selected. Thelong-term prediction filter removes long-term correlation (the pitch str ucture) in the signal. If pitch structure is present in the coder, the parameters for the long-term predictor are determined first. The most commonly used system is theadaptive codebook, where samples from previous excitation sequences are stored. The pitchperiodandgainthat resultinthegreatestreductionofperceptual errorareselected, quantized, and transmitted. The fixed codebook excitation is next considered and, again, the excitation vector c  1999 by CRC Press LLC [...]... scale of bit rate Figure 45. 8 only includes telephone bandwidth speech coders The 7-kHz speech coders have been omitted Figure 45. 9 compares the complexity as measured in MIPS and RAM for a fixed point DSP c 1999 by CRC Press LLC FIGURE 45. 8: Approximate speech quality of speech coding standards FIGURE 45. 9: Approximate complexity of speech coding standards c 1999 by CRC Press LLC implementation for most... implementations of the standard 45. 4.1 Current ITU Waveform Signal Coders Table 45. 1 describes current ITU speech coding recommendations that are based on sample-bysample scalar quantization Three of these coders operate in the time domain on the original sampled signal while the fourth is based on a two-band sub-band coder for wideband speech TABLE 45. 1 ITU Waveform Speech Coders Standard body Number... transform coding with LPC and time-domain pitch analysis [25] The residual signal is coded using ATC 45. 4 Current Standards This part of the section is divided into descriptions of current speech coder standards and activities The subsections contain information on speech coders that have been or will soon be standardized We begin first by briefly describing the standards organizations who formulate speech coding. .. split the signal 45. 4.2 ITU Linear Prediction Analysis-by-Synthesis Speech Coders Table 45. 2 describes three current analysis-by-synthesis speech coder recommendations of the ITU All three are block coders based on extensions of the original multipulse LPC speech coder TABLE 45. 2 Coders ITU Linear Prediction Analysis-By-Synthesis Speech Standard body Number Year ITU ITU ITU G.728 G.729 G.723.1 1992... Telecommunications Standards Bureau (ITU-B) is the bureaucracy handling all of the paperwork Speech coding standards are handled jointly by Study Groups 16 and 12 within the ITU-T Other Study Groups may originate requests for speech coders for specific applications The speech coding experts are found in SG16 The experts on speech performance are found in SG12 When a new standard is being formulated, SG16 draws... X., CCITT standardizing activities in speech coding, Proc ICASSP ‘86, 817–820, 1986 [30] Chen, J.-H., Cox, R.V., Lin, Y.-C., Jayant, N., and Melchner, M.J., A low-delay CELP coder for the CCITT 16 kb/s speech coding standard, IEEE JSAC, 10, 830–849, 1992 [31] Johansen, F.T., A non bit-exact approach for implementation verification of the CCITT LDCELP speech coder, Speech Commun., 12, 103–112, 1993 [32]... channel coding, so the speech coder itself must be designed to be robust for the channel conditions The noisy background conditions have proven to be difficult for vocoders making voiced/unvoiced classification decisions, whether the decisions are made for all bands or for individual bands 45. 4.5 Performance Figure 45. 8 is included to give an impression of the relative performance for clean speech of... coders Figure 45. 8 is based on the relative performances of these coders across a number of tests that have been reported In the case of coders that are not yet standards, their performance is projected and shown as a circle The vertical axis of Fig 45. 8 gives the approximate single encoding quality for clean input speech The horizontal axis is a logarithmic scale of bit rate Figure 45. 8 only includes... rate used ACELP G.723.1 and G.729 are the first ITU coders to be specified by a bit exact fixed point ANSI C code simulation of the encoder and decoder 45. 4.3 Digital Cellular Speech Coding Standards Table 45. 3 describes the first and second generation of speech coders to be standardized for digital cellular telephony The first generation coders provided adequate quality Two of the second generation coders... J.L., Optimizing digital speech coders by exploiting masking properties of the human ear, J Acoustical Soc Am., 66, 1647–1652, Dec 1979 [11] Chen, J.-H and Gersho, A., Adaptive postfiltering for quality enhancement of coded speech, IEEE Trans on Speech and Audio Processing, 3, 59–71, 1995 [12] Tremain, T., The Government Standard Linear Predictive Coding Algorithm: LPC-10, Speech Technol., 40–49, Apr . <http://www.engnetbase.com>. SpeechCoding RichardV.Cox AT&TLabs—Research 45. 1Introduction ExamplesofApplications • SpeechCoderAttributes 45. 2UsefulModelsforSpeechandHearing TheLPCSpeechProductionModel • ModelsofHumanPer- ceptionforSpeechCoding 45. 3TypesofSpeechCoders Model-BasedSpeechCoders • TimeDomainWaveform- FollowingSpeechCoders • FrequencyDomainWaveform- FollowingSpeechCoders 45. 4CurrentStandards CurrentITUWaveformSignalCoders • ITULinearPrediction Analysis-by-SynthesisSpeechCoders • DigitalCellularSpeech CodingStandards • SecureVoiceStandards • Performance References 45. 1. by CRC Press LLC FIGURE 45. 8: Approximate speech quality of speech coding standards. FIGURE 45. 9: Approximate complexity of speech coding standards. c  1999

Ngày đăng: 22/01/2014, 12:20

Xem thêm: Tài liệu 45 Speech Coding docx, Tài liệu 45 Speech Coding docx

Tài liệu 45 Speech Coding docx

Thông tin tài liệu

Từ khóa liên quan

Mục lục

Speech Coding

Introduction

Examples of Applications

Speech Coder Attributes

Useful Models for Speech and Hearing

The LPC Speech Production Model

Models of Human Perception for Speech Coding

Types of Speech Coders

Model-Based Speech Coders

Time Domain Waveform-Following Speech Coders

Frequency Domain Waveform-Following Speech Coders

Current Standards

Current ITU Waveform Signal Coders

ITU Linear Prediction Analysis-by-Synthesis Speech Coders

Digital Cellular Speech Coding Standards

Secure Voice Standards

Performance

Tài liệu cùng người dùng

Tài liệu liên quan