Quality of Telephone-Based Spoken Dialogue Systems phần 6 pptx

Quality of Synthesized Speech over the Phone 221 Figure 5.6. Effect of narrow-band circuit noise Nc. Normalized and E-model prediction for individual voices. N for = –100 dBmp. Figure 5.7. Effect of narrow-band circuit noise Nc. Normalized PESQ and TOSQA model predictions for synthetic vs. natural voices. N for = –100 dBmp. the voice and a grouping in synthetic and natural voices. The overall quality judgments are mainly comparable to the estimations given by the E-model. However, in contrast to the model, a remarkable MOS degradation can already be observed for very low noise levels (Nc between –100 and –60 dBm0p). This degradation is statistically significant only for natural voice 1; for all other voices, the overall quality starts to degrade significantly at narrow-band noise levels higher than -60 dBm0p. The listening-effort and the intelligibility (INT) ratings are similar to those obtained for wide-band circuit noise conditions. 222 Figure 5.8. Effect of signal-correlated noise with signal-to-quantizing-noise ratio Q. Normal- ized and E-model prediction for individual voices. When comparing the results for narrow-band circuit noise, Nc, with the predictions from signal-based comparative measures, the graph is similar to the one found for wideband noise N for, see Figure 5.7. The predictions for naturally produced and synthesized speech from PESQ are close to each other, whereas the TOSQA model predicts a higher quality decrease for the naturally produced speech, an estimation which is supported by the auditory tests. As for N for, the TOSQA model predicts a very steep decrease for the MOS values with increasing noise levels, whereas the shape of the curve predicted by PESQ is closer to the one found in the auditory test. As can be expected, the scatter of the auditory test results for medium noise levels (Nc ~ – 70 – 60 dBm0p) is not reflected in the signal-based model predictions. It will have its origin in the subjective ratings, and not in the speech stimuli presented to the test subjects. 5.4.2.3 Impact of Signal-Correlated Noise Signal-correlated noise is perceptively different from continuous circuit noise in the sense that it only affects the speech signal, and not the pauses. Its effects on the overall quality ratings are shown in Figure 5.8. Whereas slight individual differences for the voices are discovered (not statistically significant in the ANOVA), the overall behavior for synthetic and natural voices is very similar. This can be seen when the mean values for synthetic and natural voices are compared, see the dotted lines in Figure 5.9. The degradations are – in principle – well predicted by the E-model. However, for low levels of signal-correlated noise (high Q), there is still a significant degradation which is not predicted by the model. This effect is similar to the one observed for narrow-band circuit noise, Nc; no explanation can be given for this effect so far. Quality of Synthesized Speech over the Phone 223 Figure 5.9. Effect of signal-correlated noise with signal-to-quantizing-noise ratio Q. Normal- ized PESQ and TOSQA model predictions for synthetic vs. natural voices. The predictions of the signal-based comparative measures PESQ and TOSQA do not agree very well with the auditory test results. Whereas the PESQ model estimations are close to the auditory judgments up to SNR values of the TOSQA model estimates the signal-correlated noise impact slightly more pessimistically. This model, however, predicts a slightly lower degradation of the naturally produced speech samples, which is congruent with the auditory test. Both PESQ and TOSQA models do not predict the relatively low quality level for the highest SNR value in the test (Q = 30 dB), but give more optimistic estimations for these speech samples. Expressed differently, the models reach saturation (which is inevitable on the limited MOS scale) at higher SNR values than those included in the test conditions. As a general finding, both models are in line with the auditory test in that they do not predict strong differences between the naturally produced and the synthesized speech samples. The and the INT values are similar in the natural and synthetic case, with slightly higher values for the natural voices. These results have not been plotted for space reasons. 5.4.2.4 Impact of Ambient Noise Degradations due to ambient room noise are shown in Figure 5.10. The behavior slightly differs for the individual voices. In particular, the synthetic voices seem to be a little less prone to ambient noise impairments than the natural voices. Once again, this might be due to a higher ‘distinctness’ of the synthetic voices, which makes them more remarkable in the presence of noise. The same behavior is found for the intelligibility judgments, see Figure 5.11. For all judgments, the data point for synthetic voice 1 and Pr = 35 dB(A) 224 Figure 5.10. Effect of hoth-type ambient noise Pr . Normalized and E-model prediction for individual voices. Figure 5.11. Effect of hoth-type ambient noise Pr. Normalized intelligibility score for individual voices. seems to be an outlier, as it is rated particularly negative. Informal listening shows very inappropriate phone durations in two positions of the speech file, which makes this specific sample sound particularly bad. Here, the lack of optimization of the speech material discussed in Section 5.4.1.1 is noted. 5.4.2.5 Impact of Low Bit-Rate Coding The low bit-rate codecs investigated here cover a wide range of perceptively different types of degradations. In particular, the G.726 (ADPCM) and the Quality of Synthesized Speech over the Phone 225 Figure 5.12. Effect of low bit-rate codecs. Normalized and E-model prediction for synthetic vs. natural voices. G.728 (LD-CELP) codecs produce an impression of noisiness, whereas G.729 and IS-54 are characterized by an artificial, unnatural sound quality (informal expert judgments). Figures 5.12 to 5.14 show a fundamental difference in the quality judgments for natural and synthesized speech, when transmitted over channels including these codecs (mean values over the natural and synthetic voices are reproduced here for clarity reasons). Except for two cases (the G.726 and G.728 codecs, which are rated too negatively in comparison to the prediction model), the decrease in overall quality predicted by the E-model is well reflected in the auditory judgments for natural speech. On the other hand, the judgments for the synthesized speech do not follow this line. Instead, the overall quality of synthesized speech is much more strongly affected by ‘noisy’ codecs (G.726, G.728 and G.726*G.726) and less by the ‘artificially sounding’ codecs. Listening- effort and intelligibility ratings for synthesized speech are far less affected by all of these codecs (they scatter around a relatively constant value), whereas they show the same rank order for the naturally produced speech (once again, with exception of the G.726 and G.728 codec). The differences in behavior of the synthetic and the natural voices are also observed for the codec cascades (G.726*G.726 and IS-54*IS-54) compared to the single codecs: Whereas for the G.726 tandem mainly the synthetic voices suffer from the cascading, the effect is more dominant for the natural voices with the IS-54 cascade. The observed differences may be due to differences in quality dimensions perceived as degradations by the test subjects. Whereas the ‘artificiality’ di- 226 Figure 5.13. Effect of low bit-rate codecs. Normalized listening-effort for synthetic vs. natural voices. Figure 5.14. Effect of low bit-rate codecs. Normalized intelligibility score for synthetic vs. natural voices. mension introduced by the G.729 and IS-54 codecs is an additional degradation for the naturally produced speech, this is not the case for synthesized speech, which already carries a certain degree on artificiality. It is not yet clear why the G.726 and G.728 transmission circuits result in particularly low quality, an effect which does not correspond to the prediction of the E-model. Other investigations carried out by the author in a working group of the ITU-T (Möller, 2000) also suggest that the model predictions are too optimistic for these codecs when considered in isolation, i.e. without tandeming. Quality of Synthesized Speech over the Phone 227 Figure 5.15. Effect of low bit-rate codecs. Normalized PESQ and TOSQA model predictions for natural voices. Figure 5.16. Effect of low bit-rate codecs. Normalized PESQ and TOSQA model predictions for synthetic voices. Signal-based comparative measures like PESQ and TOSQA have been devel- ope d in particular for predicting the effects of low bit-rate codecs. A comparison to the normalized auditory values is shown in Figure 5.15 for the natural voices. Whereas for the IS-54 codec and its combinations the predicted quality is in good agreement with both models’ predictions, the differences are bigger for the G.726, G.728 and G.729 codecs. As was found for the E-model, the G.726 and G.728 codecs are rated significantly worse in the auditory test compared to the model predictions. On the other hand, the G.729 codec is rated 228 better than the predictions of both PESQ and TOSQA suggest. In all cases, either both models predict the codec degradations too optimistically or too pessimistically. Thus, no advantage can be obtained when calculating the mean of the PESQ and TOSQA model predictions. The picture is different for the synthesized voices, see Figure 5.16. The quality rank order predicted by the E-model (i.e. the bars ordered with respect to decreasing MOS values) is also found for the PESQ and TOSQA predictions, but it is not well reflected in the auditory judgments. In all, the differences between the auditory test results and the signal-based model predictions is larger for the synthesized than for the naturally produced voices. For the three ‘noisy’ codec conditions G.726, G.728 and G.726*G.726, both PESQ and TOSQA predict quality more optimistically than was judged in the test. For the other codecs the predictions are mainly more pessimistic. This supports the assumption that the overall quality of synthesized speech is much more strongly affected by ‘noisy’ and less by the ‘artificially sounding’ codecs. 5.4.2.6 Impact of Combined Impairments For combinations of circuit noise and low bit-rate distortions, synthetic and natural voices behave similarly. This can be seen in Figure 5.17, showing the combination of the IS-54 cellular codec with narrow-band circuit noise (mean values for synthetic vs. natural voices are depicted). Again, the quality for low noise does not reach the optimum value (the value predicted by the E-model). This observation has already been made for the other circuit noise conditions. In high-noise-level conditions, the synthetic voices are slightly less affected by the noise than the natural voices. This finding is similar to the one described in Section 5.4.2.2. With the help of the normalization to the scale, the additivity of different types of impairments postulated by the E-model can be tested. Figure 5.18 shows the results after applying this transformation. It can be seen that the slope of the curve for higher noise levels is well in line with the results for the natural voices. The synthesized voices seem to be more robust under these conditions, although the individual results scatter significantly. For low noise levels, the predictions of the E-model are once again too optimistic. This will be due to the unrealistically low theoretical noise floor level (N f or = –100 dBmp) of this connection, for which the E-model predictions even exceed 100 as the limit of the scale under normal (default) circuit conditions. The optimistic model prediction can also be observed for the judgment of the codec alone, depicted in Figure 5.12. In principle, however, the flat model curve for the lower noise levels is well in agreement with the results both for synthetic and natural voices. Thus, no specific doubts arise as to the validity of adding different impairment factors to obtain an overall transmission rating. Of course, the limited findings do not validate the additivity property as Quality of Synthesized Speech over the Phone 229 Figure 5.17. Effect of narrow-band circuit noise Nc and the IS-54 codec. Normalized and E-model prediction for synthetic vs. natural voices. Figure 5.18. Effect of narrow-band circuit noise Nc and the IS-54 codec. Normalized and E-model transmission rating prediction for individual voices. a whole. Other combinations of impairments will have to be tested, and more experiments have to be carried out in order to reduce the obvious scatter in the results. 5.4.2.7 Acceptability Ratings The ratings on the ‘perceived acceptability’ question in part 5.1 of the test have to be interpreted with care, because acceptability can only finally be as- sessed with a fully working system (for a definition of this term see Möller, 2000). Nevertheless, acceptability judgments are interesting for the develop- 230 Figure 5.19. Effect of narrow-band circuit noise Nc. Perceived acceptability ratings for individual voices. Figure 5.20. Effect of low bit-rate codecs. Perceived acceptability ratings for individual voices. ers, because they show whether a synthetic voice is acceptable in a specific application context. As an example, Figure 5.19 shows the overall (not normalized) level of perceived acceptability for noisy transmission channels. It can be seen that synthetic voice 2 mostly ranges in between the natural voices, whereas synthetic voice 1 is rated considerably worse. Interestingly, the highest perceived acceptability level for the three better voices seems to be reached at a moderate noise floor of dBm0p, and not for the lowest noise levels (except natural voice 1 and Nc = –100 dBm0p). Thus, under realistic transmission [...]... produced speech will also enable an adequate transmission of synthesized speech Chapter 6 QUALITY OF SPOKEN DIALOGUE SYSTEMS Investigations on the performance of speech recognition and on the quality of synthesized speech in telephone environments like the ones reported in the previous two chapters provide useful information on the influence of environmental factors on the system’s speech input and... Möller and Skowronek (2003a,b) Experiments 6. 1 and 6. 3 follow the steps mentioned in the introduction, allowing for a comparison between interaction parameters and subjective judgments Experiment 6. 2 is limited to the collection of subjective judgments, making use of guided interviews in order to optimally design the questionnaire Quality of Spoken Dialogue Systems 6. 1.1 241 The BoRIS Restaurant Information... about the town and thus about the potential location of restaurants Criteria which were mentioned Quality of Spoken Dialogue Systems 255 for choosing restaurants were the price (27 nominations), the type of food (33), the quality of food (33), and the friendliness of the staff (34) Location (14) and opening hours (7) seem to be less important criteria 13 of the 40 test subjects said that they had prior... been published in Pellegrini (2003) Experiment 6. 2: Questionnaire design and investigation of relevant quality aspects This experiment is described in Niculescu (2002) Experiment 6. 3: Analysis and validation of the QoS taxonomy; analysis of the influence of different system configurations on quality aspects; analysis and definition of existing and new quality prediction models The experiment is described... set-up of spoken dialogue systems, the resources for this evaluation were limited In particular, only a laboratory test with a limited group of subjects could be carried out, and no field test or survey with a realistic group of potential future users of the system In spite of these limitations, interesting results with respect to the influence of the environmental factors were obtained The type of degradation... element of the confusion matrix is not zero, otherwise two boundary conditions have the same value of zero: Quality of Spoken Dialogue Systems 247 As this last element describes how often an empty word is recognized as an empty word, an arbitrary number can be chosen On the basis of the scaled confusion matrix the individual rows are used to build a randomized word sequence in which the frequency of the... experiment 6. 2), and at the end of a test session The questionnaires have been developed during the course of the experiments, so the individual questions which have been asked in experiments 6. 1, 6. 2 and 6. 3 are not identical They are partly based on the literature survey (see Section 3.8 .6) , and partly on own considerations of the individual quality aspects which should be covered Details on the design of. .. the interactions Collection of user quality ratings after each interaction, and after a complete test session Transcription of the dialogues Annotation of dialogue transcriptions by a human expert Automatic calculation of interaction parameters Data analysis and quality modelling approaches The first steps serve the purpose of collecting interaction parameters and related quality judgments for specific... Skowronek (2002) After a brief pre-analysis of the collected data, Sections 6. 2.2 and 6. 2.3 investigate the underlying dimensionality of quality judgments and of interaction parameters Such an analysis may be interpreted in terms of addressed quality aspects For the quality judgments, it will thus provide information about the dimensions underlying the user’s quality percept For the interaction parameters,... deeper analysis of the correlations between interaction parameters and quality judgments is carried out in Section 6. 2.4 The aim is to 2 56 give empirical evidence for the taxonomy of QoS aspects, and to describe the relationships in a quantitative way The subsequent section analyzes the impact of the system configuration, namely of the speech recognizer, of the speech output, and of dialogue management . an adequate transmission of synthesized speech. Chapter 6 QUALITY OF SPOKEN DIALOGUE SYSTEMS Investigations on the performance of speech recognition and on the quality of synthesized speech in. cover a wide range of perceptively different types of degradations. In particular, the G.7 26 (ADPCM) and the Quality of Synthesized Speech over the Phone 225 Figure 5.12. Effect of low bit-rate. interactions with the fully working system will be addressed, making use of the mentioned speech output compo- Quality of Spoken Dialogue Systems 239 nents, and replacing the ASR module by a wizard simulation

Quality of Telephone-Based Spoken Dialogue Systems phần 6 pptx

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan