Quality of Telephone-Based Spoken Dialogue Systems phần 3 ppsx

Quality of Human-Machine Interaction over the Phone 75 has later been modified to better predict the effects of ambient noise, quantizing distortion, and time-variant impairments like lost frames or packets. The current model version is described in detail in ITU-T Rec. G.107 (2003). The idea underlying the E-model is to transform the effects of individual impairments (e.g. those caused by noise, echo, delay, etc.) first to an intermediate ‘transmission rating scale’. During this transformation, instrumentally mea- surable parameters of the transmission path are transformed into the respective amount of degradation they provoke, called ‘impairment factors’. Three types of impairment factors, reflecting three types of degradations, are calculated: All types of degradations which occur simultaneously to the speech signal, e.g. a too loud connection, quantizing noise, or a non-optimum sidetone, are expressed by the simultaneous impairment factor Is. All degradations occurring delayed to the speech signals, e.g. the effects of pure delay (in a conversation) or of listener and talker echo, are expressed by the delayed impairment factor Id. All degradations resulting from low bit-rate codecs, partly also under transmission error conditions, are expressed by the effective equipment impairment factor Ie,eff. Ie,eff takes the equipment impairment factors for the error-free case, Ie, into account. These types of degradations do not necessarily reflect the quality dimensions which can be obtained in a multidimensional auditory scaling experiment. In fact, such dimensions have been identified as “intelligibility” or “overall clar- ity”, “naturalness” or “fidelity”, loudness, color of sound, or the distinction between background and signal distortions (McGee, 1964; McDermott, 1969; Bappert and Blauert, 1994). Instead, the impairment factors of the E-model have been chosen for practical reasons, to distinguish between parameters which can easily be measured and handled in the network planning process. The different impairment factors are subtracted from the highest possible transmission rating level Ro which is determined by the overall signal-to-noise ratio of the connection. This ratio is calculated assuming a standard active speech level of -26 dB below the overload point of the digital system, cf. the definition of the active speech level in ITU-T Rec. P.56 (1993), and taking the SLR and RLR loudness ratings, the circuit noise Nc and N for, as well as the ambient room noise into account. An allowance for the transmission rating level is made to reflect the differences in user expectation towards networks differing from the standard wireline one (e.g. cordless or mobile phones), expressed by a so-called ‘advantage of access’ factor A. For a discussion of this factor see Möller (2000). In result, the overall transmission rating factor R of the connection can be calculated as 76 This transmission rating factor is the principal output of the E-model. It reflects the overall quality level of the connection which is described by the input parameters discussed in the last section. For normal parameter settings R can be transformed to an estimation of a mean user judgment on a 5-point ACR quality scale defined in ITU-T Rec. P.800 (1996), using the fixed S-shaped relationship Both the transmission rating factor R and the estimated mean opinion score MOS give an indication of the overall quality of the connection. They can be related to network planning quality classes defined in ITU-T Rec. G. 109 (1999), see Table 2.5. For the network planner, not only the overall R value is important, but also the single contributions (Ro, Is, Id and Ie,eff), because they provide an indication on the sources of the quality degradations and potential reduction solutions (e.g. by introducing an echo canceller). Other formulae exist for relating R to the percentage of users rating a connection good or better (%GoB) or poor or worse (%PoW). The exact formulae for calculating Ro, Is, Id, and Ie,eff are given in ITU-T Rec. G.107 (2003). For Ie and A, fixed values are defined in ITU-T Appendix I to Rec. G.113 (2002) and ITU-T Rec. G.107 (2003). Another example of a network planning model is the SUBMOD model developed by British Telecom (ITU-T Suppl. 3 to P-Series Rec., 1993), which is based on ideas from Richards (1973). If the network has already been set up, it is possible to obtain realistic measurements of major parts of the network equipment. The measurements can be Quality of Human-Machine Interaction over the Phone 77 performed either off-line (intrusively, when the equipment is put out of network operation), or on-line in operating networks (non-intrusive measurement). In operating networks, however, it might be difficult to access the user interfaces; therefore, standard values are taken for this part of the transmission chain. The measured input parameters or signals can be used as an input to the signal-based or network planning models (so-called monitoring models). In this way, it be- comes possible to monitor quality for the specific network under consideration. Different models and model combinations can be envisaged, and details can be found in the literature (Möller and Raake, 2002; ITU-T Rec. P.562, 2004; Ludwig, 2003). From the principles used by the models, the quality aspects which may be predicted become obvious. Current signal-based measures predict only one- way voice transmission quality for specific parts of the transmission channel that they have been optimized for. These predictions usually reach a high accuracy because adequate input parameters are available. In contrast to this, network planning models like the E-model base their predictions on simplified and perhaps imprecisely estimated planning values. In addition to one-way voice transmission quality, they cover conversational aspects and to a certain extent the effects caused by the service and its context of use. All models which have been described in this section address HHI over the phone. Investigations on how they may be used in HMI for predicting ASR performance are described in Chapter 4, and for synthesized speech in Chapter 5. 2.4.2 SDS Specification The specification phase of an SDS may be of crucial importance for the success of a service. An appropriate specification will give an indication of the scale of the whole task, increases the modularity of a system, allows early problem spotting, and is particularly suited to check the functionality of the system to be set up. The specification should be initialized by a survey of user requirements: Who are the potential users, and where, why and how will they use the service? Before starting with an exact specification of a service and the underlying system, the target functionality has to be clarified. Several authors point out that system functionality may be a very critical issue for the success of a service. For example, Lamel et al. (1998b) reported that the prototype users of their French ARISE system for train information did not differentiate between the service functionality (operative functions) and the system responses which may be critically determined by the technical functions. In the case that the system informs the user about its limitations, the system response may be appropriate under the given constraints, but completely dissatisfying for the user. Thus, 78 systems which are well-designed from a technological and from an interaction point of view may be unusable because of a restricted functionality. In order to design systems and services which are usable, human factor issues should be taken into account early in the specification phase (Dybkjær and Bernsen, 2000). The specification should cover all aspects which potentially influence the system usability, including its ease of use, its capability to perform a natural, flexible and robust dialogue with the user, a sufficient task domain coverage, and contextual factors in the deployment of the SDS (e.g. service improvement or economical benefit). The following information needs to be specified: Application domain and task. Although developers are seeking application- independent systems, there are a number of principle design decisions which are dependent on the specific application under consideration. Within a domain, different tasks may require completely differing solutions, e.g. an information task may be insensible to security requirements whereas the corresponding reservation may require the communication of a credit card number and thus may be inappropriate for the speech modality. The application will also determine the linguistic aspects of the interaction (vocabulary, syntax, etc.). User and task requirements. They may be determined from recordings of human services if the corresponding situation exists, or via interviews in case of new tasks which have no prior history in HHI. Intended user group. Contextual factors. They may be amongst the most important factors in- fluencing user’s satisfaction with SDSs, and include service improvement (longer opening hours, introduction of new functionalities, avoid queues, etc.) and economical benefits (e.g. users pay less for an SDS service than for a human one), see Dybkjær and Bernsen (2000). Common knowledge which will have to be shared between the human user and the SDS. This knowledge will arise from the application domain and task, and will have to be specified in terms of an initial vocabulary and language model, the required speech understanding capability, and the speech output capability. Common knowledge which will have to be shared between the SDS and the underlying application, and the corresponding interface (e.g. SQL). Knowledge to be included in the user model, cf. the discussion of user models in Section 2.1.3.4. Quality of Human-Machine Interaction over the Phone 79 Principle dialogue strategies to be used in the interaction, and potential description solutions (e.g. finite state machines, dialogue grammar, flowcharts). Hardware and software platform, i.e. the computing environment including communication protocols, application system interfaces, etc. These general specification topics partly overlap with the characterization of individual system components for system analysis and evaluation. They form a prerequisite to the system design and implementation phase. The evaluation specification will be discussed in Section 3.1, together with the assessment and evaluation methods. 2.4.3 SDS Design On a basis of the specification, system designers have the task to describe how to build the service. This description has to be made in a sufficiently detailed way in order to permit system implementation. System designers may consult end users as well as domain or industrial experts for support (Atwell et al., 2000). Such a consultation may be established in a principled way, as was done in the European REWARD (REal World Application of Robust Dialogue) project, see e.g Failenschmid (1998). This project aimed to provide domain specialists with a more active role in the design process of SDSs. Graphical dialogue design tools and SDS engines were provided to the domain experts which had little or no knowledge of speech technology, and only technical assistance was given to them by speech technologists. The design decisions taken by the domain experts were taken in a way which addressed as directly as possible the users’ expectations, while the technical experts concentrated on the possibility to achieve a function or task in a technically sophisticated way. From the system designer’s point of view, three design approaches and two combinations can be distinguished (Fraser, 1997, p. 571-594): Design by intuition: Starting from the specification, the task is analyzed in detail in order to establish parameters and routes for task accomplish- ment. The routes are specified in linguistic terms by introspection, and are based on expert intuition. Such a methodology is mostly suited for system- initiative dialogues and structured tasks, with a limited use of vocabulary and language. Because of the large space of possibilities, intuitions about user performance are generally unreliable, and intuitions on HMI are sparse anyway. Design by intuition can be facilitated by structured task analysis and design representations, as well as by usability criteria checklists, as will be described below. Design by observation of HHI: This methodology avoids the limitations of intuition by giving data evidence. It helps to build domain and task 80 understanding, to create initial vocabularies, language models, and dialogue descriptions. It gives information about the user goals, the items needed to satisfy the goals, and the strategies and information used during negotiation (San-Segundo et al., 2001a,b). The main problem of design by observation is that an extrapolation is performed from HHI to HMI. Such an extrapolation may be critical even for narrow tasks, because of the described differences between HHI and HMI, see Section 2.2. In particular, some aspects which are important in HMI cannot be observed in HHI, e.g. the initial setting of user expectations by the greeting, input confirmation and re-prompt, or the connection to a human operator in case of system failure. Design by simulation: The most popular method is the Wizard-of-Oz (WoZ) technique. The name is based on Baum’s novel, where the “great and terri- ble” wizard turns out to be no more than a mechanical device operated by a man hiding behind a curtain (Baum, 1900). The technique is sometimes also called PNAMBIC (Pay No Attention to the Man Behind the Curtain). In a WoZ simulation, a human wizard plays the role of the computer. The wizard takes spoken input, processes it in some principled way, and generates spoken system responses. The degree to which components are simulated can vary, and commonly so-called ‘bionic wizards’ (half human, half machine) are used. WoZ simulations can be largely facilitated by the use of rapid prototyping tools, see below. The use of WoZ simulations in the system evaluation phase is addressed in Section 3.8. Iterative WoZ methodology: This iterative methodology makes use of WoZ simulations in a principled way. In the pre-experimental phase, the application domain is analyzed in order to define the domain knowledge (database), subject scenarios, and a first experimental set-up for the simulation (loca- tion, hardware/software, subjects). In the first experimental phase, a WoZ simulation is performed in which very few constraints are put on the wizard, e.g. only some limitations of what the wizard is allowed to say. The data collected in this simulation and in the pre-experimental phase are used to develop initial linguistic resources (vocabulary, grammar, language model) and a dialogue model. In subsequent phases, the WoZ simulation is repeated, however putting more restrictions on what the wizard is allowed to understand and to say, and how to behave. Potentially, a bionic wizard is used in later simulation steps. This procedure is repeated until a fully auto- mated system is available. The methodology is expected to provide a stable set-up after three to four iterations (Fraser and Gilbert, 1991b; Bernsen et al., 1998). System-in-the-loop: The idea of this methodology is to collect data with an existing system, in order to enhance the vocabulary, the language models, etc. The use of a real system generally provides good and realistic data, but Quality of Human-Machine Interaction over the Phone 81 only for the domain captured by the current system, and perhaps for small steps beyond. A main difficulty is that the methodology requires a fully working system. Usually, a combination of approaches is used when a new system is set up. Designers start from the specification and their intuition, which should be described in a formalized way in order to be useful in the system design phase. On the basis of the intuitions and of observations from HHI, a cycle of WoZ simulations is carried out. During the WoZ cycles, more and more components of the final system are used, until a fully working system is obtained. This system is then enhanced during a system-in-the-loop paradigm. Figure 2.11. Example for a design decision addressed with the QOC (Questions-Options- Criteria) method, see de Ruyter and Hoonhout (2002). Criteria are positively (black solid lines) or negatively (gray dashed lines) met by choosing one of the options. Design based on intuition can largely be facilitated by presenting the space of design decisions in a systemized way, because the quality elements of an SDS are less well-defined than those of a transmission channel. A systemized representation illustrates the interdependence of design constraints, and helps to identify contradicting goals and requirements. An example for such a representation is the Design Space Development and Design Rationale (DSD/DR), see Bernsen et al. (1998). In this approach, the requirements are represented in a frame which also captures the designer commitments at a certain point 82 in the decision process. A DR frame represents the reasoning about a certain design problem, capturing the options, trade-offs, and reasons why a particular solution was chosen. An alternative way is the so-called Questions-Options-Criteria (QOC) rationale (MacLean et al., 1991; Bellotti et al., 1991). In this rationale, the design space is characterized by questions identifying key design issues, options pro- viding possible answers to the questions, and criteria for assessing and com- paring the options. All possible options (answers) to a question are assessed positively or negatively (or via +/- scaling), each by a number of criteria. An example is given in Figure 2.11, taken from the European IST project INSPIRE (INfotainment management with SPeech Interaction via REmote microphones and telephone interfaces), see de Ruyter and Hoonhout (2002). Questions have to be posed in a way that they provide an adequate context and structure to the design space (Bellotti et al., 1991). The methodology assists with early design reasoning as well as the later comprehension and propagation of the resulting design decisions. Apart from formalized representations of design decisions, general design guidelines and “checklists” are a commonly agreed basis for usability engi- neering, see e.g. the guidelines proposed by ETSI for telephone user interfaces (ETSI Technical Report ETR 051, 1992; ETSI Technical Report ETR 147, 1994). For SDS design, Dybkjær and Bernsen (2000) defined a number of “best practice” guidelines, including the following: Good speech recognition capability: The user should be confident that the system successfully receives what he/she says. Good speech understanding capability: Speaking to an SDS should be as easy and natural as possible. Good output voice quality: The system’s voice should be clear and intel- ligible, not be distorted or noisy, show a natural intonation and prosody, an appropriate speaking rate, be pleasant to listen to, and require no extra listening-effort. Adequate output phrasing: The system should have a cooperative way of expression and provide correct and relevant speech output with sufficient information content. The output should be clear and unambiguous, in a familiar language. Adequate feedback about processes and about information: The user should notice what the system is doing, what information has been understood by the system, and which actions have been taken. The amount and style of feedback should be adapted to the user and the dialogue situation, and depends on the risk and costs involved with the task. Quality of Human-Machine Interaction over the Phone 83 Adequate initiative control, domain coverage and reasoning capabilities: The system should make the user understand which tasks it is able to carry out, how they are structured, addressed, and accessed. Sufficient interaction guidance: Clear cues for turn-taking and barge-in should be supported, help mechanisms should be provided, and a distinction between system experts/novices and task experts/novices should be made. Adequate error handling: Errors can be handled via meta-communication for repair or clarification, initiated either by the system or by the user. Different (but partly overlapping) guidelines have been set up by Suhm (2003), on the basis of a taxonomy of speech interface limitations. Additional guidelines specifically address the system’s output speech. Sys- tem prompts are critical because people often judge a system mainly by the quality of the speech output, and not by its recognition capability (Souvig- nier et al., 2000). Fraser (1997), p. 592, collects the following prompt design guidelines: Be as brief and simple as possible. Use a consistent linguistic style. Finish each prompt with an explicit question. Allow barge-in. Use a single speaker for each function. Use a prompt voice which gives a friendly personality to the system. Remember that instructions presented at the beginning of the dialogue are not always remembered by the user. In case of re-prompting, provide additional information and guidance. Do not pose as a human as long as the system cannot understand as well as a human (Basson et al., 1996). Even when prompts are designed according to these guidelines, the system may still be pretty boring in the eyes (ears) of its users. Aspects like the metaphor, i.e. the transfer of meaning due to similarities in the external form or function, and the impression and feeling which is created have to be supported by the speech output. Speech output can be amended by other audio output, e.g. auditory signs (“earcons”) or landmarks, in order to reach this goal. System prompts will have an important effect on the user’s behavior, and may stimulate users to model the system’s language (Zoltan-Ford, 1991; Basson 84 et al., 1996). In order to prevent dialogues from having too rigid a style due to specific system prompts, adaptive systems may be able to “zoom in” to more specific questions (alternatives questions, yes/no questions) or to “zoom out” to more general ones (open questions), depending on the success or failure of system questions (Veldhuijzen van Zanten, 1999). The selection of the right system prompts also depends on the intended user group: Whereas naïve users often prefer directed prompts, open prompts may be a better solution for users which are familiar with the system (Williams et al., 2003; Witt and Williams, 2003). Respect of design guidelines will help to minimize the risks which are in- herent in intuitive design approaches. However, they do not guarantee that all relevant design issues are adequately addressed. In particular, they do not provide any help in the event of conflicting guidelines, because no weighting of the individual items can be given. Design by simulation is a very useful way to close the gaps which intuition may leave. A discussion about important factors of WoZ experiments will be given in conjunction with the assessment and evaluation methods in Section 3.8. Results which have been obtained in a WoZ simulation are often very useful and justify the effort required to set up the simulation environment. They are however limited to a simulated system which should not be confounded with a working system in a real application situation. The step between a WoZ simulation and a working system is manifested in all the environmental, agent, task, and contextual factors, and it should not be underestimated. Polifroni et al. (1998) observed for their JUPITER weather information service that the ASR error rates for the first system working in a real-world environment tripled in comparison to the performance in a WoZ simulation. Within a year, both word and sentence error rates could be reduced again by a factor of three. During the installation of new systems, it thus has to be carefully considered how to treat ASR errors in early system development stages. Apart from leaving the system unchanged, it is possible to try to detect and ignore these errors by using a different rejection threshold than the one of the optimized system (Rosset et al., 1999), or using a different confirmation strategy. Design decision-taking and testing can be largely facilitated by rapid prototyping tools. A number of such tools are described and compared in DISC Deliverable D2.7a (1999) and by McTear (2002). They include tools which enable the description of the dialogue management component, and others in- tegrating different system components (ASR, TTS, etc.) to a running prototype. The most well-known examples are: [...]... evaluation of spoken dialogue systems On the basis of the listed criteria, assessment and evaluation methods can be chosen or have to be designed An overview of such methods will be given in Chapter 3 90 2.5 Summary Spoken dialogue systems enabling task-orientated human-machine interaction over the phone offer a relatively new type of service to their users Because of the inexperience of most users,... types of performance assessment are the notion of reference (Section 3. 2) and the collection of data (Section 3. 3) which will be addressed in separate sections Then, assessment methods for individual components of SDSs will be discussed, namely for ASR (Section 3. 4), for speech and natural language understanding (Section 3. 5), for speaker recognition (Section 3. 6), and for speech output (Section 3. 7)... context of use These factors are in a complex relationship to different notions of quality (performance, effectiveness, efficiency, usability, user satisfaction, utility and acceptability), as it is described by a new taxonomy for the quality of SDS-based services which is given in Section 2 .3. 1 The taxonomy can be helpful for system developers in three different ways: (1) Quality elements of Quality of. .. for speaker recognition (Section 3. 6), and for speech output (Section 3. 7) The final Section 3. 8 deals with the assessment and evaluation of entire spoken dialogue systems, including the dialogue management component 3. 1 Characterization Following the taxonomy of QoS aspects given in Section 2 .3. 1, five types of factors characterize the interaction situations addressed in this book: Agent factors, task... concentrate on the weak points of the system which are mainly found in the evaluation phases The mentioned types of corpora differ in several characteristics From a language point of view, the corpus size, the number of dialogues, utterances and words, the total duration, the average length of a dialogue, the average length of an utterance, or the average number of words per utterance or dialogue are important... description of system components alone may be misleading for capturing the quality of the overall system A full description of the quality aspects of an SDS can only be obtained by using a combination of assessment and evaluation methods On the one hand, these methods should be able to collect information about the performance of individual system components, and about the performance of the whole... number of alternatives available at a given level Contextual analysis: Number and complexity of rules 100 Interaction with ASR and dialogue management modules: Type and amount of input and output information (single hypotheses, ranked lists, etc.), dependency of syntactic-semantic and contextual interpretation on the dialogue state 3. 1.1.4 Dialogue Manager Characterization The approach taken for dialogue. .. technical point of view, e.g a dialogue grammar, a plan-based approach, or a collaborative approach (see Section 2.1 .3. 4) The most important characteristics of the dialogue manager are the type and amount of knowledge implemented in the manager, the distribution of initiative between the system and the user, and the system’s meta-communication strategies: Dialogue manager knowledge: Dialogue history... composition of the unit to be judged 3. 3 Data Collection In order to design SDSs which are able to cope with a variety of user utterances and dialogue situations, large databases have to be collected The collection often takes place at different sites, in order to distribute the burden of recording and annotation, and to provide a more diverse pool of data than could be obtained at one single site Examples of. .. to their users The quality of the interaction with a spoken dialogue system will depend on the characteristics of the system itself, as well as on the characteristics of the transmission channel and the environment the user is situated in The physical and algorithmic characteristics of these quality elements have been addressed in Section 2.1 They can be classified with the help of an interactive speech . of spoken dialogue systems. On the basis of the listed criteria, assessment and evaluation methods can be chosen or have to be designed. An overview of such methods will be given in Chapter 3. 90 2.5 Summary Spoken. 3. 90 2.5 Summary Spoken dialogue systems enabling task-orientated human-machine interaction over the phone offer a relatively new type of service to their users. Because of the inexperience of most. taxonomy for the quality of SDS-based services which is given in Section 2 .3. 1. The taxonomy can be helpful for system developers in three different ways: (1) Quality elements of Quality of Human-Machine

Quality of Telephone-Based Spoken Dialogue Systems phần 3 ppsx

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan