Báo cáo khoa học: "Assessing Dialog System User Simulation Evaluation Measures Using Human Judges" ppt

8 203 0
  • Loading ...
1/8 trang

Thông tin tài liệu

Ngày đăng: 17/03/2014, 02:20

Proceedings of ACL-08: HLT, pages 622–629,Columbus, Ohio, USA, June 2008.c2008 Association for Computational LinguisticsAssessing Dialog System User Simulation Evaluation Measures UsingHuman JudgesHua AiUniversity of PittsburghPittsburgh PA, 15260, USAhua@cs.pitt.eduDiane J. LitmanUniversity of PittsburghPittsburgh, PA 15260, USAlitman@cs.pitt.eduAbstractPrevious studies evaluate simulated dialogcorpora using evaluation measures which canbe automatically extracted from the dialogsystems’ logs. However, the validity of theseautomatic measures has not been fully proven.In this study, we first recruit human judgesto assess the quality of three simulated dia-log corpora and then use human judgmentsas the gold standard to validate the conclu-sions drawn from the automatic measures. Weobserve that it is hard for the human judgesto reach good agreement when asked to ratethe quality of the dialogs from given perspec-tives. However, the human ratings give con-sistent ranking of the quality of simulated cor-pora generated by different simulation mod-els. When building prediction models of hu-man judgments using previously proposed au-tomatic measures, we find that we cannot reli-ably predict human ratings using a regressionmodel, but we can predict human rankings bya ranking model.1 IntroductionUser simulation has been widely used in differentphases in spoken dialog system development. Inthe system development phase, user simulation isused in training different system components. Forexample, (Levin et al., 2000) and (Scheffler, 2002)exploit user simulations to generate large corporafor using Reinforcement Learning to develop dia-log strategies, while (Chung, 2004) implement usersimulation to train the speech recognizer and under-standing components.While user simulation is considered to be morelow-cost and time-efficient than experiments withhuman subjects, one major concern is how well thestate-of-the-art user simulations can mimic humanuser behaviors and how well they can substitute forhuman users in a variety of tasks. (Schatzmannet al., 2005) propose a set of evaluation measuresto assess the quality of simulated corpora. Theyfind that these evaluation measures are sufficientto discern simulated from real dialogs. Since thismultiple-measure approach does not offer a easilyreportable statistic indicating the quality of a usersimulation, (Williams, 2007) proposes a single mea-sure for evaluating and rank-ordering user simula-tions based on the divergence between the simulatedand real users’ performance. This new approach alsooffers a lookup table that helps to judge whether anobserved ordering of two user simulations is statisti-cally significant.In this study, we also strive to develop a predictionmodel of the rankings of the simulated users’ per-formance. However, our approach use human judg-ments as the gold standard. Although to date thereare few studies that use human judges to directly as-sess the quality of user simulation, we believe thatthis is a reliable approach to assess the simulatedcorpora as well as an important step towards devel-oping a comprehensive set of user simulation evalu-ation measures. First, we can estimate the difficultyof the task of distinguishing real and simulated cor-pora by knowing how hard it is for human judges toreach an agreement. Second, human judgments canbe used as the gold standard of the automatic evalua-tion measures. Third, we can validate the automatic622measures by correlating the conclusions drawn fromthe automatic measures with the human judgments.In this study, we recruit human judges to assessthe quality of three user simulation models. Judgesare asked to read the transcripts of the dialogs be-tween a computer tutoring system and the simula-tion models and to rate the dialogs on a 5-point scalefrom different perspectives. Judges are also giventhe transcripts between human users and the com-puter tutor. We first assess human judges’ abilitiesin distinguishing real from simulated users. We findthat it is hard for human judges to reach good agree-ment on the ratings. However, these ratings giveconsistent ranking on the quality of the real and thesimulated user models. Similarly, when we use pre-viously proposed automatic measures to predict hu-man judgments, we cannot reliably predict humanratings using a regression model, but we can consis-tently mimic human judges’ rankings using a rank-ing model. We suggest that this ranking model canbe used to quickly assess the quality of a new simu-lation model without manual efforts by ranking thenew model against the old models.2 Related WorkA lot of research has been done in evaluating differ-ent components of Spoken Dialog Systems as wellas overall system performance. Different evaluationapproaches are proposed for different tasks. Somestudies (e.g., (Walker et al., 1997)) build regressionmodels to predict user satisfaction scores from thesystem log as well as the user survey. There are alsostudies that evaluate different systems/system com-ponents by ranking the quality of their outputs. Forexample, (Walker et al., 2001) train a ranking modelthat ranks the outputs of different language genera-tion strategies based on human judges’ rankings. Inthis study, we build both a regression model and aranking model to evaluate user simulation.(Schatzmann et al., 2005) summarize somebroadly used automatic evaluation measures for usersimulation and integrate several new automatic mea-sures to form a comprehensive set of statistical eval-uation measures. The first group of measures inves-tigates how much information is transmitted in thedialog and how active the dialog participants are.The second group of measures analyzes the style ofthe dialog and the last group of measures examinesthe efficiency of the dialogs. While these automaticmeasures are handy to use, these measures have notbeen validated by humans.There are well-known practices which validateautomatic measures using human judgments. Forexample, in machine translation, BLEU score (Pa-pineni et al., 2002) is developed to assess the qualityof machine translated sentences. Statistical analysisis used to validate this score by showing that BLEUscore is highly correlated with the human judgment.In this study, we validate a subset of the automaticmeasures proposed by (Schatzmann et al., 2005) bycorrelating the measures with human judgments. Wefollow the design of (Linguistic Data Consortium,2005) in obtaining human judgments. We call ourstudy an assessment study.3 System and User Simulation ModelsIn this section, we describe our dialog system (IT-SPOKE) and the user simulation models whichwe use in the assessment study. ITSPOKE isa speech-enabled Intelligent Tutoring System thathelps students understand qualitative physics ques-tions. In the system, the computer tutor first presentsa physics question and the student types an essayas the answer. Then, the tutor analyzes the essayand initiates a tutoring dialog to correct misconcep-tions and to elicit further explanations. A corpusof 100 tutoring dialogs was collected between 20college students (solving 5 physics problems each)and the computer tutor, yielding 1388 student turns.The correctness of student answers is automaticallyjudged by the system and kept in the system’s logs.Our previous study manually clustered tutor ques-tions into 20 clusters based on the knowledge (e.g.,acceleration, Newton’s 3rd Law) that is required toanswer each question (Ai and Litman, 2007).We train three simulation models from the realcorpus: the random model, the correctness model,and the cluster model. All simulation models gener-ate student utterances on the word level by pickingout the recognized student answers (with potentialspeech recognition errors) from the human subjectexperiments with different policies. The randommodel (ran) is a simple unigram model which ran-domly picks a student’s utterance from the real cor-623pus as the answer to a tutor’s question, neglectingwhich question it is. The correctness model (cor)is designed to give a correct/incorrect answer withthe same probability as the average of real students.For each tutor’s question, we automatically computethe average correctness rate of real student answersfrom the system logs. Then, a correct/incorrect an-swer is randomly chosen from the correct/incorrectanswer sets for this question. The cluster model(clu) tries to model student learning by assumingthat a student will have a higher chance to give acorrect answer to the question of a cluster in whichhe/she mostly answers correctly before. It computesthe conditional probability of whether a student an-swer is correct/incorrect given the content of the tu-tor’s question and the correctness of the student’s an-swer to the last previous question that belongs to thesame question cluster. We also refer to the real stu-dent as the real student model (real) in the paper.We hypothesize that the ranking of the four studentmodels (from the most realistic to the least) is: real,clu, cor, and ran.4 Assessment Study Design4.1 DataWe decided to conduct a middle-scale assessmentstudy that involved 30 human judges. We conducteda small pilot study to estimate how long it took ajudge to answer all survey questions (described inSection 4.2) in one dialog because we wanted to con-trol the length of the study so that judges would nothave too much cognitive load and would be consis-tent and accurate on their answers. Based on the pi-lot study, we decided to assign each judge 12 dialogswhich took about an hour to complete. Each dialogwas assigned to two judges. We used three out of thefive physics problems from the original real corpusto ensure the variety of dialog contents while keep-ing the corpus size small. Therefore, the evaluationcorpus consisted of 180 dialogs, in which 15 dialogswere generated by each of the 4 student models oneach of the 3 problems.4.2 Survey Design4.2.1 Survey questionsWe designed a web survey to collect human judg-ments on a 5-point scale on both utterance and di-Figure 1: Utterance level questions.alog levels. Each dialog is separated into pairs ofa tutor question and the corresponding student an-swer. Figure 1 shows the three questions whichare asked for each tutor-student utterance pair. Thethree questions assess the quality of the student an-swers from three aspects of Grice’s Maxim (Grice,1975): Maxim of Quantity (u QNT), Maxim of Rel-evance (u RLV), and Maxim of Manner (u MNR).We do not include the Maxim of Quality because inour task domain the correctness of the student an-swers depends largely on students’ physics knowl-edge, which is not a factor we would like to considerwhen evaluating the realness of the students’ dialogbehaviors.In Figure 2, we show the three dialog level ques-tions which are asked at the end of each dialog.The first question (d TUR) is a Turing test type ofquestion which aims to obtain an impression of thestudent’s overall performance. The second ques-tion (d QLT) assesses the dialog quality from atutoring perspective. The third question (d PAT)sets a higher standard on the student’s performance.Unlike the first two questions which ask whetherthe student “looks” good, this question further askswhether the judges would like to partner with theparticular student.4.2.2 Survey WebsiteWe display one tutor-student utterance pair andthe three utterance level questions on each web page.After the judges answer the three questions, he/shewill be led to the next page which displays the nextpair of tutor-student utterances in the dialog withthe same three utterance level questions. The judge624Figure 2: Dialog level questions.reads through the dialog in this manner and answersall utterance level questions. At the end of the di-alog, three dialog level questions are displayed onone webpage. We provide a textbox under each di-alog level question for the judge to type in a briefexplanation on his/her answer. After the judge com-pletes the three dialog level questions, he/she will beled to a new dialog. This procedure repeats until thejudge completes all of the 12 assigned dialogs.4.3 Assessment Study30 college students are recruited as human judgesvia flyers. Judges are required to be native speak-ers of American English to make correct judgmentson the language use and fluency of the dialog. Theyare also required to have taken at least one courseon Newtonian physics to ensure that they can under-stand the physics tutoring dialogs and make judg-ments about the content of the dialogs. We followthe same task assigning procedure that is used in(Linguistic Data Consortium, 2005) to ensure a uni-form distribution of judges across student modelsand dialogs while maintaining a random choice ofjudges, models, and dialogs. Judges are instructed towork as quickly as comfortably possible. They areencouraged to provide their intuitive reactions andnot to ponder their decisions.5 Assessment Study ResultsIn the initial analysis, we observe that it is a difficulttask for human judges to rate on the 5-point scaleand the agreements among the judges are fairly low.Table 1 shows for each question, the percentages ofd TUR d QLT d PAT u QNT u RLV u MNR22.8% 27.8% 35.6% 39.2% 38.4% 38.7%Table 1: Percent agreements on 5-point scalepairs of judges who gave the same ratings on the 5-point scale. For the rest of the paper, we collapsethe “definitely” types of answers with its adjacent“probably” types of answers (more specifically, an-swer 1 with 2, and 4 with 5). We substitute scores 1and 2 with a score of 1.5, and scores 4 and 5 with ascore of 4.5. A score of 3 remains the same.5.1 Inter-annotator agreementTable 2 shows the inter-annotator agreements on thecollapsed 3-point scale. The first column presentsthe question types. In the first row, “diff” standsfor the differences between human judges’ ratings.The column “diff=0” shows the percent agreementson the 3-point scale. We can see the improvementsfrom the original 5-point scale when comparing withTable 1. The column “diff=1” shows the percentagesof pairs of judges who agree with each other on aweaker basis in that one of the judges chooses “can-not tell”. The column “diff=2” shows the percent-ages of pairs of judges who disagree with each other.The column “Kappa” shows the un-weighted kappaagreements and the column “Kappa*” shows the lin-ear weighted kappa. We construct the confusion ma-trix for each question to compute kappa agreements.Table 3 shows the confusion matrix for d TUR. Thefirst three rows of the first three columns show thecounts of judges’ ratings on the 3-point scale. Forexample, the first cell shows that there are 20 caseswhere both judges give 1.5 to the same dialog. Whencalculating the linear weighted kappa, we define thedistances between the adjacent categories to be one1.Note that we randomly picked two judges to rateeach dialog so that different dialogs are rated by dif-ferent pairs of judges and one pair of judges onlyworked on one dialog together. Thus, the kappaagreements here do not reflect the agreement of onepair of judges. Instead, the kappa agreements showthe overall observed agreement among every pair of1We also calculated the quadratic weighted kappa in whichthe distances are squared and the kappa results are similar to thelinear weighted ones. For calculating the two weighted kappas,see http://faculty.vassar.edu/lowry/kappa.html for details.625Q diff=0 diff=1 diff=2 Kappa Kappa*d TUR 35.0% 45.6% 19.4% 0.022 0.079d QLT 46.1% 28.9% 25.0% 0.115 0.162d PAT 47.2% 30.6% 22.2% 0.155 0.207u QNT 66.8% 13.9% 19.3% 0.377 0.430u RLV 66.6% 17.2% 16.2% 0.369 0.433u MNR 67.5% 15.4% 17.1% 0.405 0.470Table 2: Agreements on 3-point scalescore=1.5 score=3 score=4.5 sumscore=1.5 20 26 20 66score=3 17 11 19 47score=4.5 15 20 32 67sum 52 57 71 180Table 3: Confusion Matrix on d TURjudges controlling for the chance agreement.We observe that human judges have low agree-ment on all types of questions, although the agree-ments on the utterance level questions are betterthan the dialog level questions. This observationindicates that assessing the overall quality of sim-ulated/real dialogs on the dialog level is a difficulttask. The lowest agreement appears on d TUR.We investigate the low agreements by looking intojudges’ explanations on the dialog level questions.21% of the judges find it hard to rate a particulardialog because that dialog is too short or the stu-dent utterances mostly consist of one or two words.There are also some common false beliefs amongthe judges. For example, 16% of the judges thinkthat humans will say longer utterances while 9% ofthe judges think that only humans will admit the ig-norance of an answer.5.2 Rankings of the modelsIn Table 4, the first column shows the name of thequestions; the second column shows the name ofthe models; the third to the fifth column present thepercentages of judges who choose answer 1 and 2,can’t tell, and answer 4 and 5. For example, whenlooking at the column “1 and 2” for d TUR, wesee that 22.2% of the judges think a dialog by areal student is generated probably or definitely bya computer; more judges (25.6%) think a dialog bythe cluster model is generated by a computer; evenmore judges (32.2%) think a dialog by the correct-ness model is generated by a computer; and evenQuestion model 1 and 2 can’t tell 4 and 5d TURreal 22.2% 28.9% 48.9%clu 25.6% 31.1% 43.3%cor 32.2% 26.7% 41.1%ran 51.1% 28.9% 20.0%d QLTreal 20.0% 10.0% 70.0%clu 21.1% 20.0% 58.9%cor 24.4% 15.6% 60.0%ran 60.0% 18.9% 21.1%d PATreal 28.9% 21.1% 50.0%clu 41.1% 17.8% 41.1%cor 43.3% 18.9% 37.8%ran 82.2% 14.4% 3.4%Table 4: Rankings on Dialog Level Questionsmore judges (51.1%) think a dialog by the randommodel is generated by a computer. When looking atthe column “4 and 5” for d TUR, we find that mostof the judges think a dialog by the real student isgenerated by a human while the fewest number ofjudges think a dialog by the random model is gen-erated by a human. Given that more human-like isbetter, both rankings support our hypothesis that thequality of the models from the best to the worst is:real, clu, cor, and ran. In other words, although it ishard to obtain well-agreed ratings among judges, wecan combine the judges’ ratings to produce the rank-ing of the models. We see consistent ranking orderson d QLT and d PAT as well, except for a disorderof cluster and correctness model on d QLT indicatedby the underlines.When comparing two models, we can tell whichmodel is better from the above rankings. Neverthe-less, we also want to know how significant the dif-ference is. We use t-tests to examine the significanceof differences between every two models. We aver-age the two human judges’ ratings to get an aver-aged score for each dialog. For each pair of models,we compare the two groups of the averaged scoresfor the dialogs generated by the two models using2-tail t-tests at the significance level of p < 0.05.In Table 5, the first row presents the names of themodels in each pair of comparison. Sig means thatthe t-test is significant after Bonferroni correction;question mark (?) means that the t-test is signifi-cant before the correction, but not significant after-wards, we treat this situation as a trend; not meansthat the t-test is not significant at all. The table shows626real- real- real- ran- ran- cor-ran cor clu cor clu clud TUR sig not not sig sig notd QLT sig not not sig sig notd PAT sig ? ? sig sig notu QNT sig not not sig sig notu RLV sig not not sig sig notu MNR sig not not sig sig notTable 5: T-Tests Resultsthat only the random model is significantly differentfrom all other models. The correctness model andthe cluster model are not significantly different fromthe real student given the human judges’ ratings, nei-ther are the two models significantly different fromeach other.5.3 Human judgment accuracy on d TURWe look further into d TUR in Table 4 because it isthe only question that we know the ground truth. Wecompute the accuracy of human judgment as (num-ber of ratings 4&5 on real dialogs + number of rat-ings of 1&2 on simulated dialogs)/(2*total numberof dialogs). The accuracy is 39.44%, which servesas further evidence that it is difficult to discern hu-man from simulated users directly. A weaker accu-racy is calculated to be 68.35% when we treat “can-not tell” as a correct answer as well.6 Validating Automatic MeasuresSince it is expensive to use human judges to ratesimulated dialogs, we are interested in building pre-diction models of human judgments using auto-matic measures. If the prediction model can re-liably mimic human judgments, it can be used torate new simulation models without collecting hu-man ratings. In this section, we use a subset of theautomatic measures proposed in (Schatzmann et al.,2005) that are applicable to our data to predict hu-man judgments. Here, the human judgment on eachdialog is calculated as the average of the two judges’ratings. We focus on predicting human judgmentson the dialog level because these ratings representthe overall performance of the student models. Weuse six high-level dialog feature measures includingthe number of student turns (Sturn), the number oftutor turns (Tturn), the number of words per stu-dent turn (Swordrate), the number of words per tu-tor turn (Twordrate), the ratio of system/user wordsper dialog (WordRatio), and the percentage of cor-rect answers (cRate).6.1 The Regression ModelWe use stepwise multiple linear regression to modelthe human judgments using the set of automatic fea-tures we listed above. The stepwise procedure au-tomatically selects measures to be included in themodel. For example, d TUR is predicted as 3.65 −0.08 ∗ W ordRatio − 3.21 ∗ Swordrate, with anR-square of 0.12. The prediction models for d QLTand d PAT have similar low R-square values of 0.08and 0.17, respectively. This result is not surprisingbecause we only include the surface level automaticmeasures here. Also, these measures are designedfor comparison between models instead of predic-tion. Thus, in Section 6.2, we build a ranking modelto utilize the measures in their comparative manner.6.2 The Ranking ModelWe train three ranking models to mimic humanjudges’ rankings of the real and the simulated stu-dent models on the three dialog level questions usingRankBoost, a boosting algorithm for ranking ((Fre-und et al., 2003), (Mairesse et al., 2007)). We brieflyexplain the algorithm using the same terminologiesand equations as in (Mairesse et al., 2007), by build-ing the ranking model for d TUR as an example.In the training phase, the algorithm takes as inputa group of dialogs that are represented by values ofthe automatic measures and the human judges’ rat-ings on d TUR. The RankBoost algorithm treats thegroup of dialogs as ordered pairs:T = {(x, y)| x, y are two dialog samples,x has a higher human rated score than y }Each dialog x is represented by a set of m indica-tor functions hs(x) (1 ≤ s ≤ m). For example:hs(x) =1 if WordRatio(x) ≥ 0.470 otherwiseHere, the threshold of 0.47 is calculated by Rank-Boost. α is a parameter associated with each indi-cator function. For each dialog, a ranking score is627calculated as:F (x) =sαshs(x) (1)In the training phase, the human ratings are usedto set α by minimizing the loss function:LOSS =1|T |(x,y )∈Teval(F(x) ≤ F(y)) (2)The eval function returns 0 if (x, y) pair is rankedcorrectly, and 1 otherwise. In other words, LOSSscore is the percentage of misordered pairs wherethe order of the predicted scores disagree with theorder indicated by human judges. In the testingphase, the ranking score for every dialog is cal-culated by Equation 1. A baseline model whichranks dialog pairs randomly produces a LOSS of 0.5(lower is better).While LOSS indicates how many pairs of dialogsare ranked correctly, our main focus here is to rankthe performance of the four student models insteadof individual dialogs. Therefore, we propose anotherAveraged Model Ranking (AMR) score. AMR iscomputed as the sum of the ratings of all the dialogsgenerated by one model averaged by the number ofthe dialogs. The four student models are then rankedbased on their AMR scores. The chance to get theright ranking order of the four student models byrandom guess is 1/(4!).Table 6 shows a made-up example to illustrate thetwo measures. real 1 and real 2 are two dialogs gen-erated by the real student model; ran 1 and ran 2are two dialogs by the random model. The secondand third column shows the human-rated score as thegold standard and the machine-predicted score in thetesting phase respectively. The LOSS in this exam-ple is 1/6, because only the pair of real 2 and ran 1is misordered out of all the 6 possible pair combina-tions. We then compute the AMR of the two models.According to human-rated scores, the real model isscored 0.75 (=(0.9+0.6)/2) while the random modelis scored 0.3. When looking at the predicted scores,the real model is scored 0.65, which is also higherthan the random model with a score of 0.4. We thusconclude that the ranking model ranks the two stu-dent models correctly according to the overall ratingmeasure. We use both LOSS and AMR to evaluatethe ranking models.Dialog Human-rated Score Predicted Scorereal 1 0.9 0.9real 2 0.6 0.4ran 1 0.4 0.6ran 2 0.2 0.2Table 6: A Made-up Example of the Ranking ModelCross Validation d TUR d QLT d PATRegular 0.176 0.155 0.151Minus-one-model 0.224 0.180 0.178Table 7: LOSS scores for Regular and Minus-one-model(during training) Cross ValidationsFirst, we use regular 4-fold cross validation wherewe randomly hold out 25% of the data for testingand train on the remaining 75% of the data for 4rounds. Both the training and the testing data consistof dialogs equally distributed among the four studentmodels. However, since the practical usage of theranking model is to rank a new model against sev-eral old models without collecting additional humanratings, we further test the algorithm by repeatingthe 4 rounds of testing while taking turns to hold outthe dialogs from one model in the training data, as-suming that model is the new model that we do nothave human ratings to train on. The testing corpusstill consists of dialogs from all four models. We callthis approach the minus-one-model cross validation.Table 7 shows the LOSS scores for both cross val-idations. Using 2-tailed t-tests, we observe that theranking models significantly outperforms the ran-dom baseline in all cases after Bonferroni correction(p < 0.05). When comparing the two cross vali-dation results for the same question, we see moreLOSS in the more difficult minus-one-model case.However, the LOSS scores do not offer a directconclusion on whether the ranking model ranks thefour student models correctly or not. To addressthis question, we use AMR scores to re-evaluate allcross validation results. Table 8 shows the human-rated and predicted AMR scores averaged over fourrounds of testing on the regular cross validation re-sults. We see that the ranking model gives thesame rankings of the student models as the humanjudges on all questions. When applying AMR onthe minus-one-model cross validation results, we seesimilar results that the ranking model reproduces hu-628real clu cor ranhuman predicted human predicted human predicted human predictedd TUR 0.68 0.62 0.65 0.59 0.63 0.52 0.51 0.49d QLT 0.75 0.71 0.71 0.63 0.69 0.61 0.48 0.50d PAR 0.66 0.65 0.60 0.60 0.58 0.57 0.31 0.32Table 8: AMR Scores for Regular Cross Validationman judges’ rankings. Therefore, we suggest thatthe ranking model can be used to evaluate a newsimulation model by ranking it against several oldmodels. Since our testing corpus is relatively small,we would like to confirm this result on a large corpusand on other dialog systems in the future.7 Conclusion and Future WorkAutomatic evaluation measures are used in evaluat-ing simulated dialog corpora. In this study, we inves-tigate a set of previously proposed automatic mea-sures by comparing the conclusions drawn by thesemeasures with human judgments. These measuresare considered as valid if the conclusions drawn bythese measures agree with human judgments. Weuse a tutoring dialog corpus with real students, andthree simulated dialog corpora generated by threedifferent simulation models trained from the realcorpus. Human judges are recruited to read the di-alog transcripts and rate the dialogs by answeringdifferent utterance and dialog level questions. Weobserve low agreements among human judges’ rat-ings. However, the overall human ratings give con-sistent rankings on the quality of the real and sim-ulated user models. Therefore, we build a rankingmodel which successfully mimics human judgmentsusing previously proposed automatic measures. Wesuggest that the ranking model can be used to ranknew simulation models against the old models in or-der to assess the quality of the new model.In the future, we would like to test the rankingmodel on larger dialog corpora generated by moresimulation models. We would also want to includemore automatic measures that may be available inthe richer corpora to improve the ranking and theregression models.AcknowledgmentsThis work is supported by NSF 0325054. We thankJ. Tereault, M. Rotaru, K. Forbes-Riley and theanonymous reviewers for their insightful sugges-tions, F. Mairesse for helping with RankBoost, andS. Silliman for his help in the survey experiment.ReferencesH. Ai and D. Litman. 2007. Knowledge Consistent UserSimulations for Dialog Systems. In Proc. of Inter-speech 2007.G. Chung. 2004. Developing a Flexible Spoken DialogSystem Using Simulation. In Proc. of ACL 04.Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer. 2003.An Efficient Boosting Algorithm for Combining Pref-erences. Journal of Machine Learning Research.H. P. Grice 1975. Logic and Conversation. Syntax andSemantics III: Speech Acts, 41-58.E. Levin, R. Pieraccini, and W. Eckert. 2000. A Stochas-tic Model of Human-Machine Interaction For learningDialog Strategies. IEEE Trans. On Speech and AudioProcessing, 8(1):11-23.Linguistic Data Consortium. 2005. Linguistic Data An-notation Specification: Assessment of Fluency and Ad-equacy in Translations.F. Mairesse, M. Walker, M. Mehl and R. Moore. 2007.Using Linguistic Cues for the Automatic Recognitionof Personality in Conversation and Text. Journal ofArtificial Intelligence Research, Vol 30, pp 457-501.K.A. Papineni, S. Roukos, R.T. Ward, and W-J. Zhu.2002. Bleu: A Method for Automatic Evaluation ofMachine Translation. In Proc. of 40th ACL.J. Schatzmann, K. Georgila, and S. Young. 2005. Quan-titative Evaluation of User Simulation Techniques forSpoken Dialog Systems. In Proc. of 6th SIGdial.K. Scheffler. 2002. Automatic Design of Spoken DialogSystems. Ph.D. diss., Cambridge University.J. D. Williams. 2007. A Method for Evaluating and Com-paring User Simulations: The Cramer-von Mises Di-vergence. Proc IEEE Workshop on Automatic SpeechRecognition and Understanding (ASRU).M. Walker, D. Litman, C. Kamm, and A. Abella. 1997.PARADISE: A Framework for Evaluating Spoken Dia-log Agents. In Proc. of ACL 97.M. Walker, O. Rambow, and M. Rogati. 2001. SPoT: ATrainable Sentence Planner. In Proc. of NAACL 01.629 . Association for Computational LinguisticsAssessing Dialog System User Simulation Evaluation Measures Using Human JudgesHua AiUniversity of PittsburghPittsburgh. Consistent User Simulations for Dialog Systems. In Proc. of Inter-speech 2007.G. Chung. 2004. Developing a Flexible Spoken Dialog System Using Simulation.
- Xem thêm -

Xem thêm: Báo cáo khoa học: "Assessing Dialog System User Simulation Evaluation Measures Using Human Judges" ppt, Báo cáo khoa học: "Assessing Dialog System User Simulation Evaluation Measures Using Human Judges" ppt, Báo cáo khoa học: "Assessing Dialog System User Simulation Evaluation Measures Using Human Judges" ppt

Gợi ý tài liệu liên quan cho bạn

Nhận lời giải ngay chưa đến 10 phút Đăng bài tập ngay