RESEARCH AND APPLY EVOLUTIONARY COMPUTATION TECHNIQUES ON AUTOMATIC TEXT SUMMARIZATION

VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY DO THUY DUONG RESEARCH AND APPLY EVOLUTIONARY COMPUTATION TECHNIQUES ON AUTOMATIC TEXT SUMMARIZATION MASTER THESIS IN INFORMATION TECHNOLOGY HANOI - 2015 VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY DO THUY DUONG RESEARCH AND APPLY EVOLUTIONARY COMPUTATION TECHNIQUES ON AUTOMATIC TEXT SUMMARIZATION Field: Information technology Major: Software Engineering Code: 60480103 MASTER THESIS IN INFORMATION TECHNOLOGY SUPERVISOR: Assoc Prof Nguyen Xuan Hoai HANOI - 2015 Declaration of authorship I, Do Thuy Duong, declare that this thesis ‘Research and apply evolutionary computation techniques on automatic text summarization’ and the work presented in it are my own I confirm that: This work was done wholly or mainly while in candidature for a research degree at this University; Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated; Where I have consulted the published work of others, this is always clearly attributed; I have acknowledged all main sources of help; Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself; Signed: …………………………………………………………………………………… Date: …………………………………………………………………………………… Acknowledgements I am heartily thankful to my supervisor, Prof Nguyen Xuan Hoai, whose encouragement, guidance and support from the initial to the final level have enabled me to develop an understanding of the topic I would like to show my gratitude to the teachers in the University of Engineering and Technology, Vietnam National University, Hanoi for helping me to gain a large body of knowledge during my two years of studying Lastly, I offer my regards and blessings to my friends and my family, who have always encouraged me so that I could finish this challenging research Contents Declaration of authorship Acknowledgements Contents List of figures List of tables Chapter Introduction 1 Research Objectives 10 Motivation Thesis overview 10 Chapter 11 Background knowledge 11 Automatic text summarization 11 2.1.1 Definition 11 2.1.2 Types of text summarization 12 2.1.3 Methodologies for automatic text summarization 15 2 Differential evolution (DE) 19 Evolutionary computation 16 Conclusion 26 Chapter 27 Automatic text summarization using differential evolution algorithm 27 Automatic text summarization using differential evolution (DE) 27 3.1.1 Document collection representation 27 3.1.2 Objective/ Fitness function 28 3.1.3 Main steps of differential evolution 30 3.1.4 Experiment, result and discussion 35 Improvement 40 3.2.1 Method 40 3.2.2 Experiment, result and discussion 42 3 Conclusion 46 Chapter 47 Conclusion and future work 47 4 Contributions 47 Future work 47 Reference 48 List of figures Figure 2.1 A typical summarization system 12 Figure 2.2 A summarizer highlights all sentences included in an extractive summary 13 Figure 2.3 An example of the abstract summary 14 Figure 2.4 Multi-document summarization 15 Figure 2.5 The general scheme of an Evolutionary Algorithm in pseudo-code 17 Figure 2.6 General scheme of evolutionary algorithms 18 Figure 2.7 Correlation between number of generations and best fitness in population 19 Figure 2.8 Steps of differential evolution algorithm 20 Figure 2.9 Steps to get the next X1 (generation 1) 25 Figure 3.1 Illustration of mutation operation 32 Figure 3.2 Illustration of crossover operation 33 Figure 3.3 Changes in summary length in [DE] method on DUC2004 38 Figure 3.4 Changes in summary length in [DE] method on DUC2007 39 Figure 3.5 Summary length in [MultiDE] method on DUC2004 43 Figure 3.6 Summary length in [MultiDE] method on DUC2007 43 Figure 3.7 Comparison between F-values of [DE] and [MultiDE] on DUC2004 45 Figure 3.8 Comparison between F-values of [DE] and [MultiDE] on DUC2007 46 List of tables Table 2.1 The basic evolutionary computation linking natural evolution to problem solving 17 Table 2.2.Fitness of six individuals at generation 22 Table 2.3 Creation of mutant vector V1 23 Table 2.4 Creation of trial vector Z1 23 Table 2.5 Values of X1 in generation 24 Table 3.1 Description of the datasets used in the experiment 35 Table 3.2 Parameter settings of the first experiment 37 Table 3.3 Summary lengths of some document collections in DUC2004 using [DE] method 38 Table 3.4 Summary lengths of some document collections in DUC2007 using [DE] method 40 Table 3.5 F-Values of three evaluation measures of method [DE] on DUC2004 and DUC2007 40 Table 3.6 Parameter settings of the second experiment 42 Table 3.7 Summary lengths of some document collections in DUC2004 using [MultiDE] method 44 Table 3.8 Summary lengths of some document collections in DUC2007 using [MultiDE] method 44 Table 3.9 F-Values of three evaluation measures of method [MultiDE] on DUC2004 and DUC2007 45 Chapter Introduction Automatic text summarization means detecting important and condensed contents in one or more documents This is a very challenging problem, relating to many scientific areas such as artificial intelligence, statistics, linguistics, etc Many researches have been conducted world wide since 1950 and produced some systems such as SUMMARIST, SweSUM, MEAD, SUMMON, etc However, this research area is still challenging and attracts more and more attention In this thesis, we are going to study some evolutionary computation techniques, then apply the differential evolution algorithm to the practical problem: automatic text summarization, in particular, multi-document summarization Moreover, we also attempt to deal with constraint on the summary length that has not been handled effectively in these stochastic popular-based methods 1 Motivation Evolutionary computation techniques use different algorithms to evolve a population of individuals over a certain number of generations These population are applied with operations on such as mutation, crossover and selection to reproduce new offspring, which then compete with each other and the previous generation to survive based on some evaluation function The process ends when a stopping criteria is reached and we found the best individual – the best solution to our real-world problem Evolutionary algorithms have been applied to solve numerous problems in various fields, one of which is automatic text summarization However, we have found it has a weak point in handling the summary length, not like other sentence ranking methods Therefore, this research attempts to improve this aspect of these algorithms 10 Research Objectives The thesis is aimed to study evolutionary computation techniques, especially the differential evolution algorithm, and its application to the problem of automatic text summarization We find the limitation of other researchers’ ways to handle the summary length of this algorithm, then propose a new method to manage this length constraint satisfying users’ demand, but still keep the quality of the summary Thesis overview The rest of this thesis is organized as follows In chapter 2, we review the background knowledge of text summarization, its classification and introduce the main principles of evolutionary computation In particular, the differential evolution algorithm is discussed Chapter explains in details the above algorithm when applied to automatic text summarization, in our case it is on multi-document collections Then, an experiment is performed to test the original differential evolution algorithm Besides, we improve the result of the previous experiment, dealing with the summary length so that the document collection is compressed quickly and effectively Chapter will recapitulate the thesis, present our contributions and state some future research directions in this field 36 Evaluation measures We use ROUGE (Recall – Oriented Understudy for Gisting Evaluation) package and take the average F-value to evaluate and compare our summaries [10, 11] There are some terms related to a summary evaluation such as Precision, Recall and F-value Precision = Recall = correct correct +wrong correct correct +missed (14) (15) Where, correct = the number of text units extracted by both system and human; wrong = the number of text units extracted by system but not by human; and missed = the number of text units extracted by human but not by system Therefore, Precision reflects the percentages of the system’s extracted sentences were good, and Recall reflects the percentages of good sentences the system missed In even simpler terms, a high recall means you have not missed anything but you may have a lot of useless results to sift through (which would imply low precision) High precision means that everything returned was a relevant result, but you might not have found all the relevant items (which would imply low recall) F-value is assigned to be a weighted average of Precision and Recall, best at and worst at F= x Precision x Recall Precision +Recall (16) The F-value is always a number between the values of recall and precision, and is higher when recall and precision are closer In this case, we use three types of ROUGE measures: ROUGE-N where N is the length of the n-gram (ROUGE-1: unigram/ one word and ROUGE-2: bigram/ two words) and ROUGE-L (Longest common subsequence); taking F-value from ROUGE output to compare among summaries 37 Experimental settings Parameters DUC2004 DUC2007 50 50 1000 1000 umin -5 -5 umax 5 F 0.6 0.6 CR 0.7 0.7 Number of runs 20 20 Goal: number of sentences in the summary 12 Population size P Number of generation tmax Table 3.2 Parameter settings of the first experiment Table 3.2 lists all necessary parameters needed to assign values Because this is a stochastic popular-based algorithm, we run the program for 20 times (runs), then get their mean value as the final result These parameters all follow the setting of experiments in [5] 4 Result and discussion After summarizing all collections, having ROUGE output, we take the average of their F-values as well as the summary length during generations We choose a typical document collection that contains 212 sentences in DUC2004 and a 507sentence collection in DUC2007 to show changes in its summary length during the process and the time for the algorithm to summarize 38 Figure 3.3 Changes in summary length in [DE] method on DUC2004 Figure 3.3 indicates changes in the summary length during 1000 generations on DUC2004 It is clear that the algorithm needs 135 minutes to compress a collection of 212 sentences to a summary of 25 sentences over 1000 generations Moreover, the length decreases more and more slowly, in particular, 92 sentences at generation to 37 sentences at generation 500, but the resulting length at generation 1000 is considerably great - 25 sentences Document collections Original length Summary length d30001t 212 25 d30006t 408 74 d30011t 250 34 d30033t 642 131 Table 3.3 Summary lengths of some document collections in DUC2004 using [DE] method 39 Table 3.3 presents summary lengths of some randomly chosen document collections in DUC2004 As we can see, all of the summary lengths not satisfy the goal of a summary of sentences at last Figure 3.4 Changes in summary length in [DE] method on DUC2007 Figure 3.4 inllustrates the running process of differential evolution algorithm on DUC2007 It takes 204 minutes to finish 1000 generations and the length decreases from 230 sentences at generation to 119 sentences at last It means the algorithm compresses the document collection of 507 sentences to a summary of 119 sentences over 1000 iterations One more point is that the length decreases more slowly at the end than the beginning of the run In particular, a summary of 230 sentences reduces to a 139-sentence summary over the first 500 generations while a summary of 139 sentences decreases to 119 sentences over the next 500 generations Apparently, this method is not effective in reducing summary length 40 Document collections Original length Summary length D0704 255 39 D0705 330 58 D0706 462 103 D0711 507 119 Table 3.4 Summary lengths of some document collections in DUC2007 using [DE] method Table 3.4 dipicts summary lengths of some randomly chosen document collections in DUC2007 to confirm that the summary is not shorten sufficiently because the objective is 12-sentence summaries The next thing need to be cared is the summary quality The following Table 3.5 lists three F-values corresponding to three ROUGE measures: ROUGE-1, ROUGE-2 and ROUGE-L on DUC2004 and DUC2007 Measures DUC2004 DUC2007 ROUGE -1 0.204 0.138 ROUGE -2 0.051 0.057 ROUGE –L 0.157 0.120 Table 3.5 F-Values of three evaluation measures of method [DE] on DUC2004 and DUC2007 Improvement 3.2.1 Method This section describes my suggestion to improve the method [DE] in [3.1.4] As we see, the summary length decreased very slowly In order to summarize a collection of 507 sentences to a summary of about 120 sentences, we spent 204 minutes while our goal is a summary of 12 sentences Moreover, the F-score is not very high This is due to the length of summary, leading to the fact that the summary contains huge number of unimportant sentences 41 In other ranking method for sentence extraction, all sentences might be evaluated separately and given scores for each of them Thus, the matter of compression rate of the summary is not a big problem because we could take sentences based on their scores from top to down On the other hand, in this current stochastic population-based method, solutions are generated based on some operators, then we can not control the length efficiently as the above mentioned methods All of these disadvantages encourage us to propose a new method to control the summary length better The disadvantages are: - Taking very long time to summarize a document collections containing large number of sentences - Reducing the summary length more and more slowly during the process of summarization - The F-values are low when our summaries are compared with experts’ summaries Our idea is to use multi-step summarization, which means we are going to summarize the previously returned summary until we got satisfying summary length The reason for this is that the length of summary are always reduced dramatically at the beginning generations, therefore if we summarize again the summary returned from the first round, it is certain that users can get satisfying summary length very fast To be exact, we are going to reduce the number of generations from 1000 to 150 generations in DUC2004 and 100 generations in DUC2007 while all other parameters of the first experiment remain After finishing the first run (100-150 generations), the resulting summary continues being summarized the second time In other word, the summary experiences 100-150 generations again and again until it satisfies the length constraint The process ends whenever the resulting summary has the satisfying summary length This method will make the searching space become smaller, which is the reason for the process of searching to be much faster Thus, the time for summarization is less and we could control the length easily We call this method [MultiDE] for short 42 3.2.2 Experiment, result and discussion 2 Datasets The datasets are the same as the previous method [DE] 2 Evaluation measures ROUGE package is still used to evaluate our result 2 Experimental settings Parameters DUC2004 DUC2007 Population size P 50 50 Number of generation tmax 150 100 umin -5 -5 umax 5 F 0.6 0.6 CR 0.7 0.7 Number of runs 20 20 Goal: number of sentences in the summary 12 Table 3.6 Parameter settings of the second experiment We run the program with the settings illustrated in Table 3.6 getting a summary, then continue summarizing that returned summary until we get satisfying summary lengths 2 Result and discussion The following is the results of our experiment 43 Figure 3.5 Summary length in [MultiDE] method on DUC2004 Figure 3.6 Summary length in [MultiDE] method on DUC2007 44 Figure 3.5 and Figure 3.6 demonstrate the application of multi-step summarization on differential evolution The result is promising 12 minutes are spent to get a 6-sentence summary on DUC2004, and 114 minutes are spent to get a 12-sentence summary in case of DUC2007 Document collections Original length Summary length d30001t 212 d30006t 408 d30011t 250 d30033t 642 Table 3.7 Summary lengths of some document collections in DUC2004 using [MultiDE] method Document collections Original length Summary length D0704 255 D0705 330 D0706 462 12 D0711 507 12 Table 3.8 Summary lengths of some document collections in DUC2007 using [MultiDE] method Table 3.7 and Table 3.8 dipict summary lengths of four randomly choosen document collections in DUC2004 and DUC2007 correspodingly to confirm that the summary is shorten sufficiently The following Table 3.9 presents our summary quality using differential evolution algorithm combined with multi-step summarization method 45 Measures DUC2004 DUC2007 ROUGE -1 0.300 0.388 ROUGE -2 0.054 0.063 ROUGE –L 0.233 0.309 Table 3.9 F-Values of three evaluation measures of method [MultiDE] on DUC2004 and DUC2007 Overall, looking at the diagram of summary quality from two methods performed [DE] and [MultiDE], it is clear that when multi-step summarization is used, the quality of our summaries is nearer to experts’ summaries This overweight is shown in Figure 3.7 and Figure 3.8 Figure 3.7 Comparison between F-values of [DE] and [MultiDE] on DUC2004 46 Figure 3.8 Comparison between F-values of [DE] and [MultiDE] on DUC2007 3 Conclusion This chapter has presented the DE algorithm when solving automatic text summarization, then two experiments are made to compare with each other and figure out the improvement in controlling the summary length It is apparent that in our method, the summary length satisfy user’s requirement quickly while the summary quality gets better 47 Chapter Conclusion and future work This chapter summaries the contributions of this thesis and gives some future extensions Contributions In this thesis, we have studied the evolutionary algorithms: differential evolution, applied DE to a practical problem Automatic text summarization A new method of handling summary length has been proposed In particular, 45 collections each of which contains 25 documents from DUC 2007 and 50 collections of 10 documents from DUC2004 have been summarized based on the original and improved DE Summaries are then evaluated, compared with experts’ summaries The result showed that our proposed method worked more effectively than the methods suggested earlier by other researchers Future work We are going to study more evolutionary algorithms, such as genetic algorithm (GA), genetic programming (GP), etc applying them to both single and multiple document text summarization, as well as, testing more methods of handling constraints, especially the summary length in the future 48 Reference [1] Wikipedia, Evolutionary computation, Website http://en.wikipedia.org/wiki/Evolutionary_computation [2] Talib S.Hussian, An Introduction to Evolutionary Computation, Department of Computing and Information Science, Queen’s University, Kingston, Ont.K7L3N6 [3] A.E.Eiben, J.E.Smith, Introduction to Evolutionary Computing, Chapter [4] Rasim M.Alguliev, Ramiz M.Aliguliyev, Makrufa S.Hajirahimove, Chingiz A.Mehdiyev, MCMR: Maximum coverage and minimum redundant text summarization model, Expert Systems with Applications 38 (2011) 14514-14522 [5] Rasim M.Alguliev, Ramiz M.Aliguliyev, Nijat R.Isazade, Multiple documents summarization based on evolutionary optimization algorithm, Expert Systems with Application 40 (2013) 1675-1689 [6] Differential Evolution Optimization, 2011, Website http://beyondtheblueeventhorizon.blogspot.com/2011/04/differentialevolution-optimization.html [7] Vasan Arunachalam, Optimization using differential evolution, department of civil and environmental engineering, the university of western Ontario, London, Ontario, Canada, July 2008 [8] Differential Evolution (DE) for continuous function optimization, Website http://www1.icsi.berkeley.edu/~storn/code.html [9] B.G.W Craenen, A.E Eiben, E.Marchiori, How to handle Constraints with Evolutionary Algorithms [10] Chin-Yew Lin, ROUGE: A Package for Automatic Evaluation of Summaries, In Proceedings of Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004, Barcelona, Spain 49 [11] Josef Steinberger, Karel Jezek, Evaluation measuares for text summarization, Computing and Informatics, Vol.28, 2009, 1001-1026, V 2009-Mar-2 [12] Zbigniew Michalewicz, A survey of constraint handling techniques in evolutionary computation methods [13] Jim Smith, Introduction to evolutionary algorithms, University of the West of England, UK, June 2012 [14] Wikipedia, Chebyshev polynomials, Website http://en.wikipedia.org/wiki/Chebyshev_polynomials [15] Brian Hegerty, Chih-Cheng Hung, and Kristen Kasprak, A comparative Study on Differerential Evolution and Genetic Algorithm for some combinatorial problems, Southern Polytechnic State University, Marietta GA 30060, USA [16] Ani Nenkova and Katheleen McKeown, Automatic Summarization, Foundations and Trends in Information Retrieval, Vol 5, Nos 2-3 (2011) 103-233 [17] Huang, L., He, Y., Wei, F., & Li, W (2010) Modeling document summarization as multi-objective optimization In Proceedings of the third international symposium on intelligent information technology and security informatics, Jinggangshan, China (pp.382–386) [18] Radev, D., Jing, H., Stys, M., & Tam, D (2004) Centroid-based summarization of multiple documents Information Processing & Management, 40(6), 919–938 [19] Das, S., & Suganthan, P N (2011) Differential evolution: A survey of the state-of-the-art IEEE Transactions on Evolutionary Computation, 15(1), 4– 31 [20] Yang, C C., & Wang, F L (2008) Hierarchical summarization of large documents Journal of the American Society for Information Science and Technology, 59(6), 887?902 [21] Karel Jezek, Josef Steinberger, Automatic text summarization, Katedra informatiky a výpočetní techniky, FAV, ZČU – Západočeská Univerzita v Plzni, Univerzitní 22, 306 14 Plzeň 50 [22] Ching-Wei Chien, Zhan-Rong Hsu, Wei-Ping Lee, Improving the performance of differential evolution algorithm with modified mutation factor, 2009 International Conference Machine Learning and Computing, IPCSIT vol.3 (2011) IACSIT Press, Singapore