Giải bài toán tìm Motif bằng thuật toán đàn kiến

Thông tin tài liệu

Trường Đại học Công nghệ- Đại học Quốc gia Hà Nội CÔNG TRÌNH DỰ THI GIẢI THƯỞNG “SINH VIÊN NGHIÊN CỨU KHOA HỌC” NĂM 2012 Tên công trình: Giải bài toán tìm Motif bằng thuật toán đàn kiến Họ và tên sinh viên: Nguyễn Mạnh Hà Nam Nguyễn Hải Linh Nam Lớp: K53CA Khoa: KHMT Người hướng dẫn: PGS, TS Hoàng Xuân Huấn ThS Đỗ Đức Đông 2 Abstract A challenging problem in molecular biology is the identification of the specific binding sites of transcription factors in the promoter regions of genes referred to as motifs. This report presents an Ant Colony Optimization approach that can be used to provide the motif finding problem with promising solutions. Further, it searches both in the space of starting positions as well as in the space of motif patterns so that it has more chances to discover potential motifs. The approach has been implemented and tested on some datasets. Its performance was compared with a very popular algorithm for finding motifs MEME. Experimental results show that our approach could achieve comparable or better performance in terms of motif accuracy within a reasonable computational time 3 1 Introduction Finding the location of the common motif, shared by a set of DNA sequences, in each sequence has became a fundamental problem in bioinformatics with important applications in locating regulatory sites and drug target identification [14]. The motif finding problem has been formally considered as a difficult pattern recognition problem. Most developed motif finding algorithms use either approximate or heuristic techniques to obtain near optimal solutions at relatively low computational cost. Some of them carry out the search in the space of possible starting positions, whereas others search in the space of all possible motifs based on a given model. Recent researches covering most of the relevant techniques and approaches for motif finding, as well as several of the benchmarks algorithms included in this work can be found in [3,12]. Moreover, bio-inspired algorithms and other metaheuristics have been also proposed. Examples of these algorithms include genetic algorithms [6,8], genetic programming [15], and simulated annealing [9]. Although these methods M. Dorigo et al. (Eds.): ANTS 2010, LNCS 6234, pp. 464–471, 2010.c_Springer-Verlag Berlin Heidelberg 2010Motif Finding Using Ant Colony Optimization 465 have been shown to generate acceptable results in terms of the quality of the solutions found, the motif finding problem is still unsolved. One solution technique to respond to this challenge is a swarm-based approach, a natural metaphor, called Ant Colony Optimization (ACO) [5,4]. ACO is a population- based stochastic search method inspired by the foraging behavior of ant colonies. This metaheuristic has been used successfully for computing the best-known solutions for a wide range of combinatorial optimization problems (See [5] for more details). In [11], an ACO algorithm was developed to find a set of better initial positions for the Gibbs sampler (GS) [10] in order to improve its efficiently in term of time computing time and score. However, it does not incorporate any form of heuristic information. Moreover, a specific ant colony system was used for predicting the MHC class II binders [7]. In this study, we present a new motif finding approach based on ant colony optimization called SMMAS that is improved from MAX-MIN Ant System [13], one of the best performing variants of ACO metaheuristic, mombines with other techniques, such as Local Search to improve the result. Unlike some other motif finding techniques, SMMAS 4 searches both in the space of starting positions as well as in the space of motif patterns. Due to this feature, it has more chances to find potential motif patterns. Although our approach is also valid for protein sequences, we apply it only to DNA sequences. The rest of the paper is organized as follows. In Section 2, the motif finding problem is formally introduced. Section 3 is the general knowledge about Ant Colony Optimization. Section 4 presents our new ACO algorithm for motif finding problem. Section 5 describes the conducted experiments. Finally, concluding remarks given in Section 6. 2 The Motif Finding Problem 2.1 Motif concept A DNA motif is defined as a nucleic acid sequence pattern that has some biological significance such as being DNA binding sites for a regulatory protein, i.e., a transcription factor. Normally, the pattern is fairly short (5 to 20 base-pairs (bp) long) and is known to recur in different genes or several times within a gene. DNA motifs are often associated with structural motifs found in proteins. Motifs can occur on both strands of DNA. Transcription factors indeed bind directly on the double-stranded DNA. Sequences could have zero, one, or multiple copies of a motif. In addition to the common forms of DNA motifs two special types of DNA motifs are recognized: palindromic motifs and spaced dyad (gapped) motifs. A palindromic motif is a subsequence that is exactly the same as its own reverse complement, e.g., CACGTG. A spaced dyad motif consists of two smaller conserved sites separated by a spacer (gap). The spacer occurs in the middle of the motif because the transcription factors bind as a dimer. This means that the transcription factor is made out of two subunits that have two separate contact points with the DNA sequence. The parts where the transcription factor binds to the DNA are conserved but are typically rather small (3–5 bp). These two contact points are separated by a non-conserved spacer. This spacer is mostly of fixed length but might be slightly variable. 5 2.2 Finding Motif Problem Given a set of DNA sequences (promoter region), the motif finding problem is the task of detecting overrepresented motifs as well as conserved motifs from orthologous sequences that are good candidates for being transcription factor binding sites. A large number of algorithms for finding DNA motifs have been developed. Most of these algorithms are designed to deduce motifs by considering the regulatory region (promoter) of several coregulated genes from a single genome. It is assumed that coexpression of genes arises mainly from transcriptional coregulation. As coregulated genes are known to share some similarities in their regulatory mechanism, possibly at transcriptional level, their promoter regions might contain some common motifs that are binding sites for transcription factors. A sensible approach to detect these regulatory elements is to search for statistically overrepresented motifs in the promoter region of such a set of coexpressed genes. A statistically overrepresented motif means a motif that occurs more often than one would expect by chance. Therefore, these algorithms search for overrepresented motifs in this collection of promoter sequences. However, most of these motif finding algorithms have been shown to work successfully in yeast and other lower organisms, but perform significantly worse in higher organisms. To overcome this difficulty recent motif finding algorithms are taking advantage of cross-species genome comparison or phylogenetic footprinting. The simple premise underlying phylogenetic footprinting is Figure 1.1: Example of Motif 6 that selective pressure causes functional elements to evolve at a slower rate than non- functional sequences. This means that usually well conserved sites among a set of orthologous promoter regions are excellent candidates for functional regulatory elements or motifs. Several motif finding algorithms have been developed based on phylogenetic footprinting. Most recently, algorithms that integrate DNA sequence data from coregulated genes and phylogenetic footprinting have significantly improved motif finding from genomic sequences. Efforts have also focused toward developing algorithms that incorporate parameters that are useful for motif finding in higher organisms. Stormo presented an excellent history of development and application of computer algorithms for DNA motif finding. Since then a remarkably rapid development has occurred in DNA motif finding algorithms and a large number of DNA motif finding algorithms have been developed and published. In Bioinformatics, a formal description of this problem can be viewed as follows. Given a set of DNA sequences S = S 1 ,S 2 , , S N of common length W 1 . Find the promising motif pattern X = x 1 x 2 x i x l of length l, x i ∈ {A, T,C,G} and the starting locations of its occurrences on all sequences in S. The selection of a particular motif pattern is based on a defined score function that measure the similarity between the motif pattern and its occurrences. There are several methods for scoring a motif pattern. Our proposed approach uses consensus score and information content as score functions. To illustrate how to compute these score functions, consider a candidate motif pattern that can be generated by choosing a random position from each sequence. Then, the patterns starting at these positions are aligned to form an N × l alignment matrix. Therefore, the candidate motif pattern can be represented by a count-based profile C where C(i, j) is the count of nucleotide i on the column j of the alignment matrix and its corresponding consensus score (CSc) is defined as: The information content (IC) score function can be easily computed as follows: 7 Where each element Q(i, j) indicates the frequency of the nucleotide i to be in position j of the motif pattern and B 0 (i) denotes its background frequency, i.e. the observed frequency of nucleotide i overall all sequences in the dataset. 3 Ant Colony Optimization Algorithm 3.1 ACO algorithms ACO algorithms make use of simple agents called ants which iteratively construct candidate solution to a combinatorial optimization problem. The ants’ solution construction is guided by (artificial) pheromone trails and problem-dependent heuristic information. In principle, ACO algorithms can be applied to any combinatorial optimization problem by defining solution components which the ants use to iteratively construct candidate solutions and on which they may deposit pheromone (see [10,11] for more details). An individual ant constructs candidate solutions by starting with an empty solution and then iteratively adding solution components until a complete candidate solution is generated. We will call each point at which an ant has to decide which solution component to add to its current partial solution a choice point. After the solution construction is completed, the ants give feedback on the solutions they have constructed by depositing pheromone on solution components which they have used in their solution. Typically, solution components which are part of better solutions or are used by many ants will receive a higher amount of pheromone and, hence, will more likely be used by the ants in future iterations of the algorithm. To avoid the search getting stuck, typically before the pheromone trails get reinforced, all pheromone trails are decreased by a factor p . 8 The ants’ solutions are not guaranteed to be optimal with respect to local changesand hence may be further improved using local search methods. Based on this observation, the best performing ACO algorithms for many NP-hard static combinatorial problems are in fact hybrid algorithms combining probabilistic solution construction by a colony of ants with local search algorithms. In such hybrid algorithms, the ants can be seen as guiding the local search by constructing promising initial solutions, because ants preferably use solution components which, earlier in the search, have been contained in good locally optimal solutions. In general, all ACO algorithms for static combinatorial problems follow a specific algorithmic scheme outlined in Figure 1. After the initialization of the pheromone trails and some parameters, a main loop is repeated until a termination condition — which may be a certain number of solution constructions or a given CPU-time limit — is met. In the main loop, first, the ants construct feasible solutions, then the generated solutions are possibly improved by applying local search, and finally the pheromone trails are updated. It should be noted that the ACO metaheuristic is more general than the algorithmic scheme given here. procedure ACO algorithm for static combinatorial problems Set parameters, initialize pheromone trails while (termination condition not met) do ConstructSolutions ApplyLocalSearch % optional UpdateTrails end end Fig. 3.1 Algorithmic skeleton for ACO algorithms applied to static combinatorial problems. 9 Update of Pheromone Trails After all the ants have constructed their tours, the pheromone trails are updated. This is done by first lowering the pheromone value on all arcs by a constant factor, and then adding pheromone on the arcs the ants have crossed in their tours. Pheromone evaporation is implemented by where 0 < ra1 is the pheromone evaporation rate. The parameter r is used to avoid unlimited accumulation of the pheromone trails and it enables the algorithm to ‘‘forget’’ bad decisions previously taken. In fact, if an arc is not chosen by the ants, its associated pheromone value decreases exponentially in the number of iterations. After evaporation, all ants deposit pheromone on the arcs they have crossed in their tour: Where is the amount of pheromone ant k deposits on the arcs it has visited. It is defined as follows: where Ck, the length of the tour Tk built by the k-th ant, is computed as the sum of the lengths of the arcs belonging to Tk. By means of equation, the better an 72 Chapter 3 Ant Colony Optimization Algorithms for the Traveling Salesman Problem ant’s tour is, the more pheromone the arcs belonging to this tour receive. In general, arcs that are used by many ants and which are part of short tours, receive more pheromone and are therefore more likely to be chosen by ants in future iterations of the algorithm. 10 As we said, the relative performance of AS when compared to other metaheuristics tends to decrease dramatically as the size of the test-instance increases. Therefore, a substantial amount of research on ACO has focused on how to improve AS. 3.2 MAX–MIN Ant System MAX–MIN Ant System (MMAS) (Stu¨ tzle & Hoos, 1997, 2000; Stu¨ tzle, 1999) introduces four main modifications with respect to AS. First, it strongly exploits the best tours found: only either the iteration-best ant, that is, the ant that produced the best tour in the current iteration, or the best-so-far ant is allowed to deposit pheromone. Unfortunately, such a strategy may lead to a stagnation situation in which all the ants follow the same tour, because of the excessive growth of pheromone trails on arcs of a good, although suboptimal, tour. To counteract this effect, a second modification introduced byMMAS is that it limits the possible range of pheromone trail values to the interval [tmin; tmax]. Third, the pheromone trails are initialized to the upper pheromone trail limit, which, together with a small pheromone evaporation rate, increases the exploration of tours at the start of the search. Finally, in MMAS, pheromone trails are reinitialized each time the system approaches stagnation or when no improved tour has been generated for a certain number of consecutive iterations. Update of Pheromone Trails After all ants have constructed a tour, pheromones are updated by applying evaporation as in AS, followed by the deposit of new pheromone as follows: Where , The ant which is allowed to add pheromone may be either the best-so-far, in which case or the iteration-best, in which case [...]... motif Search 10 Motif Best result Runtime(s) Score Runtime(s) 119 4.54 125 22.4 124 125 39.86 Average 119.4 Table 5.1: Motif s length is 7 22.4 124.1 20 MEME SMMAS Length of Motif = 10 Score Search 1 motif Search 10 Motif Best result Runtime(s) Score Runtime(s) 158 4.61 160 83.15 158 160 39.42 Average 153.5 83.15 158.4 Table 5.2: Motif s length is 10 MEME SMMAS Length of Motif = 13 Score Search 1 motif. .. NoA = 0 NoA = 1 NoA = 2 NoA = 3 Length of Motif = 7 Score Search 1 motif Search 10 Motif Best result Time Score Time Score Time Score Time 118 15.8 121 18.3 125 22.4 125 28.9 119 123 15.8 Average 118.7 125 125 18.3 122.1 22.4 124.1 28.9 124.2 Table 5.4: Motif s length is 7 NoA = 0 NoA = 1 NoA = 2 NoA = 3 Length of Motif = 10 Score Search 1 motif Search 10 Motif Best result Time Score Time Score Time... running time, MEME algorithm has shorter running time in finding one motif However, when finding more than one motif, the running time increases dramatically (Runtime = Runtime for finding one motif * amount of diffirent motif ) With SMMAS 21 algorithm, finding one motif or many diffirent motif pays the same running time Moreover, the balance of the average score of SMMAS algorithm is very high Tables... 149.6 160 68.4 154.2 160 83.15 158.4 Table 5.5: Motif s length is 10 99.3 158.4 22 NoA = 0 NoA = 1 NoA = 2 NoA = 3 Length of Motif = 13 Score Search 1 motif Search 10 Motif Best result Time Score Time Score Time Score Time 183 104.13 187 128.42 193 147.53 193 172.62 183 188 104.13 Average 182.8 193 128.42 187.9 193 147.53 190.3 172.62 190.5 Table 5.6: Motif s length is 13 6 Conclutions In this paper,... Search 10 Motif Best result Runtime(s) Score Runtime(s) 191 4.56 193 147.53 191 193 37.91 Average 180.2 147.53 190.3 Table 5.3: Motif s length is 13 It can be seen that in all cases, our SMMAS algorithm always gives the better score than MEME algorithms does, especially in finding 10 different motifs in one input About the running time, MEME algorithm has shorter running time in finding one motif However,... particular, we change each nucleotide of the current best motif 18 into three others nucleotides, and recaculate its score If the new solution has better score, we will use that motif to updating pheromone trails Figure 4.3 Example of local search Local search is very effective when combine with ACO, because of good solution, all next ants tend to find their motifs in better way Althought local seach makes... at motif s position j and j + 1 respectively Initially, the pheromone values are all set to a constant value 16 Figure 4.2 Each ant builds its solution by traversing the graph 4.2 Solution construction Motif finding using ACO consists of a number of iterations of solution construction Each ant incrementally builds its solution by traversing the graph to complete a tour representing one candidate motif. .. Optimization We also proposed a new promising method for the Motif finding problem and presented and commented some experimental results We compared our approach, SMMAS for Motif finding problem, to one of the most popular algorithms MEME The result shows that our algorithm computes always better solutions, especially in finding many diffirent motifs in one input Our future work will be focus on two things:... decision to add solution component 17 (nucleotide) i in position j of the pattern motif depends on the n previous visited nucleotides during the solution construction Doing so, the nth background model Bn is based on counting the frequency of all nucleotide subsequences of length (n + 1) in the dataset Let x1x2 xj−1xj be the partial motif pattern constructed by an ant and being at node (u, v), i.e xj = u If... position of the motif In addition, an edge ei(u, v) always exists between two nodes node (u, j) and node (v,j+1)where u, v ∈{A, T,C,G} and (1 ≤ j ≤ l − 1) Figure 4.1 Constructed graph In SMMAS, two types of pheromone trails are modeled First, a pheromone trail τ1i, i ∈{A, T,C,G}, is associated with each node (i, 1) The value τ1i encodes the desirability of nucleotide i being at the first motif s position . bằng thuật toán đàn kiến Họ và tên sinh viên: Nguyễn Mạnh Hà Nam Nguyễn Hải Linh Nam Lớp: K53CA Khoa: KHMT Người hướng dẫn: PGS, TS Hoàng Xuân Huấn ThS Đỗ Đức Đông

Ngày đăng: 12/04/2014, 15:42

Xem thêm: Giải bài toán tìm Motif bằng thuật toán đàn kiến, Giải bài toán tìm Motif bằng thuật toán đàn kiến

Giải bài toán tìm Motif bằng thuật toán đàn kiến

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan