Data Mining and Knowledge Discovery Handbook, 2 Edition part 43 docx

10 261 0
Data Mining and Knowledge Discovery Handbook, 2 Edition part 43 docx

Đang tải... (xem toàn văn)

Thông tin tài liệu

400 Alex A. Freitas University, Edinburgh, UK. Sharpe PK and Glover RP (1999) Efficient GA based techniques for classification. Applied Intelligence 11, 277-284. Smith RE (2000) Learning classifier systems. In: T. Back, D.B. Fogel and T. Michalewicz (Eds.) Evolutionary Computation 1: Basic Algorithms and Operators, 114-123. Institute of Physics Publishing. Smith MG and Bull L (2003) Feature construction and selection using genetic programming and a genetic algorithm. Genetic Programming: Proc. EuroGP-2003, LNCS 2610, 229- 237. Springer. Smith MG and Bull L (2004) Using genetic programming for feature creation with a genetic algorithm feature selector. Proc. Parallel Problem Solving From Nature (PPSN-2004), LNCS 3242, 1163-1171, Springer. Song D, Heywood MI and Zincir-Heywood AN (2005) Training genetic programming on half a million patterns: an example from anomaly detection. IEEE Trans. Evolutionary Computation 9(3), 225-239, June 2005. Srikanth R, George R, Warsi N, Prabhu D, Petry FE, Buckles B (1995) A variable-length genetic algorithm for clustering and classification. Pattern Recognition Letters 16(8), 789-800. Tan PN, Steinbach M and Kumar V (2006) Introduction to Data Mining. Addison-Wesley. Terano T and Ishino Y (1998) Interactive genetic algorithm based feature selection and its ap- plication to marketing data analysis. In: Liu H and Motoda H (Eds.) Feature Extraction, Construction and Selection: a data mining perspective, 393-406. Kluwer. Terano T and Inada M (2002) Data mining from clinical data using interactive evolutionary computation. In: A. Ghosh and S. Tsutsui (Eds.) Advances in Evolutionary Computing: theory and applications, 847-861. Springer. Vafaie H and De Jong K (1998) Evolutionary Feature Space Transformation. In: H. Liu and H. Motoda (Eds.) Feature Extraction, Construction and Selection, 307-323. Kluwer. Witten IH and Frank E (2005) Data Mining: practical machine learning tools and techniques . 2nd Ed. Morgan Kaufmann. Wong ML and Leung KS (2000) Data Mining Using Grammar Based Genetic Programming and Applications. Kluwer. Yang J and Honavar V (1997) Feature subset selection using a genetic algorithm. Genetic Programming 1997: Proc. 2nd Annual Conf. (GP-97), 380-385. Morgan Kaufmann. Yang J and Honavar V (1998) Feature subset selection using a genetic algorithm. In: Liu, H. and Motoda, H (Eds.) Feature Extraction, Construction and Selection, 117-136. Kluwer. Zhang P, Verma B, Kumar K (2003) Neural vs. Statistical classifier in conjunction with ge- netic algorithm feature selection in digital mammography. Proc. Congress on Evolu- tionary Computation (CEC-2003). IEEE Press. Zhou C, Xiao W, Tirpak TM and Nelson PC (2003) Evolving accurate and compact classi- fication rules with gene expression programming. IEEE Trans. on Evolutionary Compu- tation 7(6), 519-531. 20 A Review of Reinforcement Learning Methods Oded Maimon 1 and Shahar Cohen 1 Department of Industrial Engineering, Tel-Aviv University, Ramat-Aviv 69978, Israel, maimon@eng.tau.ac.il Summary. Reinforcement-Learning is learning how to best-react to situations, through trial and error. In the Machine-Learning community Reinforcement-Learning is researched with respect to artificial (machine) decision-makers, referred to as agents. The agents are assumed to be situated within an environment which behaves as a Markov Decision Process. This chap- ter provides a brief introduction to Reinforcement-Learning, and establishes its relation to Data-Mining. Specifically, the Reinforcement-Learning problem is defined; a few key ideas for solving it are described; the relevance to Data-Mining is explained; and an instructive example is presented. Key words: Reinforcement-Learning 20.1 Introduction Reinforcement-Learning (RL) is learning how to best-react to situations, through trial-and-error. The learning takes place as a decision-maker interacts with the envi- ronment she lives in. On a sequential basis, the decision-maker recognizes her state within the environment, and reacts by initiating an action. Consequently she obtains a reward signal, and enters another state. Both the reward and the next state are affected by the current state and the action taken. In the Machine Learning (ML) community, RL is researched with respect to artificial (machine) decision-makers, referred to as agents. The mechanism that generates reward signals and introduces new states is re- ferred to as the dynamics of the environment. As the RL agent begins learning, it is unfamiliar with that dynamics, and therefore initially it cannot correctly predict the outcome of actions. However as the agent interacts with the environment and observes the actual consequences of its decisions, it can gradually adapt its behav- ior accordingly. Through learning the agent chooses actions according to a policy.A policy is a means of deciding which action to choose when encountering a certain state. A policy is optimal if it maximizes an agreed-upon return function. A return function is usually some sort of expected weighted-sum over the sequence of rewards O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09823-4_20, © Springer Science+Business Media, LLC 2010 402 Oded Maimon and Shahar Cohen obtained while following a specified policy. Typically the objective of the RL agent is to find an optimal policy. RL research has been continually advancing over the past three decades. The aim of this chapter is to provide a brief introduction to this exciting research, and to establish its relation to Data-Mining (DM). For a more comprehensive RL survey, the reader is referred to Kaelbling et al. (1996). For a comprehensive introduction to RL, see Sutton and Barto (1998). A rigorous presentation of RL can be found in Bertsekas and Tsitsiklis (1996). The rest of this chapter is organized as follows. Section 20.2 formally describes the basic mathematical model of RL, and reviews some key results for this model. Section 20.3 introduces some of the principles of computational methods in RL. Sec- tion 20.4 describes some extensions to the basic RL model and computation methods. Section 20.5 reviews several application of RL. Section 20.6 discusses RL in a DM perspective. Finally Section 20.7 presents an example of how RL is used to solve a typical problem. 20.2 The Reinforcement-Learning Model RL is based on a well-known model called Markov Decision Process (MDP). An MDP is a tuple S,A,R,P, where S is a set of states, A is a set of actions 1 , R : S×A → ℜ is a mean-reward function and P : S×A ×S →[0, 1] is a state-transition function. An MDP evolves through discrete time stages. On stage t, the agent recognizes the state of its environment s t ∈S and reacts by choosing an action a t ∈A. Consequently it obtains a reward r t , whose mean-value is R(s t ,a t ), and its environment is transited to a new state s t+1 with probability P(s t ,a t ,s t+1 ). The two sets - of states and of actions - may be finite or infinite. In this chapter, unless specified otherwise, both sets are assumed to be finite. The RL agent begins interacting with the environment without any knowledge of the mean-reward function or the state-transition function. Situated within its environment, the agent seeks an optimal policy. A policy π : S ×A → [0,1] is a mapping from state-action pairs to probabilities. Namely, an agent that observes the state s and follows the policy π will chose the action a ∈ A in probability π (s,a). Deterministic policies are of particular interest. A deterministic policy π d is a policy in which for any state s ∈ S there exists an action a ∈ A so that π d (s,a)=1, and π d (s,a  )=0 for all a  = a. A deterministic policy is therefore a map- ping from states to actions. For a deterministic policy π d the action a = π d (s) is the action for which π d (s,a)=1. The subscript ”d” is added to differentiate determin- istic from non-deterministic policies (when it is clear from the context that a policy is deterministic, the subscript is omitted). A policy (either deterministic or not) is optimal if it maximizes some agreed-upon return function. The most common return function is the expected geometrically-discounted infinite-sum of rewards. Consid- ering this return function, the objective of the agent is defined as follows: 1 It is possible to allow different sets of actions for different states (i.e. letting A(s) be the set of allowable actions in state s for all s ∈ S). For ease of notation, it is assumed that all actions are allowed in all states. 20 Reinforcement Learning 403 E  ∞ ∑ t=1 γ t−1 r t  → max, (20.1) where γ ∈ (0, 1) is a discount factor representing the extent to which the agent is willing to compromise immediate rewards for the sake of future rewards. The dis- count factor can be interpreted either as a means of capturing some characteristics of the problem (for example an economic interest-rate) or as a mathematical trick that makes RL problems more tractable. Other useful return-functions are defined by expected finite-horizon sum of rewards, and expected long-run average reward (see Kaelbling et al., 1996). This chapter assumes that the return function is the expected geometrically-discounted infinite-sum of rewards. Given a policy π , the value of the state s is defined by: V π (s)=E π  ∞ ∑ t=1 γ t−1 r t | s 1 = s  s ∈S , (20.2) where the operator E π represents expectation given that actions are chosen accord- ing to the policy π . The value of a state for a specific policy represents the return correlated with following the policy given a specific initial state. Similarly, given the policy π , the value of a state-action pair is defined by: Q π (s,a)=E π  ∞ ∑ t=1 γ t−1 r t | s 1 = s,a 1 = a  s ∈S;a ∈ A , (20.3) The value of a state-action pair for a specific policy is the return correlated with first choosing a specific action while being on a specific state, and thereafter choosing actions according to the policy. The optimal value of states is defined by: V ∗ (s)=max π V π (s), s ∈S . (20.4) A policy π ∗ is optimal if it achieves the optimal values for all states, i.e. if: V π ∗ (s)=V ∗ (s) ∀s ∈S . (20.5) If π ∗ is an optimal policy it also maximizes the value of all state-action pairs: Q ∗ (s,a)=Q π ∗ (s,a)=max π Q π (s,a) ∀s ∈S; a ∈ A . (20.6) A well-known result is that under the assumed return-function, any MDP has an optimal deterministic policy. This optimal policy, however, may not be unique. Any deterministic optimal policy must satisfy the following relation: π ∗ (s)=argmax a Q ∗ (s,a), ∀s ∈ S . (20.7) Finally, the relation between optimal values of states and of state-action pairs is es- tablished by the following set of equations: 404 Oded Maimon and Shahar Cohen Q ∗ (s,a)=R(s, a)+ γ ∑ s  ∈S P(s,a, s  )V ∗ (s  ) V ∗ (s)=max a Q ∗ (s,a) , s ∈S;a ∈A . (20.8) For a more extensive discussion about MDPs, and related results, the reader is re- ferred to Puterman (1994) or Ross (1983). Some RL tasks are continuous while others are episodic. An episodic task is one that terminates after a (maybe random) number of stages. As a repeat of an episodic task terminates, another may begin, possibly at a different initial state. Con- tinuous tasks, on the other hand, never terminate. The objective defined by Equation 20.1 considers an infinite horizon and therefore might be seen as inappropriate for episodic tasks (which are finite by definition). However, by introducing the concept of an absorbing state, episodic tasks can be viewed as infinite-horizon tasks (Sut- ton and Barto, 1998). An absorbing state is one from which all actions result in a transition to the same state and with zero reward. 20.3 Reinforcement-Learning Algorithms The environment in RL problems is modeled as an MDP with unknown mean-reward and state transition-functions. Many RL algorithms are generalizations of dynamic- programming (DP) algorithms (Bellman, 1957; Howard, 1960) for finding optimal policies in MDPs given these functions. Sub-section 20.3.1 introduces a few key DP principles. The reader is referred to Puterman (1994), Bertsekas (1987) or Ross (1983) for a more comprehensive discussion. Sub-section 20.3.2 introduces several issues related to generalizing DP algorithms to RL problems. Please see Sutton and Barto (1998) for a comprehensive introduction to RL algorithms, and Bertsekas and Tsitsiklis (1996) for a more extensive treatment. 20.3.1 Dynamic-Programming Typical DP algorithms begin with an arbitrary policy and proceed by evaluating the values of states or state-action pairs for this policy. These evaluations are used to derive a new, improved policy, for evaluating the values of states or state-action pairs for the new policy and so on. Given a deterministic policy π , the evaluation of val- ues of states may take place by incorporating an iterative sequence of updates. The sequence begins with arbitrary initializations V 1 (s) for each s. On the k-th repeat of the sequence, the values V k (s) are used to derive V k+1 (s) for all s: V k+1 (s)=R(s, π (s)) + γ ∑ s  ∈S P(s, π (s),s  )V k (s  ) ∀s ∈S . (20.9) It can be shown that following this sequence, V k (s) converges to V π (s) for all s,ask increases. Having the deterministic policy π and the values of states for this policy V π (s), the values of state-actions pairs are given by: 20 Reinforcement Learning 405 Q π (s,a)=R(s, a)+ γ ∑ s  ∈S P(s,a, s  )V π (s) s ∈S;a ∈ A (20.10) The values of state-action pairs for a given policy π k , can be used to derive an im- proved deterministic policy π k+1 : π k+1 (s)=argmax a∈A Q π k (s,a) ∀s ∈S (20.11) It can be shown that V π k+1 (s) ≥V π k (s) for all s, and that if the last relation is satisfied with equality for all states, then π k is an optimal policy. Improving the policy π may be done based on estimations of V π (s) instead of the exact values (i.e. if V(s) estimate V π (s), a new policy can be derived by calculating Q(s,a) according to Equation 20.10 where V (s) replace V π (s), and by calculating the improved policy according to Equation 20.11, based on Q(s,a). Estimations of V π (s) are usually the result of executing the sequence of updates defined by Equation 20.9, without waiting for V k (s) to converge to V π (s) for all s ∈ S. In particular, it is possi- ble to repeatedly execute a single repeat of the sequence defined in Equation 20.9, to use the estimation results to derive a new policy as defined by Equations 20.10 and 20.11, to re-execute a single repeat of Equation 20.9 starting from the current esti- mation results, and so-on. This well-known approach, termed value-iteration, begins with arbitrary initialization Q 1 (s,a) for all s and a, and proceeds iteratively with the updates: Q t+1 (s,a)=R(s, a)+ γ ∑ s  ∈S P(s,a, s  )V t (s  ) ∀s ∈ S;a ∈A;t = 1, 2, (20.12) where V t (s)=max a Q t (s,a). It can be shown that using value-iteration, Q t (s,a) con- verges to Q ∗ (s,a)) 2 . The algorithm terminates using some stopping conditions (e.g. the change in the values Q t (s,a) due to a single iteration is small enough) 3 . Let the termination occur at stage T . The output policy is calculated according to: π (s)=argmax a∈A Q T (s,a) ∀s ∈S (20.13) 20.3.2 Generalization of Dynamic-Programming to Reinforcement-Learning It should be noted that both the mean-reward and the state-transition functions are required in order to take on the computations described in the previous sub-section. In RL, these functions are initially unknown. Two different approaches, indirect and direct, may be used to generalize the discussion to the absence of these functions. According to the indirect approach, samples from the consequences of choosing various actions at various states are gathered. These samples are used to approximate 2 In addition, there are results concerning the rate of this convergence. See, for instance, the discussion in Puterman (1994). 3 It is possible to establish a connection between the change in the values Q t (s,a) due to a single iteration and the distance between Q t (s,a) and Q ∗ (s,a). See, for instance, the discussion in Puterman (1994). 406 Oded Maimon and Shahar Cohen the mean-reward and state-transition functions. Subsequently, the indirect approach uses the approximations to extract policies. During the extraction of policies, the approximated functions are used as if they were the exact ones. The direct approach, the more common of the two, involves continuously main- taining estimations of the optimal values of state and state-action pairs without hav- ing any explicitly approximated mean-reward and state-transition functions. The overview in this sub-section focus on methods that take a direct approach. In a typical direct method, the agent begins learning in a certain state, while hav- ing arbitrary estimations of the optimal values of states and state-action pairs. Sub- sequently the agent uses a while-learning policy to choose an action. Consequently a new state is encountered and an immediate reward is obtained (i.e. a new experience is gathered). The agent uses the new experience to update the estimations of optimal values for states and state-action pairs visited in previous stages. The policy, that the agent uses while it learns, needs to solve a dilemma, known as the exploration-exploitation dilemma. Exploitation means using the knowledge gathered in order to obtain desired outcomes. In order to obtain desired outcomes on a certain stage, the agent needs to choose an action which corresponds with the max- imal optimal value of state-action given the state. Since the exact optimal values of state-action pairs are unknown, at least the corresponding estimations are expected to be maximized. On the other hand, due to the random fluctuations of the reward signals and the random nature of the state-transition function, the agent’s estimations are never accurate. In order to obtain better estimations, the agent must explore its possibilities. Exploration and exploitation are conflicting rationales, because by ex- ploring possibilities the agent will sometimes choose actions that seem inferior at the time they are chosen. In general, it is unknown how to best-solve the exploration-exploitation dilemma. There are, however, several helpful heuristics, which typically work as follows. Dur- ing learning, the action chosen while being on state s is randomly chosen from the entire set of actions, but with a probability function that favors actions for which the current optimal value estimates are high. (See Sutton and Barto, 1998 for a discus- sion on the exploration-exploitation dilemma and its heuristic solutions). Many RL algorithms are stochastic variations of DP algorithms. Instead of us- ing an explicit mean-reward and state-transition functions, the agent uses the actual reward signals and state transitions while interacting with the environment. These actual outcomes implicitly estimate the real, unknown functions. There are several assumptions under which the estimates maintained by stochastic variations of DP algorithms converge to the optimal values of states and state-action pairs. Having the optimal values, a deterministic optimal policy may be derived according to Equation 20.7. The reader is referred to Bertsekas and Tsitsiklis (1996), Jaakkola et al. (1994) or Szepesv ´ ari and Littman (1999) for formal, general convergence results. One of the most common RL algorithms is termed Q-Learning (Watkins, 1989; Watkins and Dayan 1992). Q-Learning takes the direct approach, and can be regarded as the stochastic version of value-iteration. At stage t of Q-Learning, the agent holds Q t (s,a) – estimations of the optimal values for state-action pairs Q ∗ (s,a) for all state-action pairs. At this stage, the agent encounters state s t and chooses the action 20 Reinforcement Learning 407 a t . Following the execution of a t from s t , the agent obtains the actual reward r t , and faces the new state s t+1 . The tuple s t ,a t ,r t ,s t+1  is referred to as the experience gathered on stage t. Given that experience, the agent updates its estimate as follows: Q t+1 (s t ,a t )=(1 − α t (s t ,a t ))Q t (s t ,a t )+ α t (s t ,a t )(r t + γ V t (s t+1 )) = Q t (s t ,a t )+ α t (s t ,a t )(r t + γ V t (s t+1 ) −Q t (s t ,a t )) (20.14) where V t (s)=max a Q t (s,a), and α t (s t ,a t ) ∈ (0, 1) is a step size reflecting the ex- tent to which the new experience needs to be blended into the current estimates 4 .It can be shown that as t → ∞, Q t (s,a) converges to Q ∗ (s,a) for all s and a and un- der several assumptions. (Convergence proofs can be found in Watkins and Dayan, 1992; Jaakkola et al. 1996; Szepesv ´ ari and Littman 1999 and Bertsekas and Tsitsik- lis, 1996). In order to understand the claim that Q-Learning is a stochastic version of value-iteration, it is helpful to address Equation 20.14 as an update of the estimated value of a certain state-action pair Q t (s t ,a t ) in the direction r t + γ V t (s t+1 ), with a step size α t (s t ,a t ). With this interpretation, Q-Learning can be compared to value-iteration (Equation 20.12). Referring to a certain state-action pair, noted by s t ,a t , and replacing the state space index s  with s t+1 Equation 20.12 can be re-phrased as: Q t+1 (s t ,a t )=R(s t ,a t )+ γ ∑ s t+1 ∈S P(s t ,a t ,s t+1 )V t (s t+1 ) (20.15) Rewriting Equation 20.14 with α t (s t ,a t )=1 results in: Q t+1 (s t ,a t )=r t + γ V t (s t+1 ) (20.16) The only difference between Equations 20.15 and 20.16 lies in the exchanged use of r t instead of R(s t ,a t ) and the exchanged use of V t (s t+1 ) instead of ∑ s t+1 ∈S P(s t ,a t ,s t+1 )V t (s t+1 ). It can be shown that: E [r t + γ V t (s t+1 )] = R (s t ,a t )+ γ ∑ s t+1 ∈S P(s t ,a t ,s t+1 )V t (s t+1 ), (20.17) namely Equation 20.16 is a stochastic version of Equation 15. It is appropriate to use a unit step-size when basing an update on exact values, but it is inappropriate to do so when basing the update on unbiased estimates, since the learning algorithm must be robust to the random fluctuations. In order to converge, Q-Learning is assumed to infinitely update each state-action pair, but there is no explicit instruction as to which action to choose at each stage. However in order to boost the rate at which estimations converge to optimal values, heuristics that break the exploration-exploitation dilemma are usually used. 4 In general, a unique step-size is defined for each state and action and for each stage. Usually step-sizes decrease with time. 408 Oded Maimon and Shahar Cohen 20.4 Extensions to Basic Model and Algorithms In general, the RL model is quite flexible and can be used to capture problems in a variety of domains. Applying an appropriate RL algorithm can lead to optimal solu- tions without requiring an explicit mean-reward or state-transition functions. There are some problems, however, that the RL model described in Section 20.2 cannot capture. There may also be some serious difficulties in applying RL algorithms de- scribed in Section 20.3. This section presents overviews of two extensions. The first extension involves, a multi-agent RL scenario, where learning agents co-exist in a single environment. This is followed by an overview of the problem of large (or even infinite) sets of states and actions. 20.4.1 Multi-Agent RL In RL, as in real-life, the existence of one autonomous agent affects the outcomes ob- tained by other co-existing agents. An immediate approach for tackling multi-agent RL is to let each agent refer to its colleagues (or adversaries) as part of the envi- ronment. A learner that takes this approach is regarded as an independent learner (Claus and Boutilier, 1998). It is to be noticed that the environment of an indepen- dent learner consists of learning components (the other agents) and is therefore not stationary. The model described in Section 20.4, as well as the convergence results mentioned in Section 20.3, assumed a stationary environment (i.e. the mean-reward and state-transition functions do not change in time). Although convergence is not guaranteed when using independent learners in multi-agent problems, several au- thors have reported good empirical results for this approach. For example, Sen et al. (1994) used Q-Learning in multi-agent domains. Littman (1994) proposed Markov Games (often referred to as Stochastic Games) as the theoretic model appropriate for multi-agent RL problems. A k-agents Stochas- tic Game (SG) is defined by a tuple  S, ¯ A, ¯ R,P  , where S is a finite set of states (as in the case of MDPs); ¯ A = A 1 ×A 2 × ×A k is a Cartesian product of k action-sets available for the k agents; ¯ R : S × ¯ A → ℜ k is a collection of k mean-reward functions for the k agents; and P : S× ¯ A×S →[0,1] is a state-transition function. The evolution of an SG is controlled by k autonomous agents acting simultaneously rather than by a single agent. The notion of a game as a model for multi-agent RL problems raises the concept of Nash-equilibrium as an optimality criterion. Generally speaking, a joint policy is said to be in Nash-equilibrium if no agent can gain from being the only one to deviate from it. Several algorithms that rely on the SG model appear in the literature (Littman, 1994; Hu and Wellman, 1998; Littman, 2001). In order to assure con- vergence, agents in these algorithms are programmed as joint learners (Claus and Boutilier, 1998). A joint learner is aware of the existence of other agents and in one way or another adapts its own behavior to the behavior of its colleagues. In gen- eral, the problem of multi-agent RL is still the subject of ongoing research. For a comprehensive introduction to SGs the reader is referred to Filar and Vrieze (1997). 20 Reinforcement Learning 409 20.4.2 Tackling Large Sets of States and Actions The algorithms described in Section 20.3 assumed that the agent maintains a look- up table with a unique entry corresponding to each state or state-action pair. As the agent gathers new experience, it retrieves the entry corresponding with state or state-action pair for that experience, and updates the estimate stored within the entry. Representation of estimates to optimal values in the form of a look-up table is limited to problems with a reasonably small number of states and actions. Obviously, as the number of states and action increases, the memory required for the look-up table increases, and so does the time and experience needed to fill up this table with reliable estimates. That is to say, if there is a large number of states and actions, the agent cannot have the privilege of exploring them all, but must incorporate some sort of generalization. Generalization takes place through function-approximation in which the agent maintains a single approximated value function from state or state-action pairs to mean-rewards. Function approximation is a central idea in Supervised-Learning (SL). The task in function-approximation is to find a function that will best ap- proximate some unknown target-function based on a limited set of observations. The approximating function is located through a search over the space of parame- ters of a decided-upon family of parameterized functions. For example, the unknown function Q ∗ : S ×A → ℜ may be approximated by an artificial neural network of pre-determined architecture. A certain network f : S ×A → ℜ belongs to a family of parameterized networks Φ , where the parameters are the weights of the connections in the network. RL with function approximation inherits the direct approach described in Sec- tion 20.3. That is, the agent repeatedly estimates the values of states or state-action pairs for its current policy; uses the estimates to derive an improved policy; estimates the values corresponding with the new policy; and so on. However, the function rep- resentation adds some more complications to the process. In particular, when the number of actions is large, the principle of policy improvement by finding the ac- tion that maximizes current estimates, does not scale well. Moreover, convergence results characterizing look-up table representations usually cannot be generalized to function-approximation representations. The reader is referred to Bertsekas and Tsitsiklis (1996) for an extensive discussion on RL with function-approximation and corresponding theoretic results. 20.5 Applications of Reinforcement-Learning Using RL, an agent may learn how to best behave in a complex environment without any explicit knowledge regarding the nature or the dynamics of this environment. All that an agent needs in order to find an optimal policy is the opportunity to explore its options. In some cases, RL occurs through interaction with the real environment under consideration. However, there are cases in which experience is expensive. For exam- ple, consider an agent that needs to learn a decision policy in a business environment. . Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09 823 -4 _20 , © Springer Science+Business Media, LLC 20 10 4 02 Oded Maimon and Shahar Cohen obtained. Construction and Selection, 307- 323 . Kluwer. Witten IH and Frank E (20 05) Data Mining: practical machine learning tools and techniques . 2nd Ed. Morgan Kaufmann. Wong ML and Leung KS (20 00) Data Mining. analysis. In: Liu H and Motoda H (Eds.) Feature Extraction, Construction and Selection: a data mining perspective, 393-406. Kluwer. Terano T and Inada M (20 02) Data mining from clinical data using interactive

Ngày đăng: 04/07/2014, 05:21

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan