Scalable cooperative multiagent reinforcement learning in the context of an organization

SCALABLE COOPERATIVE MULTIAGENT REINFORCEMENT LEARNING IN THE CONTEXT OF AN ORGANIZATION A Dissertation Presented by SHERIEF ABDALLAH Submitted to the Graduate School of the University of Massachusetts Amherst in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY September 2006 Computer Science UMI Number: 3242334 UMI Microform 3242334 Copyright 2007 by ProQuest Information and Learning Company All rights reserved This microform edition is protected against unauthorized copying under Title 17, United States Code ProQuest Information and Learning Company 300 North Zeeb Road P.O Box 1346 Ann Arbor, MI 48106-1346 c Copyright by Sherief Abdallah 2006 All Rights Reserved SCALABLE COOPERATIVE MULTIAGENT REINFORCEMENT LEARNING IN THE CONTEXT OF AN ORGANIZATION A Dissertation Presented by SHERIEF ABDALLAH Approved as to style and content by: Victor Lesser, Chair Abhi Deshmukh, Member Sridhar Mahadevan, Member Shlomo Zilberstein, Member W Bruce Croft, Department Chair Computer Science ABSTRACT SCALABLE COOPERATIVE MULTIAGENT REINFORCEMENT LEARNING IN THE CONTEXT OF AN ORGANIZATION SEPTEMBER 2006 SHERIEF ABDALLAH B.Sc., CAIRO UNIVERSITY M.Sc., CAIRO UNIVERSITY M.Sc, UNIVERSITY OF MASSACHUSETTS Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST Directed by: Professor Victor Lesser Reinforcement learning techniques have been successfully used to solve single agent optimization problems but many of the real problems involve multiple agents, or multi-agent systems This explains the growing interest in multi-agent reinforcement learning algorithms, or MARL To be applicable in large real domains, MARL algorithms need to be both stable and scalable A scalable MARL will be able to perform adequately as the number of agents increases A MARL algorithm is stable if all agents (eventually) converge to a stable joint policy Unfortunately, most of the previous approaches lack at least one of these two crucial properties This dissertation proposes a scalable and stable MARL framework using a network of mediator agents The network connections restrict the space of valid policies, which iv reduces the search time and achieves scalability Optimizing performance in such a system consists of optimizing two subproblems: optimizing mediators’ local policies and optimizing the structure of the network interconnecting mediators and servers I present extensions to Markovian models that allow exponential savings in time and space I also present the first integrated framework for MARL in a network, which includes both a MARL algorithm and a reorganization algorithm that work concurrently with one another To evaluate performance, I use the distributed task allocation problem as a motivating domain v TABLE OF CONTENTS Page ABSTRACT iv LIST OF TABLES x LIST OF FIGURES xi CHAPTER INTRODUCTION 1.1 1.2 The Distributed Task Allocation Problem, DTAP Modeling and Solving Multi-agent Decisions 1.2.1 1.2.2 1.2.3 1.3 1.4 Decision in Single Agent Systems Decision in Multi Agent Systems 10 Feedback Mechanisms for Computing Cost 14 Contributions 15 Summary 16 STUDYING THE EFFECT OF THE NETWORK STRUCTURE AND ABSTRACTION FUNCTION 18 2.1 Problem definition 18 2.1.1 2.2 Complexity 19 Proposed Solution 21 2.2.1 2.2.2 2.2.3 2.2.4 2.2.5 2.2.6 Architecture 23 Local Decision 25 State Abstraction 26 Task Decomposition 29 Learning 31 Neural Nets 34 vi 2.2.7 2.3 2.4 2.5 Organization Structure 34 Experiments and Results 35 Related work 40 Conclusion 43 EXTENDING AND GENERALIZING MDP MODELS 44 3.1 3.2 3.3 Example 48 Semi Markov Decision Process, SMDP 49 Randomly available actions 51 3.3.1 3.4 3.5 Extension to Concurrent Action Model 56 Learning the Mediator’s Decision Process 58 3.5.1 3.6 Handling Multiple Tasks in Parallel 58 Results 60 3.6.1 3.6.2 3.6.3 3.7 3.8 The wait operator 56 The Taxi Domain 61 The DTAP Experiments 62 When Traditional SMDP Outperforms than ℘-SMDP 68 Related Work 69 Conclusion 70 LEARNING DECOMPOSITIONS 72 4.1 4.2 Motivating Example 74 Multi-level policy gradient algorithm 74 4.2.1 4.3 4.4 4.5 4.6 Learning 77 Cycles 80 Experimental Results 81 Related Work 87 Conclusion 88 WEIGHTED POLICY LEARNER, WPL 89 5.1 Game Theory 91 5.1.1 5.2 Learning and Convergence 93 The Weighted Policy Learner (WPL) algorithm 94 vii 5.2.1 5.2.2 5.3 Related Work 99 5.3.1 5.3.2 5.4 WPL Convergence 95 Analyzing WPL Using Differential Equations 97 Generalized Infinitesimal Gradient Ascent, GIGA 102 GIGA-WoLF 102 Results 103 5.4.1 Computing Expected Reward 103 5.4.1.1 5.4.2 5.4.3 5.5 Fixing Learning Parameters 104 Benchmark Games 104 The Task Allocation Game 108 Conclusion 109 MULTI-STEP WEIGHTED POLICY LEARNING AND REORGANIZATION 114 6.1 6.2 6.3 6.4 6.5 6.6 6.7 Performance Evaluation 115 Optimizing Local Decision 116 Updating the State 117 MS-WPL Learning Algorithm 118 Re-Organization Algorithm 121 Algorithm Parameters 123 Experimental Results 124 6.7.1 6.7.2 6.8 6.9 MS-WPL 125 Re-Organization 132 Related Work 134 Conclusion 140 RELATED WORK 141 7.1 7.2 7.3 7.4 Scheduling 141 Task Allocation 142 Partially Observable MDP 143 Markovian Models for Multi-agent Systems 146 CONCLUSION 148 8.1 8.2 Summary 148 Contributions 153 viii 8.3 Limitations and Future Work 155 APPENDICES A SYMBOLIC ANALYSIS OF WPL DIFFERENTIAL EQUATIONS 158 B SOLVING WPL DIFFERENTIAL EQUATIONS NUMERICALLY USING MATHEMATICA 163 BIBLIOGRAPHY 166 ix • → T 1, p(t) < p∗ , q(t) > q ∗ I will use p and q instead of p(t) and q(t) respectively from now on for brevity dp (1 − p)(u1 q + u2 ) = dq (1 − q)(u3 p + u4 ) p∗ pmin u3 p + u4 dp = 1−p −u3 (p∗ − pmin ) + (u3 + u4 )ln qmax q∗ u1 q + u2 dq 1−q − q∗ − pmin ∗ = −u (q − q ) + (u + u )ln max − p∗ − qmax and by re-arranging, (u1 + u2 )ln − q∗ − pmin − (u3 + u4 )ln = −u3 (p∗ − pmin ) + u1 (qmax − q ∗ ) > ∗ − qmax 1−p (A.1) because u3 < 0, u1 > 0, p∗ > pmin , and qmax > q ∗ • T → T 2, p > p∗ , q > q ∗ (1 − p)(u1 q + u2 ) dp = dq (q)(u3 p + u4 ) pmax p∗ u3 p + u4 dp = 1−p −u3 (pmax − p∗ ) + (u3 + u4 )ln q∗ qmax u1 q + u2 dq q − p∗ q∗ = u1 (q ∗ − qmax ) + u2 ln − pmax qmax and by re-arranging, u2 ln q∗ qmax − (u3 + u4 )ln − p∗ = −u3 (pmax − p∗ ) + u1 (qmax − q ∗ ) > (A.2) − pmax 159 • T → T 3, p > p∗ , q < q ∗ p(u1 q + u2 ) dp = dq (q)(u3 p + u4 ) p∗ pmax u3 p + u dp = p u3 (p∗ − pmax ) + u4 ln p∗ pmax qmin q∗ u1 q + u2 dq q = u1 (qmin − q ∗ ) + u2 ln qmin q∗ and by re-arranging, u2 ln qmin p∗ − u ln = −u3 (pmax − p∗ ) + u1 (q ∗ − qmin ) > q∗ pmax (A.3) • T → T 4, p < p∗ , q < q ∗ dp p(u1 q + u2 ) = dq (1 − q)(u3 p + u4 ) p′min p∗ u3 (p′min − p∗ ) + u3 p + u4 dp = p p′min u4 ln ∗ p q∗ qmin u1 q + u2 dq 1−q = −u1 (q ∗ − qmin ) + (u1 + u2 )ln − qmin − q∗ and by re-arranging, (u1 + u2 )ln p′min − qmin − u ln = −u3 (p∗ − p′min ) + u1 (q ∗ − qmin ) > (A.4) − q∗ p∗ Since I have equations (and related inequalities) in unknowns (pmin , p′min , pmax , qmin , qmax ), I thought I can relate pmin and p′min in one equation and hence prove p′min − pmin > Unfortunately, solving any of these equations involves Lambert’s function, which is difficult to manipulate I tried solving the first segment using Mathematica using the following code: 160 Assuming[{u1}>0,u3u4,u1>-u2}, DSolve[p’[t]==(1-p[t])*(u1*q[t]+u2) && q’[t]==(1-q[t])*(u3*p[t]+u4) && p[0]==p0 && q[0]==-u4/u3,{p,q},t]] Figure A.2 shows portion of the resulting solution It is difficult to manipulate this solution symbolically and substitute for p and q values This is the main reason for solving the equations numerically 161 p0 u3 u3 Log Out[51]= q Function t , u1 u1 p0 u4 Log p0 u1 Log u1 u3 u1 u4 u1 u2 u3 u1 u3 u1 u4 u1 u3 u1 ProductLog p0 u3 u3 Log p0 u2 u1 u4 Log p0 u1 Log u1 u3 u1 u4 u1 u2 u3 u1 u3 u1 u4 u1 u3 u1 u2 u1 u3 Log InverseFunction u2 ProductLog u2 u1 #1 p Function 1 t , InverseFunction K$19964 ProductLog u3 K$19964 u3 u3 p0 u3 u3 Log u1 p0 u4 Log p0 u1 Log u1 u3 u1 u4 u1 u2 u3 u1 u3 u1 u4 u1 u3 u1 u2 u1 u1 u2 u2 K$19964 u3 u4 u1 u2 K$19964 & u1 t u1 u2 p0 1 K$19964 ProductLog u3 K$19964 u3 C u1 u2 K$19964 u1 u2 u3 u4 u1 u2 K$19964 u1 Figure A.2 Symbolic solution, using Mathematica, of the first set of differential equations 162 APPENDIX B SOLVING WPL DIFFERENTIAL EQUATIONS NUMERICALLY USING MATHEMATICA This appendix lists the code I have used to solve the WPL differential equations numerically and plot the figures in Chapter ClearAll[p, q] eq[u1_, u2_, u3_, u4_] := {p’[t] == Piecewise[{{ (1 -p[t])*(u1*q[t] + u2), q[t] > -u2/u1}, {p[t]*(u1*q[t] + u2), q[t] = -u2/u1}}] ,q’[t] == Piecewise[{{ (1 - q[t])*(u3*p[t] + u4), p[t] < -u4/u3}, {q[t]*(u3*p[t] + u4), p[t] >= -u4/u3}}]} eqsol[u1_, u2_, u3_, u4_, p0_, q0_, t0_, tf_] := NDSolve[{eq[u1, u2, u3,u4] , p[0] == p0, q[0] == q0} , {p, q} , {t, t0, tf}] eqplot[u1_, u2_, u3_, u4_, p0_, q0_, t0_, tf_]:= ParametricPlot[Evaluate[{p[t], q[t]} / eqsol[u1, u2, u3, u4, p0, q0, t0, tf]], 163 {t, t0, tf}, PlotRange -> {{0, 1}, {0, 1}}, DisplayFunction -> Identity] eqtimeplot[u1_, u2_, u3_, u4_, p0_, q0_, t0_, tf_]:= Plot[Evaluate[{p[t], q[t]} / eqsol[u1, u2, u3, u4, p0, q0, t0, tf]] , {t, t0, tf} , DisplayFunction -> Identity ,PlotStyle->{RGBColor[1, 0, 0],RGBColor[0,0,1]} , PlotRange -> {0, 1}] Clear[plots]; plots[u1_, u2_, u3_, u4_, t0_, tf_, s_]:= Join[Table[eqplot[u1,u2, u3, u4, pq, , t0, tf] ,{pq, 0.0, 1, s}], Table[eqplot[u1, u2, u3, u4, pq, , t0, tf] ,{pq, 0.0, 1, s}], Table[eqplot[u1, u2, u3, u4, 0, pq , t0, tf] ,{pq, 0.0, 1, s}], Table[eqplot[u1, u2, u3, u4, 1, pq , t0, tf] ,{pq, 0.0, 1, s}]] Clear[timeplots]; timeplots[u1_, u2_, u3_, u4_, t0_, tf_, s_]:= Join[Table[ eqtimeplot[u1, u2, u3, u4, pq, , t0, tf] , {pq, 0.0, 1, s}], 164 Table[ eqtimeplot[u1, u2, u3, u4, pq, , t0, tf] , {pq, 0.0, 1, s}], Table[ eqtimeplot[u1, u2, u3, u4, 0, pq , t0, tf] , {pq, 0.0, 1, s}], Table[ eqtimeplot[u1, u2, u3, u4, 1, pq , t0, tf] , {pq, 0.0, 1, s}]] ps = Flatten[Table[ plots[0.5, u2, -0.5, u4, 700, 800, 0.25] , {u2, 0.01, -0.49, -0.05} , {u4, 0.01, 0.49, 0.05}], 1] Show[ps, DisplayFunction -> $DisplayFunction] Show[plots[0.5, -0.45, -0.5, 0.45, 0, 1000, 0.025], DisplayFunction -> $DisplayFunction] Show[timeplots[0.5, -0.45, -0.5, 0.45, 0, 1000, 0.025], DisplayFunction -> $DisplayFunction] 165 BIBLIOGRAPHY [1] S Abdallah and V Lesser Organization based cooperative coalition formation In Proceedings of the International Conference on Intelligent Agent Technology (IAT), pages 162–168, 2004 [2] S Abdallah and V Lesser Learning task allocation via multi-level policy gradient algorithm with dynamic learning rate In Proceedings of the Workshop on Planning and Learning in A Priori Unknown or Dynamic Domains, International Joint Conference on Artificial Intelligence (IJCAI), pages 76–82, 2005 [3] S Abdallah and V Lesser Modeling task allocation using a decision theoretic model In Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 719–726, 2005 [4] S Abdallah and V Lesser Learning the task allocation game In Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2006 [5] D G Andersen, H Balakrishnan, M F Kaashoek, and R Morris Resilient overlay networks In Symposium on Operating Systems Principles, pages 131– 145, 2001 [6] S Arai and K Sycara Credit assignment method for learning effective stochastic policies in uncertain domain In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001), pages 815–822, San Francisco, California, USA, 7-11 2001 Morgan Kaufmann [7] B Banerjee and J Peng Adaptive policy gradient in multiagent learning In Proceedings of the Second International Joint Conference on Autonomous Agents and Multi Agent Systems (AAMAS)., pages 686–692, 2003 [8] A Barto and S Mahadevan Recent advances in hierarchical reinforcement learning In Discrete Event Systems journal, volume 13, pages 41–77, 2003 [9] J Baxter and P L Bartlett Reinforcement learning in POMDPs via direct gradient ascent In Proceedings of the Seventeenth International Conference on Machine Learning, pages 41–48, 2000 [10] J C Beck and M S Fox Constraint-directed techniques for scheduling alternative activities Artificial Intelligence, 121:211–250, 2000 166 [11] R Becker, S Zilberstein, V Lesser, and C V Goldman Transition-independent decentralized markov decision processes In Proceedings of the Second International Joint Conference on Autonomous Agents and Multi Agent Systems, pages 41–48, Melbourne, Australia, July 2003 ACM Press [12] D S Bernstein, S Zilberstein, and N Immerman The complexity of decentralized control of markov decision processes In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence (UAI), pages 32–37, 2000 [13] D P Bertsekas and J N Tsitsiklis Gradient convergence in gradient methods with errors SIAM Journal on Optimization, 10(3):627–642, 1999 [14] C Boutilier Sequential optimality and coordination in multiagent systems In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 478–485, 1999 [15] C Boutilier, T Dean, and S Hanks Decision-theoretic planning: Structural assumptions and computational leverage Journal of Artificial Intelligence Research, 11:1–94, 1999 [16] M Bowling Convergence and no-regret in multiagent learning In Advances in Neural Information Processing Systems 17, pages 209–216 MIT Press, Cambridge, MA, 2005 [17] M Bowling and M Veloso Multiagent learning using a variable learning rate Artificial Intelligence, 136(2):215–250, 2002 [18] J A Boyan and M L Littman Packet routing in dynamically changing networks: A reinforcement learning approach In Advances in Neural Information Processing Systems, volume 6, pages 671–678 Morgan Kaufmann Publishers, Inc., 1994 [19] S Brainov and T Sandholm Contracting with uncertain level of trust In EC ’99: Proceedings of the 1st ACM conference on Electronic commerce, pages 15–21, New York, NY, USA, 1999 ACM Press [20] J S A Bridgewater, P O Boykin, and V P Roychowdhury Balanced overlay networks (bon): Decentralized load balancing via self-organized random networks The Computing Research Repository (CoRR), cs.DC/0411046, 2004 [21] V Cardellini, M Colajanni, and P S Yu Dynamic load balancing on web-server systems IEEE Internet Computing, 3(3):28–39, 1999 [22] A Cassandra, M L Littman, and N L Zhang Incremental Pruning: A simple, fast, exact method for partially observable Markov decision processes In Proceedings of the Thirteenth Annual Conference on Uncertainty in Artificial Intelligence (UAI), pages 54–61, San Francisco, CA, 1997 167 [23] C Claus and C Boutilier The dynamics of reinforcement learning in cooperative multiagent systems In Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence (AAAI/IAAI), pages 746–752, 1998 [24] V Conitzer and T Sandholm Awesome: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents In International Conference on Machine Learning, pages 83–90, 2003 [25] R B Cooper Introduction to Queueing Theory, 3rd Edition CEE Press, 1990 [26] R H Crites and A G Barto Elevator group control using multiple reinforcement learning agents Machine Learning, 33(2-3):235–262, 1998 [27] P A Crook and G Hayes Could active perception aid navigation of partially observable grid worlds In Proceedings of 14th European Conference on Machine Learning (ECML), pages 72–83, 2003 [28] V D Dang and N Jennings Generating coalition structures with finite bound from the optimal guarantees In Proceedings of the Third International Joint Conference on Autonomous Agents and MultiAgent Systems, pages 564–571, 2004 [29] K Decker Environment Centered Analysis and Design of Coordination Mechanisms PhD thesis, University of Massachusetts, 1995 [30] T G Dietterich Hierarchical reinforcement learning with the MAXQ value function decomposition Journal of Artificial Intelligence Research, 13:227–303, 2000 [31] P Dinda and D Lu Nondeterministic queries in relational grid information service In Proceedings of the 2003 ACM/IEEE conference on Supercomputing, page 12, Washington, DC, USA, 2003 IEEE Computer Society [32] D Dolgov and E Durfee Optimal resource allocation and policy formulation in loosely-coupled markov decision processes In Proceedings of the Fourteenth International Conference on Automated Planning and Scheduling., pages 315– 324, 2004 [33] E Durfee and T Montgomery Coordination as distributed search in a hierarchical behavior space IEEE Trans on Systems, Man, and Cybernetics, 21:1363– 1378, September 1991 [34] I Foster and C Kesselman The Grid: Blueprint for a New Computing Infrastructure Morgan Kaufman, 1999 [35] M Fox ISIS: A retrospective In Intelligent Scheduling, chapter 1, pages 3–28 1994 168 [36] A Galstyan, K Czajkowski, and K Lerman Resource allocation in the grid using reinforcement learning In International Conference on Autonomous Agents and Multiagent Systems, pages 1314–1315, 2004 [37] M E Gaston and M desJardins Agent-organized networks for dynamic team formation In AAMAS ’05: Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems, pages 230–237, New York, NY, USA, 2005 ACM Press [38] D Hankerson, P D Johnson, and G A Harris Introduction to Information Theory and Data Compression CRC Press, Inc., Boca Raton, FL, USA, 1998 [39] H Hannah and A.-I Mouaddib Task selection problem under uncertainty as decision-making In Proceedings of the first international joint conference on Autonomous agents and multiagent systems, 2002 [40] E A Hansen, D S Bernstein, and S Zilberstein Dynamic programming for partially observable stochastic games In the Nineteenth National Conference on Artificial Intelligence, AAAI, pages 709–715, 2004 [41] B Horling Quantitative Organizational Modeling and Design for Multi-Agent Systems PhD thesis, University of Massachusetts at Amherst, February 2006 [42] J Hu and M P Wellman Nash q-learning for general-sum stochastic games Journal of Machine Learning Research, 4:1039–1069, 2003 [43] T Jaakkola, S P Singh, and M I Jordan Reinforcement learning algorithm for partially observable Markov decision problems In Advances in Neural Information Processing Systems (NIPS), volume 7, pages 345–352, 1995 [44] L P Kaelbling, M L Littman, and A R Cassandra Planning and acting in partially observable stochastic domains Artificial Intelligence, 101(1-2):99–134, 1998 [45] S Kandula, D Katabi, B Davie, and A Charny Walking the tightrope: responsive yet stable traffic engineering In Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications (SIGCOMM), pages 253–264, 2005 [46] M Klusch and K Sycara Brokering and matchmaking for coordination of agent societies: a survey Coordination of Internet agents: models, technologies, and applications, pages 197–224, 2001 [47] N K Krothapalli and A V Deshmukh Distributed task allocation in multiagent systems In IIE Annual Conference, 2002 [48] S Kumar and R Miikkulainen Dual reinforcement q-routing: An on-line adaptive routing algorithm In Proceedings Artificial Neural Networks in Engineering, volume 7, pages 231–238, 1997 169 [49] I Kwee, M Hutter, and J Schmidhuber Market-based reinforcement learning in partially observable worlds In Proceedings of the International Conference on Artificial Neural Networks (ICANN), Lecture Notes in Computer Science (LNCS 2130), pages 865–873, Manno(Lugano), CH, 2001 Springer [50] V Lesser, K Decker, T Wagner, N Carver, A Garvey, B Horling, D Neiman, R Podorozhny, M NagendraPrasad, A Raja, R Vincent, P Xuan, and X Zhang Evolution of the gpgp/taems domain-independent coordination framework Autonomous Agents and Multi-Agent Systems, 9(1):87–143, July 2004 [51] M L Littman Memoryless policies: theoretical limitations and practical results In Proceedings of the third international conference on Simulation of adaptive behavior, pages 238–245, Cambridge, MA, USA, 1994 [52] M L Littman, A R Cassandra, and L P Kaelbling Learning policies for partially observable environments: Scaling up In Proceedings of the Twelfth International Conference on Machine Learning (ICML), pages 362–370, San Francisco, CA, USA, 1995 [53] M L Littman and G Szepesv´ari A generalized reinforcement-learning model: Convergence and applications In Proceedings of the 13th International Conference on Machine Learning (ICML-96), pages 310–318, Bari, Italy, 1996 Morgan Kaufmann [54] J Loch and S Singh Using eligibility traces to find the best memoryless policy in partially observable Markov decision processes In Proc 15th International Conf on Machine Learning, pages 323–331, 1998 [55] S Mahadevan Average reward reinforcement learning: Foundations, algorithms, and empirical results Machine Learning, 22(1-3):159–195, 1996 [56] A K McCallum Reinforcement Learning with Selective Perception and Hidden State PhD thesis, University of Rochester, 1996 [57] K Murphy A survey of POMDP solution techniques Technical Report, U.C Berkeley., 2000 [58] D Oppenheimer, J Albrecht, D Patterson, and A Vahdat Scalable widearea resource discovery Computer Science Technical Report UCB/CSD-04-1334, University of California Berkeley, 2004 [59] J Ostwald and V Lesser Combinatorial auctions for resource allocation in a distributed sensor network Computer Science Technical Report UM-CS-2004072, University of Massachusetts, 2004 [60] L Peshkin, K.-E Kim, N Meuleau, and L P Kaelbling Learning to cooperate via policy search In Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI), pages 307–314, San Francisco, CA, 2000 Morgan Kaufmann 170 [61] L Peshkin and V Savova Reinforcement learning for adaptive routing In International Joint Conference on Neural Networks (IJCNN), pages 1825–1830, 2002 [62] A Raja and V Lesser Meta-level reasoning in deliberative agents In Proceedings of the International Conference on Intelligent Agent Technology (IAT), pages 141–147, 2004 [63] K Rohanimanesh and S Mahadevan Learning to take concurrent actions Sixteenth International Conference on Neural Information Processing Systems (NIPS), pages 1619–1626, 2003 [64] T Roughgarden and E Tardos How bad is selfish routing? In IEEE Symposium on Foundations of Computer Science, pages 93–102, 2000 [65] S Russell and P Norvig Artificial Intelligence: A Modern Approach PrenticeHall, Englewood Cliffs, NJ, 2nd edition edition, 2003 [66] T Sandholm, K Larson, M Andersson, O Shehory, and F Tohme Coalition structure generation with worst case guarantees Artificial Intelligence, 111(12):209–238, 1999 [67] T Sandholm, S Suri, A Gilpin, and D Levine Winner determination in combinatorial auction generalizations In Proceedings of the first international joint conference on Autonomous agents and multiagent systems, pages 69–76, 2002 [68] G Shani and R I Brafman Resolving perceptual aliasing in the presence of noisy sensors In Advances in Neural Information Processing Systems (NIPS), pages 1249–1256, 2004 [69] O Shehory and S Kraus Methods for task allocation via agent coalition formation Artificial Intelligence, 101(1-2):165–200, 1998 [70] J Shen and V Lesser Communication management using abstraction in distributed bayesian networks In Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems, pages 1115–1116, 2005 [71] J Shen, I Weber, and V Lesser OAR: A formal framework for multi-agent negotiation Proceedings of the Twentieth National Conference on Artificial Intelligence (AAAI 05), July 2005 [72] M Sims, C Goldman, and V Lesser Self-organization through bottom-up coalition formation In Proceedings of Second International Joint Conference on Autonomous Agents and MultiAgent Systems (AAMAS 2003), pages 867–874, Melbourne, AUS, July 2003 ACM Press [73] S Singh, M Kearns, and Y Mansour Nash convergence of gradient dynamics in general-sum games In the 16th Conference on Uncertainty in Artificial Intelligence, pages 541–548, 2000 171 [74] R Smith The contract net protocol: High-level communication and control in a distributed problem solver IEEE Transactions on Computers, 29(12):1104–1113, 1980 [75] Y P So and E Durfee Designing tree-structured organizations for computational agents Computational and Mathematical Organization Theory, 2(3):219– 246, 1996 [76] K Soh and X Li An integrated multilevel learning approach to multiagent coalition formation In Proceedings International Joint Conference on Artificial Intelligence, pages 619–624, 2003 [77] E J Sondik The Optimal Control of Partially Observable Markov Decision Processes PhD thesis, Stanford University, 1971 [78] R Sutton and A Barto Reinforcement Learning: An Introduction MIT Press, 1999 [79] R S Sutton, D Precup, and S P Singh Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning Artificial Intelligence, 112(1-2):181–211, 1999 [80] D Szer, F Charpillet, and S Zilberstein MAA*: A heuristic search algorithm for solving decentralized POMDPs In the 21st International Conference on Uncertainty in Artificial Intelligence (UAI-05), page 576, 2005 [81] N Tao, J Baxter, and L Weaver A multi-agent, policy-gradient approach to network routing In Proceedings 18th International Conference on Machine Learning, pages 553–560 Morgan Kaufmann, San Francisco, CA, 2001 [82] D N W Shen An agent-based approach for dynamic manufacturing scheduling Proceedings of the 3rd International Conference on the Practical Applications of Agents and Multi-Agent Systems (PAAM), pages 117–128, 1998 [83] T Wagner and V Lesser Design-to-Criteria scheduling: Real-time agent control Proceedings of AAAI 2000 Spring Symposium on Real-Time Autonomous Systems, pages 89–96, March 2000 [84] W Walsh and M Wellman A market protocol for decentralized task allocation In Third International Conference on Multi-Agent Systems., pages 325–332, 1998 [85] X Wang and T Sandholm Reinforcement learning to play an optimal nash equilibrium in team markov games In Advances in Neural Information Processing Systems, pages 1571–1578, 2003 [86] M P Wellman and P R Wurman Market-aware agents for a multiagent world Robotics and Autonomous Systems, 24:115–125, 1998 172 [87] X Zhang, V Lesser, and S Abdallah Efficient management of multi-linked negotiation based on a formalized model Autonomous Agents and Multi-Agent Systems, 10(2):165–205, 2005 [88] M Zink, D Westbrook, and S Abdallah Meteorological command and control: An end-to-end architecture for a hazardous weather detection sensor network Proceedings of the Workshop on End-to-End, Sense-and-Respond Systems, Applications, and Services, Seattle, WA, USA, pages 37–42, 2005 [89] M Zinkevich Online convex programming and generalized infinitesimal gradient ascent In Proceedings of the International Conference on Machine Learning, pages 928–936, 2003 173

Scalable cooperative multiagent reinforcement learning in the context of an organization

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan