Multi agent systems on wireless sensor networks a distributed reinforcement learning approach

MULTI-AGENT SYSTEMS ON WIRELESS SENSOR NETWORKS: A DISTRIBUTED REINFORCEMENT LEARNING APPROACH JEAN-CHRISTOPHE RENAUD NATIONAL UNIVERSITY OF SINGAPORE 2006 MULTI-AGENT SYSTEMS ON WIRELESS SENSOR NETWORKS: A DISTRIBUTED REINFORCEMENT LEARNING APPROACH JEAN-CHRISTOPHE RENAUD (B.Eng Institut National des T´el´ecommunications, France) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF ENGINEERING DEPARTMENT OF ELECTRICAL & COMPUTER ENGINEERING NATIONAL UNIVERSITY OF SINGAPORE 2007 Acknowledgments I consider myself extremely fortunate for having been given the opportunity and privilege of doing this research work at the National University of Singapore (NUS) as part of the Double-Degree program between NUS and the French “Grandes Ecoles” This experience has been a most valuable one I wish to express my deepest gratitude to my Research Supervisor Associate Professor Chen-Khong Tham for his expertise, advice, and support throughout the progress of this work His kindness and optimism created a very motivating work environment that made this thesis possible Warm thanks to Lisa, who helped me finalize this work by reading and commenting it and for listening to my eternal, self-centered ramblings I would also like to express my gratitude to all my friends with whom I discovered Singapore Finally, I would like to thank my wonderful family, in Redon and Paris, for providing the love and encouragement I needed to leave my home country for Singapore and complete this Master I dedicate this work to them i A mes parents Contents Acknowledgments Contents i ii Summary vii List of Tables ix List of Figures x List of Abbreviations xii Declaration xiv Introduction 1.1 Wireless Sensor Networks and Multi-Agent Systems 1.2 Challenges with Multi-Agent Systems 1.3 Reinforcement Learning 1.4 Markov Decision Processes 1.5 1.4.1 Value functions 10 1.4.2 The Q-learning algorithm 11 Focus, motivation and contributions of this thesis 12 ii CONTENTS Literature review 16 2.1 Multi-agent Learning 17 2.2 Solutions to the curse of dimensionality 18 2.3 2.4 2.5 2.2.1 Independent vs cooperative agents 18 2.2.2 Global optimality by local optimizations 19 2.2.3 Exploiting the structure of the problem 22 Solutions to partial observability 28 2.3.1 Partially Observable MDPs 29 2.3.2 Multi-agent learning with communication 31 Other approaches 35 2.4.1 The Game theoretic approach 35 2.4.2 The Bayesian approach 37 Summary 38 Distributed Reinforcement Learning Algorithms 3.1 40 The common multi-agent extensions to Reinforcement Learning 41 3.1.1 The centralized approach and the independent Q-learners algorithm 42 3.1.2 The Global Reward DRL algorithm 42 3.2 Schneider et al.’s Distributed Value Function algorithms 43 3.3 Lauer and Riedmiller’s optimistic assumption algorithm 45 3.4 3.3.1 General framework: Multi-Agent MDP 45 3.3.2 The Optimistic DRL algorithm 46 Kapetanakis and Kudenko’s FMQ heuristic 48 3.4.1 Extensions of the FMQ heuristic to multi-state environments 50 iii CONTENTS 3.5 Guestrin’s Coordinated Reinforcement Learning 50 3.5.1 Description of the approach 51 3.5.2 The Variable Elimination algorithm 51 3.5.3 The Coordinated Q-Learning algorithm 53 3.6 Bowling and Veloso’s WoLF-PHC algorithm 54 3.7 Summary and conclusion 3.7.1 56 Conclusion 58 Design of a testbed for distributed learning of coordination 4.1 4.2 4.3 4.4 59 The multi-agent lighting grid system testbed 60 4.1.1 State-action spaces 60 4.1.2 Reward functions 62 4.1.3 Analysis of the light-grid problem for the CQL algorithm 63 Distributed learning of coordination 66 4.2.1 Single optimal joint-action setting 66 4.2.2 Multiple optimal joint-actions settings 67 Deterministic and stochastic environments 68 4.3.1 Deterministic environments 69 4.3.2 Stochastic environments 69 Summary 72 iv CONTENTS Implementation on actual sensor motes and simulations 5.1 73 Generalities 74 5.1.1 Software and Hardware 74 5.1.2 Parameters used in the simulations 75 5.2 Energy considerations 76 5.3 Results for the Deterministic environments 79 5.4 5.5 5.6 5.3.1 Convergence and speed of convergence of the algorithms for the Deterministic environments 79 5.3.2 Application-level results 83 Results for Partially Stochastic environments 86 5.4.1 Convergence and speed of convergence of the algorithms for the Partially Stochastic environments 86 5.4.2 Application-level results 88 Results for Fully Stochastic environments 91 5.5.1 Convergence and speed of convergence of the algorithms for Fully Stochastic environments 91 5.5.2 Application-level results 93 5.5.3 Influence of stochasticity over the convergence performance of the DRL algorithms 96 Conclusion 98 Conclusions and Future Work 101 6.1 Contributions of this work 101 6.2 Directions for future work 103 Bibliography 105 v CONTENTS APPENDIX 117 Appendix A - Pseudo-code of the DRL algorithms 117 A-1 Independent Q-Learning and GR DRL 117 A-2 Distributed Value Function DRL - Schneider et al 118 A-3 Optimistic DRL - Lauer and Riedmiller 119 A-4 WoLF-PHC - Bowling and Veloso A-5 FMQ heuristics extended from Kudenko and Kapetanakis 120 A-6 Coordinated Q-Learning - Guestrin 121 Appendix B - List of Publications 119 122 B-1 Published paper 122 B-2 Pending Publication 122 B-3 Submitted paper 122 vi Summary Implementing a multi-agent system (MAS) on a wireless sensor network comprising sensoractuator nodes is very promising as it has the potential to tackle the resource constraints inherent in wireless sensor networks by efficiently coordinating the activities among the nodes In fact, the processing and communication capabilities of sensor nodes enable them to make decisions and perform tasks in a coordinated manner in order to achieve some desired system-wide or global objective that they could not achieve by their own In this thesis, we review the research work about multi-agent learning and learning of coordination in cooperative MAS We then study the behavior and performance of several distributed reinforcement learning (DRL) algorithms: (i) fully distributed Q-learning and its centralized counterpart, (ii) Global Reward DRL, (iii) Distributed Reward and Distributed Value Function, (iv) Optimistic DRL, (v) Frequency Maximum Q-learning (FMQ) that we have extended to multi-stage environments, (vi) Coordinated Q-Learning and (vii) WoLF-PHC Furthermore, we have designed a general testbed in order to study the problem of coordination in a MAS and to analyze more into detail the aforementioned DRL algorithms We present our experience and results from simulation studies and actual vii BIBLIOGRAPHY [24] D H Wolpert, K R Wheller, and K Tumer, “General principles of learning-based multi-agent systems,” in Proceedings of the Third International Conference on Autonomous Agents (AGENTS99), O Etzioni, J P Mă uller, and J M Bradshaw, Eds Seattle, WA, USA: ACM Press, 1999, pp 77–83 [25] L Peshkin, K.-E Kim, N Meuleau, and L P Kaelbling, “Learning to Cooperate via Policy Search,” in Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI’00) San Francisco, CA, USA: Morgan Kaufmann, 2000, pp 307–314 [26] E Ferreira and P Khosla, “Multiagent collaboration using distributed value functions,” in Proceedings of the IEEE Intelligent Vehicles Symposium (IV’00), October 2000, pp 404–409 [27] M G Lagoudakis, R E Parr, and M L Littman, “Least-Squares Methods in Reinforcement Learning for Control,” in Proceedings of the Second Hellenic Conference on Artificial Intelligence (SETN’02), April 2002, pp 249–260 [28] S Babvey, O Momtahan, and M R Meybodi, “Multi Mobile Robot Navigation Using Distributed Value Function Reinforcement Learning,” in Proceedings of the International Conference on Robotics and Automation (ICRA’03), vol 1, September 2003, pp 957–962 [29] C E Guestrin, D Koller, and R E Parr, “Multiagent planning with factored MDPs,” in Proceedings of the Fourteenth Neural Information Processing Systems (NIPS’01), Vancouver, Canada, 2002, pp 1523–1530 108 BIBLIOGRAPHY [30] S Kapetanakis and D Kudenko, “Reinforcement Learning of Coordination in Cooperative Multi-Agent Systems,” in AAAI/IAAI, 2002, pp 326–331 [31] A Dutech, O Buffet, and F Charpillet, “Multi-Agent Systems by Incremental Gradient Reinforcement Learning,” in Proceedings of the Seventeenth International Joint Conferences on Artificial Intelligence (IJCAI’01), 2001, pp 833–838 [32] M Littman and J Boyan, “A Distributed Reinforcement Learning Scheme for Network Routing,” Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, USA, Tech Rep CMU-CS-93-165, 1993 [33] D S Bernstein, S Zilberstein, and N Immerman, “The Complexity of Decentralized Control of Markov Decision Processes,” in Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI’00), July 2000, pp 32–37 [34] N Vlassis, R Elhorst, and J R Kok, “Anytime Algorithms for Multiagent Decision Making Using Coordination Graphs,” in Proceedings of the International Conference on Systems, Man and Cybernetics (SMC’04), 2004 [35] J R Kok, M T J Spaan, and N Vlassis, “Multi-robot decision making using coordination graphs,” in Proceedings of the International Conference on Advanced Robotics (ICAR’03), A T de Almeida and U Nunes, Ed., Coimbra, Portugal, June 2003, pp 1124–1129 [36] J R Kok, R de Boer, N Vlassis, and F Groen, “UvA Trilearn 2002 team description,” in Proceedings of the CD RoboCup 2003 Symposium Springer-Verlag, 2002 109 BIBLIOGRAPHY [37] C Boutilier, R Dearden, and M Goldszmidt, “Exploiting Structure in Policy Construction,” in Proceedings of the Fourteenth International Joint Conferences on Artificial Intelligence (IJCAI’95), C Mellish, Ed San Francisco: Morgan Kaufmann, 1995, pp 1104–1111 [38] D Koller and R E Parr, “Policy Iteration for Factored MDPs,” in Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence (UAI’00) San Francisco, CA, USA: Morgan Kaufmann, 2000, pp 326–334 [39] C E Guestrin, “Planning Under Uncertainty in Complex Structured Environments,” Ph.D dissertation, Stanford University, CA, USA, 2003 [40] C E Guestrin, D Koller, and R E Parr, “Max-norm Projections for Factored MDPs,” in Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI’01) San Francisco, CA, USA: Morgan Kaufmann, August 2001, pp 673–682 [41] C E Guestrin and G Gordon, “Distributed Planning in Hierarchical Factored MDPs,” in Proceedings of the Eightteenth Conference on Uncertainty in Artificial Intelligence (UAI’02) San Francisco, CA, USA: Morgan Kaufmann, 2002, pp 197– 206 [42] C E Guestrin, D Koller, R E Parr, and S Venkataraman, “Efficient Solution Algorithms for Factored MDPs,” Journal of Artificial Intelligence Research (JAIR), vol 19, pp 399–468, 2003 110 BIBLIOGRAPHY [43] D Schuurmans and R Patrascu, “Direct value-approximation for factored MDPs,” in Proceedings of the Fourteenth Neural Information Processing Systems (NIPS’01), 2001 [44] C Boutilier, T Dean, and S Hanks, “Decision-Theoretic Planning: Structural Assumptions and Computational Leverage,” Journal of Artificial Intelligence Research (JAIR), vol 11, pp 1–94, 1999 [45] M G Lagoudakis and R E Parr, “Model-Free Least-Squares Policy Iteration,” in Proceedings of the Fourteenth Neural Information Processing Systems (NIPS’01), T G Dietterich, S Becker, and Z Ghahramani, Eds MIT Press, December 2001, pp 1547–1554 [46] R J Williams, “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning,” Machine Learning, vol 8, no 3-4, pp 229–256, 1992 [47] L C Baird, “Residual Algorithms: Reinforcement Learning with Function Approximation,” in Proceedings of the Twelfth International Conference on Machine Learning (ICML’95) Morgan Kaufman Publishers, July 1995, pp 30–37 [48] R E Parr, “Hierarchical Control and Learning for Markov Decision Processes,” Ph.D dissertation, University of California, Berkeley, CA, USA., 1998 [49] R S Sutton, D Precup, and S P Singh, “Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning,” Artificial Intelligence, vol 112, no 1-2, pp 181–211, 1999 111 BIBLIOGRAPHY [50] T Dietterich, “Hierarchical reinforcement learning with the MAXQ value function decomposition,” in Proceedings of the Fifteenth International Conference on Machine Learning (ICML’98) Morgan Kaufmann, 1998 [51] A G Barto and S Mahadevan, “Recent Advances in Hierarchical Reinforcement Learning,” Discrete Event Dynamic Systems, vol 13, no 4, pp 341–379, 2003 [52] R Makar, S Mahadevan, and M Ghavamzadeh, “Hierarchical multi-agent reinforcement learning,” in Proceedings of the Fifth International Conference on Autonomous Agents and Multiagent Systems (AAMAS’01), J P Mă uller, E Andre, S Sen, and C Frasson, Eds Montreal, Canada: ACM Press, 2001, pp 246–253 [53] M Ghavamzadeh and S Mahadevan, “Learning to Communicate and Act Using Hierarchical Reinforcement Learning,” in Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS’04) IEEE Computer Society, 2004, pp 1114–1121 [54] Tony Cassandra’s webpage about POMDPs [Online] Available: http://pomdp.org/ [55] D V Pynadath and M Tambe, “The Communicative Multiagent Team Decision Problem: Analyzing Teamwork Theories and Models,” Journal of Artificial Intelligence Research (JAIR), vol 16, pp 389–423, 2002 [56] B Rathnasabapathy and P Gmytrasiewicz, “Formalizing multi-agent POMDPs in the context of network routing,” in Proceedings of the 36th Annual Hawaii International Conference on System Sciences (HICSS’03), January 2003 112 BIBLIOGRAPHY [57] E Hansen, D S Bernstein, and S Zilberstein, “Dynamic programming for partially observable stochastic games,” in Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI’04), 2004, pp 709–715 [58] R Nair, M Tambe, M Yokoo, D Pynadath, and S Marsella, “Taming decentralized POMDPs: Towards efficient policy computation for multiagent settings,” in Proceedings of the Eighteenth International Joint Conferences on Artificial Intelligence (IJCAI’03), 2003 [59] I Chades, B Scherrer, and F Charpillet, “A heuristic approach for solving decentralized-pomdp: Assessment on the pursuit problem,” in Proceedings of the 2002 ACM Symposium on Applied Computing (SAC-02), 2002, pp 57–62 [60] R Becker, S Zilberstein, V R Lesser, and C V Goldman, “Solving Transition Independent Decentralized Markov Decision Processes,” Journal of Artificial Intelligence Research (JAIR), vol 22, pp 423–455, 2004 [61] R Nair, P Varakantham, M Tambe, and M Yokoo, “Networked Distributed POMDPs: A Synthesis of Distributed Constraint Optimization and POMDPs,” in AAAI, M M Veloso and S Kambhampati, Eds AAAI Press AAAI Press / The MIT Press, 2005, pp 133–139 [62] D Braziunas, “POMDP solution methods: a survey,” Department of Computer Science, University of Toronto, Tech Rep., 2003 [63] S Sen and G Weiss, “Learning in multiagent systems,” in Multiagent systems: a modern approach to distributed artificial intelligence Cambridge, MA, USA: MIT Press, 1999, pp 259–298 113 BIBLIOGRAPHY [64] P Xuan, V R Lesser, and S Zilberstein, “Communication decisions in multi-agent cooperation: model and experiments,” in Proceedings of the Fifth International Conference on Autonomous Agents and Multiagent Systems (AAMAS’01) New York, NY, USA: ACM Press, 2001, pp 616–623 [65] ——, “Communication in Multi-Agent Markov Decision Processes,” in Proceedings of the Fourth International Conference on MultiAgent Systems (ICMAS’00) Washington, DC, USA: IEEE Computer Society, 2000, p 467 [66] R Nair, M Roth, M Yokoo, and M Tambe, “Communication for Improving Policy Computation in Distributed POMDPs,” in Proceedings of the Third International Joint Conference on Agents and Multiagent Systems (AAMAS’04), 2004, pp 1098– 1105 [67] C V Goldman and S Zilberstein, “Optimizing information exchange in cooperative multi-agent systems,” in Proceedings of the Second International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS’03), 2003 [68] ——, “Decentralized control of cooperative systems: Categorization and complexity analysis.” Journal of Artificial Intelligence Research (JAIR), vol 22, pp 143–174, 2004 [69] M L Littman, “Markov Games as a Framework for Multi-Agent Reinforcement Learning,” in Proceedings of the Eleventh International Conference on Machine Learning (ML’94) New Brunswick, NJ, USA: Morgan Kaufmann, 1994, pp 157–163 [70] J Hu and M P Wellman, “Multiagent reinforcement learning: theoretical framework and an algorithm,” in Proceedings of the Fifteenth International Conference on Ma114 BIBLIOGRAPHY chine Learning (ICML’98) San Francisco, CA, USA: Morgan Kaufmann, 1998, pp 242–250 [71] ——, “Nash Q-Learning for General-Sum Stochastic Games,” Journal of Machine Learning Research, vol 4, pp 1039–1069, 2003 [72] M H Bowling and M M Veloso, “Rational and Convergent Learning in Stochastic Games,” in Proceedings of the Seventeenth International Joint Conferences on Artificial Intelligence (IJCAI’01), 2001, pp 1021–1026 [73] G Chalkiadakis, “Multiagent reinforcement learning: stochastic games with multiple learning players,” Department of Computer Science, Univeristy of Toronto, Tech Rep., 2003 [74] G Chalkiadakis and C Boutilier, “Coordination in multiagent reinforcement learning: a Bayesian approach,” in Proceedings of the second international joint conference on Autonomous agents and multiagent systems (AAMAS’03) ACM Press, 2003, pp 709–716 [75] P Levis, N Lee, M Welsh, and D Culler, “TOSSIM: Accurate and scalable simulation of entire TinyOS applications,” in Proceedings of the First International Conference on Embedded Networked Sensor Systems (SenSys’03), November 2003 [76] J Hill, R Szewczyk, A Woo, S Hollar, D E Culler, and K S J Pister, “System Architecture Directions for Networked Sensors,” in Proceedings of the Ninth Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX), November 2000, pp 93–104 115 BIBLIOGRAPHY [77] TinyOS website [Online] Available: http://www.tinyos.net/ [78] V Shnayder, M Hempstead, B r Chen, G W Allen, and M Welsh, “Simulating the Power Consumption of LargeScale Sensor Network Applications,” in Proceedings of the Second International Conference on Embedded Networked Sensor Systems (SenSys’04) New York, NY, USA: ACM Press, 2004, pp 188–200 [79] SensorGrid: sensor networks integrated with grid computing [Online] Available: http://www.sensorgrid.org [80] C.-K Tham and R Buyya, “Sensorgrid: Integrating sensor networks and grid computing,” CSI Communications, Special Issue on Grid Computing, pp 24–29, 2005 116 Appendix Appendix A - Pseudo-code of the DRL algorithms A-1 Independent Q-Learning and GR DRL Algorithm IL algorithm (∗ Phase I: Initialization ∗) Qi (si , ) ← 0, ∀(si , ) ∈ S i × Ai Sense the initial state si0 ai0 ← random action (∗ Phase II: Learning phase ∗) repeat (for each time step t) t←t+1 Sense the new state sit Observe the reward rti (sit ) Update Q-Value: Qi (sit−1 , ait−1 ) ← (1 − α) Qi (sit−1 , ait−1 ) + α rti (sit ) + γ maxa∈Ai Qi (sit , a) Take local action ait ( -greedy action selection): arg maxa∈Ai Qi (sit , a), with probability (1 − ) ait = random action, with probability 10 until terminal condition Figure A-I: IL algorithm for agent i In the GR DRL case, the algorithm is the same except that the reward signal is global instead of local to the agent i, i.e all the agents of the MAS receive the same reward that depends on the joint-action taken by the agents at the previous time step 117 A Pseudo-code of the DRL algorithms A-2 APPENDIX Distributed Value Function DRL - Schneider et al Algorithm DVF Algorithm (∗ Phase I: Initialization ∗) Broadcast a message to determine Neigh(i) Compute f i (j), ∀j ∈ Neigh(i) AllVvaluesReceived ← False Qi (si , ) ← 0, ∀(si , ) ∈ S i × Ai V i (si ) ← 0, ∀si ∈ S i Sense the initial state si0 a0 ← random action (∗ Phase II: Learning phase ∗) repeat (for each time step t) t←t+1 10 Sense the new state sit 11 Observe the reward rti (sit ) 12 Broadcast V i (sit ) 13 while AllVvaluesReceived==False 14 waitForVvalues() 15 Update Q-value: Qi (sit−1 , ait−1 ) ← (1 − α) Qi (sit−1 , ait−1 ) + α rti (sit ) + γ 16 Update V-value: V i (sit−1 ) ← maxa∈Ai Qi (sit−1 , a) 17 Take local action ait ( -greedy action selection): arg maxa∈Ai Qi (sit , a), with probability (1 − ) ait = random action, with probability 18 until terminal condition j∈N eigh(i) f i (j)V j (sj ) t Figure A-II: DVF algorithm for agent i In the DR case, the algorithm is the same except that the agents not broadcast the value of the state they land in but the immediate reward they receive Therefore, in the previous Figure A-II, V i (sit ) and V j (sjt ) should be replaced by ri (sit ) and rj (sjt ) respectively at Lines 12 and 15 118 A Pseudo-code of the DRL algorithms APPENDIX Algorithm OptDRL Algorithm (∗ Phase I: Initialization ∗) Qi (S, ) ← 0, ∀(S, ) ∈ S × Ai Πi (S) ← 0, ∀ S ∈ S Determine the initial global state S0 ai0 ←random action (∗ Phase II: Learning phase ∗) repeat (for each time step t) t←t+1 Determine the new global state St Observe the global reward Rt (St ) Update Q-value: Qi (St−1 , ait−1 ) ← max Qi (St−1 , ait−1 ), (1 − α) Qi (St−1 , ait−1 ) + α Rt (St )+ γ max Qi (St , a) a∈Ai 10 Update policy Π: Πi (St−1 ) ← ait−1 iff maxa∈Ai Qi (St−1 , a) = maxa∈Ai Qi (St , a) 11 Take local action ait ( -greedy action selection): arg maxa∈Ai Qi (St , a), with probability (1 − ) ait = random action, with probability 12 until terminal condition Figure A-III: OptDRL algorithm for agent i A-3 Optimistic DRL - Lauer and Riedmiller A-4 WoLF-PHC - Bowling and Veloso The pseudo-code of the WoLF-PHC algorithm for agent i is given in Figure 3.3 on page 55 119 A Pseudo-code of the DRL algorithms A-5 APPENDIX FMQ heuristics extended from Kudenko and Kapetanakis Algorithm FMQg Heuristic (∗ Phase I: Initialization ∗) Qi (S, ) ← 0, ∀(S, ) ∈ S × Ai EV i (S, ) ← 0, ∀(S, ) ∈ S × Ai i Rmax (S, ) ← 0, ∀(S, ) ∈ S × Ai CountRimax (S, ) ← 0, ∀(S, ) ∈ S × Ai CountAction(S, ) ← 0, ∀(S, ) ∈ S × Ai Determine the initial global state S0 ai0 ←random action (∗ Phase II: Learning phase ∗) repeat (for each time step t) t←t+1 10 Determine the new global state St 11 Observe the global reward Rt (St ) 12 Update CountRimax (St−1 , ait−1 ): i if Rt (St ) > Rmax (St−1 , ait−1 ) i i Rmax (St−1 , at−1 ) ← Rt (St ) CountRimax (St−1 , ait−1 ) ← i else if Rt (St ) == Rmax (St−1 , ait−1 ) CountRimax (St−1 , ait−1 ) ← CountRimax (St−1 , ait−1 ) + 13 Update Q-value: Qi (St−1 , ait−1 ) ← (1 − α) Qi (St−1 , ait−1 ) + α Rt (St ) + γ maxa∈Ai Qi (St , a) 14 Update EV-value: CountRi (St−1 ,ait−1 ) i ×Rmax (St−1 , ait−1 ) i t−1 ,at−1 ) max EV i (St−1 , ait−1 ) ← Qi (St−1 , ait−1 )+C × CountAction(S 15 Update π (Boltzmann action selection): π i (S t−1 , a) = P EV i (St−1 ,a) Tt EV i (St−1 ,a ) Tt a ∈Ai e e , ∀a ∈ Ai Take local action ait with probability π i (St , ait ) Update CountAction: CountAction(St , ait ) ← CountAction(St , ait ) + 18 until terminal condition 16 17 Figure A-IV: FMQg algorithm for agent i In the FMQl case, the state relies on information locally available to every agent, i.e S has to be replaced by si in Figure A-IV 120 A A-6 Pseudo-code of the DRL algorithms APPENDIX Coordinated Q-Learning - Guestrin Algorithm CQL Algorithm (∗ Phase I: Initialization ∗) w0 ← 1, initial values for the parameters Πi (S) ← 0, ∀ S ∈ S O: eliminating ordering O− : action selection ordering (reverse ordering of O) Determine the initial global state S0 ai0 ←random action (∗ Phase II: Learning phase ∗) repeat (for each time step t) t←t+1 Determine the new global state St 10 Receive the global reward Rt (St ) 11 Compute a∗t and V (St ) using a distributed version of the VE algorithm presented in Figure 3.2 on page 52, i.e.: 12 (i) Maximization: 13 Wait for signal for eliminating signal defined by O 14 Collect the local functions (ej and Qj -functions) influenced by the actions of i j j 15 Define a new function ei = maxa∈Ai je + jQ 16 Signal ei to the next agent to be eliminated given by O 17 (ii) Action selection and computation of V : 18 Wait for the action choice of previous agents given by O− i 19 Determine ai∗ t = arg maxa∈Ai e (·) 20 Compute V (St ) ← V (St ) + Q(St , a∗t ) − 21 Signal ai∗ t and V (St ) to the next agent according to O − ∗ 22 The last agent defined by O broadcast at and V (St ) 23 Compute ∇wi Qi (St , at ) 24 Update Qi -function parameters wi : i ← wi i wj,t Qw (St−1 , at−1 ) i j,t−1 + α ∆(St−1 , at , R(St ), St , wt−1 ) ∇wi j,t−1 j,t−1 Take local action ait ( -greedy action selection): ai∗ with probability (1 − ) t , ait = random action, with probability 26 until terminal condition 25 Figure A-V: CQL algorithm for agent i 121 B List of Publications APPENDIX Appendix B - List of Publications B-1 Published paper [1] C.-K Tham and J.-C Renaud, “Multi-Agent Systems on Sensor Networks: A Distributed Reinforcement Learning Approach”, in Proceedings of the Second Intelligent Sensors, Sensor Networks and Information Processing Conference (ISSNIP’05), Melbourne, Australia, 2005, pp 423–429 B-2 Pending Publication [1] J.-C Renaud and C.-K Tham, “Coordinated Sensing Coverage in Sensor Networks using Distributed Reinforcement Learning”, will appear in the proceedings of the Second Workshop on Coordinated Quality of Service in Distributed Systems (COQODS-II) This workshop will be held in conjunction with the Fourteenth International Conference On Networks (ICON 2006) in Singapore in September 2006 B-3 Submitted paper [1] Article under preparation for submission to an international journal 122 ... Multi- agent Learning 2.1 CHAPTER: Multi- agent Learning A possible approach to multiagent learning is to regard the MAS as a large single agent whose state and action spaces are the concatenation of... the MAXQ approach to multiagent RL The main idea of [52] is to take advantage of the hierarchy approach and enable communication at high level tasks only Each node uses the same MAXQ hierarchy... observability is to enable the agents to exchange information via communication 2.3.2 Multi- agent learning with communication Allowing agents to communicate with one another in a multiagent domain

Multi agent systems on wireless sensor networks a distributed reinforcement learning approach

Thông tin tài liệu

Từ khóa liên quan

Mục lục

Acknowledgments

Contents

Summary

List of Tables

List of Figures

List of Abbreviations

Declaration

Introduction

Wireless Sensor Networks and Multi-Agent Systems

Challenges with Multi-Agent Systems

Reinforcement Learning

Markov Decision Processes

Value functions

The Q-learning algorithm

Focus, motivation and contributions of this thesis

Literature review

Multi-agent Learning

Solutions to the curse of dimensionality

Independent vs. cooperative agents

Global optimality by local optimizations

Exploiting the structure of the problem

Solutions to partial observability

Partially Observable MDPs

Multi-agent learning with communication

Other approaches

The Game theoretic approach

Tài liệu cùng người dùng

Tài liệu liên quan