强化学习算法类型
策略梯度:可直接区别以上的⽬标
基于值:估计最优策略(不明确哪个是最优的策略情况下估计)的值函数和Q函数Actor-critic(演员-评论家):使⽤当前策略去估计值函数和Q函数来改进策略基于模型:估计转换模型,接着
1.让该模型去规划不明确的策略 2.让该模型去改进策略 3.其他 ⽐较:
有监督学习:⼏乎都是使⽤梯度下降强化学习:通常不使⽤梯度下降
特定算法⽰例:• 值函数⽅法
• Q-learning, DQN
• Temporal difference learning • Fitted value iteration• 策略梯度⽅法 • REINFORCE
• Natural policy gradient • Trust region policy optimization• Actor-critic⽅法
• Asynchronous advantage actor-critic (A3C) • Soft actor-critic (SAC)• Model-based⽅法 • Dyna
• Guided policy search
应⽤举例:
例1: ⽤Q函数玩Atari games论⽂参考:
• Playing Atari with deep reinforcement learning, Mnih et al. ‘13• Q-learning with convolutional neural networks
例2: 机器⼈和基于模型的强化学习论⽂参考:
• End-to-end training of deep visuomotor policies, L.* , Finn* ’16
• Guided policy search (model-based RL) for image-based robotic manipulation
例3: ⽤策略梯度实现⾛路论⽂参考:
• High-dimensional continuous control with generalized advantage estimation, Schulman et al. ‘16• Trust region policy optimization with value function approximation
例4: ⽤Q函数实现机器⼈抓取论⽂参考:
• QT-Opt, Kalashnikov et al. ‘18
• Q-learning from images for real-world robotic grasping
因篇幅问题不能全部显示,请点此查看更多更全内容