本文是在学习David Silver所教授的Reinforcement learning课程过程中所记录的笔记。因为个人知识的不足以及全程啃生肉,难免会有理解偏差的地方,欢迎一起交流。


关于 Reinforcement learning的两本参考:

An Introduction to Reinforcement Learning


Algorithms for Reinforcement Learning


1、About Reinforcement Learning

Many Faces of Reinforcement Learning

Machine Learningd的三个分支:Supervised Learning、Unsupervised Learning、Reinforcement Learning


  • There is no supervisor, only a reward signal
  • Feedback is delayed, not instantaneous
  • Time really matters (sequential, non i.i.d data)
  • Agent’s actions affect the subsequent data it receives

2、The Reinforcement Learning Problem




Reinforcement learning is based on the reward hypothesis(All goals can be described by the maximisation of expected cumulative reward)



Sequential Decision Making





state:is the information used to determine what happens next。就是说我利用了history中某些信息来判断接下来会发生什么,所利用的这些信息就被称为state

这里what happens next分为两部分

  • The agent selects actions。agent会选择什么action
  • The environment selects observations/rewards。environment会给出什么observation/reward

state又分为environment state、agent state、information state

environment state:Ste is the environment’s private representation。对于agent而言一般是invisible的,就算是visible,那也包含不相关的信息

agent state:Sta is the agent’s internal representation。It can be any function of history

information state:又被称为markov state。所以具有’The future is independent of the past given the present’

The environment state and the history are Markov


分析:如果agent state是利用最后三个动作的顺序,那么老鼠会遭到电击。如果agent state是利用灯亮铃响按下开关的次数,那么老鼠会得到奶酪。如果agent state是利用整个序列,那我们也不知道会发生什么

Fully Observable Environments:agent directly observes environment state。这种被称为Markov decision process (MDP)

agent state = environment state = information state

Partially Observable Environments:agent indirectly observes environment。这种被称为partially observable Markov decision process(POMDP)

  • A robot with camera vision isn’t told its absolute location
  • A trading agent only observes current prices
  • A poker playing agent only observes public cards

agent state 不等于 environment state

3、Inside An Reinforcement Learning Agent


  • Policy:agent采取的行为策略(behaviour function)
  • Value Function:评估state/action的好坏
  • Model:agent对environment所构建的模型(在agent眼中environment的样子)

==Policy==:agent的策略,也就是agent在某个状态会采取什么样的行动,所以policy is a map from state to action

==Value Function==:是对未来收益的一个预测,用来评估状态的好坏程度


example:左上角那个是state value function,游戏画面上有一个紫色的,那个是mothership,击落的分数奖励更高,所以当mothership从右边出现后,对未来收益的预测增加,从而value function的值开始上升。当mothership从眼前过去后,不管打没打中,value function都会陡然下降,因为后面都是小兵,所以对未来收益的预期也就回到了一般水平。

还有个打砖块的例子,越靠上面的砖块分数越高,所以在游戏刚开始的时候value function比较平滑,当下面的打了好多以后,打到更深的砖块的概率上升,所以value function的波动增加了。


P predicts the next state

R predicts the next (immediate) reward

基于上面三要素,RL agent有以下几种分类

4、Problems within Reinforcement Learning

==Learning and Planning==

Two fundamental problems in sequential decision making

  • Reinforcement Learning:
    • The environment is initially unknown
    • The agent interacts with the environment
    • The agent improves its policy
  • Planning:
    • A model of the environment is known
    • The agent performs computations with its model (without any external interaction)
    • The agent improves its policy
    • a.k.a. deliberation, reasoning, introspection, pondering, thought, search

Reinforcement Learning的例子:游戏的机制不清楚,只能通过玩来学习,通过观察得分与游戏画面来选择下一步行动


==Exploration and Exploitation==

Reinforcement learning is like trial-and-error learning

  • Exploration:探索,更多的去探索environment的信息
  • Exploitation:利用,更多的利用已知的environment信息来最大化reward


==Prediction and Control==

  • Prediction:估计未来的收益,given a policy
  • Control:最优化未来的收益,find the best policy

Gridworld Example,没看懂

后记:有了一些理解,如果移动到A的话那么就会跳转到A’,并且reward +10,如果移动到B的话那么就会跳转到B’,并且reward + 5