本文是在学习David Silver所教授的Reinforcement learning课程过程中所记录的笔记。因为个人知识的不足以及全程啃生肉,难免会有理解偏差的地方,欢迎一起交流。

课程资料:http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html

关于 Reinforcement learning的两本参考:

An Introduction to Reinforcement Learning

https://webdocs.cs.ualberta.ca/~sutton/book/the-book-1st.html

Algorithms for Reinforcement Learning

https://sites.ualberta.ca/~szepesva/papers/RLAlgsInMDPs.pdf

1、About Reinforcement Learning

Many Faces of Reinforcement Learning

Machine Learningd的三个分支:Supervised Learning、Unsupervised Learning、Reinforcement Learning

RL与其他两种的区别:

  • There is no supervisor, only a reward signal
  • Feedback is delayed, not instantaneous
  • Time really matters (sequential, non i.i.d data)
  • Agent’s actions affect the subsequent data it receives

2、The Reinforcement Learning Problem

介绍三个概念:reward、environment、state

==reward==

用Rt来表示reward(标量,就是个’数’),衡量在第t步agent表现的好坏(收益),agent的目标就是最大化累计reward

Reinforcement learning is based on the reward hypothesis(All goals can be described by the maximisation of expected cumulative reward)

简而言之就是假设所有的目标都可以用最大化累计收益来表示

example:

Sequential Decision Making

==environment==

agent与environment的关系,agent执行action影响environment,environment给agent关于observation和reward的反馈

==state==

history:在第t步之前的observation、reward、action。注意没有At,因为agent是基于observation和reward来选择action,在选择action之前的这个时间点,在此之前的都算是过去

state:is the information used to determine what happens next。就是说我利用了history中某些信息来判断接下来会发生什么,所利用的这些信息就被称为state

这里what happens next分为两部分

  • The agent selects actions。agent会选择什么action
  • The environment selects observations/rewards。environment会给出什么observation/reward

state又分为environment state、agent state、information state

environment state:Ste is the environment’s private representation。对于agent而言一般是invisible的,就算是visible,那也包含不相关的信息

agent state:Sta is the agent’s internal representation。It can be any function of history

information state:又被称为markov state。所以具有’The future is independent of the past given the present’

The environment state and the history are Markov

例子:每一行为一次过程,第一次灯亮灯亮,老鼠按下开关,然后铃响,结果老鼠遭到电击。第二次先铃响灯亮,然后老鼠两次按下开关,结果老鼠得到奶酪。第三次老鼠先按下开关,然后灯亮,然后老鼠又按下开关,然后铃响,猜测老鼠会得到什么?

分析:如果agent state是利用最后三个动作的顺序,那么老鼠会遭到电击。如果agent state是利用灯亮铃响按下开关的次数,那么老鼠会得到奶酪。如果agent state是利用整个序列,那我们也不知道会发生什么

Fully Observable Environments:agent directly observes environment state。这种被称为Markov decision process (MDP)

agent state = environment state = information state

Partially Observable Environments:agent indirectly observes environment。这种被称为partially observable Markov decision process(POMDP)

  • A robot with camera vision isn’t told its absolute location
  • A trading agent only observes current prices
  • A poker playing agent only observes public cards

agent state 不等于 environment state

3、Inside An Reinforcement Learning Agent

agent的三要素

  • Policy:agent采取的行为策略(behaviour function)
  • Value Function:评估state/action的好坏
  • Model:agent对environment所构建的模型(在agent眼中environment的样子)

==Policy==:agent的策略,也就是agent在某个状态会采取什么样的行动,所以policy is a map from state to action

==Value Function==:是对未来收益的一个预测,用来评估状态的好坏程度

其中gamma是discounted系数,表示了未来的收益对现在的影响,越远的影响越小。比如gamma是0.9,那么这个预测的时间跨度大约是未来三四十步

example:左上角那个是state value function,游戏画面上有一个紫色的,那个是mothership,击落的分数奖励更高,所以当mothership从右边出现后,对未来收益的预测增加,从而value function的值开始上升。当mothership从眼前过去后,不管打没打中,value function都会陡然下降,因为后面都是小兵,所以对未来收益的预期也就回到了一般水平。

还有个打砖块的例子,越靠上面的砖块分数越高,所以在游戏刚开始的时候value function比较平滑,当下面的打了好多以后,打到更深的砖块的概率上升,所以value function的波动增加了。

==Model==:agent对environment构建的模型,用来预测environment下一步会干什么(会跳转到哪个state,会给出什么reward)

P predicts the next state

R predicts the next (immediate) reward

基于上面三要素,RL agent有以下几种分类

4、Problems within Reinforcement Learning

==Learning and Planning==

Two fundamental problems in sequential decision making

  • Reinforcement Learning:
    • The environment is initially unknown
    • The agent interacts with the environment
    • The agent improves its policy
  • Planning:
    • A model of the environment is known
    • The agent performs computations with its model (without any external interaction)
    • The agent improves its policy
    • a.k.a. deliberation, reasoning, introspection, pondering, thought, search

Reinforcement Learning的例子:游戏的机制不清楚,只能通过玩来学习,通过观察得分与游戏画面来选择下一步行动

Planning的例子:游戏机制很清楚,下一步是什么样子的都知道,有完整的策略(就像是玩游戏有攻略一样)

==Exploration and Exploitation==

Reinforcement learning is like trial-and-error learning

  • Exploration:探索,更多的去探索environment的信息
  • Exploitation:利用,更多的利用已知的environment信息来最大化reward

举个例子,吃饭选择餐厅,exploration是选择一个新餐厅,exploitation是选择自己平时最喜欢吃的餐厅

==Prediction and Control==

  • Prediction:估计未来的收益,given a policy
  • Control:最优化未来的收益,find the best policy

Gridworld Example,没看懂

后记:有了一些理解,如果移动到A的话那么就会跳转到A’,并且reward +10,如果移动到B的话那么就会跳转到B’,并且reward + 5