增强学习 Reinforcement learning part 1 - Introduction

本文是在学习David Silver所教授的Reinforcement learning课程过程中所记录的笔记。因为个人知识的不足以及全程啃生肉，难免会有理解偏差的地方，欢迎一起交流。

课程资料：http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html

关于 Reinforcement learning的两本参考：

An Introduction to Reinforcement Learning

https://webdocs.cs.ualberta.ca/~sutton/book/the-book-1st.html

Algorithms for Reinforcement Learning

https://sites.ualberta.ca/~szepesva/papers/RLAlgsInMDPs.pdf

1、About Reinforcement Learning

Many Faces of Reinforcement Learning

Machine Learningd的三个分支：Supervised Learning、Unsupervised Learning、Reinforcement Learning

RL与其他两种的区别：

There is no supervisor, only a reward signal
Feedback is delayed, not instantaneous
Time really matters (sequential, non i.i.d data)
Agent’s actions affect the subsequent data it receives

2、The Reinforcement Learning Problem

介绍三个概念：reward、environment、state

==reward==

用Rt来表示reward（标量，就是个’数’），衡量在第t步agent表现的好坏（收益），agent的目标就是最大化累计reward

Reinforcement learning is based on the reward hypothesis（All goals can be described by the maximisation of expected cumulative reward）

简而言之就是假设所有的目标都可以用最大化累计收益来表示

example：

Sequential Decision Making

==environment==

agent与environment的关系，agent执行action影响environment，environment给agent关于observation和reward的反馈

==state==

history：在第t步之前的observation、reward、action。注意没有At，因为agent是基于observation和reward来选择action，在选择action之前的这个时间点，在此之前的都算是过去

state：is the information used to determine what happens next。就是说我利用了history中某些信息来判断接下来会发生什么，所利用的这些信息就被称为state

这里what happens next分为两部分

The agent selects actions。agent会选择什么action
The environment selects observations/rewards。environment会给出什么observation/reward

state又分为environment state、agent state、information state

environment state：Ste is the environment’s private representation。对于agent而言一般是invisible的，就算是visible，那也包含不相关的信息

agent state：Sta is the agent’s internal representation。It can be any function of history

information state：又被称为markov state。所以具有’The future is independent of the past given the present’

The environment state and the history are Markov

例子：每一行为一次过程，第一次灯亮灯亮，老鼠按下开关，然后铃响，结果老鼠遭到电击。第二次先铃响灯亮，然后老鼠两次按下开关，结果老鼠得到奶酪。第三次老鼠先按下开关，然后灯亮，然后老鼠又按下开关，然后铃响，猜测老鼠会得到什么？

分析：如果agent state是利用最后三个动作的顺序，那么老鼠会遭到电击。如果agent state是利用灯亮铃响按下开关的次数，那么老鼠会得到奶酪。如果agent state是利用整个序列，那我们也不知道会发生什么

Fully Observable Environments：agent directly observes environment state。这种被称为Markov decision process (MDP)

agent state = environment state = information state

Partially Observable Environments：agent indirectly observes environment。这种被称为partially observable Markov decision process（POMDP）

A robot with camera vision isn’t told its absolute location
A trading agent only observes current prices
A poker playing agent only observes public cards

agent state 不等于 environment state

3、Inside An Reinforcement Learning Agent

agent的三要素

Policy：agent采取的行为策略（behaviour function）
Value Function：评估state/action的好坏
Model：agent对environment所构建的模型（在agent眼中environment的样子）

==Policy==：agent的策略，也就是agent在某个状态会采取什么样的行动，所以policy is a map from state to action

==Value Function==：是对未来收益的一个预测，用来评估状态的好坏程度

其中gamma是discounted系数，表示了未来的收益对现在的影响，越远的影响越小。比如gamma是0.9，那么这个预测的时间跨度大约是未来三四十步

example：左上角那个是state value function，游戏画面上有一个紫色的，那个是mothership，击落的分数奖励更高，所以当mothership从右边出现后，对未来收益的预测增加，从而value function的值开始上升。当mothership从眼前过去后，不管打没打中，value function都会陡然下降，因为后面都是小兵，所以对未来收益的预期也就回到了一般水平。

还有个打砖块的例子，越靠上面的砖块分数越高，所以在游戏刚开始的时候value function比较平滑，当下面的打了好多以后，打到更深的砖块的概率上升，所以value function的波动增加了。

==Model==：agent对environment构建的模型，用来预测environment下一步会干什么（会跳转到哪个state，会给出什么reward）

P predicts the next state

R predicts the next (immediate) reward

基于上面三要素，RL agent有以下几种分类

4、Problems within Reinforcement Learning

==Learning and Planning==

Two fundamental problems in sequential decision making

Reinforcement Learning:
- The environment is initially unknown
- The agent interacts with the environment
- The agent improves its policy
Planning:
- A model of the environment is known
- The agent performs computations with its model (without any external interaction)
- The agent improves its policy
- a.k.a. deliberation, reasoning, introspection, pondering, thought, search

Reinforcement Learning的例子：游戏的机制不清楚，只能通过玩来学习，通过观察得分与游戏画面来选择下一步行动

Planning的例子：游戏机制很清楚，下一步是什么样子的都知道，有完整的策略（就像是玩游戏有攻略一样）

==Exploration and Exploitation==

Reinforcement learning is like trial-and-error learning

Exploration：探索，更多的去探索environment的信息
Exploitation：利用，更多的利用已知的environment信息来最大化reward

举个例子，吃饭选择餐厅，exploration是选择一个新餐厅，exploitation是选择自己平时最喜欢吃的餐厅

==Prediction and Control==

Prediction：估计未来的收益，given a policy
Control：最优化未来的收益，find the best policy

Gridworld Example，没看懂

后记：有了一些理解，如果移动到A的话那么就会跳转到A’，并且reward +10，如果移动到B的话那么就会跳转到B’，并且reward + 5

文章目录

1、About Reinforcement Learning

2、The Reinforcement Learning Problem

3、Inside An Reinforcement Learning Agent

4、Problems within Reinforcement Learning