增强学习 Reinforcement learning part 1 - Introduction
文章目录
本文是在学习David Silver所教授的Reinforcement learning课程过程中所记录的笔记。因为个人知识的不足以及全程啃生肉,难免会有理解偏差的地方,欢迎一起交流。
课程资料:http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html
关于 Reinforcement learning的两本参考:
An Introduction to Reinforcement Learning
https://webdocs.cs.ualberta.ca/~sutton/book/the-book-1st.html
Algorithms for Reinforcement Learning
https://sites.ualberta.ca/~szepesva/papers/RLAlgsInMDPs.pdf
1、About Reinforcement Learning
Many Faces of Reinforcement Learning
Machine Learningd的三个分支:Supervised Learning、Unsupervised Learning、Reinforcement Learning
RL与其他两种的区别:
- There is no supervisor, only a reward signal
- Feedback is delayed, not instantaneous
- Time really matters (sequential, non i.i.d data)
- Agent’s actions affect the subsequent data it receives
2、The Reinforcement Learning Problem
介绍三个概念:reward、environment、state
==reward==
用Rt来表示reward(标量,就是个’数’),衡量在第t步agent表现的好坏(收益),agent的目标就是最大化累计reward
Reinforcement learning is based on the reward hypothesis(All goals can be described by the maximisation of expected cumulative reward)
简而言之就是假设所有的目标都可以用最大化累计收益来表示
example:
Sequential Decision Making
==environment==
agent与environment的关系,agent执行action影响environment,environment给agent关于observation和reward的反馈
==state==
history:在第t步之前的observation、reward、action。注意没有At,因为agent是基于observation和reward来选择action,在选择action之前的这个时间点,在此之前的都算是过去
state:is the information used to determine what happens next。就是说我利用了history中某些信息来判断接下来会发生什么,所利用的这些信息就被称为state
这里what happens next分为两部分
- The agent selects actions。agent会选择什么action
- The environment selects observations/rewards。environment会给出什么observation/reward
state又分为environment state、agent state、information state
environment state:Ste is the environment’s private representation。对于agent而言一般是invisible的,就算是visible,那也包含不相关的信息
agent state:Sta is the agent’s internal representation。It can be any function of history
information state:又被称为markov state。所以具有’The future is independent of the past given the present’
The environment state and the history are Markov
例子:每一行为一次过程,第一次灯亮灯亮,老鼠按下开关,然后铃响,结果老鼠遭到电击。第二次先铃响灯亮,然后老鼠两次按下开关,结果老鼠得到奶酪。第三次老鼠先按下开关,然后灯亮,然后老鼠又按下开关,然后铃响,猜测老鼠会得到什么?
分析:如果agent state是利用最后三个动作的顺序,那么老鼠会遭到电击。如果agent state是利用灯亮铃响按下开关的次数,那么老鼠会得到奶酪。如果agent state是利用整个序列,那我们也不知道会发生什么
Fully Observable Environments:agent directly observes environment state。这种被称为Markov decision process (MDP)
agent state = environment state = information state
Partially Observable Environments:agent indirectly observes environment。这种被称为partially observable Markov decision process(POMDP)
- A robot with camera vision isn’t told its absolute location
- A trading agent only observes current prices
- A poker playing agent only observes public cards
agent state 不等于 environment state
3、Inside An Reinforcement Learning Agent
agent的三要素
- Policy:agent采取的行为策略(behaviour function)
- Value Function:评估state/action的好坏
- Model:agent对environment所构建的模型(在agent眼中environment的样子)
==Policy==:agent的策略,也就是agent在某个状态会采取什么样的行动,所以policy is a map from state to action
==Value Function==:是对未来收益的一个预测,用来评估状态的好坏程度
其中gamma是discounted系数,表示了未来的收益对现在的影响,越远的影响越小。比如gamma是0.9,那么这个预测的时间跨度大约是未来三四十步
example:左上角那个是state value function,游戏画面上有一个紫色的,那个是mothership,击落的分数奖励更高,所以当mothership从右边出现后,对未来收益的预测增加,从而value function的值开始上升。当mothership从眼前过去后,不管打没打中,value function都会陡然下降,因为后面都是小兵,所以对未来收益的预期也就回到了一般水平。
还有个打砖块的例子,越靠上面的砖块分数越高,所以在游戏刚开始的时候value function比较平滑,当下面的打了好多以后,打到更深的砖块的概率上升,所以value function的波动增加了。
==Model==:agent对environment构建的模型,用来预测environment下一步会干什么(会跳转到哪个state,会给出什么reward)
P predicts the next state
R predicts the next (immediate) reward
基于上面三要素,RL agent有以下几种分类
4、Problems within Reinforcement Learning
==Learning and Planning==
Two fundamental problems in sequential decision making
- Reinforcement Learning:
- The environment is initially unknown
- The agent interacts with the environment
- The agent improves its policy
- Planning:
- A model of the environment is known
- The agent performs computations with its model (without any external interaction)
- The agent improves its policy
- a.k.a. deliberation, reasoning, introspection, pondering, thought, search
Reinforcement Learning的例子:游戏的机制不清楚,只能通过玩来学习,通过观察得分与游戏画面来选择下一步行动
Planning的例子:游戏机制很清楚,下一步是什么样子的都知道,有完整的策略(就像是玩游戏有攻略一样)
==Exploration and Exploitation==
Reinforcement learning is like trial-and-error learning
- Exploration:探索,更多的去探索environment的信息
- Exploitation:利用,更多的利用已知的environment信息来最大化reward
举个例子,吃饭选择餐厅,exploration是选择一个新餐厅,exploitation是选择自己平时最喜欢吃的餐厅
==Prediction and Control==
- Prediction:估计未来的收益,given a policy
- Control:最优化未来的收益,find the best policy
Gridworld Example,没看懂
后记:有了一些理解,如果移动到A的话那么就会跳转到A’,并且reward +10,如果移动到B的话那么就会跳转到B’,并且reward + 5