Decision Making|人工智能、机器学习与强化学习的概述与比较

2020-10-03

字数统计: 2.4k字 | 阅读时长≈ 10分

之前帆问我，强化学习和机器学习有什么区别，我自以为清楚，但是和她讨论到最后才发现自己对这两个概念很糊涂。当时也查过资料，但是中文资料含糊不清、不够系统，大多文章是复制粘贴，作者不求甚解。

最近在了解making decision under uncertainty的背景知识，看到一篇review，索引至作者的实验室主页，看到有一篇谈论AI、machine learning和reinforcement learning的文章，故认真学习，然后写成博客，以备不时之需。

一、Artificial Intelligence

一般来说，有三种方法企图使用计算机实现智能：

Rule-based logic
Making estimates using data from the environment
Making decisions that interact with the environment

1）Rule-based logic

利用逻辑推理，基于一系列前提条件，利用规则得到结论。类似于多次使用“if…else”语句进行判断、选择。

2）Making estimates using data from the environment

这个问题往往使用Machine Learning方法，而Machine Learning与统计学紧密相关，也被称为统计学习，或者更广泛地说，数据科学。Machine Learning分为两类：

supervised learning

使用带标签的数据训练machine learning 模型。

unsupervised learning

对不带标签的数据进行聚类。

3）Making decisions that interact with the environment

简单地说，making decision的方法分为以下几类：

Rule-based logic
Deterministic optimization
Reinforcement learning
Stochastic optimization

二、Making estimates using data from the environment

所谓的estimates，有三种形式（目的）：

classification：判断input的类型。
inference：对input进行推断。
prediction：基于input，进行预测。

三种任务都基于一个共同的框架：已知输入 $x$，通过统计模型$f(.|\theta)$，得到相应的结果 $f(x|\theta)$。为了得到一个效果不错的模型，需要通过最小化模型输出和实际结果的方法，训练模型。

统计模型可以分为三类：

Lookup table
Parametric models
Nonparametric/locally parametric

三、Making decisions that interact with the environment

making decisions的目的，可以分为两类：

Decisions that change the environment. 所做的决策会改变环境。
Decisions to exchange information. 所做的决策是为了交换信息。而该决策又可以分为两类：
- Decisions to acquire information 所做决策是为了获取信息。（比如simulation）
- Decisions to communicate/disseminate information. 所做的决策是为了传递信息，对环境没有任何互动。

实际上，大部分决策既会改变环境，又会交换信息。因为在大部分情况下，改变环境之后，当前的信息肯定会变化，比如说蝴蝶效应，蝴蝶扇一扇翅膀（决策），就会引起龙卷风（改变环境），龙卷风本身也是一种信息，会影响之后的决策。

为了更清晰地表述making decisions，下面将引入一些符号来表示决策过程中的重要概念。“state”，即状态，包含了所有决策所需要的信息，用$S_t$表示。$S_t$ 可以是物理信息$R_t$ 比如汽车的位置、速度，水库的水量等等；也可以是其他的信息$I_t$，比如今天的天气情况；或者是某些我们无法观察到的数量的置信 (beliefs) $B_t$。

什么叫置信 $B_t$？可以理解成先验，比如数据的均值、方差，比如参数的分布等。所以，纯粹的Decisions to acquire information，因为获得了环境信息，所以对环境的先验知识得到了丰富，因此改变的只有$B_t$，这是一个learning problem，只是为了学习$B_t$。关于学习$B_t$的问题，被称为“多臂老虎机” (multiarmed bandit) 问题。当然，大部分情况下，为了获取$B_t$而making decision之后，环境状态的$B_t$会随之变化，相关问题被称为 active learning problem.

还有几个概念。我们使用$C(S,x)$来表示cost function（或者称为contribution function ）。所谓的$C(S,x)$，是用来衡量在状态$S$下，决策$x$的好坏。making decision的目的就是选择最好的$x$，即使得$C(S,x)$最大的$x$。$C(S,x)$是用来衡量决策好坏的，这个评估直接影响决策的选择。我们使用$X^\pi(S)$来表示决策选择函数，$x_t=X^\pi(S_t)$。making decision的核心就是得到最优的决策选择函数$X^\pi(S)$。

（一）什么是reinforcement learning？

强化学习（reinforcement learning）是making decision的一个方法之一，但是这几年热度很高。所有的强化学习方法，都基于“Markov decision processes”。强化学习的核心思想是Q-learning。以下是Q-learning的公式表示：
$$
\begin{array}{l}
\hat{q}^{n}\left(s^{n}, a^{n}\right)=r(s^n, a^n)+\max _{a^{\prime}} \bar{Q}^{n-1}\left(s^{n+1}, a^{\prime}\right) \\
\bar{Q}^{n}\left(s^{n}, a^{n}\right)=(1-\alpha) \bar{Q}^{n-1}\left(s^{n}, a^{n}\right)+\alpha \hat{q}^{n}\left(s^{n}, a^{n}\right)
\end{array}
$$
$r(s^n, a^n)$是奖励函数，评估在状态$s^n$下进行动作$a^n$的好坏。$\alpha$是平滑系数，用来平衡$Q(s,a)$和$q(s,a)$对最终结果的影响程度。

Q-learning的思想，简单地说，就是在选定决策$a^n$的情况下，计算该决策对下一个状态$s^{n+1}$的影响，再把这种影响计入对该决策好坏的评定。强化学习方法通过计算$Q(s,a)$来评估决策的好坏，最好的决策能够最大化$Q(s,a)$。

（二）reinforcement leanring与machine leanring的交集

换一个角度看machine learning，比如说，如果making decision的目的是为了模仿人的行为，那么人在每一个时间单位的状态$s$，其实就是$x$；在每一个状态下的行为$a$，其实就是$y$，machine learning模型的目的，就是学习如何将$x$映射成$y$，即在状态$s$下做出怎样的行为$a$。

但是这仅仅是“模仿”，所谓的making decision，目的在于将决策的performance最大化，而衡量、最大化performance的过程不需要标签$y$，简单地说，模仿是无法实现决策最优的。

（三）什么是stochastic optimization？

stochastic optimization描述的问题是：进行决策，并根据决策后的环境状态选择下一步决策，但是观察环境状态的行为又会影响到环境状态，从而影响决策。数学化表示，有以下情况：

Make decision $x$, see information $W$, stop.
Make decision $x^0$, see information $W^1$, make another decision $x^1$, stop.
Sequential decisions and information: $S^{0}, x^{0}, W^{1}, S^{1}, x^{1}, W^{2}, S^{2}, \ldots, W^{N}, S^{N}, x^{N}$ where $S^n$ is the state variable, which captures everything we need to determine $x^n$. Note that state variables may include beliefs about unobservable parameters.
Infinite horizon problems, where $N→∞$, and where the information $W^n$ comes from a stationary distribution.

stochastic optimization的核心问题是，寻找到一个决定$x^n$（写作$X^{\pi}\left(S^{n}\right)$）的策略选择算法$\pi$，从而最大化cumulative reward：
$$
\max {\pi} \mathbb{E}\left(\sum{n=0}^{N} C\left(S^{n}, X^{\pi}\left(S^{n}\right)\right) \mid S^{0}\right)
$$
或者final reward
$$
\max _{\pi} \mathbb{E} F\left(x^{\pi, N}, \hat{W}\right)
$$
其中$x^{\pi, N}$是指在策略搜索算法$\pi$下$N$次迭代后的策略结果集。

有一点需要注意，接下来将详细分析。整个优化过程如下：

通过策略选择算法$\pi$，选择策略$x^{\pi, N}$。
计算 cumulative reward 或者 final reward，评估决策好坏。
优化策略选择算法$\pi$。
重复以上操作，直到 cumulative reward 或者 final reward 最大。

需要注意的点是：已知策略选择算法$\pi$，我们该如何选择每一个策略$x^t$，从而得到策略集$x^{\pi, N}$？

可以看到，stochastic optimization的核心问题是通过最大化reward，来优化策略选择算法$\pi$。而该核心问题的核心是，如何通过策略选择算法$\pi$，选择每一个策略$x^t$，从而形成需要评估的策略集$x^{\pi, N}$，这个问题称为searching over policies。

searching over policies和machine learning中对统计模型的选择有些相像。具体说，machine learning基于cost function来改变模型的参数，从而达到对模型的选择。而策略选择则更加复杂，用什么评判标准来选择每一个policy，在下一节进行讨论。

（四）The classes of policies for stochastic optimization

有2种基本的策略选择方法，而这两种方法各自又可以分成2类：

The policy search class: policies that have to be tuned over time to work well in terms of the performance metric (My opinion: find a function ,which determines the policy, that can make a set of policies that maximize the performance metric)
- Policy function approximations (PFAs)
  - map state to action
- Cost function approximations (CFAs)
  - parametrically modified optimization problems (to optimize cost function)

该类简单地说，就是选择对整体效果最好的策略，而不考虑每一个策略对整一个决策链的影响。

The lookahead class: These policies work by finding the action now, given the state we are in, that maximizes the one-period reward plus an approximation of the downstream value of the state that the action takes us to. (modeling the downstream impact of a decision made now on the future.) (My opinion: find a function, which determines the policy, that can make every policy based on current state which have the best performance for the following policies. Like greedy algorithm)
- Policies based on value function approximations (VFAs)
  - estimate the value of being in this new state, choose the decision that makes the biggest value (like Q-learning)
- Direct lookaheads (DLAs)
  - make several actions and update them periodically

该方法的思想类似于贪婪思想，searching over policies的策略是，寻找每一个对之后决策积极影响最大的决策。

三、Reference

Powell, Warren B. “What is AI?” https://castlelab.princeton.edu/what-is-ai/
Powell, Warren B. “From Reinforcement Learning to Optimal Control: A unified framework for sequential decisions.” arXiv preprint arXiv:1912.03513 (2019).

本文作者： YA
本文链接： https://shiyuang-scu.github.io/2020/10/03/Decision-Making-人工智能、机器学习与强化学习的概述与比较/
版权声明： 本博客所有文章除特别声明外，均采用 MIT 许可协议。转载请注明出处！