【学习笔记】人工智能规划与决策

阅读数：23835次 2022-04-26

字数统计: 3.3k字 | 阅读时长≈ 20分

6 topics

Classical AI Planning and Acting
Decision Theory
- Rational and Judgmental Decision Making
- Decision Networks
Markov Decision Process
Reinforcement Learning
POMDP
Game Theory and Multi-agent Decision Making

Classical AI Planning and Acting

Rational and Judgmental Decision Making

Uncertainty

Partially Observable
Non-deterministic

Rational Decision Making

Combining belief and desire
- belief: judgment of unknown states (uncertainty)
- desire: preference of consequences (utility)

Utility Theory

utility function which expresses the desitability of a state: $U(s)$
expeted utility of an action given evidence: $EU(a \mid e)$
Principle of maximum expected utility (MEU): agent should choose actions which can maximize the expected utility.
Six Axioms

Decision Networks

Basian Network $\rightarrow$ DAG

Decision Network, A.K.A. , Influence Diagrams

Chance Node
Decision Node
Utility Node

Value of Information

$V P I_{\mathbf{e}}\left(E_{j}\right)=\left(\sum_{k} P\left(E_{j}=e_{j k} \mid \mathbf{e}\right) E U\left(\alpha_{e_{j k}} \mid \mathbf{e}, E_{j}=e_{j k}\right)-E U(\alpha \mid \mathbf{e})\right).$

Markov Decision Process

MDP = Transition Function + Reward Function
Expeted utility executing policy $\pi$ starting in state $s$ : $U^{\pi}(s)=E\left[\sum_{t=0}^{\infty} \gamma^{t} R\left(s_{t}\right)\right]$
Value function: $U^{\pi^{\}}(s) $. Expeted utility executing optimal policy$ \pi^{\} $starting in state$ s$.
Optimal policy starting in state $s$ : $\pi_{s}^{\}=\underset{\pi}{\operatorname{argmax}} U^{\pi}(s)$. The optimal policy is independent of the starting state, so optimal policy can be simply denoted as $\pi^{\}$. (see AIMA-3rd P651)
Utility function (Value function) of state s : Uπ\*(s) can be written as U(s).
- Utility function U(s) vs. reward function R(s)
  - $U(s)$ is long-term total reward from $s$ onward.
  - $R(s)$ is short-term reward of being $s$ .
After finding Utility function $U(s)$ , we can easily get the optimal policy — choose the action that maximizes the expected utility of the subsequent state:

$\pi^{\*}(s)=\underset{a \in A(s)}{\operatorname{argmax}} \sum_{s^{\prime}} P\left(s^{\prime} \mid s, a\right) U\left(s^{\prime}\right)$

Utility Function: the key to find the optimal policy.
Two methods to find optimal policy (get the utility function):
- Value Iteration (VI)
  1. calculate the utility of each state
  2. use the state utilities to select an optimal action in each state.
- Policy Iteration (PI)
  1. policy evaluation
  2. Policy improvement

Value Iteration

Bellman Equation

$U(s)=R(s)+\gamma \max_{a \in A(s)} \sum_{s^{\prime}} P\left(s^{\prime} \mid s, a\right) U\left(s^{\prime}\right) .$

Utility of a state $U(s)$ :

immediate reward for that state $R(s)$
expected utility of the next state $\sum_{s^{\prime}} P\left(s^{\prime} \mid s, a\right) U\left(s^{\prime}\right)$ , assuming that the agent chooses the optimal action $a$ .

Note: the exact utilities of states are $U^{\pi}(s)=E\left[\sum_{t=0}^{\infty} \gamma^{t} R\left(s_{t}\right)\right]$ , they are exactly the unique solutions to the Bellman Equation.

Policy Iteration

Intuitive idea

If one action is clearly better than all others, then the exact utilities on the states need not be precise.

Two steps

Policy Evaluation:
- calculate the utility of executing policy $\pi_{i}$ starting in state $s$

$U_{i}(s)=R(s)+\gamma \sum_{s^{\prime}} P\left(s^{\prime} \mid s, \pi_{i}(s)\right) U_{i}\left(s^{\prime}\right)$

Policy Improvement:
- if the current action $a$ is not the optimal action:
  $\max_{a \in A(s)} \sum_{s^{\prime}} P\left(s^{\prime} \mid s, a\right) U\left[s^{\prime}\right]>\sum_{s^{\prime}} P\left(s^{\prime} \mid s, \pi[s]\right) U\left[s^{\prime}\right]$
- then update the policy and the utility function

Improvement Version of PI

Modified Policy Iteration
Asynchronous Policy Iteration
General Policy Iteration

Reinforcement Learning

Definition

lecture 7, P8-9

reply on MDP
- optimal policy = policy that maximize the expected total reward
use observed reward to learn the optimal policy
unknow transition function and reward function

Structure

Three agent designs

Model-based
- A utility-based agent, learns a Model $P$ and utility function $U$ on states, and uses it to select actions that maximize the expected outcome utility.
Model-free
- A Q-learning agent, learns an action-utility function $Q$ , or Q-function, giving the expected utility of taking a given action in a given state.
Reflex
- A reflex agent learns a policy $\pi$ that maps directly from states to actions.

Learning Strategy

Learn Utility Function U
- Direct Utility Estimation
- Passive Adaptive Dynamic Programming
- Passive Temporal-Difference Learning
Learn action-utility function Q (Q-function)
- Active Temporal-Difference Learning
Learn a policy π that maps directly from states to actions
- Policy Search

Known or Unknown Policy

Known Policy (Passive Learning)
- Direct Utility Estimation
- Passive Adaptive Dynamic Programming
- Passive Temporal-Difference Learning
Unknown Policy (Acitve Learning)
- Active Adaptive Dynamic Programming
- Q-Learning
  - Active Temporal-Difference Learning

Passive Learning

Known: policy $\pi$
Unknown: transition function $P\left(s^{\prime} \mid s, a\right)$ and reward function $R(s)$ .
Goal: evaluate how good the policy is, i.e. learn the utility function $U$ .

Note: The passive learning task is similar to the policy evaluation task, part of the policy iteration algorithm

Direct Utility Estimation (Monte Carlo Learning, MC)

The exact utility of state $s$ is

$U^{\pi}(s)=E\left[\sum_{t=0}^{\infty} \gamma^{t} R\left(s_{t}\right)\right]$

The core idea of MC
- The utility of a state is the expected total reward from that state onward (called the expected reward-to-go), and each trial provides a sample of this quantity for each state visited.
- Run infinite trails, the he expected total reward from that state onward will converge to the true utility.

Pros and Cons

Simple and Efficient: reduce RL problems to inducive learning problems.
Slow: need to wait until the end of episodes.
Converges very slowly: variance can be high as we accumulate the variance of every trial.
- Ignore the fact that utility of state is not independent, instead it is related to the successor states.
  - By ignoring the connections between states, direct utility estimation misses opportunities for learning. More broadly, we can view direct utility estimation as searching for $U$ in a hypothesis space that is much larger than it needs to be, in that it includes many functions that violate the Bellman equations. For this reason, the algorithm often converges very slowly.

Therefore, use Bellman equation is a better choice.

Adaptive Dynamic Programming (ADP)

Simplified Bellman Equation (as shown in Policy Iteration):

$U^{\pi}(s)=R(s)+\gamma \sum_{s^{\prime}} P\left(s^{\prime} \mid s, \pi(s)\right) U^{\pi}\left(s^{\prime}\right)$
Need to learn transition function

$P(s’ \mid s, \pi(s))$

Estimate the transition function P(t∣s,a) based on the trails
- $P(t \mid s, a) \leftarrow N_{s^{\prime} \mid s a}[t, s, a] / N_{s a}[s, a]$
Learn utility function U(s) by policy evaluation
- $U_{}(s)=R(s)+\gamma \sum_{s^{\prime}} P\left(s^{\prime} \mid s, \pi_{}(s)\right) U_{}\left(s^{\prime}\right)$

Temporal-Difference Learning (TD)

Another way to utilize Bellman equation (i.e. relation to the successor states)
- use the observed transitions to adjust the utilities of the observed states so that they agree with the constraint equations.
  - $U^{\pi}(s) \leftarrow U^{\pi}(s)+\alpha\left(R(s)+\gamma U^{\pi}\left(s^{\prime}\right)-U^{\pi}(s)\right)$
  - where $\alpha$ is the learning rate, it will converge if $\alpha$ decreases appropriately with the number of times the state has visited.
Similar to gradient descent.
TD vs. ADP
- ADP needs to estimate transition model while TD doesn’t need transition model.
  - The environment supplies the connection between neighboring states in the form of observed transitions.
- Both try to make local adjustments to the utility estimates in order to make each state “agree” with its successors (Bellman equation). But they still have differences:
  - TD adjusts a state to agree with its observed successor.
  - ADP adjusts the state to agree with all of the successors that might occur.
  - TD makes a single adjustment per observed transition
  - ADP makes as many as it needs to restore consistency between the utility estimates $U$ and the environment model $P$ .
- Thus, TD can be viewed as a crude but efficient first approximation to ADP.
TD (ADP) vs. MC
- TD (ADP) can perform online learning every step while MC need to wait until the end of episode.
- TD depends only on one measured reward while MC considers all rewards.
  - TD target has lower variance, but is biased, while MC target is unbiased but has higher variance.
- TD (ADP) usually converges faster than MC in practice.
ADP vs. TD (MC)
- ADP learns the model (transition function) and then solves for the value while TD and MC don’t need model.
- ADP is more data efficient
  - ADP it requires less data from the real-world while TD and MC require more data for learning.
- TD is more computationally efficient
  - TD doesn’t need to compute expectations but ADP needs to do so.

Active Learning

the agent will need to learn a complete model with outcome probabilities for all actions, rather than just the model for the fixed policy.
we need to take into account the fact that the agent has a choice of actions.

Adaptive Dynamic Programming

Note: policy $\pi$ is no longer fixed; changes as transitions & rewards are learned

Algorithm:

Repeat untill converges
- Estimate the transition function P(t∣s,a) based on the current optimal policy
  - $P(t \mid s, a) \leftarrow N_{s^{\prime} \mid s a}[t, s, a] / N_{s a}[s, a]$
- Perform Policy Iteration or Value Iteration
  - Learn utilities defined by the optimal policy
  - choose optimal policy based on utilities

The ADP agen is greedy, so need to trade off

Exploitation: maximize value as reflected by current estimate
Exploration: learn more about the model to potentially improve long term gain

How to balance Exploitation and Exploration:

Greedy in the Limit of Infinite Exploration, or GLIE:
- Try each action in each state an unbounded number of times to avoid a finite probability of missing an optimal action.
- Eventually become greedy so that it is optimal w.r.t the true model.

GLIE

ϵ-greedy exploration
- Choose greedy action with probability (1 − $\epsilon$ ); random action with probability $\epsilon$ .
optimistic estimate of the utility

Q-Learning

Model-free algorithm
- take action by refering to Q-table
Find optimal policy by maximizing the expected value of toal reward
How to update the Q-table?
- bellman equation (TD update)

$bellman-equation$

Function Approximation

Policy Search

POMDP

Basics

Unknown state but can receive the sensor information for state estimation.

Belief State: b. The probability distribution over states.
- b(s) the probability of being state s. s
  - For example, in a 2*2 drid world, the initial belief state is ${1/4, 1/4, 1/4, 1/4}$ because the agent don’t know anything about the world.
The agent can learn something about its actual state by sensing the environment:
- Sensor Model: $P(e \mid s)$ . Probability of sensing evidence $e$ when in state $s$ .
POMDPs can be formulated by 3 components:
- Transition Model: $P(s’ \mid s, a)$
- Reward Function: $R(s)$
- Sensor Model: $P(e \mid s)$
Belief Update (Filtering):
- $b^{\prime}\left(s^{\prime}\right)=\alpha P\left(e \mid s^{\prime}\right) \sum_{s} P\left(s^{\prime} \mid s, a\right) b(s)$
- First term: sensor model, probability of sensing evedence $e$ in state $s$
- Second term: sum over all the states that can take to s’ after performing $a$ .
- Ofen written as $b’ = FORWARD(b,a,e)$ .
Optimal Policy: $\pi^*(b)$

How to find the optimal policy?

POMDP VI
Dynamic Decision Network

POMDP to MDP

MDP consists of 2 components

Transition Function: $P(s’ \mid s, a)$
Reward Function: $R(s)$

Now we can convert POMDP problem into MDP problem in belief space.

Transition Function: $P(b’ \mid b,a)$
- $\begin{aligned} P\left(b^{\prime} \mid b, a\right) &=\sum_{e} P\left(b^{\prime} \mid e, a, b\right) P(e \mid a, b) \ &=\sum_{e} P\left(b^{\prime} \mid e, a, b\right) \sum_{s^{\prime}} P\left(e \mid s^{\prime}\right) \sum_{s} P\left(s^{\prime} \mid s, a\right) b(s) \end{aligned}$
- where P(e∣a,b) is the percept probability
  - $\begin{aligned} P(e \mid a, b) &=\sum_{s^{\prime}} P\left(e \mid a, s^{\prime}, b\right) P\left(s^{\prime} \mid a, b\right) \ &=\sum_{s^{\prime}} P\left(e \mid s^{\prime}\right) P\left(s^{\prime} \mid a, b\right) \ &=\sum_{s^{\prime}} P\left(e \mid s^{\prime}\right) \sum_{s} P\left(s^{\prime} \mid s, a\right) b(s) \end{aligned}$
  - where term $P(e \mid s’)$ is sensor model, term $P(s’ \mid s,a)$ is transition model.
Reward Function: $\rho(b)$
- $\rho(b)=\sum_{s} b(s) R(s)$

POMDPs in physical space is transfered to MDPs in belief state space.

Belief State Space: space of probability distributions over original states, i.e. space of belief state b
- For example, the initial belief state $b$ in a 2*2 drid world is ${1/4, 1/4, 1/4, 1/4}$ , where it is a point in belief state space.
- Thue, belief state space is continuous and multi-dimensional (e.g., 4 dimensions in a 2*2 grid world). The original VI or PI in MDP cannot work in a continuous space.

POMDP Value Iteration

Ideas:

Original VI (in discrete state space): Calculate utility value of each physical state.
POMDP VI (in continuous state sapce, i.e. belief state space): Calculate utility value of conditional plan.

Concepts:

$p$ : conditional plan, a policy at a belief $b$ .
$\alpha_p(s)$ : The utility of executing a fixed conditional plan $p$ starting in physical state $s$ .
$\sum_sb(s)\alpha_p(s)$ : The expected utility of executing $p$ in belief state $b$ , which is linearly to belief state $b$ .

Value Iteration in belief state space:

$\alpha_{p}(s)=R(s)+\gamma \left(\sum_{s^{\prime}} P\left(s^{\prime} \mid s, a\right) \sum_{e} P\left(e \mid s^{\prime}\right) \alpha_{p . e}\left(s^{\prime}\right){ }^{\prime}\right).$
Problem

Exact POMPDP solvers only work for very small problems
Finding optimal policy for general POMDP is PSPACE-hard
Approximate solvers can scale to moderate sized problems

Online search tends to scale better.

Dynamic Decision Network

Execution of POMDP over time can be represented as a DDN
- The transition and sensor models are represented by a dynamic Bayesian network (DBN), as described in Chapter 15.
- The dynamic Bayesian network is extended with decision and utility nodes, as used in decision networks in Chapter 16. The resulting model is called a dynamic decision network, or DDN.
- A filtering algorithm is used to incorporate each new percept and action and to update the belief state representation.
- Decisions are made by projecting forward possible action sequences and choosing the best one.

Game Theory

Single-Move Game

Conponents
- Agents
- Actions
- Payoff Function
  - can be represented as a matrix
Game Strategies
- pure strategy: a deterministic policy
- mixed strategy: a randomized policy that selects actions according to a probability distribution
- game profile
  - game solutions

E.g. Prisoners’ Dilemma

	Alice: testify	Alice: refuse
Bob: testify	B=-5;A=-5	B=0;A=-10
Bob: refuse	B=-10;A=0	B=-1;A=-1

Alice finds that “testify” is a dominant strategy for the game, so she eliminates “refuse”, the same for Bob. They finally reach an equilibrium: both choose to “testify”.

Domination
- strictly dominate
  - for one player, if the outcom of adopting strategy s is always higher than the outcome of every choice of strategies for other players, than strategy s strictly dominate other strategies.
    - E.g. In prisoners’ dilemma, for Alice, Alice is always better than Bob when adopts “testify”., i.e. in (B: testify, A: testify), A=-5, b=-5, where -5>=-5; in (B: refuse, A: testify), A=0, B=-10, where 0>=-10.
- weekly dominate
Iterated elimination of strictly dominated strategies
Nash Equilibrium
- A strategy profile forms an equilibrium if no player can benefit from switching strategies, given that every other player sticks to the same strategies
  - Note: an equilibrium is a local optimum
- dominant strategy equilibrium
  - When each player has a dominant strategy, combination is called dominant strategy equilibrium

Although both choose “testify” and reach an equilibrium, the equilibrium is a local optimum, because they can both choose to “refuse”.

Pareto optimal
- An outcome is Pareto optimal if there is no other outcomes that all players would prefer
- Pareto dominate
  - An outcome is Pareto dominated by another outcome if all players would prefer the other outcome

(B: testify, A: testify) is the Nash equilibrium while (B: refuse, A: refuse) is Pareto dominating.

Another Example:

	Acme: bluray	Acme: dvd
Best: bluray	B=+9;A=+9	B=-1;A=-4
Best: dvd	B=-1;A=-3	B=+5;A=+5

We can find two Nash equilibria: (B: bluray, A: bluray) and (B: dvd, A: dvd). We can also find the Pareto Optimum: (B: bluray, A: bluray)). To choose the best strategy, the agent can either guess or communicate

Coordination Games
- Games where agents need to communicate like the example.

Another Example: two-finger Morra

	O: one	O: two
E: one	E=+2,O=-2	E=-3,O=+3
E: two	E=-3,O=+3	E=+4,O=-4

Both E and O cannot reach Nash equilibrium because they always want to swicth a strategy in a bad situation, i.e. O will ask for switch when in (E: one, O: one) or (E: two, O: two), the same for E. It’s a zero-sum game.

Thus, there is no pure-strategy Nash equilibria, which requires for mixed strategy

A simple algorithm to find pure-strategy Nash equilibrium:
1. For each column, mark cell if it has the maximum payoff for the row player
  - May be more than one
2. For each row, mark cell if it has the maximum payoff for the column player
  - May be more than one
3. Cells with both row and column marks are pure Nash equilibrium
Zero-Sum Games
- Games in which the sum of the payoffs is always zero (or a constant), in other word, the agents cannot get benefit at the same time.
- Minimax Algorithm
  - Minimax Game Tree
  - mixed strategy (with probability)
  - maximin equilibrium
- See example in Lecture12 slide No. 5-6

Multiple-Move Game (simpliest ver.: Repeated Game)

The players face the same choices repeatedly
- In other words, each time, history of all players’ previous choices is available
Two strategies
- perpetual punishment
  - suppose that after each round there is a 99% chance that the players will meet again. Then the expected number of rounds is still 100, but neither player knows for sure which round will be the last.
    - If always choose “refuse”:
    - $\sum_{t=0}^{\infty} 0.99^{t} \cdot(-1)=-100$
    - If one round choose “testify”:
    - $0+\sum_{t=1}^{\infty} 0.99^{t} \cdot(-5)=-495$
- tit-for-tat
  - calls for starting with “refuse” and then echoing the other player’s previous move on all subsequent moves.
  - So Alice would refuse as long as Bob refuses and would testify the move after Bob testified, but would go back to refusing if Bob did. Although very simple, this strategy has proven to be highly robust and effective against a wide variety of strategies.

本文作者： YA
本文链接： http://www.yuuuuang.com/2022/04/26/【学习笔记】人工智能规划与决策/
版权声明： 本博客所有文章除特别声明外，均采用 MIT 许可协议。转载请注明出处！