6 topics

- Classical AI Planning and Acting
[x] Decision Theory

- Rational and Judgmental Decision Making
- Decision Networks

[x] Markov Decision Process

- Reinforcement Learning
- POMDP
- Game Theory and Multi-agent Decision Making

## Classical AI Planning and Acting

## Rational and Judgmental Decision Making

Uncertainty

- Partially Observable
- Non-deterministic

Rational Decision Making

- Combining
*belief*and*desire*- belief: judgment of unknown states (uncertainty)
- desire: preference of consequences (utility)

Utility Theory

*utility function*which expresses the desitability of a state: $U(s)$*expeted utility*of an action given evidence: $EU(a \mid e)$*Principle of maximum expected utility (MEU)*: agent should choose actions which can maximize the expected utility.- Six Axioms

## Decision Networks

Basian Network $\rightarrow$ DAG

Decision Network, A.K.A. , Influence Diagrams

- Chance Node
- Decision Node
- Utility Node

Value of Information

$$

V P I_{\mathbf{e}}\left(E_{j}\right)=\left(\sum_{k} P\left(E_{j}=e_{j k} \mid \mathbf{e}\right) E U\left(\alpha_{e_{j k}} \mid \mathbf{e}, E_{j}=e_{j k}\right)-E U(\alpha \mid \mathbf{e})\right).

$$

## Markov Decision Process

- MDP = Transition Function + Reward Function
- Expeted utility executing policy $\pi$ starting in state $s$: $U^{\pi}(s)=E\left[\sum_{t=0}^{\infty} \gamma^{t} R\left(s_{t}\right)\right]$
- Value function: $U^{\pi^{
*}}(s)$. Expeted utility executing optimal policy $\pi^{*}$ starting in state $s$. - Optimal policy starting in state $s$: $\pi_{s}^{
*}=\underset{\pi}{\operatorname{argmax}} U^{\pi}(s)$. The optimal policy is*independent*of the starting state, so optimal policy can be simply denoted as $\pi^{*}$. (see AIMA-3rd P651) - Utility function (Value function) of state $s$ : $U^{\pi^{*}}(s)$ can be written as $U(s)$.
- Utility function $U(s)$ vs. reward function $R(s)$
- $U(s)$ is
*long-term*total reward from $s$ onward. - $R(s)$ is
*short-term*reward of being $s$.

- $U(s)$ is

- Utility function $U(s)$ vs. reward function $R(s)$
- After finding Utility function $U(s)$, we can easily get the optimal policy — choose the action that maximizes the expected utility of the subsequent state:

$$

\pi^{*}(s)=\underset{a \in A(s)}{\operatorname{argmax}} \sum_{s^{\prime}} P\left(s^{\prime} \mid s, a\right) U\left(s^{\prime}\right)

$$

Utility Function: the

*key*to find the optimal policy.Two methods to find

*optimal policy*(get the*utility function*):Value Iteration (VI)

- calculate the utility of each state
- use the state utilities to select an optimal action in each state.

Policy Iteration (PI)

- policy evaluation
- Policy improvement

### Value Iteration

- Bellman Equation

$$

U(s)=R(s)+\gamma \max *{a \in A(s)} \sum*{s^{\prime}} P\left(s^{\prime} \mid s, a\right) U\left(s^{\prime}\right) .

$$

Utility of a state $U(s)$:

*immediate reward*for that state $R(s)$*expected utility of the next state*$\sum_{s^{\prime}} P\left(s^{\prime} \mid s, a\right) U\left(s^{\prime}\right)$, assuming that the agent chooses the*optimal action*$a$.

**Note**: the exact utilities of states are $U^{\pi}(s)=E\left[\sum_{t=0}^{\infty} \gamma^{t} R\left(s_{t}\right)\right]$, they are exactly the *unique* solutions to the Bellman Equation.

### Policy Iteration

Intuitive idea

- If one action is clearly better than all others, then the
*exact utilities*on the states need not be precise.

Two steps

- Policy Evaluation:
- calculate the utility of executing policy $\pi_{i}$ starting in state $s$

$$

U_{i}(s)=R(s)+\gamma \sum_{s^{\prime}} P\left(s^{\prime} \mid s, \pi_{i}(s)\right) U_{i}\left(s^{\prime}\right)

$$

Policy Improvement:

if the current action $a$ is not the optimal action:

$$

\max*{a \in A(s)} \sum*{s^{\prime}} P\left(s^{\prime} \mid s, a\right) U\left[s^{\prime}\right]>\sum_{s^{\prime}} P\left(s^{\prime} \mid s, \pi[s]\right) U\left[s^{\prime}\right]

$$then update the

*policy*and the*utility function*

Improvement Version of PI

- Modified Policy Iteration
- Asynchronous Policy Iteration
- General Policy Iteration

## Reinforcement Learning

### Definition

*lecture 7, P8-9*

reply on MDP

- optimal policy = policy that maximize the expected total reward

use observed reward to learn the optimal policy

- unknow transition function and reward function

### Structure

#### Three agent designs

Model-based

- A
**utility-based agent**, learns a Model $P$ and**utility function**$U$ on states, and uses it to select actions that maximize the expected outcome utility.

- A
Model-free

- A
**Q-learning agent**, learns an**action-utility function**$Q$, or**Q-function**, giving the expected utility of taking a given action in a given state.

- A
Reflex

- A
**reflex agent**learns a policy $\pi$ that maps directly from states to actions.

- A

#### Learning Strategy

Learn Utility Function $U$

Direct Utility Estimation

Passive Adaptive Dynamic Programming

Passive Temporal-Difference Learning

Learn action-utility function $Q$ (Q-function)

- Active Temporal-Difference Learning

- Learn a policy $\pi$ that maps directly from states to actions
- Policy Search

#### Known or Unknown Policy

Known Policy (Passive Learning)

- Direct Utility Estimation
Passive Adaptive Dynamic Programming

Passive Temporal-Difference Learning

- Unknown Policy (Acitve Learning)
- Active Adaptive Dynamic Programming
- Q-Learning
- Active Temporal-Difference Learning

### Passive Learning

**Known:**policy $\pi$**Unknown:**transition function $P\left(s^{\prime} \mid s, a\right)$ and reward function $R(s)$.**Goal:**evaluate how good the policy is, i.e. learn the utility function $U$.

*Note: The passive learning task is similar to the policy evaluation task, part of the policy iteration algorithm*

#### Direct Utility Estimation (Monte Carlo Learning, MC)

- The exact utility of state $s$ is

$$

U^{\pi}(s)=E\left[\sum_{t=0}^{\infty} \gamma^{t} R\left(s_{t}\right)\right]

$$

- The
*core idea*of MC- The utility of a state is the expected total reward from that state onward (called the expected
**reward-to-go**), and each trial provides a*sample*of this quantity for each state visited. - Run
*infinite*trails, the he expected total reward from that state onward will converge to the true utility.

- The utility of a state is the expected total reward from that state onward (called the expected

Pros and Cons

- Simple and Efficient: reduce RL problems to inducive learning problems.
- Slow: need to wait until the end of episodes.
- Converges very slowly: variance can be high as we accumulate the variance of every trial.
- Ignore the fact that utility of state is not independent, instead it is related to the successor states.
- By ignoring the connections between states, direct utility estimation misses opportunities for learning. More broadly, we can view direct utility estimation as searching for $U$ in a hypothesis space that is much larger than it needs to be, in that it includes many functions that violate the Bellman equations. For this reason, the algorithm often converges very slowly.

- Ignore the fact that utility of state is not independent, instead it is related to the successor states.

Therefore, use Bellman equation is a better choice.

#### Adaptive Dynamic Programming (ADP)

Simplified Bellman Equation (as shown in Policy Iteration):

$$

U^{\pi}(s)=R(s)+\gamma \sum_{s^{\prime}} P\left(s^{\prime} \mid s, \pi(s)\right) U^{\pi}\left(s^{\prime}\right)

$$

Need to learn transition function $P(s’ \mid s, \pi(s))$

- Estimate the transition function $P(t \mid s, a)$ based on the trails
- $P(t \mid s, a) \leftarrow N_{s^{\prime} \mid s a}[t, s, a] / N_{s a}[s, a]$

- Learn utility function $U(s)$ by policy evaluation
- $U_{}(s)=R(s)+\gamma \sum_{s^{\prime}} P\left(s^{\prime} \mid s, \pi_{}(s)\right) U_{}\left(s^{\prime}\right)$

#### Temporal-Difference Learning (TD)

Another way to utilize Bellman equation (i.e. relation to the successor states)

use the

*observed transitions*to adjust the utilities of the observed states so that they agree with the constraint equations.$$

U^{\pi}(s) \leftarrow U^{\pi}(s)+\alpha\left(R(s)+\gamma U^{\pi}\left(s^{\prime}\right)-U^{\pi}(s)\right)

$$where $\alpha$ is the learning rate, it will converge if $\alpha$ decreases appropriately with the number of times the state has visited.

Similar to gradient descent.

TD vs. ADP

- ADP needs to estimate transition model while TD doesn’t need transition model.
- The environment supplies the connection between neighboring states in the form of
*observed transitions*.

- The environment supplies the connection between neighboring states in the form of
- Both try to make local adjustments to the utility estimates in order to make each state “agree” with its successors (Bellman equation). But they still have differences:
- TD adjusts a state to agree with its
*observed*successor. - ADP adjusts the state to agree with
*all*of the successors that might occur. - TD makes a single adjustment per observed transition
- ADP makes as many as it needs to restore consistency between the utility estimates $U$ and the environment model $P$.

- TD adjusts a state to agree with its
- Thus, TD can be viewed as a crude but efficient first approximation to ADP.

- ADP needs to estimate transition model while TD doesn’t need transition model.
TD (ADP) vs. MC

- TD (ADP) can perform online learning every step while MC need to wait until the end of episode.
- TD depends only on one measured reward while MC considers all rewards.
- TD target has lower variance, but is biased, while MC target is unbiased but has higher variance.

- TD (ADP) usually converges faster than MC in practice.

- ADP vs. TD (MC)
- ADP learns the model (transition function) and then solves for the value while TD and MC don’t need model.
- ADP is more
*data efficient*- ADP it requires less data from the real-world while TD and MC require more data for learning.

- TD is more
*computationally efficient*- TD doesn’t need to compute expectations but ADP needs to do so.

### Active Learning

- the agent will need to learn a complete model with outcome probabilities for all actions, rather than just the model for the fixed policy.
- we need to take into account the fact that the agent has a choice of actions.

#### Adaptive Dynamic Programming

Note: policy $\pi$ is no longer fixed; changes as transitions & rewards are learned

Algorithm:

- Repeat untill converges
- Estimate the transition function $P(t \mid s, a)$ based on the current optimal policy
- $P(t \mid s, a) \leftarrow N_{s^{\prime} \mid s a}[t, s, a] / N_{s a}[s, a]$

- Perform
*Policy Iteration*or*Value Iteration*- Learn utilities defined by the optimal policy
- choose optimal policy based on utilities

- Estimate the transition function $P(t \mid s, a)$ based on the current optimal policy

The ADP agen is greedy, so need to trade off

- Exploitation: maximize value as reflected by current estimate
- Exploration: learn more about the model to potentially improve long term gain

How to balance Exploitation and Exploration:

- Greedy in the Limit of Infinite Exploration, or
**GLIE**:- Try each action in each state an unbounded number of times to avoid a finite probability of missing an optimal action.
- Eventually become greedy so that it is optimal w.r.t the true model.

GLIE

- $\epsilon$-greedy exploration
- Choose greedy action with probability (1 − $\epsilon$); random action with probability $\epsilon$.

- optimistic estimate of the utility

#### Q-Learning

- Model-free algorithm
- take action by refering to Q-table

Find optimal policy by maximizing the expected value of toal reward

How to update the Q-table?

- bellman equation (TD update)

### Function Approximation

### Policy Search

## POMDP

### Basics

Unknown state but can receive the *sensor information* for *state estimation*.

- Belief State: $b$. The probability distribution over states.
- $b(s)$ the probability of being state $s$. s
- For example, in a 2*2 drid world, the initial belief state is ${1/4, 1/4, 1/4, 1/4}$ because the agent don’t know anything about the world.

- $b(s)$ the probability of being state $s$. s
The agent can learn something about its actual state by sensing the environment:

- Sensor Model: $P(e \mid s)$. Probability of sensing evidence $e$ when in state $s$.

POMDPs can be formulated by 3 components:

- Transition Model: $P(s’ \mid s, a)$
- Reward Function: $R(s)$
- Sensor Model: $P(e \mid s)$

Belief Update (

*Filtering*):- $$

b^{\prime}\left(s^{\prime}\right)=\alpha P\left(e \mid s^{\prime}\right) \sum_{s} P\left(s^{\prime} \mid s, a\right) b(s)

$$

- $$

- First term: sensor model, probability of sensing evedence $e$ in state $s$

- Second term: sum over all the states that can take to s’ after performing $a$.

- Ofen written as $b’ = FORWARD(b,a,e)$.

- Optimal Policy: $\pi^*(b)$

How to find the optimal policy?

- POMDP VI
- Dynamic Decision Network

### POMDP to MDP

MDP consists of 2 components

- Transition Function: $P(s’ \mid s, a)$
- Reward Function: $R(s)$

Now we can convert POMDP problem into MDP problem in *belief space*.

Transition Function: $P(b’ \mid b,a)$

$$

\begin{aligned}

P\left(b^{\prime} \mid b, a\right) &=\sum_{e} P\left(b^{\prime} \mid e, a, b\right) P(e \mid a, b) \

&=\sum_{e} P\left(b^{\prime} \mid e, a, b\right) \sum_{s^{\prime}} P\left(e \mid s^{\prime}\right) \sum_{s} P\left(s^{\prime} \mid s, a\right) b(s)

\end{aligned}

$$where $P(e \mid a, b)$ is the percept probability

$$

\begin{aligned}

P(e \mid a, b) &=\sum_{s^{\prime}} P\left(e \mid a, s^{\prime}, b\right) P\left(s^{\prime} \mid a, b\right) \

&=\sum_{s^{\prime}} P\left(e \mid s^{\prime}\right) P\left(s^{\prime} \mid a, b\right) \

&=\sum_{s^{\prime}} P\left(e \mid s^{\prime}\right) \sum_{s} P\left(s^{\prime} \mid s, a\right) b(s)

\end{aligned}

$$where term $P(e \mid s’)$ is sensor model, term $P(s’ \mid s,a)$ is transition model.

Reward Function: $\rho(b)$

- $$

\rho(b)=\sum_{s} b(s) R(s)

$$

- $$

POMDPs in physical space is transfered to MDPs in *belief state space*.

*Belief State Space*: space of probability distributions over original states, i.e. space of belief state $b$- For example, the initial belief state $b$ in a 2*2 drid world is ${1/4, 1/4, 1/4, 1/4}$, where it is a point in belief state space.
- Thue, belief state space is
*continuous*and*multi-dimensional*(e.g., 4 dimensions in a 2*2 grid world). The original VI or PI in MDP cannot work in a continuous space.

### POMDP Value Iteration

Ideas:

- Original VI (in discrete state space): Calculate utility value of each
*physical state*. - POMDP VI (in continuous state sapce, i.e. belief state space): Calculate utility value of
*conditional plan*.

Concepts:

$p$: conditional plan, a policy at a belief $b$.

$\alpha_p(s)$: The utility of executing a

*fixed*conditional plan $p$ starting in physical state $s$.- $\sum_sb(s)\alpha_p(s)$: The expected utility of executing $p$ in belief state $b$, which is linearly to belief state $b$.

Value Iteration in belief state space:

$$

\alpha_{p}(s)=R(s)+\gamma \left(\sum_{s^{\prime}} P\left(s^{\prime} \mid s, a\right) \sum_{e} P\left(e \mid s^{\prime}\right) \alpha_{p . e}\left(s^{\prime}\right){ }^{\prime}\right).

$$

Problem

Exact POMPDP solvers only work for very small problems

Finding optimal policy for general POMDP is PSPACE-hard

Approximate solvers can scale to moderate sized problems

*Online search* tends to scale better.

### Dynamic Decision Network

- Execution of POMDP over time can be represented as a DDN
- The transition and sensor models are represented by a
**dynamic Bayesian network**(DBN), as described in Chapter 15. - The dynamic Bayesian network is extended with decision and utility nodes, as used in
**decision networks**in Chapter 16. The resulting model is called a**dynamic decision network**, or DDN. - A filtering algorithm is used to incorporate each new percept and action and to update the belief state representation.
- Decisions are made by projecting forward possible action sequences and choosing the best one.

- The transition and sensor models are represented by a

## Game Theory

### Single-Move Game

Conponents

Agents

Actions

Payoff Function

- can be represented as a matrix

Game Strategies

pure strategy: a deterministic policy

mixed strategy: a randomized policy that selects actions according to a probability distribution

game profile

- game solutions

E.g. Prisoners’ Dilemma

$$

\begin{array}{|c|c|c|}

\hline & \text { Alice: testify } & \text { Alice: refuse } \

\hline \text { Bob: testify } & B=-5 ; A=-5 & B=0 ; A=-10 \

\hline \text { Bob: refuse } & B=-10 ; A=0 & B=-1 ; A=-1 \

\hline

\end{array}

$$

Alice finds that “testify” is afor the game, so shedominant strategy“refuse”, the same for Bob. They finally reach aneliminates: both choose to “testify”.equilibrium

- Domination
- strictly dominate
- for one player, if the outcom of adopting strategy $s$ is always higher than the outcome of every choice of strategies for other players, than strategy $s$ strictly dominate other strategies.
- E.g. In prisoners’ dilemma, for Alice, Alice is always better than Bob when adopts “testify”., i.e. in (B: testify, A: testify), A=-5, b=-5, where -5>=-5; in (B: refuse, A: testify), A=0, B=-10, where 0>=-10.

- for one player, if the outcom of adopting strategy $s$ is always higher than the outcome of every choice of strategies for other players, than strategy $s$ strictly dominate other strategies.
- weekly dominate

- strictly dominate
- Iterated elimination of strictly dominated strategies
- Nash Equilibrium
- A strategy profile forms an equilibrium if no player can benefit from switching strategies, given that every other player sticks to the same strategies
- Note: an equilibrium is a local optimum

- dominant strategy equilibrium
- When each player has a dominant strategy, combination is called dominant strategy equilibrium

- A strategy profile forms an equilibrium if no player can benefit from switching strategies, given that every other player sticks to the same strategies

Although both choose “testify” and reach an equilibrium, the equilibrium is a

local optimum, because they can both choose to “refuse”.

- Pareto optimal
- An outcome is Pareto optimal if there is no other outcomes that all players would prefer
- Pareto dominate
- An outcome is Pareto dominated by another outcome if all players would prefer the other outcome

(B: testify, A: testify) is the Nash equilibrium while (B: refuse, A: refuse) is Pareto dominating.

Another Example:

$$

\begin{array}{|c|c|c|}

\hline & \text { Acme: bluray } & \text { Acme: } d v d \

\hline \text { Best: bluray } & B=+9 ; A=+9 & B=-1 ; A=-4 \

\hline \text { Best: } d v d & B=-1 ; A=-3 & B=+5 ; A=+5 \

\hline

\end{array}

$$

We can find two Nash equilibria: (B: bluray, A: bluray) and (B: dvd, A: dvd). We can also find the Pareto Optimum: (B: bluray, A: bluray)). To choose the best strategy, the agent can either guess orcommunicate

- Coordination Games
- Games where agents need to
*communicate*like the example.

- Games where agents need to

Another Example: two-finger Morra

$$

\begin{array}{|c|c|c|}

\hline & \text { O: one } & \text { O: two } \

\hline E: \text { one } & E=+2, O=-2 & E=-3, O=+3 \

\hline E: \text { two } & E=-3, O=+3 & E=+4, O=-4 \

\hline

\end{array}

$$

Both E and O cannot reach Nash equilibrium because they always want to swicth a strategy in a bad situation, i.e. O will ask for switch when in (E: one, O: one) or (E: two, O: two), the same for E. It’s a zero-sum game.Thus, there is no

pure-strategyNash equilibria, which requires formixed strategy

A simple algorithm to find pure-strategy Nash equilibrium:

For each column, mark cell if it has the maximum payoff for the row player

- May be more than one

For each row, mark cell if it has the maximum payoff for the column player

- May be more than one

Cells with both row and column marks are pure Nash equilibrium

Zero-Sum Games

Games in which the sum of the payoffs is always zero (or a constant), in other word, the agents cannot get benefit at the same time.

Minimax Algorithm

- Minimax Game Tree
- mixed strategy (with probability)
- maximin equilibrium

- See example in
*Lecture12*slide No. 5-6

### Multiple-Move Game (simpliest ver.: Repeated Game)

The players face the same choices repeatedly

- In other words, each time, history of all players’ previous choices is available

Two strategies

perpetual punishment

suppose that after each round there is a 99% chance that the players will meet again. Then the expected number of rounds is still 100, but neither player knows for sure which round will be the last.

If always choose “refuse”:

$$

\sum_{t=0}^{\infty} 0.99^{t} \cdot(-1)=-100

$$If one round choose “testify”:

$$

0+\sum_{t=1}^{\infty} 0.99^{t} \cdot(-5)=-495

$$

tit-for-tat

- calls for starting with “refuse “ and then echoing the other player’s previous move on all subsequent moves.
- So Alice would refuse as long as Bob refuses and would testify the move after Bob testified, but would go back to refusing if Bob did. Although very simple, this strategy has proven to be highly robust and effective against a wide variety of strategies.

**本文作者：**Yuang

**本文链接：**http://www.yuuuuang.com/2022/04/26/%E3%80%90%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0%E3%80%91%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD%E8%A7%84%E5%88%92%E4%B8%8E%E5%86%B3%E7%AD%96/

**版权声明：**本博客所有文章除特别声明外，均采用 BY-NC-SA 许可协议。转载请注明出处！