Planning and Acting in Partially Observable Stochastic Domains

(1)

Planning and Acting in Partially Observable Stochastic Domains

Leslie Pack Kaelbling

Michael L. Littman

^y

Anthony R. Cassandra January 13, 1997

Abstract

In this paper, we bring techniques from operations research to bear on the problem of choosing optimal actions in partially observable stochastic domains. We begin by introducing the theory of Markov decision processes (mdps) and partially observable mdps (pomdps). We then outline a novel algorithm for solving pomdps o line and show how, in some cases, a nite-memory controller can be extracted from the solution to a pomdp. We conclude with a discussion of how our approach relates to previous work, the complexity of nding exact solutions to pomdps, and of some possibilities for nding approximate solutions.

Consider the problem of a robot navigating in a large oce building. The robot can move from hallway intersection to intersection and can make local observations of its world.

Its actions are not completely reliable, however. Sometimes, when it intends to move, it stays where it is or goes too far; sometimes, when it intends to turn, it overshoots. It has similar problems with observation. Sometimes a corridor looks like a corner; sometimes a T-junction looks like an L-junction. How can such an error-plagued robot navigate, even given a map of the corridors?

In general, the robot will have to remember something about its history of actions and observations and use this information, together with its knowledge of the underlying dynamics of the world (the map and other information), to maintain an estimate of its location. Many engineering applications follow this approach, using methods like the Kalman lter [18] to maintain a running estimate of the robot's spatial uncertainty, expressed as an ellipsoid or normal distribution in Cartesian space. This approach will not do for our robot, though. Its uncertainty may be discrete: it might be almost certain that it is in the north-east corner of either the fourth or the seventh oors, though it admits a chance that it is on the fth oor, as well.

Then, given an uncertain estimate of its location, the robot has to decide what actions to take. In some cases, it might be sucient to ignore its uncertainty and take actions that would be appropriate for the most likely location. In other cases, it might be better for

This work was supported in part by NSF grants IRI-9453383 and IRI-9312395.

yThis work was supported in part by Bellcore.

1

(2)

the robot to take actions for the purpose of gathering information, such as searching for a landmark or reading signs on the wall. In general, it will take actions that fulll both purposes simultaneously.

1 Introduction

In this paper, we bring techniques from operations research to bear on the problem of choosing optimal actions in partially observable stochastic domains. Problems like the one described above can be modeled as partially observable Markov decision processes (pomdps).

Of course, we are not interested only in problems of robot navigation. Similar problems come up in factory process control, oil exploration, transportation logistics, and a variety of other complex real-world situations.

This is essentially a planning problem: given a complete and correct model of the world dynamics and a reward structure, nd an optimal way to behave. In the articial intelli- gence (AI) literature, a deterministic version of this problem has been addressed by adding knowledge preconditions to traditional planning systems [34]. Because we are interested in stochastic domains, however, we must depart from the traditional AI planning model. Rather than taking plans to be sequences of actions, which may only rarely execute as expected, we take them to be mappings from situations to actions that specify the agent's behavior no matter what may happen. In many cases, we may not want a full policy; methods for developing partial policies and conditional plans for completely observable domains are the subject of much current interest [13, 11, 22]. A weakness of the methods described in this paper is that they require the states of the world to be represented enumeratively, rather than through compositional representations such as Bayes nets or probabilistic operator descrip- tions. However, this work has served as a substrate for development of more complex and ecient representations [6]. Section 6 describes the relation between the present approach and prior research in more detail.

One important facet of thepomdpapproach is that there is no distinction drawn between actions taken to change the state of the world and actions taken to gain information. This is important because, in general, every action has both types of eect. Stopping to ask questions may delay the robot's arrival at the goal or spend extra energy; moving forward may give the robot information that it is in a dead-end because of the resulting crash. Thus, from the pomdp perspective, optimal performance involves something akin to a \value of information" calculation, only more complex; the agent chooses between actions that dier in the amount of information they provide, the amount of reward they produce, and how they change the state of the world.

Much of the content of this paper is a recapitulation of work in the operations-research literature [28, 33, 48, 50, 54]. We have developed new ways of viewing the problem that are, perhaps, more consistent with the AI perspective; for example, we give a novel development of exact nite-horizon pomdp algorithms in terms of \policy trees" instead of the classical algebraic approach [48]. We begin by introducing the theory of Markov decision processes (mdps) and pomdps. We then outline a novel algorithm for solving pomdps o line and show how, in some cases, a nite-memory controller can be extracted from the solution to a pomdp. We conclude with a brief discussion of related work and of approximation methods.

2

(3)

AGENT

Actions States

WORLD

Figure 1: An mdpmodels the synchronous interaction between agent and world.

2 Markov Decision Processes

Markov decision processes serve as a basis for solving the more complex partially observable problems that we are ultimately interested in. An mdp is a model of an agent interacting synchronously with a world. As shown in Figure 1, the agent takes as input the state of the world and generates as output actions, which themselves aect the state of the world. In the mdpframework, it is assumed that, although there may be a great deal of uncertainty about the eects of an agent's actions, there is never any uncertainty about the agent's current state|it has complete and perfect perceptual abilities.

Markov decision processes are described in depth in a variety of texts [3, 40]; we will just brie y cover the necessary background.

2.1 Basic Framework

A Markov decision process can be described as a tuple^hS;^A;^T;^{R i}, where

S is a nite set of states of the world;

A is a nite set of actions;

T : ^S ^A ^! (^S) is the state-transition function, giving for each world state and agent action, a probability distribution over world states (we write ^T(^s;^a;^s⁰) for the probability of ending in state^s⁰, given that the agent starts in state ^sand takes action

a); and

R :^S Â ^!^R is the reward function, giving the expected immediate reward gained by the agent for taking each action in each state (we write ^R(^s;â) for the expected reward for taking action â in state ^s).

In this model, the next state and the expected reward depend only on the previous state and the action taken; even if we were to condition on additional previous states, the transition probabilities and the expected rewards would remain the same. This is known as the Markov property|the state and reward at time ^t+ 1 is dependent only on the state at time ^t and the action at time^t.

In fact, mdps can have innite state and action spaces. The algorithms that we describe in this section apply only to the nite case; however, in the context of pomdps, we will consider a class of mdps with uncountably innite state spaces.

3

(4)

2.2 Acting Optimally

We would like our agents to act in such a way as to maximize some measure of the long-run reward received. One such framework is nite-horizon optimality, in which the agent should act in order to maximize the expected sum of reward that it gets on the next ^k steps; it should maximize

E

"

k,1

X

t=0 r

t

#

;

where ^r^t is the reward received on step ^t. This model is somewhat inconvenient, because it is rare that an appropriate ^k will be known exactly. We might prefer to consider an innite lifetime for the agent. The most straightforward is the innite-horizon discounted model, in which we sum the rewards over the innite lifetime of the agent, but discount them geometrically using discount factor 0^< ^<1; the agent should act so as to optimize

E

"

1

X

t=0

t

r

t

#

:

In this model, rewards received earlier in its lifetime have more value to the agent; the innite lifetime is considered, but the discount factor ensures that the sum is nite. This sum is also the expected amount of reward received if a decision to terminate the run is made on each step with probability 1^,. The larger the discount factor (closer to 1), the more eect future rewards have on current decision making. In our future discussions of nite-horizon optimality, we will also use a discount factor; when it has value one, it is equivalent to the simple nite horizon case described above.

A policy is a description of the behavior of an agent. We consider two kinds of policies:

stationary and non-stationary. A stationary policy, :^S ^!^A, is a situation-action mapping that species, for each state, an action to be taken. The choice of action depends only on the state and is independent of the time step. A non-stationary policy is a sequence of situation- action mappings, indexed by time. The policy ^t is to be used to choose the action on the

t

th-to-last step as a function of the current state,^s^t. In the nite-horizon model, the optimal policy is not typically stationary: the way an agent chooses its actions on the last step of its life is generally going to be very dierent from the way it chooses them when it has a long life ahead of it. In the innite-horizon discounted model, the agent always has a constant expected amount of time remaining, so there is no reason to change action strategies: there is a stationary optimal policy.

Given a policy, we can evaluate it based on the long-run value that the agent expects to gain from executing it. In the nite-horizon case, let ^V^;t(^s) be the expected sum of reward gained from starting in state ^s and executing non-stationary policy for ^t steps. Clearly,

V

;1(^s) = ^R(^s;¹(^s)); that is, on the last step, the value is just the expected reward for taking the action specied by the nal element of the policy. Now, we can dene ^V^;t(^s) inductively as

V

;t(^s) =^R(^s;^t(^s)) + ^X

s 0

2S

T(^s;^t(^s)^;^s⁰)^V^;t,1(^s⁰) ^:

The ^t-step value of being in state ^s and executing non-stationary policy is the immediate reward, ^R(^s;^t(^s)), plus the discounted expected value of the remaining ^t^,1 steps. To

4

(5)

evaluate the future, we must consider all possible resulting states ^s⁰, the likelihood of their occurrence ^T(^s;^t(^s)^;^s⁰), and their (^t ^,1)-step value under policy , ^V^;t,1(^s⁰). In the innite-horizon discounted case, we write ^V(^s) for the expected discounted sum of future reward for starting in state ^s and executing policy. It is recursively dened by

V

(^s) =^R(^s;(^s)) + ^X

s 0

2S

T(^s;(^s)^;^s⁰)^V(^s⁰) ^:

The value function, ^V, for policy is the unique simultaneous solution of this set of linear equations, one equation for each state ^s.

Now we know how to compute a value function, given a policy. Sometimes, we will need to go the opposite way, and compute a greedy policy given a value function. It really only makes sense to do this for the innite-horizon discounted case; to derive a policy for the nite horizon, we would need a whole sequence of value functions. Given any value function

V, a greedy policy with respect to that value function, ^V, is dened as

V(^s) = argmax

a 2

4

R(^s;^a) + ^X

s 0

2S

T(^s;^a;^s⁰)^V(^s⁰)

3

5

:

This is the policy obtained by, at every step, taking the action that maximizes expected immediate reward plus the expected discounted value of the next state, as measured by ^V.

What is the optimal nite-horizon policy, ? The agent's last step is easy: it should maximize its nal reward. So

1(^s) = argmax

a

R(^s;^a) ^:

The optimal policy for the ^t^th step, ^t, can be dened in terms of the optimal (^t^,1)-step value function ^V^t,1^;t,1 (written for simplicity as ^V^t,1 ):

t(^s) = argmax

a 2

4

R(^s;^a) + ^X

s 0

2S

T(^s;^a;^s⁰)^V^t,1 (^s⁰)

3

5 ;

V

t,1 is derived from^t,1 and ^V^t,2 .

In the innite-horizon discounted case, for any initial state ^s, we want to execute the policy that maximizes^V(^s). Howard [17] showed that there exists a stationary policy, , that is optimal for every starting state. The value function for this policy,^V, also written

V

, is dened by the set of equations

V

(^s) = maxa 2

4

R(^s;^a) + ^X

s 0

2S

T(^s;^a;^s⁰)^V(^s⁰)

3

5

;

which has a unique solution. An optimal policy, , is just a greedy policy with respect to

V

.

Another way to understand the innite-horizon value function, ^V, is to approach it by using an ever-increasing discounted nite horizon. As the horizon, ^t, approaches innity, ^V^t approaches ^V. This is only guaranteed to occur when the discount factor, , is less than 1, which tends to wash out the details of exactly what happens at the end of the agent's life.

5

(6)

V

1(^s) := 0 for all ^s

t:= 1

loop

t :=^t+ 1

loop

for all^s²^S and for all^a ²^A

Q a

t(^s) :=^R(^s;^a) +^P^s⁰^2S^T(^s;^a;^s⁰)^V^t,1(^s⁰)

V

t(^s) := max^a^Q^a^t(^s)

end loop

until

^jV^t⁽^s⁾^,^V^t,1⁽^s⁾^j^< ^{for all}^s²^S

Table 1: The value iteration algorithm for nite state space mdps.

2.3 Computing an Optimal Policy

There are many methods for nding optimal policies for mdps. In this section, we explore value iteration because it will also serve as the basis for nding policies in the partially observable case.

Value iteration proceeds by computing the sequence ^V^t of discounted nite-horizon optimal value functions, as shown in Table 1 (the superscript is omitted, because we shall henceforth only be considering optimal value functions). It makes use of an auxiliary function, ^Q^a^t(^s), which is the ^t-step value of starting in state ^s, taking action^a, then continuing with the optimal (^t ^,1)-step non-stationary policy. The algorithm terminates when the maximum dierence between two successive value functions (known as the Bellman error magnitude) is less than some .

It can be shown [52] that there exists a ^t, polynomial in ^jSj, ^jAj, the magnitude of the largest value of^R(^s;^a), and 1⁼(1^,), such that the greedy policy with respect to^V^t is equal to the optimal innite-horizon policy, . Rather than calculating a bound on ^t in advance and running value iteration for that long, we instead use the following result regarding the Bellman error magnitude [55] in order to terminate with a near-optimal policy.

If ^jV^t(^s)^,^V^t,1(^s)^j ^< for all^s, then the value of the greedy policy with respect to ^V^t does not dier from ^V by more than 2⁼(1^,) at any state. That is,

max

s2S jV

V

t(^s)^,^V(^s)^j^<21^, ^:

It is often the case that ^Vt = long before ^V^t is near^V; tighter bounds may be obtained using the span semi-norm on the value function [40].

3 Partial Observability

For mdps we can compute the optimal policy and use it to act by simply executing (^s) for current state ^s. What happens if the agent is no longer able to determine the state it is currently in with complete reliability? A naive approach would be for the agent to map the most recent observation directly into an action without remembering anything from the past.

6

(7)

In our hallway navigation example, this amounts to performing the same action in every location that looks the same|hardly a promising approach. Somewhat better results can be obtained by adding randomness to the agent's behavior: a policy can be a mapping from observations to probability distributions over actions [47]. Randomness eectively allows the agent to sometimes choose dierent actions in dierent locations with the same appear- ance, increasing the probability that it might choose a good action; in practice deterministic observation-action mappings are prone to getting trapped in deterministic loops [24].

In order to behave truly eectively in a partially observable world, it is necessary to use memory of previous actions and observations to aid in the disambiguation of the states of the world. The pomdpframework provides a systematic method of doing just that.

3.1 POMDP Framework

A partially observable Markov decision process can be described as a tuple^hS;^A;^T;^{R ;}^;^Oi, where

S, ^A, ^T, and^R describe a Markov decision process;

is a nite set of observations the agent can experience of its world; and

O:^SÂ^!() is the observation function, which gives, for each action and resulting state, a probability distribution over possible observations (we write Ô(^s⁰^;â;ô) for the probability of making observation ô given that the agent took action â and landed in state ^s⁰).

A pomdp is an mdp in which the agent is unable to observe the current state. Instead, it makes an observation based on the action and resulting state.¹ The agent's goal remains to maximize expected discounted future reward.

3.2 Problem Structure

We decompose the problem of controlling a pomdp into two parts, as shown in Figure 2.

The agent makes observations and generates actions. It keeps an internal belief state,^b, that summarizes its previous experience. The component labeled SE is the state estimator: it is responsible for updating the belief state based on the last action, the current observation, and the previous belief state. The component labeled is the policy: as before, it is responsible for generating actions, but this time as a function of the agent's belief state rather than the state of the world.

What, exactly, is a belief state? One choice might be the most probable state of the world, given the past experience. Although this might be a plausible basis for action in some cases, it is not sucient in general. In order to act eectively, an agent must take into account its own degree of uncertainty. If it is lost or confused, it might be appropriate for it to take sensing actions such as asking for directions, reading a map, or searching for

1It is possible to formulate an equivalent model in which the observation depends on the previous state instead of, or in addition to, the resulting state, but it complicates the exposition and adds no more expressive power.

7

(8)

AGENT

Action Observation

SE b WORLD

π

Figure 2: A pomdp agent can be decomposed into a state estimator (SE) and a policy ().

1 2 3 4

Figure 3: Simplepomdp to illustrate belief state evolution.

a landmark. In the pomdp framework, such actions are not explicitly distinguished: their informational properties are described via the observation function.

Our choice for belief states will be probability distributions over states of the world.

These distributions encode the agent's subjective probability about the state of the world and provide a basis for acting under uncertainty. Furthermore, they comprise a sucient statistic for the past history and initial belief state of the agent: given the agent's current belief state (properly computed), no additional data about its past actions or observations would supply any further information about the current state of the world. As a corollary of this, no additional data about the past would help to increase the agent's expected reward.

To illustrate the evolution of a belief state, we will use the simple example depicted in Figure 3. There are four states in this example, one of which is a goal state, indicated by the star. There are two possible observations: one is always made when the agent is in state 1, 2, or 4; the other, when it is in the goal state. There are two possible actions: east and west. These actions succeed with probability 0^:9, and when they fail, the movement is in the opposite direction. If no movement is possible in a particular direction, then the agent remains in the same location.

Assume that the agent is initially equally likely to be in any of the three non-goal states.

Thus, its initial belief state is^h0^:333 0^:333 0^:000 0^:333ⁱ, where the position in the belief vector corresponds to the state number.

If the agent takes action east and does not observe the goal, then the new belief state becomes ^h0^:100 0^:450 0^:000 0^:450ⁱ. If it takes action east again, and still does not observe the goal, then the probability mass becomes concentrated in the right-most state:

h0^:100 0^:164 0^:000 0^:736ⁱ. Notice that as long as the agent does not observe the goal state, it will always have some non-zero belief that it is in any of the non-goal states, since the actions have non-zero probability of failing.

8

(9)

3.3 Computing Belief States

A belief state ^b is a probability distribution over ^S. We let ^b(^s) denote the probability assigned to world state^sby belief state^b. The axioms of probability require that 0^b(^s)1 for all^s ²^S and that ^P^s2S^b(^s) = 1. The state estimator must compute a new belief state,

b

0, given an old belief state ^b, an action^a, and an observation^o. The new degree of belief in some state ^s⁰,^b⁰(^s⁰), can be obtained from basic probability theory as follows:

b

0(^s⁰) = Pr(^s⁰^jo;^a;^b)

= Pr(ôjs⁰^;â;^b)Pr(^s⁰^ja;^b) Pr(ôja;^b)

= Pr(ôjs⁰^;â)^P^s2SPr(^s⁰^ja;^b;^s)Pr(^sja;^b) Pr(ôja;^b)

= Ô(^s⁰^;â;ô)^P^s2S^T(^s;â;^s⁰)^b(^s) Pr(ôja;^b)

The denominator, Pr(ôja;^b), can be treated as a normalizing factor, independent of ^s⁰, that causes ^b⁰ to sum to 1. The state estimation function ^SE(^b;â;ô) has as its output the new belief state ^b⁰.

Thus, the state-estimation component of a pomdp controller can be constructed quite simply from a given model.

3.4 Finding an Optimal Policy

The policy component of a pomdpagent must map the current belief state into action. Be- cause the belief state is a sucient statistic, the optimal policy is the solution of a continuous- space \belief mdp." It is dened as follows:

B, the set of belief states, comprise the state space;

A, the set of actions, remains the same;

(^b;^a;^b⁰) is the state-transition function, which is dened as

(^b;^a;^b⁰) = Pr(^b⁰^ja;^b)

= ^X

o2

Pr(^b⁰^ja;^b;^o)Pr(^oja;^b) ^; where

Pr(^b⁰^jb;^a;^o) =

( 1 if ^SE(^a;^b;^o) =^b⁰ 0 otherwise ; and

(^b;^a) is the reward function on belief states, constructed from the original reward function on world states:

(^b;^a) = ^X

s2S

b(^s)^R(^s;^a) ^: 9

(10)

O₂ O_k O₁

O₂ O_k

A

A A A

A A

O₁ t steps to go

t-1 steps to go

2 steps to go 1 step to go

Figure 4: A ^t-step policy tree.

The reward function may seem strange; the agent appears to be rewarded for merely believing that it is in good states. However, because the state estimator is constructed from a correct observation and transition model of the world, the belief state represents the true occupation probabilities for all states ^s ²^S, and therefore the reward function represents the true expected reward to the agent.

This belief mdp is such that an optimal policy for it, coupled with the correct state estimator, will give rise to optimal behavior (in the discounted innite-horizon sense) for the originalpomdp[50, 1]. The remaining problem, then, is to solve thismdp. It is very dicult to solve continuous-space mdps in the general case, but, as we shall see in the next section, the optimal value function for the beliefmdphas special properties that can be exploited to simplify the problem.

4 Value functions for POMDPs

As in the case of discrete mdps, if we can compute the optimal value function, then we can use it to directly determine the optimal policy. This section concentrates on nding an approximation to the optimal value function. We approach the problem using value iteration to construct, at each iteration, the optimal^t-step discounted value function over belief space.

4.1 Policy Trees

When an agent has one step remaining, all it can do is take a single action. With two steps to go, it can take an action, make an observation, then take another action, perhaps depending on the previous observation. In general, an agent's non-stationary ^t-step policy can be represented by a policy tree as shown in Figure 4. It is a tree of depth ^tthat species a complete ^t-step policy. The top node determines the rst action to be taken. Then, depending on the resulting observation, an arc is followed to a node on the next level, which determines the next action. This is a complete recipe for ^t steps of conditional behavior.

Now, what is the expected discounted value to be gained from executing a policy tree ^p? It depends on the true state of the world when the agent starts. In the simplest case, ^p is a

10

(11)

1-step policy tree (a single action). The value of executing that action in state^s is

V

p(^s) =^R(^s;^a(^p)) ^;

where ^a(^p) is the action specied in the top node of policy tree ^p. More generally, if ^p is a

t-step policy tree, then

V

p(^s) = ^R(^s;^a(^p)) + Expected value of the future

= ^R(^s;^a(^p)) + ^X

s 0

2S

Pr(^s⁰^js;^a(^p)) ^X

oi2

Pr(ôⁱ^js⁰^;â(^p))^Vôⁱ^(p)(^s⁰)

= ^R(^s;^a(^p)) + ^X

s 0

2S

T(^s;^a(^p)^;^s⁰) ^X

oi2

O(^s⁰^;â(^p)^;ôⁱ)^Vôⁱ^(p)(^s⁰) ^;

where ôⁱ(^p) is the (^t^,1)-step policy subtree associated with observation ôⁱ at the top level of a ^t-step policy tree ^p. The expected value of the future is computed by rst taking an expectation over possible next states, ^s⁰, then considering the value of each of those states. The value depends on which policy subtree will be executed which, itself, depends on which observation is made. So, we take another expectation, with respect to the possible observations, of the value of executing the associated subtree, ôⁱ(^p), starting in state^s⁰.

Because the agent will never know the exact state of the world, it must be able to determine the value of executing a policy tree, ^p, from some belief state ^b. This is just an expectation over world states of executing ^p in each state:

V

p(^b) =^X

s2S

b(^s)^V^p(^s) ^:

It will be useful, in the following exposition, to express this more compactly. If we let

p =^hV^p(^s¹)^;^:^:^: ^;^V^p(^sⁿ)ⁱ, then ^V^p(^b) =^b^p.

Now we have the value of executing the policy tree ^p in every possible belief state. To construct an optimal^t-step policy, however, it will generally be necessary to execute dierent policy trees from dierent initial belief states. Let ^P be the nite set of all ^t-step policy trees. Then

V

t(^b) = max

p2P b

p :

That is, the optimal ^t-step value of starting in belief state ^b is the value of executing the best policy tree in that belief state.

This denition of the value function leads us to some important geometric insights into its form. Each policy tree,^p, induces a value function that is linear in^b, and^V^t is the upper surface of those functions. So, ^V^t is piecewise-linear and convex. Figure 5 illustrates this property. Consider a world with only two states. In such a world, a belief state consists of a vector of two non-negative numbers,^hb(^s¹)^;^b(^s²)ⁱ, that sum to 1. Because of this constraint, a single number is sucient to describe the belief state. The value function associated with a policy tree ^p¹, ^V^p¹, is a linear function of ^b(^s¹) and is shown in the gure as a line. The value functions of other policy trees are similarly represented. Finally, ^V^t is the maximum of all the ^V^pⁱ at each point in the belief space, giving us the upper surface, which is drawn in the gure with a bold line.

11

(12)

0 ^{b s}^{( )}¹ 1

V_p

1

V_p

2

V_p expected 3

t-step discounted

value

Figure 5: The optimal ^t-step value function is the upper surface of the value functions associated with all^t-step policy trees.

(0, 1) (1, 0)

(0, 0)

s₁

s₂

Figure 6: A value function in three dimensions.

When there are three world states, a belief state is determined by two values (again because of the simplex constraint, which requires the individual values to be non-negative and sum to 1). The belief space can be seen as the triangle in two-space with vertices (0^;0), (1^;0), and (0^;1). The value function associated with a single policy tree is a plane in three-space, and the optimal value function is typically a bowl shape that is composed of planar facets; an example is shown in Figure 6. This general pattern repeats itself in higher dimensions, but becomes dicult to contemplate and even harder to draw!

The convexity of the optimal value function makes intuitive sense when we think about the value of belief states. States that are in the \middle" of the belief space have high entropy|the agent is very uncertain about the real underlying state of the world. In such belief states, the agent cannot select actions very appropriately and so tends to gain less long-term reward. In low-entropy belief states, which are near the corners of the simplex, the agent can take actions more likely to be appropriate for the current state of the world and, so, gain more reward. This has some connection to the notion of \value of information,"

where an agent can incur a cost to move it from a high-entropy to a low-entropy state; this 12

(13)

0 1

V_p

1

V_p

2

V_p expected 3

t-step discounted

value

a p( )₁ a p( )₂ a p( )₃

Figure 7: The optimal ^t-step policy is determined by projecting the optimal value function back down onto the belief space.

is only worthwhile when the value of the information (the dierence in value between the two states) exceeds the cost of gaining the information.

Given a piecewise-linear convex value function and the ^t-step policy trees from which it was derived, it is straightforward to determine the optimal policy for execution on the^t^thstep from the end. The optimal value function can be projected back down onto the belief space, yielding a partition into polyhedral regions. Within each region, there is some single policy tree ^p such that ^b^p is maximal over the entire region. The optimal action for each belief state in this region is ^a(^p), the action in the root node of policy tree ^p. Figure 7 shows the projection of the optimal value function down into a policy partition in the two-dimensional example introduced in Figure 5; over each of the intervals illustrated, a single policy tree can be executed to maximize expected reward.

4.2 The Innite Horizon

In the previous section, we showed that the optimal^t-step value function is always piecewise- linear and convex. This is not necessarily true for the innite-horizon discounted value function; it remains convex [53], but may have innitely many facets. Still, the optimal innite-horizon discounted value function can be approximated arbitrarily closely by a nite- horizon value function for a suciently long horizon [50, 42].

The optimal innite-horizon discounted value function can be approximated via value iteration, in which the series of^t-step discounted value functions is computed; the iteration is stopped when the dierence between two successive results is small, yielding an arbitrarily good piecewise-linear and convex approximation to the desired value function. From the approximate value function we can extract a stationary policy that is approximately optimal.

Recall that the value-iteration algorithm has the following form:

loop

t := t + 1

compute the optimal ^t-step discounted value function, ^V^t

until

sup^b^jV^t(^b)^,^V^t,1(^b)^j^<

We have already developed the components of an extremely naive version of this algorithm.

On each iteration, we can enumerate all of the^t-step policy trees, then compute the maximum 13

(14)

0 ^{b s}^{( )}¹ 1

V_p

a

V_p

c

V_p expected b

t-step discounted

value

V_p

d

Figure 8: Some policy trees may be totally dominated by others and can be ignored.

of their value functions to get ^V^t. If the value functions are represented by sets of policy trees, the test for termination can be implemented exactly using linear programming [27].

This is, of course, hopelessly computationally intractable. Each ^t-step policy tree contains (^j^j^t^,1)⁼(^j^j^,1) nodes (the branching factor is ^j^j, the number of possible observations).

Each node can be labeled with one of ^jAj possible actions, so the total number of ^t-step policy trees is

jAj jj

t

,1

jj,1

;

which grows astronomically in ^t.

It is possible, in principle, that each of these policy trees might represent the optimal strategy at some point in the belief space and, hence, that each would contribute to the computation of the optimal value function. Luckily, however, this seems rarely to be the case. There are generally many policy trees whose value functions are totally dominated by or tied with value functions associated with other policy trees. Figure 8 shows a situation in which the value function associated with policy^p^d is completely dominated by (everywhere less than or equal to) the value function for policy^p^b. The situation with the value function for policy^p^cis somewhat more complicated; although it is not completely dominated by any single value function, it is completely dominated by ^p^a and ^p^b taken together.

Given a set of policy trees, ^V, it is possible to dene a unique² minimal subset that represents the same value function that ^V represents. We will call the elements of this set the useful policy trees.

The ability to nd the set of useful policy trees serves as a basis for a more ecient version of the value-iteration algorithm [33]. We let ^V stand for a set of policy trees, though for each tree we need only actually store the top-level action and the vector of values, .

t:= 1

V

1 := the set of 1-step policy trees (one for each action)

loop

t :=^t+ 1

compute ^V^t⁺, the set of possibly useful ^t-step policy trees, from^V^t,1 prune ^V^t⁺ to get ^V^t, the useful set of ^t-step policy trees

until

^sup^b^jV^t⁽^b⁾^,^V^t,1⁽^b⁾^j^<

2We assume here that two policy trees with the same value function are identical.

14

(15)

The idea behind this algorithm is the following: ^V^t,1, the set of useful (^t^,1)-step policy trees, can be used to construct a superset of the useful ^t-step policy trees. A ^t-step policy tree is composed of a root node with an associated action, ^a, with ^j^j subtrees, which are (^t^,1)-step policy trees. We propose to restrict our choice of subtrees to those (^t^,1)-step policy trees that were useful. For any belief state and any choice of policy subtree, there is always a useful subtree that is at least as good at that state; so there is never any reason to include a non-useful policy subtree.

The time complexity of a single iteration of this algorithm can be divided into two parts:

generation and pruning. There are ^jAjjV^t,1^j^jj elements in^V^t⁺: there are ^jAj dierent ways to choose the action and all possible lists of length ^j^j may be chosen from the set ^V^t,1 to form the subtrees. The value functions for the policy trees in^V^t⁺can be computed eciently from those of the subtrees.

Although this algorithm may represent a large computational savings over the naive enumeration strategy, it still does more work than may be necessary. Even if ^V^t is very small, we must go through the step of generating ^V^t⁺, which always has size exponential in

j^j. In the next section, we sketch a more ecient algorithm.

4.3 The Witness Algorithm

To improve the complexity of the value-iteration algorithm, we must avoid generating ^V^t⁺; instead, we would like to generate the elements of^V^t directly. If we could do this, we might be able to reach a computation time per iteration that is polynomial in ^jSj,^jAj,^j^j, ^jV^t,1^j, and ^jV^t^j. Cheng [8] and Smallwood and Sondik [48] also try to avoid generating all of ^V^t⁺ by constructing ^V^t directly. However, their algorithms still have worst-case running times exponential in at least one of the problem parameters [25]. In fact, the existence of an algorithm that runs in time polynomial in ^jSj, ^jAj, ^j^j, ^jV^t,1^j, and ^jV^t^j would settle the long-standing complexity-theoretic question \Does NP=RP?" in the armative [25], so we will pursue a slightly dierent approach.

Instead of computing ^V^t directly, we will compute, for each action â, a set ^Qâ^t of ^t-step policy trees that have action â at their root. We can compute ^V^t by taking the union of the ^Qâ^t sets for all actions and pruning as described in the previous section. The witness algorithm is a method for computing^Qâ^t in time polynomial in^jS^j,^jAj,^j^j, ^jV^t,1^j, and^jQâ^t^j (specically, run time is polynomial in the size of the inputs, the outputs, and an important intermediate result). It is possible that the ^Qâ^t are exponentially larger than ^V^t, but this seems to be rarely the case in practice³.

In what sense is the witness algorithm superior to previous algorithms for solvingpomdps, then? Experiments indicate that the witness algorithm is faster in practice over a wide range of problem sizes [25]. The primary complexity-theoretic dierence is that the witness algorithm runs in polynomial time in the number of policy trees in ^Qâ^t. There are example problems that cause the other algorithms, although they never construct the ^Qâ^t's directly, to run in time exponential in the number of policy trees in ^Qâ^t. That means, if we restrict ourselves to problems in which ^jQâ^t^j is polynomial, that the resulting running time is poly-

3A more recent algorithm by Zhang [56], inspired by the witness algorithm, has the same asymptotic complexity but appears to be the current fastest algorithm empirically for this problem.

15

(16)

V

1 :=^fh0^;0^;^:^:^:^;0^ig

t:= 1

loop

t :=^t+ 1

foreach

^a

in

^A

Q a

t := witness(^V^t,1^;â) prune ^Sâ^Qâ^t to get ^V^t

until

^sup^b^jV^t⁽^b⁾^,^V^t,1⁽^b⁾^j^<

Table 2: Outer loop of the witness algorithm.

nomial. It is worth noting, however, that it is possible to create families of pomdps that Cheng's algorithm can solve in polynomial time that take the witness exponential time to solve; they are problems in which^S and^V^t are very small and^Q^a^t is exponentially larger for some action ^a.

From the denition of the state estimator, SE, and the ^t-step value function, ^V^t(^b), we can express ^Q^a^t(^b) formally as

Q a

t(^b) = ^X

s2S

b(^s)^R(^s;^a) +^X

o2

Pr(^oja;^b)^V^t,1(^b⁰^o) ^;

where^b⁰ô is the belief state resulting from taking action âand observingô from belief state^b; that is, ^b⁰ = SE(^b;â;ô). Since^V is the value of the best action, we have ^V^t(^b) = maxâ^Qâ^t(^b).

Using arguments similar to those in the previous section, we can show that these ^Q- functions are piecewise-linear and convex and can be represented by collections of policy trees. Let ^Qâ^t be the collection of policy trees that specify ^Qâ^t. Once again, we can dene a unique minimal useful set of policy trees for each ^Q function. Note that the policy trees needed to represent the function ^V^t are a subset of the policy trees needed to represent all of the ^Qâ^t functions: ^V^t ^Sâ^Qâ^t. This is because maximizing over actions and then policy trees is the same as maximizing over the pooled sets of policy trees.

The code in Table 2 outlines our approach to solving pomdps. The basic structure remains that of value iteration. At iteration ^t, the algorithm has a representation of the optimal ^t-step value function. Within the value-iteration loop, separate ^Q-functions are found for each action, represented by sets of policy trees. The union of these sets forms a representation of the optimal value function. Since there may be extraneous policy trees in the combined set, it is pruned to yield the useful set of ^t-step policy trees, ^V^t.

4.3.1 Witness inner loop

The basic structure of the witness algorithm is as follows. We would like to nd a minimal set of policy trees for representing ^Q^a^t for each^a. We consider the^Q-functions one at a time.

The set ^U of policy trees is initialized with a single policy tree that is the best for some arbitrary belief state (this is easy to do). At each iteration we ask, Is there some belief state,

b, for which the true value, ^Q^a^t(^b), computed by one-step lookahead using ^V^t,1, is dierent 16

Planning and Acting in Partially Observable Stochastic Domains