Continuous-time Markov Decision Process - 大型網路狀態取決路由：提升最短路徑路由的個案研究

In this subsection, we introduce the basic idea of continuous-time Markov decision process. We call aij the transition rate of a process from state i to state j, for

. The quantity a i≠ j

)

ij is defined as follows: In a short time interval dt, a process that is now in state i will make a transition to state j with probability a_ij dt

(

^{. The}

probability of two or more state transitions is of the order of (dt)

i≠ j

2 or higher and is assumed to be zero if dt is taken sufficiently small. We shall consider only those processes for which the transition rates aij are constants and then describe the continuous-time Markov process by a transition rate matrix A with components aij. Note that the diagonal components have been not defined. The probability which the system occupies state i at a time t after the start of the process is the state probability π_i(t). We relate the state probabilities at a time t to those a short time dt later by the

Let us define the diagonal elements of the A matrix by _jj _ji

i j

≠

= −

∑

. If Eq. 3.1.1

is used in Eg.3.1.1, we have . To divide both sides

Equations 3.1.2 are a set of N linear constant-coefficient differential equations that relate the state probabilities to the transition-rate matrix A. If a solution is to be obtained, the initial condition πi

( )

⁰ for i = 1, 2, …, N must be specified. In matrix form we can write Eqs. 3.1.2 as

( ) ( )

d t t

dtπ =π (3.1.3)

where π(t) is the vector of state probabilities at time t. The elements of matrix A are given above, and as a result the rows of A sum to zero, or

Now we define the concept of reward. Suppose that the system earns a reward at the rate of rii dollars per unit time during all the time that it occupies state i and when it makes a transition from state i to state j ( i≠j ) it receives a reward of rii dollars. It is interested to derive the expected earnings of the system that operates for a time t with a given initial condition. We let vi( t ) be the expected total reward that the system will earn in a time t starting in state i, and then we can relate the total expected reward in a time t + dt, vi( t + dt ), to vi( t ) by Eq. 3.1.4.

Eq. 3.1.4 is interpreted as follows. During the time interval dt the system maybe remains in state i or makes a transition to another state j. If it remains in state i for a time dt, it will earn a rii dt plus the expected reward that it will earn in the remaining t units of time, vi( t ). The probability that it remains in state i for a time dt is 1 minus the probability that it makes a transition in dt. On the other hand, it maybe makes a transition to some state j≠i during the time interval dt with probability aij dt. In this

case, the system will receive the reward rij plus the expected reward to be made starting in state j with time t remaining. We rewrite Eq. 3.1.4 as

( ) (

) ( ) ( )

where terms of higher order than dt have been neglected. Finally, if we subtract vi( t ) from both sides of the equation and divide by dt, we have

( ) ( ) ( )

and take the limit as dt→0, we obtain

( ) ( )

We have a set of N linear constant-coefficient differential equations, that completely define vi( t ) when the vi(0) are known. Let us define a quantity qi as the “earning rate”

For a given policy the total expected reward of the system in time t is governed by Eqs. 3.1.5

Since we are concerned only with processes whose termination is remote, we may use the asymptotic expression (Eq. 3.1.6) for vi( t )

If Eqs. 3.1.7 are to hold for all large t, then we obtain the two sets of linear algebraic

Solution of Eqs. 3.1.8 expresses the gain of each state in terms of the gains of the recurrent chains in the process. The relative value of one state in each chain is set equal zero, and Eqs. 3.1.9 are used to solve for the remaining relative values and the gains of the recurrent chains.

Suppose that we have a policy that is optimal when t units of time remain, and that this policy has expected total rewards vi( t ). If we are considering what policy to follow if more time than t is available, we see from Eqs. 3.1.5 that we may maximize our rate of increase of vi( t ) by maximizing

as the quantity to be maximized in the ith state. For large t, Expression 3.1.11 is maximized by the alternative that maximizes

the gain test quantity, using the gains of the old policy. However, when all alternatives produce the same value of Expression 3.1.12 or when a group of alternatives produces the same maximum value, then the tie is broken by the alternative that maximizes

the value test quantity, using the relative values of the old policy. The relative values may be used for the value test because a constant difference will not affect decisions within a chain. The general iteration cycle is shown in Figure 3.1.

It, as is usually the case, all possible policies of the problem are completely ergodic, the computational process may be considerably simplified. Since all states of each Markov process have the same gain g, the value-determination operation involves only the solution of the equations

1 an improved policy. Multiplication of Eqs. 3.1.14 by the limiting state probability πi

and summation over i show that

a result previously obtained.

The policy-improvement routine becomes simply: For each state i, find the alternative k that maximizes

using the relative values of the previous policy. This alternative becomes the new decision in the ith state. A new policy has been found when this procedure has been performed for every state. The iteration cycle for completely ergodic continuous-time systems is shown in Figure 3.2. Note that, if the iteration is started in the policy-improvement routine with all vi = 0, the initial policy selected is the one that maximizes the earning rate of each state.

Policy Evaluation

For each state i, determine the alternative k that maximizes

using the gains g_j of the previous policy, and make it the new decision in the ith state.

is the same for all alternatives, or if several alternatives are equally good according to this test, the decision must be made on the basis of relative values rather than gains. Therefore, if the gain test fail, break the tie by determining the alternative k that maximizes

using the relative values of the previous policy, and by making it the new decision in the ith state.

Regardless of whether the policy-improvement test is based on gains or values, if the old decision in the ith state yields as high a value of the test quantity as any other alternative, leave the old decision unchanged. This rule assures convergence in the case of equivalent policies.

When this procedure has been repeated for all states, a new policy has been determined, and new [a_ij] and [q_i] matrices have been obtained, If the new policy is the same as the previous one, the iteration process has converged, and the best policy has been found;

otherwise, enter the upper box.

Figure 3.1 General iteration cycle for continuous-time decision processes.

V alue-D eterm ination O peration

For each state i, find the alternative k that m axim izes

using the relative values v_i of the previous policy. Then k’

becom es the new decision in the ith state, q_i^k’ becom es q_i, and

Figure 3.2 Iteration cycle for completely ergodic continuous-time decision processes.

3.2 Lagrangian multiplier

The classic technique for dealing with constrained maximization problems is the method of Lagrangian multipliers. A constrained maximization problem is that one wishes to maximize a function f (x) over some set of x-values X, subject to the constraint

g(x) = b. (3.2.1)

If there are several constraints, then g and b must be regarded as vectors. The commonest example in an economic-technical context is that of allocation, when x represents a pattern of activity (such as the amounts of various goods which are to be manufactured), f(x) the consequent economic return, g(x) the consequent consumption of necessary resources (such as capital, labour, raw materials, energy, etc.) and b the amounts of these resources available. Actually, for this example one should rather modify the constraint (3.2.1) to

g(x) ≦ b (3.2.2) since there is usually no compulsion to exhaust all resources.

The method of Larangian multipliers is based on the fact that, if ˆx is a value in X maximizing f(x) subject to (3.2.1), then under some conditions, there exits a multiplier vector y with property that the form f(x) - y^Tg(x) is stationary at x =ˆx. This assertion, when true, we shall refer to as the weak or classical Lagrangian principle.

Under certain conditions a stronger conclusion can be deduced: that ˆx maximizes the Lagrangian form f(x) - y^Tg(x) absolutely in X. This assertion, when true, we shall refer to as the strong Lagrangian principle. The simple nature of the strong principle obviously makes it attractive. For instance, there is no mention of derivatives and, indeed, the principle may hold in cases when f(x) or g(x) do not possess derivatives at ˆx (although compensating conditions of some other nature are required). Neither does the fact that ˆx may be a boundary point of X affect the statement of the principle. Both Lagrangian principles can be adapted to the case of inequality constraints such as (3.2.2), or indeed, to very much more general forms of constraint.

Suppose that the point ˆx+εs, where ε is a non-negative scalar and s a vector, belongs to X for all non-negativeε less than some positive valueε(s). Then s will be described as directed into the interior of X from ˆx, or as a feasible direction from ˆx. Denote the row vector of derivatives

1 2

, ,...,

f f f

x x x

⎡∂ ∂ ∂ ⎤

⎢∂ ∂ ∂ ⎥

⎣ ⎦

if this exits, by fx. If the derivative fx exists at ˆx, then fxs ≦ 0 at ˆx for feasible directions s from ˆx. In particular, if ˆx is an interior point of X, then

fx = 0 (3.2.3)

at ˆx. Criterion (3.2.3) is the classic stationary condition, the condition that ˆx be a stationary point of f. For the weak Lagrangian principle, an appropriate stationary point of the Lagrangian form is located, and y is then adjusted until the constraint (3.2.1) is satisfied at the stationary point.

3.3 Maximum Flow Algorithm

Network flow problems are linear programs with the particularly useful property that they posses optimal solutions in integers. In this subsection we review “classical”

network flow theory, including the max-flow min-cut theorem and computation of minimum cost flows.

Suppose that each arc (i, j) of a directed graph G has assigned to it a non-negative number cij, the capacity of (i, j). The capacity can be thought of as representing the maximum amount of some commodity that can “flow” through the arc per unit time in a steady-state situation. Such a flow is permitted only in the indicated direction of the arc, i.e., from i to j.

Consider the problem of finding a maximal flow from a source node s to a sink node t, which can be formulated as follows. Let

xij = the amount of flow through arc (i, j).

Then, 0 ≦ xij ≦ Cij. (3.3.1)

A conversation law is observed at each of the nodes other than s or t. That is, what goes out of node i must be equal to what comes in. So we have flow, or simply a flow, and v is its value. The problem of finding a maximum value flow from s to t is a linear program in which the objective is to maximize v subject to constraints (3.3.1) and (3.3.2).

Let P be an undirected path from s to t. An arc (i, j) in P is said to be a forward arc if it is directed from s toward t and backward otherwise. P is said to be a flow augmenting path with respect to a given flow x = (xij) if xij < cij for each forward arc (i, j) and xij > 0 for each backward arc in P.

A (s, t)-cutset is identified by a pair (S, T) of complementary subsets of nodes, with s∈S and t∈T. The capacity of the cutset (S, T) is defined as

(

)

ij i S j T

c S T c

∈ ∈

∑∑

i.e., the sum of the capacities of all arcs which are directed from S to T. The value of any (s, t)-flow cannot exceed the capacity of any (s, t)-cutset. Suppose that x = (xij) is a flow and (S, T) is an (s, t)-cutset. Sum the equations (3.3.2) identified with nodes i∈S to obtain

It follows that from the preceding analysis that the flow is maximal and that the cutest has minimal capacity. Notice that each arc (i, j) is saturated, i.e., xij = cij, if i∈S, j∈T and void, i.e., xij = 0, if i∈T, j∈S. We now state three of the principle theorems of network flow theory. They will later be applied to yield good algorithms for maximal flow problems.

Theorem (Augmenting Path Theorem) A flow is maximal if and only if it admits no augmenting path from s to t.

Theorem (Integral Flow Theorem) If all arc capacities are integers there is a

maximal flow which is integral.

Theorem (Max-Flow Min-Cut Theorem) The maximum value of an (s, t)-flow is equal to the minimum capacity of an (s, t)-cutset.

The problem of finding a maximum capacity flow augmenting path is evidently

quite similar to the problem of finding a shortest path, or, more precisely, a path in which the minimum arc length is maximum. We can make the similarity quite clear, as follows. Let

{ }

ˆ_ij max _ij _ij, _ji

c = c −x x ,

where cij = 0, if there is no arc (i, j). Let ui = the capacity of a maximum capacity augmenting path from node s to node i. Then the analogues of Bellman’s equations are:

It is clear that the ui values and the corresponding maximum capacity paths can be found by Dijkstra-like computation which is O(n²). Actually, we shall be satisfied with a computation which does not necessarily compute maximum capacity paths. A procedure in which labels are given to nodes is proposed. These labels are of the form (i⁺, δj) or (i^-, δj). A label (i⁺, δj) indicates that these exists an augmenting path with capacityδj from the source to the node j in question, and that (i, j) is the last arc in this path. A label (i^-, δj) indicates that (j, i) is the last arc in the path, i.e., (j, i) will be a backward arc if the path is extended to the sink t. Initially only the source node s is labeled with the special label (-, ∞ ). Thereafter, additional nodes are labeled in one of two ways:

When the procedure succeeds in labeling node t, an augmenting path has been found and the value of the flow can be augmented by δt . If the procedure concludes without labeling node t, then no augmenting path exists. A minimum capacity cutest (S, T) is constructed by letting S contain all labeled nodes and T contain all unlabeled nodes.

A labeled node is either “scanned” or “unscanned.” A node is scanned by examining all incident arcs and applying labels to previously unlabeled adjacent nodes, according to the rules given above.

The maximal flow algorithm is shown as follows:

Step 0 (Start)

Let x = (xij) be any integral feasible flow, possibly the zero flow. Give node s thee permanent label (-, ∞).

Step 1 (Labeling and Scanning)

(1.1) If all labeled nodes have been scanned, go to Step 3.

(1.2) Find a labeled but unscanned node i and scan it as follows: For each arc (i, j), if xij < cij and j is unlabeled, give j the label (i⁺, δj), where

{ }

min ,

j cij xij _i

δ = − δ

For each arc (j, i), if xji > 0 and j is unlabeled, give j the label (i^-, δj), where

{ }

min ,

j xji _i

δ = δ .

(1.3) If node t has been labeled, go to Step 2; otherwise go to Step 1.1.

Step 2 (Augmentation)

Starting at node t, use the index labels to construct an augmenting path. (The label on node t indicates the second-to-last node in the path, the label on that node indicates the third-to-last node, and so on.) Augment the flow by increasing and decreasing the arc flows byδt , as indicated by the superscripts on the index labels.

Erase all labels, except the label on node s. Go to Step 1.

Step 3 (Construction of Minimal Cut)

The existing flow is maximal. A cutset of minimum capacity is obtained by placing all labeled nodes in S and all unlabeled nodes in T. The computation is completed.

Chapter 4 Proposed Scheme

In this chapter, first we will formulate this routing problem to the constrained Markov decision process, and find the optimal policy (the partitions on each link) corresponding to the value of Lagrangian multiplier called here by using value determine and policy improvement procedure. Second, the original maximum flow algorithm is modified and we will describe our modified algorithm at each source node for choosing the best feasible route or the shortest-path to avoid more calls lost in the future when a new call is arriving.

Lambda

4.1 Link State Aggregation

We describe the network as a set of nodes N and a set of links K connecting the nodes. Hence K=1, 2, …, 1/2N(N-1) . Any call attempt of type k∈K will be a call attempt with the link k. The link k contains Ck ≧0 trunks. In this subsection, we first make some assumptions below. Suppose that the network is offered only one class of call and operates in a lost call model. That is, in our network, all arriving calls will request the same number of occupied trunks which we give the unit trunk the value of 1, and if a arriving call is blocked, no retransmission will happen and it will be cleared. When a call is accepted we assume that the setup of a call is instantaneous.

In treating traffic routing as Markov decision process and obtaining the cost to carry a connection by the network in the light of the Markov decision theory, the network state will require the detailed specification of the number of calls in processing on each feasible path. Because of enormous computations of the network state, network state should be approximated to a simpler definition for practical implementation. Some assumptions are made based on previous researches: statistical link independence and separable cost. First, link independence assumes that when a call is routed on a multi-link route, the call will be set up n independent calls with independent, identically distributed holding times and also they terminate independently. The call holding time for each call is exponentially distributed with the same mean, µ⁻¹, which is used as the unit of time and given the value 1. This

assumption is needed in order to reduce the routing information from the specification of the whole network to the specification of one trunk group. Second, separable cost assumes that the cost of carrying a call over a route is the sum of the cost of carrying a directed-link call over each link of the route respectively. The further reduced state information is helpful to heal the enormous computation. By assumptions above in the network, the network state is decomposed of the vector of link states. Each link state is described by the description of occupied trunk over the link.

The call arrivals for different origin-destination pairs are traffic streams which follow the independent Poisson distribution, and the possibility of blocking a call followed by accepting a call on condition of the same link depends on the value of the call arrival rate on the link, i.e. on condition of accepting a call higher arrival rate makes more lost probability of calls continued. It is important for each link to compute its cost by knowing two arguments: the current state of the link and that how many calls will be set up over itself by node pairs in the future. Once the link considers that the cost of accepting a call continuously is higher than the one of blocking a call, the link should notify nodes that send traffic over it with the latest routing information that the link is overloaded or congested. Such the latest exchanged information will result in much network overhead. Instead of the precise routing information, if a coarse one is replaced for cost computing, the computed cost is not accurate and overall expected blocking probability of each link will increase. In the large network if the link information is flooding through the whole network instantaneously once the link state is changed, although source nodes will receive the accurate routing information for making decisions but simultaneously the network will be congested resulting from a great amount of flooding. This phenomenon is not

在文檔中大型網路狀態取決路由：提升最短路徑路由的個案研究 (頁 17-0)