• 沒有找到結果。

Dueling Network : Split Q-network into two channels

Reinforcement Learning

3. Dueling Network : Split Q-network into two channels

training, the Q-network’s values shift, and if we are using a constantly shifting set of values to adjust our network values, then the value estimations can easily spiral out of control. If the target Q and estimation Q-values are the same, the network can become destabilized by falling into feedback loops between them. To mitigate that possibility, the target network’s weights are fixed, and only periodically or slowly updated to the primary Q-networks values. So the loss function could change into the following form:

L(w) = (r+γmax

a Q(s, a : θ))−Q(s, a :θ)

As shown in the above loss function, the weights used to calculate the target Q value are w-, not w.

There are three main method to improve the efficiency of DQN in atari game, like Double DQN [?], Prioritised Replay and Dueling Network [15]. David Silver introduced the Tutorial in ICML 2016: Deep Enhancement Learning Tutorial. The following introduces it:

1. DoubleDQN : Remove upwards bias caused by max

a Q(s, a :θ) 1.Current Q-network w is used to select actions

2.Older Q-network w is used to evaluate to actions

L = (r+γQ(s, argmax

a

Q(s, a,θ): θ)−Q(s, a :θ))2

2. Prioritised Replay :

Store experience in priority according to DQN error.

|r+γmax

a Q(s, a : θ)−Q(s, a :θ)|

3. Dueling Network : Split Q-network into two channels.

1.Action-independent value function V(s).

2.Action-independent advantage function A(s, a :θ)

Q(s, a :θ) = V(s) +A(s, a :θ)

Double Q-network has two networks as mentioned in the table above, and its main purpose is to reduce computational error due to maximum Q-value, or it is called over-estimated problem. You could think it that if there are double-check in your work, the error would significantly reduce. ??

Prioritized replay is actually as the name implies. Original mechanism of replay is to random sample and prioritized replay sample according to priority experience and replay weight. The calculation of the priority is used by difference of target Q-value and Q-value.

If an object of prior is high, the probability sampled is high.

Dueling Network separate Q-network into two channel, one output V, the other output A. In other words, Q(s,a) decomposed into two more fundamental notions of value, V(s) and A(s,a). The V-function is independent of the action, and it says simple how good it is to be in any given state. The A-function is dependent of the action. A(s,a) tells how much better taking a certain action would be compared to the others and is to solve problem of reward-bias, so we also call it advantage function A(s,a). The goal of Dueling DQN is to have a network that separately computes the advantage and value functions, and combines them back into a single Q-function only at the final layer. [15]

3.5 Policy-Based Method

We have analyzed the DQN algorithm in detail and it is an algorithm based on value.

We analyze another algorithm in deep reinforcement learning, that is an algorithm based on the policy. The detail is introduced in [?] The value method is to compute a value for

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Figure 3.3: The dueling network.

every action and state and then select the maximum value. This is an indirect approach.

Now, we introduce a direct method to update policy network. Policy network is actually a neural network. Its input is also states and its output are actions. I represent policy by deep network with weights θ.

a=π(a|s, u) or a=π(s, u)

or output is probability of action: a = π(a|s, u). There are two advantage when policy-based is better than value-policy-based. One is that its output is probability rather than Q-value is that person could not always choose same behavior. None could always act same action but DQN could not output probability, so the using policy network is a better way. Two is that DQN need to know every q-value of action. If action is continuous, DQN will be inappropriate.

In policy network, we define its objective function as total discounted reward

J(u) = E[r1+γr2+γ2r3+...(·, u)]

This objective function is the cumulative expectation of all declining rewards and we want to maximize it. The policy network would optimize objective end-to-end by SGD. Actu-ally, it is to adjust policy weight u to achieve more reward. How to make adjust action

Theorem 3.5.1. For any differentiable policy πu(s, a), for any of the policy objective func-tions J(u), the policy gradient is

uJ(u) = Eπu[▽ulog πu(s, a)Qπu(s, a)]

The theorem is published by [?]. Actually, the idea of policy gradient is easy. If an action could achieve a good result, we enhance its probability. If an action could achieve a bad result, we reduce the possibility of being drawn. In the following, we will introduce that the famous policy gradient method is REINFORCE and its persudocode.

Algotithm 3: REINFORCE

Actor-Critic is also a TD method and it combines the value-based and policy-based methods. Policy network is actor and it is to output action (action-selection). Value network is critic and it actually be used to evaluate the action which been selected by actor network is good or bad (action value estimated).

Then, the value network would generate TD-error, and it guide to update the actor

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

network and critic network. The following figure shows a structure of the actor-critic algorithm, DDPG is this type of famous algorithm.

Figure 3.4: The structure of actor-critic algorithm.

Google’s paper (DDPG) [10] successfully combined some skill in DQN and DPG, and it make deep reinforcement learning was push to continuous control. In DDPG, the input of actor network is state and output is action. We use DNN to fit function of actor network. If action is continuous, the output could be tanh or sigmod. If action is discrete, it could use softmax layer to output probability. The input of critic network is state and action, and its output is Q-value.

1. DPG

In section 3.5, it supply the formula and proof of policy gradient .

2. DQN

The critic network in DDPG use the skills which are experience replay and target network in DQN. The two ways are also to stable the train model.

3. Noise sample

If now action is continuous, reinforcement learning will encounter a exploring trou-ble. DDPG uses add noise on action.

a=π(st) =π(st : uπt ) +ε,

and ε is noise.

We will introduce the asynchronous and advantage.

1. Asynchronous

In 2015, google’s Gorila framework publish paper [12] and it say about Asynchronous Distributed RL Framework. Gorila adopted separate machines and a parameter server, and A3C is similarly to it. There is a little difference between the method by Gorila and A3C, it is multiple CPU threads on a single machine. Why we abandon to use distributed framework of many machine? The reason is that using a sin-gle machine could save communication costs of sending gradients and parameters.

In [11], it has verified that the iteration is significantly faster. That is the first main advantage of the A3C.

Now, we will introduce second main advantage of A3C published in 2016 year. [11]

It uses multiple actors-learners running in parallel, and then we will explicitly use different exploration policies in each actor-learner to maximize this diversity. By running different exploration policies in different threads, multiple actor-learners applying online updates in parallel are likely to be less correlated than a single agent. Hence, we do not need to use experience replay mechanism to train in DQN.

In addition, the A3C uses the CPU to train rather than GPU, because the training process batch in RL is very small, GPU has many time in waiting for new data.

2. Advantage Actor-Critic

In section 3.5, we have mentioned that standard REINFORCE updates the policy parameters θ in the directionθ log πθ(st, at)Vt, which is an unbiased estimate of

θE[Rt]. We could use a new learned function of state bt(st), known as a baseline, and make the unbiased estimate subtract it. The skill could reduce the variance of this estimate. The resulting gradient is ▽θ log πθ(at, st)(Rtbt(st)), and then

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Figure 3.5: The abstract structure of asynchronous method.

using state function to estimate the baseline. In addition, we also use value function Qπ(at, st) to estimate the reward Rt. When an approximate state function uses to estimated as the baseline and Rt is an estimation of Qπ(at, st), the quantity Rt−bt(st) can be seen as an estimate of the advantage of action at in state st i.e, A(at, st) = Q(at, st)−V(st). Therefore, advantage function A(at, st)could evaluate how good and bad are action at in state st.

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Chapter 4

相關文件