... action a and thentransforms from state s to s’ by a certain probability p,after that it will generate a current reward r and feedback to agent.4. The agent updates its strategy π: S ® A accordingto ... iteration [14] tosatisfy (2) [15,16], without knowing R(s, a) andPs, s’ (a) .The specific method is each state and action pair (s, a) is associated with a Q-value Q(s, a) , the (2) deformation:Qπ(s, ... “temperature” parameter, and decreases with the Q value iterat ive process. Equation 8expressed the basic idea that with the constant iterationof Q-learning algorithm update, the c hoice of stateaction...