Development Notes

CtrlK

AI Notes

Bellman equation

目標函數(描述目標的數學函數)
V value
deterministic process vs non-deterministic process

Markov process property

Policy vs Plan living penalty

Q-learning intuition

动作效用函数（action-utility function）
Q = Reward + discount * sum(max(all possible expected value end up with)) // probabilistic approach
Q = Reward + discount * max(all possible expected value end up with) // simpler look
transformed from V value

Temporal difference

TD = Q after action - Q before action // Decreasingly look
Q(s, a) after action = Q(s, a) before action + alpha TD(a, s) // alpha is learning rate. Incremental look

Deep Q-learning

Q learning
Neural network
- Activation function
  - Threshold function - If x >=0, value = 1. Else value = 0.
  - Sigmoid function - value = 1 / (1 + e^-x). would be smooth line between 0~1.
  - Rectifier function - value: max(x, 0)
  - Hyperbolic Tangent (tanh) - value = (1 - e^-2x) / (1 + e^-2x). would be smooth line between -1~1.
- Learning - Experience replay
  - Cost function = 1/2 (output value - actual value)*2 // Trying to get minimum of the cost, then the weights of each input are available.
- Acting - Action selection policies
  - exploration
    epsilon greedy
    Softmax
    epsilon greedy VDBE
  - exploitation

PreviousNotes of JCConf 2017

Last updated 5 years ago

Was this helpful?