AI Notes
Bellman equation
目標函數(描述目標的數學函數)
V value
deterministic process vs non-deterministic process
Markov process property
Policy vs Plan living penalty
Q-learning intuition
动作效用函数(action-utility function)
Q = Reward + discount * sum(max(all possible expected value end up with)) // probabilistic approach
Q = Reward + discount * max(all possible expected value end up with) // simpler look
transformed from V value
Temporal difference
TD = Q after action - Q before action // Decreasingly look
Q(s, a) after action = Q(s, a) before action + alpha TD(a, s) // alpha is learning rate. Incremental look
Deep Q-learning
Q learning
Neural network
Activation function
Threshold function - If x >=0, value = 1. Else value = 0.
Sigmoid function - value = 1 / (1 + e^-x). would be smooth line between 0~1.
Rectifier function - value: max(x, 0)
Hyperbolic Tangent (tanh) - value = (1 - e^-2x) / (1 + e^-2x). would be smooth line between -1~1.
Learning - Experience replay
Cost function = 1/2 (output value - actual value)*2 // Trying to get minimum of the cost, then the weights of each input are available.
Acting - Action selection policies
exploration
epsilon greedy
Softmax
epsilon greedy VDBE
exploitation
Last updated