AI Notes

Bellman equation

  • 目標函數(描述目標的數學函數)

  • V value

  • deterministic process vs non-deterministic process

Markov process property

Policy vs Plan living penalty

Q-learning intuition

  • 动作效用函数(action-utility function)

  • Q = Reward + discount * sum(max(all possible expected value end up with)) // probabilistic approach

  • Q = Reward + discount * max(all possible expected value end up with) // simpler look

  • transformed from V value

Temporal difference

  • TD = Q after action - Q before action // Decreasingly look

  • Q(s, a) after action = Q(s, a) before action + alpha TD(a, s) // alpha is learning rate. Incremental look

Deep Q-learning

  • Q learning

  • Neural network

    • Activation function

      • Threshold function - If x >=0, value = 1. Else value = 0.

      • Sigmoid function - value = 1 / (1 + e^-x). would be smooth line between 0~1.

      • Rectifier function - value: max(x, 0)

      • Hyperbolic Tangent (tanh) - value = (1 - e^-2x) / (1 + e^-2x). would be smooth line between -1~1.

    • Learning - Experience replay

      • Cost function = 1/2 (output value - actual value)*2 // Trying to get minimum of the cost, then the weights of each input are available.

    • Acting - Action selection policies

      • exploration

        • epsilon greedy

        • Softmax

        • epsilon greedy VDBE

      • exploitation

Last updated