Q-learning
最近开始学习各类算法,目前的思路是快速上手code,对算法有值直观的认识,然后再回头看论文公式。
Q-learning是一种很基础的off-policy强化学习,适合初学者。由于强化学习针对不同问题需要定制场景,因此没有通用的库,需要自己写程序。算法详解可见wiki-Q-learning,这里粘贴一下用于练习的两个case,其中一维case是学习莫烦的,在此基础上扩展了二维,可以直观感受简单的强化学习算法。
简要思路
- 准备状态&动作表、奖励表
- 随机初始化状态
- 根据动作表选择动作
- 环境变化
- 返回奖励及新状态
- 计算上一状态的q值
- 根据上衣状态两次q值更新q表
- 状态转换
伪代码
- Initialize$Q(s,a),∀s ∈ S, a ∈ A(s),$
,arbitrarily, and$Q(terminal-state,·)=0$ - Repeat for each episode:
- Initialize $S$
- Repeat for each episode:
- Choose $A$from $S$ using policy drived from $Q$(e.g.,$ϵ$-greedy)
- Take action $A$, observe $R,S’$
- $Q(S,A)←Q(S,A)+α[R+γmax_aQ(S’,a)-Q(S,A)]$
- $S ← S’$
- until $S$ is terminal
一维探索游戏
1 | import numpy as np |
二维探索游戏
1 | import numpy as np |
最终可最短路径,并输出q表:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58 0 1 2 3 4 5 6 7
0 - - - - - - - -
1 - - - 0 - - - -
2 - - - T - - - -
3 - - - - - - - -
4 - - - - - - - -
5 - - - - - - - -
Q-table:
up down left right
0 7.307904e-04 1.904577e-03 0.000000e+00 0.007859
1 7.858897e-03 7.556999e-01 0.000000e+00 0.402958
2 2.631690e-02 8.414524e+00 3.626621e-02 1.745023
3 8.100000e-02 2.261374e+01 1.975590e-02 0.013851
4 3.222180e-02 1.211320e+01 8.100000e-02 0.001837
5 2.368521e-03 0.000000e+00 6.363587e-02 0.000218
6 5.904900e-05 1.516812e-01 1.837080e-03 0.000164
7 3.472489e-05 3.289256e-02 5.904900e-05 0.000005
8 7.307904e-04 1.339231e-06 2.994458e-03 1.414500
9 3.988592e-02 4.304672e-08 7.212840e-02 11.509400
10 5.065821e-02 1.484958e-01 3.619704e-04 55.499094
11 2.268000e-01 9.749684e+01 3.653100e-01 5.496623
12 9.978600e-01 7.265790e+00 6.105641e+01 0.215999
13 5.509928e-03 5.940395e-03 2.289216e+01 0.097409
14 5.010639e-04 5.947947e-05 4.372300e+00 0.000478
15 4.903927e-04 1.827626e-05 5.142739e-01 0.002211
16 2.919956e-03 0.000000e+00 0.000000e+00 0.000000
17 2.114380e-01 7.775514e-02 4.782969e-07 3.301809
18 3.003703e+00 1.249748e+01 0.000000e+00 27.100000
19 1.521753e+01 0.000000e+00 9.000000e-01 0.000000
20 6.642014e+00 2.268000e-01 4.685590e+01 0.226800
21 6.600439e-02 2.631690e-02 6.609690e+00 0.000257
22 2.851123e-03 1.605384e-02 2.195100e-01 0.000000
23 1.665721e-02 0.000000e+00 1.975590e-02 0.001809
24 0.000000e+00 1.571279e-01 0.000000e+00 0.000000
25 1.591809e-01 2.455123e+00 0.000000e+00 11.233118
26 1.744314e+00 4.288338e+00 5.544728e-01 79.152640
27 9.972611e+01 4.540708e+00 6.069450e+00 18.925634
28 6.493028e+00 1.346001e+00 8.319825e+01 0.050002
29 2.268000e-01 6.949118e-02 5.408570e+01 0.436982
30 4.782969e-07 1.153245e-03 2.298841e+01 0.000002
31 1.778031e-03 1.661233e-04 0.000000e+00 0.000000
32 9.701053e-05 1.565183e-04 1.436938e-03 2.462021
33 2.041200e-02 5.733493e-03 1.985465e-02 18.339899
34 5.023670e+01 1.246590e-03 1.296716e-02 1.431899
35 2.967056e+01 0.000000e+00 8.428189e+00 0.241540
36 2.075822e+01 0.000000e+00 2.195100e-01 0.332408
37 9.607798e+00 1.826627e-03 3.878280e-02 0.000405
38 4.855764e+00 3.645154e-05 2.368521e-03 0.000019
39 0.000000e+00 6.233213e-06 3.998927e-02 0.000019
40 1.905050e-01 2.736390e-05 7.344711e-04 0.000213
41 4.621959e-01 7.653708e-04 1.102849e-04 0.001247
42 1.385100e-02 1.778031e-03 9.572210e-04 0.100822
43 2.909448e+00 7.163884e-04 0.000000e+00 0.015124
44 9.807243e-01 7.959871e-03 2.040065e-02 0.000961
45 3.069987e-02 0.000000e+00 0.000000e+00 0.000118
46 6.075773e-01 6.925874e-05 4.050171e-04 0.000023
47 4.556314e-04 8.890530e-06 3.241323e-02 0.000000