采用数据集来自:数据竞赛:问答网站问题、回答数量预测-SofaSofa的train数据集
实现目标:用三种不同的方式自写函数预测数据questions。
SGD算法
算法简介
所谓SGD算法,就是标准梯度下降算法,其思路就是将一个函数的梯度作为反馈进行处理,思路简单直接,是一种凸优化中常见的处理方式,在此不再加以赘述。
本文采用线性函数作为目标函数,选取均方差和均方根函数作为损失函数,保持学习率不变。
参数与函数自定义
下列函数和参数都是自己定义的变量:
相关参数
设置目标函数为$y = m_{1} x + m_{0}$,将其系数设置为一个数组m进行处理。
从数据集”train.csv”中取出id作为x,question作为y,路径为绝对路径,需自行更改。
设置学习率为定值eta($\eta$),loss_max为损失函数变化阈值,当其变化小于这个阈值时认为不变,达到极值。
相关函数
优化函数:predict:用于预测数据
1 2
| def predict(m, x): return x*m[0] + m[1]
|
损失函数:loss:定义损失函数为均方差
1 2
| def loss(m, x_all, y_all): return np.mean((x_all * m[0] + m[1] - y_all)**2)
|
损失函数的梯度计算函数:grad:根据损失函数求导
1 2 3 4 5
| def grad(m, x_all, y_all): grad_loss = [0, 0] grad_loss[0] = 2 * np.mean(x_all * (x_all * m[0] + m[1] - y_all)) grad_loss[1] = 2 * np.mean(x_all * m[0] + m[1] - y_all) return np.array(grad_loss)
|
迭代函数:update:用于根据grad更新m取值
1 2
| def update(eta, m, grad_loss): return np.array(m) - eta * grad_loss
|
注意事项
- 为使得函数更容易收敛,将x归一化。
- 同样为了使函数更容易收敛,可以改变loss为均方根,通过取根号的方式更容易收敛。
- 注意矩阵(向量)和列表的不同,有一些计算是不能使用列表的。
结果
1 2 3 4 5 6 7 8 9 10 11 12 13
| ... 126: m: [4941.22094925 938.09814254] 127: loss: 282856.28634899214 127: m: [4941.42197907 937.99050829] 128: loss: 282856.1727332602 128: m: [4941.61243806 937.88853382] 129: loss: 282856.07075199106 129: m: [4941.79288205 937.79192152] 130: loss: 282855.979213794 130: m: [4941.96383768 937.70038943] m1: 2.1935037007024962 m2: 937.7003894264082 time cost: 0.1464390754699707
|
共迭代130次,m1,m2取值分别为2.193,937.7,用时为0.14s
完整代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
| import numpy as np import pandas as ps import time
def main(): time_start = time.time()
m = [1, 1] eta = 0.4 loss_max = 0.1
numlist = ps.read_csv('XXX\\train.csv') x = numlist['id'] x_max = np.max(x) x_train = x/x_max y_train = numlist['questions']
loss_last = loss(m, x_train, y_train) loss_now = 0
print('-- start training ---') num = 0 while np.abs(loss_last - loss_now) >= loss_max: grad_loss = grad(m, x_train, y_train) m = update(eta, m, grad_loss) loss_last = loss_now loss_now = loss(m, x_train, y_train) num += 1 print(str(num) + ': loss: '+str(loss_now)) print(str(num) + ': m: '+str(m)) time_end = time.time() print('m1: %s \nm2: %s'%(m[0]/x_max,m[1])) print('time cost: '+str(time_end - time_start))
return
def predict(m, x): return x*m[0] + m[1]
def loss(m, x_all, y_all): return np.mean((x_all * m[0] + m[1] - y_all)**2)
def grad(m, x_all, y_all): grad_loss = [0, 0] grad_loss[0] = 2 * np.mean(x_all * (x_all * m[0] + m[1] - y_all)) grad_loss[1] = 2 * np.mean(x_all * m[0] + m[1] - y_all) return np.array(grad_loss)
def update(eta, m, grad_loss): return np.array(m) - eta * grad_loss
if __name__ == "__main__": main()
|
其中XXX要改为自己数据所在的绝对位置
随机SGD算法
算法简介
和SGD区别:由于对每一个样本求导开销往往很大,很麻烦,所以随机只选取一个样本来求导处理,这样就大大加快了运行速度。
因为是随机选取的,所以在期望上和原先并没有很大的不同,只是存在取值后放回和取值后不放回(不再取这个值)的区别。
但是,只通过一个样本就来改变模型使得结果不够稳定,常常需要迭代更多次数才能倾向于收敛(收敛得更慢了)。
取值后放回
代码
改变梯度计算规则,并由于loss的计算量和原梯度计算方式的计算量相同,所以不进行这么多次的loss检查,将main中的loss改为每1000次迭代检查是否满足要求。
同时,为了更易收敛,将loss改为均方根,并调低了学习率。
改变的main为:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
| def main(): time_start = time.time()
m = [1, 1] eta = 0.2 loss_max = 0.1
numlist = ps.read_csv('XXX\\train.csv') x = numlist['id'] x_max = np.max(x) x_train = x/x_max y_train = numlist['questions']
loss_last = loss(m, x_train, y_train) loss_now = 0
print('-- start training ---') num = 1 while np.abs(loss_last - loss_now) >= loss_max: while num % 1000 != 0: grad_loss = grad_yes(m, x_train, y_train) m = update(eta, m, grad_loss) num += 1 num += 1 loss_last = loss_now loss_now = loss(m, x_train, y_train) print(str(num) + ': loss: '+str(loss_now)) print(str(num) + ': m: '+str(m)) time_end = time.time() print('m1: %s \nm2: %s'%(m[0]/x_max,m[1])) print('time cost: '+str(time_end - time_start))
return
|
而改变的grad为:
1 2 3 4 5 6
| def grad_yes(m, x_all, y_all): grad_loss = [0, 0] n = np.random.randint(0,len(x_all)) grad_loss[0] = 2 * x_all[n] * (x_all[n] * m[0] + m[1] - y_all[n]) grad_loss[1] = 2 * (x_all[n] * m[0] + m[1] - y_all[n]) return np.array(grad_loss)
|
结果
1 2 3 4 5 6 7 8 9 10 11 12 13
| ... 30001: m: [4574.0792231 1459.14024352] 31001: loss: 693.3256213246475 31001: m: [5359.3018607 1157.2512358] 32001: loss: 667.4950871908976 32001: m: [5372.91711498 1105.99657984] 33001: loss: 548.5828585929333 33001: m: [4956.27599193 1064.88519278] 34001: loss: 548.5393614856855 34001: m: [4522.56052786 1091.12039757] m1: 2.0073504340254114 m2: 1091.120397570372 time cost: 0.5240442752838135
|
34000次迭代后满足要求,可看出计算开销变小,但更不易收敛。
取值后不放回
代码
同放回的处理,只是添加一个全局变量用于记录是否被取值。
grad改为:
1 2 3 4 5 6 7 8 9 10
| def grad_no(m, x_all, y_all): global x_take grad_loss = [0, 0] while True: n = np.random.randint(0,len(x_all)) if n not in x_take: break grad_loss[0] = 2 * x_all[n] * (x_all[n] * m[0] + m[1] - y_all[n]) grad_loss[1] = 2 * (x_all[n] * m[0] + m[1] - y_all[n]) return np.array(grad_loss)
|
结果
1 2 3 4 5 6 7 8 9 10 11 12 13
| ... 248001: m: [4880.01644085 840.39996252] 249001: loss: 628.2718124284057 249001: m: [5063.90178143 543.89033753] 250001: loss: 744.7237441508288 250001: m: [4868.60609451 453.4499488 ] 251001: loss: 546.3392904223941 251001: m: [5322.36473068 685.92958607] 252001: loss: 546.3997052005344 252001: m: [5335.40219647 795.55018315] m1: 2.368132355291041 m2: 795.5501831533912 time cost: 3.8524723052978516
|
可以看到时间变长,计算不稳定。
完整代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
| import numpy as np import pandas as ps import time
x_take = []
def main(): time_start = time.time()
m = [1, 1] eta = 0.2 loss_max = 0.1
numlist = ps.read_csv('XXX\\train.csv') x = numlist['id'] x_max = np.max(x) x_train = x/x_max y_train = numlist['questions']
loss_last = loss(m, x_train, y_train) loss_now = 0
print('-- start training ---') num = 1 while np.abs(loss_last - loss_now) >= loss_max: while num % 1000 != 0: grad_loss = grad_yes(m, x_train, y_train) m = update(eta, m, grad_loss) num += 1 num += 1 loss_last = loss_now loss_now = loss(m, x_train, y_train) print(str(num) + ': loss: '+str(loss_now)) print(str(num) + ': m: '+str(m)) time_end = time.time() print('m1: %s \nm2: %s'%(m[0]/x_max,m[1])) print('time cost: '+str(time_end - time_start))
return
def predict(m, x): return x*m[0] + m[1]
def loss(m, x_all, y_all): return np.sqrt(np.mean((x_all * m[0] + m[1] - y_all)**2))
def grad_yes(m, x_all, y_all): grad_loss = [0, 0] n = np.random.randint(0,len(x_all)) grad_loss[0] = 2 * x_all[n] * (x_all[n] * m[0] + m[1] - y_all[n]) grad_loss[1] = 2 * (x_all[n] * m[0] + m[1] - y_all[n]) return np.array(grad_loss)
def grad_no(m, x_all, y_all): global x_take grad_loss = [0, 0] while True: n = np.random.randint(0,len(x_all)) if n not in x_take: break grad_loss[0] = 2 * x_all[n] * (x_all[n] * m[0] + m[1] - y_all[n]) grad_loss[1] = 2 * (x_all[n] * m[0] + m[1] - y_all[n]) return np.array(grad_loss)
def update(eta, m, grad_loss): return np.array(m) - eta * grad_loss
if __name__ == "__main__": main()
|
Mini_batch SGD
算法简介
小批量SGD综合了前两个方法的优点,它采用随机选取小批量数据的方式,使得结果平衡了一次梯度计算开销和迭代次数(稳定性)。通过合理选取一次计算的样本个数batch,我们可以得到一个比较综合的结果
代码
main的处理与随机SGD类似,都是等循环多次迭代后才检测是否满足要求,只是改变了grad的处理方式。
添加rand_int函数作为随机生成关于x的数组的函数,其定义方式如下:
1 2 3 4 5 6 7 8 9
| def randon_int(n, x_all, y_all): x_rand = [] y_rand = [] while n >= 1: n -= 1 num_x = np.random.randint(0, len(x_all)) x_rand.append(x_all[num_x]) y_rand.append(y_all[num_x]) return np.array(x_rand), np.array(y_rand)
|
将此函数生成的数组替代原先的数组,重写grad函数如下:
1 2 3 4 5 6
| def grad(m, x_all, y_all, batch): grad_loss = [0, 0] x_rand, y_rand = randon_int(batch, x_all, y_all) grad_loss[0] = 2 * np.mean(x_rand * (x_rand * m[0] + m[1] - y_rand)) grad_loss[1] = 2 * np.mean(x_rand * m[0] + m[1] - y_rand) return np.array(grad_loss)
|
结果
1 2 3 4 5 6 7 8 9 10 11 12 13
| ... 1602: m: [4968.82514484 877.14162992] 1702: loss: 532.5291872425305 1702: m: [4854.68393389 988.44482143] 1802: loss: 532.7930688072918 1802: m: [4860.3578364 958.02605338] 1902: loss: 532.5027597249214 1902: m: [4896.79865359 937.60044292] 2002: loss: 532.5811608417331 2002: m: [4916.35037578 977.216822 ] m1: 2.182135097993372 m2: 977.2168220048245 time cost: 0.1969621181488037
|
尽管迭代了2000次,其时间消耗仍然较小,结果也比较稳定。
完整代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
| import numpy as np import pandas as ps import time
def main(): time_start = time.time()
m = [1, 1] eta = 0.1 loss_max = 0.1 batch = 10
numlist = ps.read_csv('XXX\\train.csv') x = numlist['id'] x_max = np.max(x) x_train = x/x_max y_train = numlist['questions']
loss_last = loss(m, x_train, y_train) loss_now = 0
print('-- start training ---') num = 0 while np.abs(loss_last - loss_now) >= loss_max: while num % 100 != 0: grad_loss = grad(m, x_train, y_train, batch) m = update(eta, m, grad_loss) num += 1 num += 1 loss_last = loss_now loss_now = loss(m, x_train, y_train) num += 1 print(str(num) + ': loss: '+str(loss_now)) print(str(num) + ': m: '+str(m)) time_end = time.time() print('m1: %s \nm2: %s'%(m[0]/x_max,m[1])) print('time cost: '+str(time_end - time_start))
return
def predict(m, x): return x*m[0] + m[1]
def loss(m, x_all, y_all): return np.sqrt(np.mean((x_all * m[0] + m[1] - y_all)**2))
def randon_int(n, x_all, y_all): x_rand = [] y_rand = [] while n >= 1: n -= 1 num_x = np.random.randint(0, len(x_all)) x_rand.append(x_all[num_x]) y_rand.append(y_all[num_x]) return np.array(x_rand), np.array(y_rand)
def grad(m, x_all, y_all, batch): grad_loss = [0, 0] x_rand, y_rand = randon_int(batch, x_all, y_all) grad_loss[0] = 2 * np.mean(x_rand * (x_rand * m[0] + m[1] - y_rand)) grad_loss[1] = 2 * np.mean(x_rand * m[0] + m[1] - y_rand) return np.array(grad_loss)
def update(eta, m, grad_loss): return np.array(m) - eta * grad_loss
if __name__ == "__main__": main()
|