Sklearn机器学习初探与KNN算法

目录
  1. 前言
  2. python与sklearn环境配置
    1. python环境配置
    2. Sklearn配置
  3. KNN算法
    1. KNN简介
    2. KNN思想
    3. KNN实现方式
  4. 代码中函数简介
    1. load_iris()
    2. train_test_split(···)
    3. KNeighborsClassifier(k)
    4. accuracy_score(···)
  5. 代码部分
    1. 观察数据集结构:
    2. 主体函数
TOC

前言

实际上,sklearn不算是我第一次接触的东西。但这么久过去了,该忘的知识都忘得七七八八,再加上新电脑才装上python,记录一下环境配置和注意事项也是极好的。

IDE:vscode,数据集:Iris数据集,语言:python


python与sklearn环境配置

python环境配置

使用的编辑器是vscode,下载官方最新的python,安装时勾选path选项。

随后在vscode中下载python拓展。

新建文件夹打印hello world,解释器工作正常。

Sklearn配置

在python命令行窗口中输入:

1
$ pip install scikit-learn

双手离开键盘,等待程序自动加载。

安装时若出现权限不够的问题:

1
ERROR: Could not install packages due to an OSError: [WinError 5] 拒绝访问。

在代码中加入--user,再次下载即可


KNN算法

KNN简介

  • 中文名为:K最邻近法
  • 适用于:样本量较大的自动分类
  • 不足:计算量大,样本不平衡时有误差

KNN思想

如果一个样本在样本空间内K个最相邻的样本大多数属于一类,那么认为它也属于那一类。

“如果一个动物走起来像鸭子,看起来像鸭子,叫起来像鸭子,那么它就是鸭子”

KNN实现方式

  • 在样本空间内,对一个测试样本到其他样本的距离进行计算
  • 对这些距离排序,取离它最近的k个点
  • 获得这k个样本的类别,将测试样本归入k个点中出现次数最多的那一类

代码中函数简介

load_iris()

属于sklearn内置函数,用于获得iris数据集

上属: sklearn.datasets

下属:

  • load_iris.keys():查询关键字
  • load_iris.data():数据集特征集(x)
  • load_iris.target():数据集标签集(y/label)
  • 其他各种信息

train_test_split(···)

sklearn内置函数,用于划分数据为测试/训练

输入依次为:data(特征集),target(标签集),test_size=?(浮点数,表示测试数据占比),random_state=?(随机数种子,0或缺省表示每次随机),stratify=?(是否保持划分前的分类,暂时用不到,可缺省

输出依次为:x_train,x_test,y_train,y_test

上属:sklearn.model_selection

KNeighborsClassifier(k)

sklearn内置函数,用于进行KNN相关操作,k为相邻的样本量,建议设为奇数。

输出为构建的模型

上属:sklearn.neighbors

下属:

  • 本身:构建待训练的模型
  • fit(x_train,y_train):训练模型,参数为训练数据的x,y
  • predict(x_test):根据训练的模型,对测试数据进行预测,x_test为测试数据的x,输出预测的y
  • score(x_test,y_test):直接根据测试数据对模型打分
  • 其他

accuracy_score(···)

sklearn内置函数,用于对两个标签集(y)比较打分

输入:两个待比较的标签集

输出:分数

上属:sklearn.metrics

注意:直接import具体的库中函数,而不是import sklearn可能会更好,否则可能出现无法识别函数的情况(sklearn库中的东西太多了)

例如:

1
2
3
from sklearn import metrics
#使用:
print("score:",metrics.accuracy_score(lab_test,lab_pre))

代码部分

将执行部分包装成main函数确实可能更好:

1
2
if __name__ == '__main__':
main()

观察数据集结构:

执行代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
def load_data():
data = load_iris()

#数据结构
print("数据结构:")
print("关键字:",data.keys())
print("target:",data.target)
print("frame:" ,data.frame)
print("target_names:" ,data.target_names)
print("DESCR:" ,data.DESCR)
print("feature_names:" ,data.feature_names)
print("filename:" ,data.filename)
print("data_module:" ,data.data_module)

显示如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
数据结构:
关键字: dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
target: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
frame: None
target_names: ['setosa' 'versicolor' 'virginica']
DESCR: .. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica

:Summary Statistics:

============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
feature_names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
filename: iris.csv
data_module: sklearn.datasets.data

主体函数

根据数据集结构,编写数据拆分,模型训练及打分代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def load_data():
data = load_iris()

#数据拆分,按7:3划分
x_tain,x_test,lab_train,lab_test = train_test_split(data.data,data.target,test_size = 0.3)
return x_tain,x_test,lab_train,lab_test


#主函数
def main():
x_train,x_test,lab_train,lab_test = load_data()
print("------------ training and testing")
#建立KNN模型
my_model = neighbors.KNeighborsClassifier(5)
#训练模型
my_model.fit(x_train,lab_train)
#对结果进行预测
lab_pre = my_model.predict(x_test)
#将预测结果和实际结果相比较打分
print("score:",metrics.accuracy_score(lab_test,lab_pre))
#尝试另一种函数:
print("score:",my_model.score(x_test,lab_test))


if __name__ == '__main__':
main()

显示如下:

1
2
score: 0.9555555555555556
score: 0.9555555555555556

说明执行成功。


最后贴上完整代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from sklearn import neighbors
from sklearn import metrics
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# 用于处理及观察数据集结构
def load_data():
data = load_iris()

#数据结构
print("数据结构:")
print("关键字:",data.keys())
print("target:",data.target)
print("frame:" ,data.frame)
print("target_names:" ,data.target_names)
print("DESCR:" ,data.DESCR)
print("feature_names:" ,data.feature_names)
print("filename:" ,data.filename)
print("data_module:" ,data.data_module)

#数据拆分,按7:3划分
x_tain,x_test,lab_train,lab_test = train_test_split(data.data,data.target,test_size = 0.3)
return x_tain,x_test,lab_train,lab_test


#主函数
def main():
x_train,x_test,lab_train,lab_test = load_data()
print("------------ training and testing")
#建立KNN模型
my_model = neighbors.KNeighborsClassifier(5)
#训练模型
my_model.fit(x_train,lab_train)
#对结果进行预测
lab_pre = my_model.predict(x_test)
#将预测结果和实际结果相比较打分
print("score:",metrics.accuracy_score(lab_test,lab_pre))
#尝试另一种函数:
print("score:",my_model.score(x_test,lab_test))


if __name__ == '__main__':
main()
DAR
SON