Scikit Learn库的学习历程:回归分析

18 min read Page Views

1.原始数据

粮食产量的影响因素有很多,现从粮食作物播种面积、有效灌溉面积、农用化肥施用折纯量、农业机械总动力、农村用电量、成灾面积这6个因素着手分析,具体数据如下所示:

年份	粮食产量(万吨)	粮食作物播种面积(千公顷)	有效灌溉面积(千公顷)	农用化肥施用折纯量(万吨)	农业机械总动力(万千瓦)	农村用电量(亿千瓦小时)	成灾面积(千公顷)
1991	43529.3	112313.6	47822.07	2805.1	29388.6	963.2	27810
1992	44265.8	110559.7	48590.1	2930.2	30308.4	1107.1	25900
1993	45648.8	110508.7	48727.9	3151.9	31816.6	1244.9	23130
1994	44510.1	109543.7	48759.1	3317.9	33802.5	1473.9	31380
1995	46661.8	110060.4	49281.2	3593.7	36118.05	1655.66	22270
1996	50453.5	112547.92	50381.4	3827.93	38546.92	1812.7	21230
1997	49417.1	112912.1	51238.5	3980.7	42015.64	1980.1	30310
1998	51229.53	113787.4	52295.6	4085.6	45207.71	2042.1	25180
1999	50838.58	113160.98	53158.41	4124.32	48996.12	2173.4	26730
2000	46217.52	108462.54	53820.33	4146.41	52573.61	2421.3	34370
2001	45263.67	106080.04	54249.39	4253.76	55172.1	2610.8	31790
2002	45705.75	103890.83	54354.85	4339.39	57929.85	2993.4	27160
2003	43069.53	99410.37	54014.23	4411.56	60386.54	3432.9	32520
2004	46946.95	101606.03	54478.42	4636.58	64027.91	3933	16300
2005	48402.19	104278.38	55029.34	4766.22	68397.85	4375.7	19970
2006	49804.23	104957.7	55750.5	4927.69	72522.12	4895.82	24630
2007	50413.85	105998.62	56518.34	5107.83	76589.56	5509.93	25060
2008	53434.29	107544.51	58471.68	5239.02	82190.41	5713.15	22280
2009	53940.86	110255.09	59261.45	5404.35	87496.1	6104.44	21230
2010	55911.31	111695.42	60347.7	5561.68	92780.48	6632.35	18540
2011	58849.33	112980.35	61681.56	5704.24	97734.66	7139.62	12440
2012	61222.62	114368.04	62490.52	5838.85	102558.96	7508.46	11470
2013	63048.2	115907.54	63473.3	5911.86	103906.75	8549.52	14300
2014	63964.83	117455.18	64539.53	5995.94	108056.58	8884.45	12680
2015	66060.27	118962.81	65872.64	6022.6	111728.07	9026.92	12380
2016	66043.51	119230.06	67140.62	5984.41	97245.59	9238.26	13670
2017	66160.73	117989.06	67815.57	5859.41	98783.35	9524.42	9200
2018	65789.22	117038.21	68271.64	5653.42	100371.74	9358.54	10569
2019	66384.34	116063.6	68678.61	5403.59	102758.26	9482.87	7913
2020	66949.15	116768.17	69160.52	5250.65	105622.15	6210.98	7993
2021	68284.75	117630.82	69609.48	5191.26	107764.32	6736.3	4682
2022	68652.77	118332.11	70358.87	5079.2	110597.19	7765.57	4373
2023	69540.99	118968.54	71644	5021.74	113742.57	7991.9	4797

数据下载地址:

https://data.stats.gov.cn/easyquery.htm?cn=C01

2.python程序

import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.cross_decomposition import PLSRegression
from sklearn.svm import LinearSVR
from sklearn.linear_model import SGDRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor

data = pd.read_excel(io='数据.xlsx', index_col=0)
X = data.iloc[:, 1:]
y = data.iloc[:, 0]
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
y = np.ravel(scaler.fit_transform(np.array(y).reshape(-1, 1)))

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=1)

# 线性回归
model = LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1)
model.fit(X_train, y_train)
result = {'Intercept': [model.intercept_], **{f'{data.columns[i + 1]}': [coef] for i, coef in enumerate(model.coef_)}}
result = pd.DataFrame(result, index=['Coef']).T
print('线性回归:\n', result)

y_predict = model.predict(X_test)
ev = metrics.explained_variance_score(y_test, y_predict)
mae = metrics.mean_absolute_error(y_test, y_predict)
mse = metrics.mean_squared_error(y_test, y_predict)
rmse = np.sqrt(mse)
score = model.score(X_test, y_test)
print(f'\nExplained Variance Score:{ev}\nMAE:{mae}\nMSE:{mse}\nRMSE:{rmse}\nR2 Score:{score}')

# 偏最小二乘法回归
model = PLSRegression(max_iter=5000)
model.fit(X_train, y_train)
result = {'Intercept': model.intercept_, **{f'{data.columns[i + 1]}': [coef] for i, coef in enumerate(model.coef_[0])}}
result = pd.DataFrame(result, index=['Coef']).T
print('\n偏最小二乘法回归:\n', result)

y_predict = model.predict(X_test)
ev = metrics.explained_variance_score(y_test, y_predict)
mae = metrics.mean_absolute_error(y_test, y_predict)
mse = metrics.mean_squared_error(y_test, y_predict)
rmse = np.sqrt(mse)
score = model.score(X_test, y_test)
print(f'\nExplained Variance Score:{ev}\nMAE:{mae}\nMSE:{mse}\nRMSE:{rmse}\nR2 Score:{score}')

# 支持向量机回归
model = LinearSVR(dual='auto', max_iter=5000, random_state=1)
model.fit(X_train, y_train)
result = {'Intercept': model.intercept_, **{f'{data.columns[i + 1]}': [coef] for i, coef in enumerate(model.coef_)}}
result = pd.DataFrame(result, index=['Coef']).T
print('\n支持向量机回归:\n', result)

y_predict = model.predict(X_test)
ev = metrics.explained_variance_score(y_test, y_predict)
mae = metrics.mean_absolute_error(y_test, y_predict)
mse = metrics.mean_squared_error(y_test, y_predict)
rmse = np.sqrt(mse)
score = model.score(X_test, y_test)
print(f'\nExplained Variance Score:{ev}\nMAE:{mae}\nMSE:{mse}\nRMSE:{rmse}\nR2 Score:{score}')

# 随机梯度下降回归
model = SGDRegressor(loss='epsilon_insensitive', max_iter=5000, random_state=1)
model.fit(X_train, y_train)
result = {'Intercept': model.intercept_, **{f'{data.columns[i + 1]}': [coef] for i, coef in enumerate(model.coef_)}}
result = pd.DataFrame(result, index=['Coef']).T
print('\n随机梯度下降回归:\n', result)

y_predict = model.predict(X_test)
ev = metrics.explained_variance_score(y_test, y_predict)
mae = metrics.mean_absolute_error(y_test, y_predict)
mse = metrics.mean_squared_error(y_test, y_predict)
rmse = np.sqrt(mse)
score = model.score(X_test, y_test)
print(f'\nExplained Variance Score:{ev}\nMAE:{mae}\nMSE:{mse}\nRMSE:{rmse}\nR2 Score:{score}')

# 岭回归
model = KernelRidge(alpha=1.0, kernel='linear')
model.fit(X_train, y_train)
print('\n岭回归:')

y_predict = model.predict(X_test)
ev = metrics.explained_variance_score(y_test, y_predict)
mae = metrics.mean_absolute_error(y_test, y_predict)
mse = metrics.mean_squared_error(y_test, y_predict)
rmse = np.sqrt(mse)
score = model.score(X_test, y_test)
print(f'\nExplained Variance Score:{ev}\nMAE:{mae}\nMSE:{mse}\nRMSE:{rmse}\nR2 Score:{score}')

# 最近邻回归
model = KNeighborsRegressor(algorithm='auto')
model.fit(X_train, y_train)
print('\n最近邻回归:')

y_predict = model.predict(X_test)
ev = metrics.explained_variance_score(y_test, y_predict)
mae = metrics.mean_absolute_error(y_test, y_predict)
mse = metrics.mean_squared_error(y_test, y_predict)
rmse = np.sqrt(mse)
score = model.score(X_test, y_test)
print(f'\nExplained Variance Score:{ev}\nMAE:{mae}\nMSE:{mse}\nRMSE:{rmse}\nR2 Score:{score}')

# 高斯过程回归
model = GaussianProcessRegressor(random_state=1)
model.fit(X_train, y_train)
print('\n高斯过程回归:')

y_predict = model.predict(X_test)
ev = metrics.explained_variance_score(y_test, y_predict)
mae = metrics.mean_absolute_error(y_test, y_predict)
mse = metrics.mean_squared_error(y_test, y_predict)
rmse = np.sqrt(mse)
score = model.score(X_test, y_test)
print(f'\nExplained Variance Score:{ev}\nMAE:{mae}\nMSE:{mse}\nRMSE:{rmse}\nR2 Score:{score}')

# 决策树回归
model = DecisionTreeRegressor(random_state=1)
model.fit(X_train, y_train)
print('\n决策树回归:')

y_predict = model.predict(X_test)
ev = metrics.explained_variance_score(y_test, y_predict)
mae = metrics.mean_absolute_error(y_test, y_predict)
mse = metrics.mean_squared_error(y_test, y_predict)
rmse = np.sqrt(mse)
score = model.score(X_test, y_test)
print(f'\nExplained Variance Score:{ev}\nMAE:{mae}\nMSE:{mse}\nRMSE:{rmse}\nR2 Score:{score}')

# 随机森林回归
model = RandomForestRegressor(random_state=1)
model.fit(X_train, y_train)
print('\n随机森林回归:')

y_predict = model.predict(X_test)
ev = metrics.explained_variance_score(y_test, y_predict)
mae = metrics.mean_absolute_error(y_test, y_predict)
mse = metrics.mean_squared_error(y_test, y_predict)
rmse = np.sqrt(mse)
score = model.score(X_test, y_test)
print(f'\nExplained Variance Score:{ev}\nMAE:{mae}\nMSE:{mse}\nRMSE:{rmse}\nR2 Score:{score}')

# 神经网络回归
model = MLPRegressor(random_state=1)
model.fit(X_train, y_train)
print('\n神经网络回归:')

y_predict = model.predict(X_test)
ev = metrics.explained_variance_score(y_test, y_predict)
mae = metrics.mean_absolute_error(y_test, y_predict)
mse = metrics.mean_squared_error(y_test, y_predict)
rmse = np.sqrt(mse)
score = model.score(X_test, y_test)
print(f'\nExplained Variance Score:{ev}\nMAE:{mae}\nMSE:{mse}\nRMSE:{rmse}\nR2 Score:{score}')

3.输出结果

线性回归
                    Coef
Intercept     -0.042659
粮食作物播种面积(千公顷)  0.413728
有效灌溉面积(千公顷)    0.712346
农用化肥施用折纯量(万吨)  0.382265
农业机械总动力(万千瓦)  -0.355217
农村用电量(亿千瓦小时)  -0.003523
成灾面积(千公顷)     -0.246983

Explained Variance Score0.9793128550922058
MAE0.04283378501648115
MSE0.002849346595959882
RMSE0.053379271219827294
R2 Score0.9703413600563636

偏最小二乘法回归
                    Coef
Intercept      0.356992
粮食作物播种面积(千公顷)  0.110331
有效灌溉面积(千公顷)    0.066494
农用化肥施用折纯量(万吨)  0.024808
农业机械总动力(万千瓦)   0.046250
农村用电量(亿千瓦小时)   0.043094
成灾面积(千公顷)     -0.077288

Explained Variance Score0.988007896804864
MAE0.03196077200495074
MSE0.0014267205623277619
RMSE0.03777195470620712
R2 Score0.98514937020359

支持向量机回归
                    Coef
Intercept     -0.014827
粮食作物播种面积(千公顷)  0.396596
有效灌溉面积(千公顷)    0.478833
农用化肥施用折纯量(万吨)  0.211099
农业机械总动力(万千瓦)   0.010051
农村用电量(亿千瓦小时)  -0.009123
成灾面积(千公顷)     -0.241092

Explained Variance Score0.9908459388803968
MAE0.033182266634767546
MSE0.0015271912200772659
RMSE0.039079294006894055
R2 Score0.984103578488634

随机梯度下降回归
                    Coef
Intercept      0.039527
粮食作物播种面积(千公顷)  0.205599
有效灌溉面积(千公顷)    0.187572
农用化肥施用折纯量(万吨)  0.138283
农业机械总动力(万千瓦)   0.193506
农村用电量(亿千瓦小时)   0.183747
成灾面积(千公顷)     -0.154441

Explained Variance Score0.9310854851405879
MAE0.06917097462178992
MSE0.0068356266055746824
RMSE0.08267784833662933
R2 Score0.9288484635139364

岭回归

Explained Variance Score0.9581051014199996
MAE0.05746447497254316
MSE0.004663208317882412
RMSE0.06828768789381005
R2 Score0.9514610062958476

最近邻回归

Explained Variance Score0.9500953938148504
MAE0.06499139072797641
MSE0.005265357423870815
RMSE0.07256278263594096
R2 Score0.94519328037152

高斯过程回归

Explained Variance Score0.9541859660302379
MAE0.057675077483357054
MSE0.005249304626932844
RMSE0.07245208504199754
R2 Score0.945360372758647

决策树回归

Explained Variance Score0.9154593097020374
MAE0.08799087772264931
MSE0.010441754256853264
RMSE0.10218490229409266
R2 Score0.8913125450153893

随机森林回归

Explained Variance Score0.9788265473254734
MAE0.042065818054613915
MSE0.002536649402791395
RMSE0.050365160605237776
R2 Score0.9735962022285022

神经网络回归

Explained Variance Score0.9643446618998314
MAE0.06354909495638342
MSE0.005136044450074681
RMSE0.07166620158815926
R2 Score0.9465392896409077
Last updated on 2025-06-20