Scikit Learn库的学习历程:回归分析
18 min read
Page Views
1.原始数据
粮食产量的影响因素有很多,现从粮食作物播种面积、有效灌溉面积、农用化肥施用折纯量、农业机械总动力、农村用电量、成灾面积这6个因素着手分析,具体数据如下所示:
年份 粮食产量(万吨) 粮食作物播种面积(千公顷) 有效灌溉面积(千公顷) 农用化肥施用折纯量(万吨) 农业机械总动力(万千瓦) 农村用电量(亿千瓦小时) 成灾面积(千公顷)
1991 43529.3 112313.6 47822.07 2805.1 29388.6 963.2 27810
1992 44265.8 110559.7 48590.1 2930.2 30308.4 1107.1 25900
1993 45648.8 110508.7 48727.9 3151.9 31816.6 1244.9 23130
1994 44510.1 109543.7 48759.1 3317.9 33802.5 1473.9 31380
1995 46661.8 110060.4 49281.2 3593.7 36118.05 1655.66 22270
1996 50453.5 112547.92 50381.4 3827.93 38546.92 1812.7 21230
1997 49417.1 112912.1 51238.5 3980.7 42015.64 1980.1 30310
1998 51229.53 113787.4 52295.6 4085.6 45207.71 2042.1 25180
1999 50838.58 113160.98 53158.41 4124.32 48996.12 2173.4 26730
2000 46217.52 108462.54 53820.33 4146.41 52573.61 2421.3 34370
2001 45263.67 106080.04 54249.39 4253.76 55172.1 2610.8 31790
2002 45705.75 103890.83 54354.85 4339.39 57929.85 2993.4 27160
2003 43069.53 99410.37 54014.23 4411.56 60386.54 3432.9 32520
2004 46946.95 101606.03 54478.42 4636.58 64027.91 3933 16300
2005 48402.19 104278.38 55029.34 4766.22 68397.85 4375.7 19970
2006 49804.23 104957.7 55750.5 4927.69 72522.12 4895.82 24630
2007 50413.85 105998.62 56518.34 5107.83 76589.56 5509.93 25060
2008 53434.29 107544.51 58471.68 5239.02 82190.41 5713.15 22280
2009 53940.86 110255.09 59261.45 5404.35 87496.1 6104.44 21230
2010 55911.31 111695.42 60347.7 5561.68 92780.48 6632.35 18540
2011 58849.33 112980.35 61681.56 5704.24 97734.66 7139.62 12440
2012 61222.62 114368.04 62490.52 5838.85 102558.96 7508.46 11470
2013 63048.2 115907.54 63473.3 5911.86 103906.75 8549.52 14300
2014 63964.83 117455.18 64539.53 5995.94 108056.58 8884.45 12680
2015 66060.27 118962.81 65872.64 6022.6 111728.07 9026.92 12380
2016 66043.51 119230.06 67140.62 5984.41 97245.59 9238.26 13670
2017 66160.73 117989.06 67815.57 5859.41 98783.35 9524.42 9200
2018 65789.22 117038.21 68271.64 5653.42 100371.74 9358.54 10569
2019 66384.34 116063.6 68678.61 5403.59 102758.26 9482.87 7913
2020 66949.15 116768.17 69160.52 5250.65 105622.15 6210.98 7993
2021 68284.75 117630.82 69609.48 5191.26 107764.32 6736.3 4682
2022 68652.77 118332.11 70358.87 5079.2 110597.19 7765.57 4373
2023 69540.99 118968.54 71644 5021.74 113742.57 7991.9 4797
数据下载地址:
https://data.stats.gov.cn/easyquery.htm?cn=C01
2.python程序
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.cross_decomposition import PLSRegression
from sklearn.svm import LinearSVR
from sklearn.linear_model import SGDRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
data = pd.read_excel(io='数据.xlsx', index_col=0)
X = data.iloc[:, 1:]
y = data.iloc[:, 0]
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
y = np.ravel(scaler.fit_transform(np.array(y).reshape(-1, 1)))
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=1)
# 线性回归
model = LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1)
model.fit(X_train, y_train)
result = {'Intercept': [model.intercept_], **{f'{data.columns[i + 1]}': [coef] for i, coef in enumerate(model.coef_)}}
result = pd.DataFrame(result, index=['Coef']).T
print('线性回归:\n', result)
y_predict = model.predict(X_test)
ev = metrics.explained_variance_score(y_test, y_predict)
mae = metrics.mean_absolute_error(y_test, y_predict)
mse = metrics.mean_squared_error(y_test, y_predict)
rmse = np.sqrt(mse)
score = model.score(X_test, y_test)
print(f'\nExplained Variance Score:{ev}\nMAE:{mae}\nMSE:{mse}\nRMSE:{rmse}\nR2 Score:{score}')
# 偏最小二乘法回归
model = PLSRegression(max_iter=5000)
model.fit(X_train, y_train)
result = {'Intercept': model.intercept_, **{f'{data.columns[i + 1]}': [coef] for i, coef in enumerate(model.coef_[0])}}
result = pd.DataFrame(result, index=['Coef']).T
print('\n偏最小二乘法回归:\n', result)
y_predict = model.predict(X_test)
ev = metrics.explained_variance_score(y_test, y_predict)
mae = metrics.mean_absolute_error(y_test, y_predict)
mse = metrics.mean_squared_error(y_test, y_predict)
rmse = np.sqrt(mse)
score = model.score(X_test, y_test)
print(f'\nExplained Variance Score:{ev}\nMAE:{mae}\nMSE:{mse}\nRMSE:{rmse}\nR2 Score:{score}')
# 支持向量机回归
model = LinearSVR(dual='auto', max_iter=5000, random_state=1)
model.fit(X_train, y_train)
result = {'Intercept': model.intercept_, **{f'{data.columns[i + 1]}': [coef] for i, coef in enumerate(model.coef_)}}
result = pd.DataFrame(result, index=['Coef']).T
print('\n支持向量机回归:\n', result)
y_predict = model.predict(X_test)
ev = metrics.explained_variance_score(y_test, y_predict)
mae = metrics.mean_absolute_error(y_test, y_predict)
mse = metrics.mean_squared_error(y_test, y_predict)
rmse = np.sqrt(mse)
score = model.score(X_test, y_test)
print(f'\nExplained Variance Score:{ev}\nMAE:{mae}\nMSE:{mse}\nRMSE:{rmse}\nR2 Score:{score}')
# 随机梯度下降回归
model = SGDRegressor(loss='epsilon_insensitive', max_iter=5000, random_state=1)
model.fit(X_train, y_train)
result = {'Intercept': model.intercept_, **{f'{data.columns[i + 1]}': [coef] for i, coef in enumerate(model.coef_)}}
result = pd.DataFrame(result, index=['Coef']).T
print('\n随机梯度下降回归:\n', result)
y_predict = model.predict(X_test)
ev = metrics.explained_variance_score(y_test, y_predict)
mae = metrics.mean_absolute_error(y_test, y_predict)
mse = metrics.mean_squared_error(y_test, y_predict)
rmse = np.sqrt(mse)
score = model.score(X_test, y_test)
print(f'\nExplained Variance Score:{ev}\nMAE:{mae}\nMSE:{mse}\nRMSE:{rmse}\nR2 Score:{score}')
# 岭回归
model = KernelRidge(alpha=1.0, kernel='linear')
model.fit(X_train, y_train)
print('\n岭回归:')
y_predict = model.predict(X_test)
ev = metrics.explained_variance_score(y_test, y_predict)
mae = metrics.mean_absolute_error(y_test, y_predict)
mse = metrics.mean_squared_error(y_test, y_predict)
rmse = np.sqrt(mse)
score = model.score(X_test, y_test)
print(f'\nExplained Variance Score:{ev}\nMAE:{mae}\nMSE:{mse}\nRMSE:{rmse}\nR2 Score:{score}')
# 最近邻回归
model = KNeighborsRegressor(algorithm='auto')
model.fit(X_train, y_train)
print('\n最近邻回归:')
y_predict = model.predict(X_test)
ev = metrics.explained_variance_score(y_test, y_predict)
mae = metrics.mean_absolute_error(y_test, y_predict)
mse = metrics.mean_squared_error(y_test, y_predict)
rmse = np.sqrt(mse)
score = model.score(X_test, y_test)
print(f'\nExplained Variance Score:{ev}\nMAE:{mae}\nMSE:{mse}\nRMSE:{rmse}\nR2 Score:{score}')
# 高斯过程回归
model = GaussianProcessRegressor(random_state=1)
model.fit(X_train, y_train)
print('\n高斯过程回归:')
y_predict = model.predict(X_test)
ev = metrics.explained_variance_score(y_test, y_predict)
mae = metrics.mean_absolute_error(y_test, y_predict)
mse = metrics.mean_squared_error(y_test, y_predict)
rmse = np.sqrt(mse)
score = model.score(X_test, y_test)
print(f'\nExplained Variance Score:{ev}\nMAE:{mae}\nMSE:{mse}\nRMSE:{rmse}\nR2 Score:{score}')
# 决策树回归
model = DecisionTreeRegressor(random_state=1)
model.fit(X_train, y_train)
print('\n决策树回归:')
y_predict = model.predict(X_test)
ev = metrics.explained_variance_score(y_test, y_predict)
mae = metrics.mean_absolute_error(y_test, y_predict)
mse = metrics.mean_squared_error(y_test, y_predict)
rmse = np.sqrt(mse)
score = model.score(X_test, y_test)
print(f'\nExplained Variance Score:{ev}\nMAE:{mae}\nMSE:{mse}\nRMSE:{rmse}\nR2 Score:{score}')
# 随机森林回归
model = RandomForestRegressor(random_state=1)
model.fit(X_train, y_train)
print('\n随机森林回归:')
y_predict = model.predict(X_test)
ev = metrics.explained_variance_score(y_test, y_predict)
mae = metrics.mean_absolute_error(y_test, y_predict)
mse = metrics.mean_squared_error(y_test, y_predict)
rmse = np.sqrt(mse)
score = model.score(X_test, y_test)
print(f'\nExplained Variance Score:{ev}\nMAE:{mae}\nMSE:{mse}\nRMSE:{rmse}\nR2 Score:{score}')
# 神经网络回归
model = MLPRegressor(random_state=1)
model.fit(X_train, y_train)
print('\n神经网络回归:')
y_predict = model.predict(X_test)
ev = metrics.explained_variance_score(y_test, y_predict)
mae = metrics.mean_absolute_error(y_test, y_predict)
mse = metrics.mean_squared_error(y_test, y_predict)
rmse = np.sqrt(mse)
score = model.score(X_test, y_test)
print(f'\nExplained Variance Score:{ev}\nMAE:{mae}\nMSE:{mse}\nRMSE:{rmse}\nR2 Score:{score}')
3.输出结果
线性回归:
Coef
Intercept -0.042659
粮食作物播种面积(千公顷) 0.413728
有效灌溉面积(千公顷) 0.712346
农用化肥施用折纯量(万吨) 0.382265
农业机械总动力(万千瓦) -0.355217
农村用电量(亿千瓦小时) -0.003523
成灾面积(千公顷) -0.246983
Explained Variance Score:0.9793128550922058
MAE:0.04283378501648115
MSE:0.002849346595959882
RMSE:0.053379271219827294
R2 Score:0.9703413600563636
偏最小二乘法回归:
Coef
Intercept 0.356992
粮食作物播种面积(千公顷) 0.110331
有效灌溉面积(千公顷) 0.066494
农用化肥施用折纯量(万吨) 0.024808
农业机械总动力(万千瓦) 0.046250
农村用电量(亿千瓦小时) 0.043094
成灾面积(千公顷) -0.077288
Explained Variance Score:0.988007896804864
MAE:0.03196077200495074
MSE:0.0014267205623277619
RMSE:0.03777195470620712
R2 Score:0.98514937020359
支持向量机回归:
Coef
Intercept -0.014827
粮食作物播种面积(千公顷) 0.396596
有效灌溉面积(千公顷) 0.478833
农用化肥施用折纯量(万吨) 0.211099
农业机械总动力(万千瓦) 0.010051
农村用电量(亿千瓦小时) -0.009123
成灾面积(千公顷) -0.241092
Explained Variance Score:0.9908459388803968
MAE:0.033182266634767546
MSE:0.0015271912200772659
RMSE:0.039079294006894055
R2 Score:0.984103578488634
随机梯度下降回归:
Coef
Intercept 0.039527
粮食作物播种面积(千公顷) 0.205599
有效灌溉面积(千公顷) 0.187572
农用化肥施用折纯量(万吨) 0.138283
农业机械总动力(万千瓦) 0.193506
农村用电量(亿千瓦小时) 0.183747
成灾面积(千公顷) -0.154441
Explained Variance Score:0.9310854851405879
MAE:0.06917097462178992
MSE:0.0068356266055746824
RMSE:0.08267784833662933
R2 Score:0.9288484635139364
岭回归:
Explained Variance Score:0.9581051014199996
MAE:0.05746447497254316
MSE:0.004663208317882412
RMSE:0.06828768789381005
R2 Score:0.9514610062958476
最近邻回归:
Explained Variance Score:0.9500953938148504
MAE:0.06499139072797641
MSE:0.005265357423870815
RMSE:0.07256278263594096
R2 Score:0.94519328037152
高斯过程回归:
Explained Variance Score:0.9541859660302379
MAE:0.057675077483357054
MSE:0.005249304626932844
RMSE:0.07245208504199754
R2 Score:0.945360372758647
决策树回归:
Explained Variance Score:0.9154593097020374
MAE:0.08799087772264931
MSE:0.010441754256853264
RMSE:0.10218490229409266
R2 Score:0.8913125450153893
随机森林回归:
Explained Variance Score:0.9788265473254734
MAE:0.042065818054613915
MSE:0.002536649402791395
RMSE:0.050365160605237776
R2 Score:0.9735962022285022
神经网络回归:
Explained Variance Score:0.9643446618998314
MAE:0.06354909495638342
MSE:0.005136044450074681
RMSE:0.07166620158815926
R2 Score:0.9465392896409077
Last updated on 2025-06-20