Scikit Learn库的学习历程:聚类预测
9 min read
Page Views
1.原始数据
这次使用的是鸢尾花数据集,特征变量依次为:花萼长度、花萼宽度、花瓣长度、花瓣宽度,具体数据如下所示:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
5.1 3.5 1.4 0.2
4.9 3 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.1 1.5 0.2
5 3.6 1.4 0.2
5.4 3.9 1.7 0.4
4.6 3.4 1.4 0.3
5 3.4 1.5 0.2
4.4 2.9 1.4 0.2
4.9 3.1 1.5 0.1
5.4 3.7 1.5 0.2
4.8 3.4 1.6 0.2
4.8 3 1.4 0.1
4.3 3 1.1 0.1
5.8 4 1.2 0.2
5.7 4.4 1.5 0.4
5.4 3.9 1.3 0.4
5.1 3.5 1.4 0.3
5.7 3.8 1.7 0.3
5.1 3.8 1.5 0.3
5.4 3.4 1.7 0.2
5.1 3.7 1.5 0.4
4.6 3.6 1 0.2
5.1 3.3 1.7 0.5
4.8 3.4 1.9 0.2
5 3 1.6 0.2
5 3.4 1.6 0.4
5.2 3.5 1.5 0.2
5.2 3.4 1.4 0.2
4.7 3.2 1.6 0.2
4.8 3.1 1.6 0.2
5.4 3.4 1.5 0.4
5.2 4.1 1.5 0.1
5.5 4.2 1.4 0.2
4.9 3.1 1.5 0.2
5 3.2 1.2 0.2
5.5 3.5 1.3 0.2
4.9 3.6 1.4 0.1
4.4 3 1.3 0.2
5.1 3.4 1.5 0.2
5 3.5 1.3 0.3
4.5 2.3 1.3 0.3
4.4 3.2 1.3 0.2
5 3.5 1.6 0.6
5.1 3.8 1.9 0.4
4.8 3 1.4 0.3
5.1 3.8 1.6 0.2
4.6 3.2 1.4 0.2
5.3 3.7 1.5 0.2
5 3.3 1.4 0.2
7 3.2 4.7 1.4
6.4 3.2 4.5 1.5
6.9 3.1 4.9 1.5
5.5 2.3 4 1.3
6.5 2.8 4.6 1.5
5.7 2.8 4.5 1.3
6.3 3.3 4.7 1.6
4.9 2.4 3.3 1
6.6 2.9 4.6 1.3
5.2 2.7 3.9 1.4
5 2 3.5 1
5.9 3 4.2 1.5
6 2.2 4 1
6.1 2.9 4.7 1.4
5.6 2.9 3.6 1.3
6.7 3.1 4.4 1.4
5.6 3 4.5 1.5
5.8 2.7 4.1 1
6.2 2.2 4.5 1.5
5.6 2.5 3.9 1.1
5.9 3.2 4.8 1.8
6.1 2.8 4 1.3
6.3 2.5 4.9 1.5
6.1 2.8 4.7 1.2
6.4 2.9 4.3 1.3
6.6 3 4.4 1.4
6.8 2.8 4.8 1.4
6.7 3 5 1.7
6 2.9 4.5 1.5
5.7 2.6 3.5 1
5.5 2.4 3.8 1.1
5.5 2.4 3.7 1
5.8 2.7 3.9 1.2
6 2.7 5.1 1.6
5.4 3 4.5 1.5
6 3.4 4.5 1.6
6.7 3.1 4.7 1.5
6.3 2.3 4.4 1.3
5.6 3 4.1 1.3
5.5 2.5 4 1.3
5.5 2.6 4.4 1.2
6.1 3 4.6 1.4
5.8 2.6 4 1.2
5 2.3 3.3 1
5.6 2.7 4.2 1.3
5.7 3 4.2 1.2
5.7 2.9 4.2 1.3
6.2 2.9 4.3 1.3
5.1 2.5 3 1.1
5.7 2.8 4.1 1.3
6.3 3.3 6 2.5
5.8 2.7 5.1 1.9
7.1 3 5.9 2.1
6.3 2.9 5.6 1.8
6.5 3 5.8 2.2
7.6 3 6.6 2.1
4.9 2.5 4.5 1.7
7.3 2.9 6.3 1.8
6.7 2.5 5.8 1.8
7.2 3.6 6.1 2.5
6.5 3.2 5.1 2
6.4 2.7 5.3 1.9
6.8 3 5.5 2.1
5.7 2.5 5 2
5.8 2.8 5.1 2.4
6.4 3.2 5.3 2.3
6.5 3 5.5 1.8
7.7 3.8 6.7 2.2
7.7 2.6 6.9 2.3
6 2.2 5 1.5
6.9 3.2 5.7 2.3
5.6 2.8 4.9 2
7.7 2.8 6.7 2
6.3 2.7 4.9 1.8
6.7 3.3 5.7 2.1
7.2 3.2 6 1.8
6.2 2.8 4.8 1.8
6.1 3 4.9 1.8
6.4 2.8 5.6 2.1
7.2 3 5.8 1.6
7.4 2.8 6.1 1.9
7.9 3.8 6.4 2
6.4 2.8 5.6 2.2
6.3 2.8 5.1 1.5
6.1 2.6 5.6 1.4
7.7 3 6.1 2.3
6.3 3.4 5.6 2.4
6.4 3.1 5.5 1.8
6 3 4.8 1.8
6.9 3.1 5.4 2.1
6.7 3.1 5.6 2.4
6.9 3.1 5.1 2.3
5.8 2.7 5.1 1.9
6.8 3.2 5.9 2.3
6.7 3.3 5.7 2.5
6.7 3 5.2 2.3
6.3 2.5 5 1.9
6.5 3 5.2 2
6.2 3.4 5.4 2.3
5.9 3 5.1 1.8
2.python程序
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from sklearn.cluster import KMeans
from sklearn.cluster import MiniBatchKMeans
from sklearn.cluster import AffinityPropagation
from sklearn.cluster import MeanShift
from sklearn.cluster import SpectralClustering
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import FeatureAgglomeration
from sklearn.cluster import DBSCAN
from sklearn.cluster import OPTICS
from sklearn.cluster import Birch
pd.set_option('display.max_columns', None)
pd.set_option('expand_frame_repr', False)
iris_dataset = load_iris()
data = pd.DataFrame(iris_dataset.get('data'), columns=iris_dataset.get('feature_names'))
data.to_excel('iris dataset.xlsx', index=False)
print(data)
scaler = MinMaxScaler()
X = scaler.fit_transform(data)
# 在真实簇标签未知的情况下,聚类性能评估指数有以下3种:
# Silhouette系数得分,取值范围:[-1,1],越接近1效果越好
# Calinski-Harabaz指数得分,取值范围:[0,+∞),值越大效果越好
# Davies-Bouldin指数得分,取值范围:[0,+∞),值越小效果越好
# K均值聚类
model = KMeans(n_init='auto', n_clusters=3, max_iter=5000, random_state=1)
model.fit(X)
sil_score = silhouette_score(X, labels=model.labels_)
ch_score = calinski_harabasz_score(X, labels=model.labels_)
db_score = davies_bouldin_score(X, labels=model.labels_)
scores = pd.DataFrame([sil_score, ch_score, db_score], index=['Silhouette', 'Calinski-Harabaz', 'Davies-Bouldin'],
columns=['Score'])
print('\nK均值聚类:\n', scores)
# 小批次K均值聚类
model = MiniBatchKMeans(n_init='auto', n_clusters=3, max_iter=5000, random_state=1)
model.fit(X)
sil_score = silhouette_score(X, labels=model.labels_)
ch_score = calinski_harabasz_score(X, labels=model.labels_)
db_score = davies_bouldin_score(X, labels=model.labels_)
scores = pd.DataFrame([sil_score, ch_score, db_score], index=['Silhouette', 'Calinski-Harabaz', 'Davies-Bouldin'],
columns=['Score'])
print('\n小批次K均值聚类:\n', scores)
# Affinity Propagation聚类
model = AffinityPropagation(max_iter=5000, random_state=1)
model.fit(X)
sil_score = silhouette_score(X, labels=model.labels_)
ch_score = calinski_harabasz_score(X, labels=model.labels_)
db_score = davies_bouldin_score(X, labels=model.labels_)
scores = pd.DataFrame([sil_score, ch_score, db_score], index=['Silhouette', 'Calinski-Harabaz', 'Davies-Bouldin'],
columns=['Score'])
print('\nAffinity Propagation聚类:\n', scores)
# 均值漂移聚类
model = MeanShift(max_iter=5000)
model.fit(X)
sil_score = silhouette_score(X, labels=model.labels_)
ch_score = calinski_harabasz_score(X, labels=model.labels_)
db_score = davies_bouldin_score(X, labels=model.labels_)
scores = pd.DataFrame([sil_score, ch_score, db_score], index=['Silhouette', 'Calinski-Harabaz', 'Davies-Bouldin'],
columns=['Score'])
print('\n均值漂移聚类:\n', scores)
# 谱聚类
model = SpectralClustering(n_clusters=3, random_state=1)
model.fit(X)
sil_score = silhouette_score(X, labels=model.labels_)
ch_score = calinski_harabasz_score(X, labels=model.labels_)
db_score = davies_bouldin_score(X, labels=model.labels_)
scores = pd.DataFrame([sil_score, ch_score, db_score], index=['Silhouette', 'Calinski-Harabaz', 'Davies-Bouldin'],
columns=['Score'])
print('\n谱聚类:\n', scores)
# 层次聚类
model = AgglomerativeClustering(n_clusters=3, linkage='ward')
model.fit(X)
sil_score = silhouette_score(X, labels=model.labels_)
ch_score = calinski_harabasz_score(X, labels=model.labels_)
db_score = davies_bouldin_score(X, labels=model.labels_)
scores = pd.DataFrame([sil_score, ch_score, db_score], index=['Silhouette', 'Calinski-Harabaz', 'Davies-Bouldin'],
columns=['Score'])
print('\n层次聚类:\n', scores)
# 特征融合聚类
model = FeatureAgglomeration(n_clusters=3, linkage='ward')
model.fit(X.T)
sil_score = silhouette_score(X, labels=model.labels_)
ch_score = calinski_harabasz_score(X, labels=model.labels_)
db_score = davies_bouldin_score(X, labels=model.labels_)
scores = pd.DataFrame([sil_score, ch_score, db_score], index=['Silhouette', 'Calinski-Harabaz', 'Davies-Bouldin'],
columns=['Score'])
print('\n特征融合聚类:\n', scores)
# DBSCAN聚类
model = DBSCAN(eps=0.3, algorithm='auto')
model.fit(X)
sil_score = silhouette_score(X, labels=model.labels_)
ch_score = calinski_harabasz_score(X, labels=model.labels_)
db_score = davies_bouldin_score(X, labels=model.labels_)
scores = pd.DataFrame([sil_score, ch_score, db_score], index=['Silhouette', 'Calinski-Harabaz', 'Davies-Bouldin'],
columns=['Score'])
print('\nDBSCAN聚类:\n', scores)
# OPTICS聚类
model = OPTICS(eps=0.3, xi=0.3, algorithm='auto')
model.fit(X)
sil_score = silhouette_score(X, labels=model.labels_)
ch_score = calinski_harabasz_score(X, labels=model.labels_)
db_score = davies_bouldin_score(X, labels=model.labels_)
scores = pd.DataFrame([sil_score, ch_score, db_score], index=['Silhouette', 'Calinski-Harabaz', 'Davies-Bouldin'],
columns=['Score'])
print('\nOPTICS聚类:\n', scores)
# Birch聚类
model = Birch(n_clusters=2)
model.fit(X)
sil_score = silhouette_score(X, labels=model.labels_)
ch_score = calinski_harabasz_score(X, labels=model.labels_)
db_score = davies_bouldin_score(X, labels=model.labels_)
scores = pd.DataFrame([sil_score, ch_score, db_score], index=['Silhouette', 'Calinski-Harabaz', 'Davies-Bouldin'],
columns=['Score'])
print('\nBirch聚类:\n', scores)
3.输出结果
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
.. ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
[150 rows x 4 columns]
K均值聚类:
Score
Silhouette 0.482929
Calinski-Harabaz 351.295064
Davies-Bouldin 0.786733
小批次K均值聚类:
Score
Silhouette 0.482775
Calinski-Harabaz 351.550040
Davies-Bouldin 0.788390
Affinity Propagation聚类:
Score
Silhouette 0.318824
Calinski-Harabaz 234.581396
Davies-Bouldin 1.044406
均值漂移聚类:
Score
Silhouette 0.630047
Calinski-Harabaz 354.365556
Davies-Bouldin 0.486167
谱聚类:
Score
Silhouette 0.486720
Calinski-Harabaz 323.788307
Davies-Bouldin 0.775937
层次聚类:
Score
Silhouette 0.504800
Calinski-Harabaz 349.254185
Davies-Bouldin 0.747977
特征融合聚类:
Score
Silhouette 0.504800
Calinski-Harabaz 349.254185
Davies-Bouldin 0.747977
DBSCAN聚类:
Score
Silhouette 0.468185
Calinski-Harabaz 185.451204
Davies-Bouldin 0.464860
OPTICS聚类:
Score
Silhouette 0.630047
Calinski-Harabaz 354.365556
Davies-Bouldin 0.486167
Birch聚类:
Score
Silhouette 0.438651
Calinski-Harabaz 193.643218
Davies-Bouldin 0.788608
Last updated on 2025-06-22