Scikit Learn库的学习历程:聚类预测

9 min read Page Views

1.原始数据

这次使用的是鸢尾花数据集,特征变量依次为:花萼长度、花萼宽度、花瓣长度、花瓣宽度,具体数据如下所示:

sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
5.1	3.5	1.4	0.2
4.9	3	1.4	0.2
4.7	3.2	1.3	0.2
4.6	3.1	1.5	0.2
5	3.6	1.4	0.2
5.4	3.9	1.7	0.4
4.6	3.4	1.4	0.3
5	3.4	1.5	0.2
4.4	2.9	1.4	0.2
4.9	3.1	1.5	0.1
5.4	3.7	1.5	0.2
4.8	3.4	1.6	0.2
4.8	3	1.4	0.1
4.3	3	1.1	0.1
5.8	4	1.2	0.2
5.7	4.4	1.5	0.4
5.4	3.9	1.3	0.4
5.1	3.5	1.4	0.3
5.7	3.8	1.7	0.3
5.1	3.8	1.5	0.3
5.4	3.4	1.7	0.2
5.1	3.7	1.5	0.4
4.6	3.6	1	0.2
5.1	3.3	1.7	0.5
4.8	3.4	1.9	0.2
5	3	1.6	0.2
5	3.4	1.6	0.4
5.2	3.5	1.5	0.2
5.2	3.4	1.4	0.2
4.7	3.2	1.6	0.2
4.8	3.1	1.6	0.2
5.4	3.4	1.5	0.4
5.2	4.1	1.5	0.1
5.5	4.2	1.4	0.2
4.9	3.1	1.5	0.2
5	3.2	1.2	0.2
5.5	3.5	1.3	0.2
4.9	3.6	1.4	0.1
4.4	3	1.3	0.2
5.1	3.4	1.5	0.2
5	3.5	1.3	0.3
4.5	2.3	1.3	0.3
4.4	3.2	1.3	0.2
5	3.5	1.6	0.6
5.1	3.8	1.9	0.4
4.8	3	1.4	0.3
5.1	3.8	1.6	0.2
4.6	3.2	1.4	0.2
5.3	3.7	1.5	0.2
5	3.3	1.4	0.2
7	3.2	4.7	1.4
6.4	3.2	4.5	1.5
6.9	3.1	4.9	1.5
5.5	2.3	4	1.3
6.5	2.8	4.6	1.5
5.7	2.8	4.5	1.3
6.3	3.3	4.7	1.6
4.9	2.4	3.3	1
6.6	2.9	4.6	1.3
5.2	2.7	3.9	1.4
5	2	3.5	1
5.9	3	4.2	1.5
6	2.2	4	1
6.1	2.9	4.7	1.4
5.6	2.9	3.6	1.3
6.7	3.1	4.4	1.4
5.6	3	4.5	1.5
5.8	2.7	4.1	1
6.2	2.2	4.5	1.5
5.6	2.5	3.9	1.1
5.9	3.2	4.8	1.8
6.1	2.8	4	1.3
6.3	2.5	4.9	1.5
6.1	2.8	4.7	1.2
6.4	2.9	4.3	1.3
6.6	3	4.4	1.4
6.8	2.8	4.8	1.4
6.7	3	5	1.7
6	2.9	4.5	1.5
5.7	2.6	3.5	1
5.5	2.4	3.8	1.1
5.5	2.4	3.7	1
5.8	2.7	3.9	1.2
6	2.7	5.1	1.6
5.4	3	4.5	1.5
6	3.4	4.5	1.6
6.7	3.1	4.7	1.5
6.3	2.3	4.4	1.3
5.6	3	4.1	1.3
5.5	2.5	4	1.3
5.5	2.6	4.4	1.2
6.1	3	4.6	1.4
5.8	2.6	4	1.2
5	2.3	3.3	1
5.6	2.7	4.2	1.3
5.7	3	4.2	1.2
5.7	2.9	4.2	1.3
6.2	2.9	4.3	1.3
5.1	2.5	3	1.1
5.7	2.8	4.1	1.3
6.3	3.3	6	2.5
5.8	2.7	5.1	1.9
7.1	3	5.9	2.1
6.3	2.9	5.6	1.8
6.5	3	5.8	2.2
7.6	3	6.6	2.1
4.9	2.5	4.5	1.7
7.3	2.9	6.3	1.8
6.7	2.5	5.8	1.8
7.2	3.6	6.1	2.5
6.5	3.2	5.1	2
6.4	2.7	5.3	1.9
6.8	3	5.5	2.1
5.7	2.5	5	2
5.8	2.8	5.1	2.4
6.4	3.2	5.3	2.3
6.5	3	5.5	1.8
7.7	3.8	6.7	2.2
7.7	2.6	6.9	2.3
6	2.2	5	1.5
6.9	3.2	5.7	2.3
5.6	2.8	4.9	2
7.7	2.8	6.7	2
6.3	2.7	4.9	1.8
6.7	3.3	5.7	2.1
7.2	3.2	6	1.8
6.2	2.8	4.8	1.8
6.1	3	4.9	1.8
6.4	2.8	5.6	2.1
7.2	3	5.8	1.6
7.4	2.8	6.1	1.9
7.9	3.8	6.4	2
6.4	2.8	5.6	2.2
6.3	2.8	5.1	1.5
6.1	2.6	5.6	1.4
7.7	3	6.1	2.3
6.3	3.4	5.6	2.4
6.4	3.1	5.5	1.8
6	3	4.8	1.8
6.9	3.1	5.4	2.1
6.7	3.1	5.6	2.4
6.9	3.1	5.1	2.3
5.8	2.7	5.1	1.9
6.8	3.2	5.9	2.3
6.7	3.3	5.7	2.5
6.7	3	5.2	2.3
6.3	2.5	5	1.9
6.5	3	5.2	2
6.2	3.4	5.4	2.3
5.9	3	5.1	1.8

2.python程序

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from sklearn.cluster import KMeans
from sklearn.cluster import MiniBatchKMeans
from sklearn.cluster import AffinityPropagation
from sklearn.cluster import MeanShift
from sklearn.cluster import SpectralClustering
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import FeatureAgglomeration
from sklearn.cluster import DBSCAN
from sklearn.cluster import OPTICS
from sklearn.cluster import Birch

pd.set_option('display.max_columns', None)
pd.set_option('expand_frame_repr', False)

iris_dataset = load_iris()
data = pd.DataFrame(iris_dataset.get('data'), columns=iris_dataset.get('feature_names'))
data.to_excel('iris dataset.xlsx', index=False)
print(data)

scaler = MinMaxScaler()
X = scaler.fit_transform(data)

# 在真实簇标签未知的情况下,聚类性能评估指数有以下3种:
# Silhouette系数得分,取值范围:[-1,1],越接近1效果越好‌
# Calinski-Harabaz指数得分,取值范围:[0,+∞),值越大效果越好‌
# Davies-Bouldin指数得分,取值范围:[0,+∞),值越小效果越好‌

# K均值聚类
model = KMeans(n_init='auto', n_clusters=3, max_iter=5000, random_state=1)
model.fit(X)
sil_score = silhouette_score(X, labels=model.labels_)
ch_score = calinski_harabasz_score(X, labels=model.labels_)
db_score = davies_bouldin_score(X, labels=model.labels_)
scores = pd.DataFrame([sil_score, ch_score, db_score], index=['Silhouette', 'Calinski-Harabaz', 'Davies-Bouldin'],
                      columns=['Score'])
print('\nK均值聚类:\n', scores)

# 小批次K均值聚类
model = MiniBatchKMeans(n_init='auto', n_clusters=3, max_iter=5000, random_state=1)
model.fit(X)
sil_score = silhouette_score(X, labels=model.labels_)
ch_score = calinski_harabasz_score(X, labels=model.labels_)
db_score = davies_bouldin_score(X, labels=model.labels_)
scores = pd.DataFrame([sil_score, ch_score, db_score], index=['Silhouette', 'Calinski-Harabaz', 'Davies-Bouldin'],
                      columns=['Score'])
print('\n小批次K均值聚类:\n', scores)

# Affinity Propagation聚类
model = AffinityPropagation(max_iter=5000, random_state=1)
model.fit(X)
sil_score = silhouette_score(X, labels=model.labels_)
ch_score = calinski_harabasz_score(X, labels=model.labels_)
db_score = davies_bouldin_score(X, labels=model.labels_)
scores = pd.DataFrame([sil_score, ch_score, db_score], index=['Silhouette', 'Calinski-Harabaz', 'Davies-Bouldin'],
                      columns=['Score'])
print('\nAffinity Propagation聚类:\n', scores)

# 均值漂移聚类
model = MeanShift(max_iter=5000)
model.fit(X)
sil_score = silhouette_score(X, labels=model.labels_)
ch_score = calinski_harabasz_score(X, labels=model.labels_)
db_score = davies_bouldin_score(X, labels=model.labels_)
scores = pd.DataFrame([sil_score, ch_score, db_score], index=['Silhouette', 'Calinski-Harabaz', 'Davies-Bouldin'],
                      columns=['Score'])
print('\n均值漂移聚类:\n', scores)

# 谱聚类
model = SpectralClustering(n_clusters=3, random_state=1)
model.fit(X)
sil_score = silhouette_score(X, labels=model.labels_)
ch_score = calinski_harabasz_score(X, labels=model.labels_)
db_score = davies_bouldin_score(X, labels=model.labels_)
scores = pd.DataFrame([sil_score, ch_score, db_score], index=['Silhouette', 'Calinski-Harabaz', 'Davies-Bouldin'],
                      columns=['Score'])
print('\n谱聚类:\n', scores)

# 层次聚类
model = AgglomerativeClustering(n_clusters=3, linkage='ward')
model.fit(X)
sil_score = silhouette_score(X, labels=model.labels_)
ch_score = calinski_harabasz_score(X, labels=model.labels_)
db_score = davies_bouldin_score(X, labels=model.labels_)
scores = pd.DataFrame([sil_score, ch_score, db_score], index=['Silhouette', 'Calinski-Harabaz', 'Davies-Bouldin'],
                      columns=['Score'])
print('\n层次聚类:\n', scores)

# 特征融合聚类
model = FeatureAgglomeration(n_clusters=3, linkage='ward')
model.fit(X.T)
sil_score = silhouette_score(X, labels=model.labels_)
ch_score = calinski_harabasz_score(X, labels=model.labels_)
db_score = davies_bouldin_score(X, labels=model.labels_)
scores = pd.DataFrame([sil_score, ch_score, db_score], index=['Silhouette', 'Calinski-Harabaz', 'Davies-Bouldin'],
                      columns=['Score'])
print('\n特征融合聚类:\n', scores)

# DBSCAN聚类
model = DBSCAN(eps=0.3, algorithm='auto')
model.fit(X)
sil_score = silhouette_score(X, labels=model.labels_)
ch_score = calinski_harabasz_score(X, labels=model.labels_)
db_score = davies_bouldin_score(X, labels=model.labels_)
scores = pd.DataFrame([sil_score, ch_score, db_score], index=['Silhouette', 'Calinski-Harabaz', 'Davies-Bouldin'],
                      columns=['Score'])
print('\nDBSCAN聚类:\n', scores)

# OPTICS聚类
model = OPTICS(eps=0.3, xi=0.3, algorithm='auto')
model.fit(X)
sil_score = silhouette_score(X, labels=model.labels_)
ch_score = calinski_harabasz_score(X, labels=model.labels_)
db_score = davies_bouldin_score(X, labels=model.labels_)
scores = pd.DataFrame([sil_score, ch_score, db_score], index=['Silhouette', 'Calinski-Harabaz', 'Davies-Bouldin'],
                      columns=['Score'])
print('\nOPTICS聚类:\n', scores)

# Birch聚类
model = Birch(n_clusters=2)
model.fit(X)
sil_score = silhouette_score(X, labels=model.labels_)
ch_score = calinski_harabasz_score(X, labels=model.labels_)
db_score = davies_bouldin_score(X, labels=model.labels_)
scores = pd.DataFrame([sil_score, ch_score, db_score], index=['Silhouette', 'Calinski-Harabaz', 'Davies-Bouldin'],
                      columns=['Score'])
print('\nBirch聚类:\n', scores)

3.输出结果

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                  5.1               3.5                1.4               0.2
1                  4.9               3.0                1.4               0.2
2                  4.7               3.2                1.3               0.2
3                  4.6               3.1                1.5               0.2
4                  5.0               3.6                1.4               0.2
..                 ...               ...                ...               ...
145                6.7               3.0                5.2               2.3
146                6.3               2.5                5.0               1.9
147                6.5               3.0                5.2               2.0
148                6.2               3.4                5.4               2.3
149                5.9               3.0                5.1               1.8

[150 rows x 4 columns]

K均值聚类
                        Score
Silhouette          0.482929
Calinski-Harabaz  351.295064
Davies-Bouldin      0.786733

小批次K均值聚类
                        Score
Silhouette          0.482775
Calinski-Harabaz  351.550040
Davies-Bouldin      0.788390

Affinity Propagation聚类
                        Score
Silhouette          0.318824
Calinski-Harabaz  234.581396
Davies-Bouldin      1.044406

均值漂移聚类
                        Score
Silhouette          0.630047
Calinski-Harabaz  354.365556
Davies-Bouldin      0.486167

谱聚类
                        Score
Silhouette          0.486720
Calinski-Harabaz  323.788307
Davies-Bouldin      0.775937

层次聚类
                        Score
Silhouette          0.504800
Calinski-Harabaz  349.254185
Davies-Bouldin      0.747977

特征融合聚类
                        Score
Silhouette          0.504800
Calinski-Harabaz  349.254185
Davies-Bouldin      0.747977

DBSCAN聚类
                        Score
Silhouette          0.468185
Calinski-Harabaz  185.451204
Davies-Bouldin      0.464860

OPTICS聚类
                        Score
Silhouette          0.630047
Calinski-Harabaz  354.365556
Davies-Bouldin      0.486167

Birch聚类
                        Score
Silhouette          0.438651
Calinski-Harabaz  193.643218
Davies-Bouldin      0.788608
Last updated on 2025-06-22