本站消息

站长简介/公众号


站长简介:逗比程序员,理工宅男,前每日优鲜python全栈开发工程师,利用周末时间开发出本站,欢迎关注我的微信公众号:幽默盒子,一个专注于搞笑,分享快乐的公众号

  价值13000svip视频教程,python大神匠心打造,零基础python开发工程师视频教程全套,基础+进阶+项目实战,包含课件和源码

  出租广告位,需要合作请联系站长

+关注
已关注

分类  

暂无分类

标签  

暂无标签

日期归档  

2020-07(10)

2020-08(50)

论文复现:Pool-Based Sequential Active Learning for Regression

发布于2021-04-17 20:21     阅读(614)     评论(0)     点赞(27)     收藏(2)



 

[1] Wu D . Pool-Based Sequential Active Learning for Regression[J]. IEEE Transactions on Neural Networks and Learning Systems, 2018, PP(99):1-12.

In practice, some clusters may contain multiple labeled samples, so usually, there are more than one cluster that do not contain any labeled sample. We then identify the largest cluster that does not contain any labeled sample as the current most representative cluster and select the sample closest to its centroid for labeling.
 

  1. """
  2. Author:Daniel
  3. Date: 2021-04-16
  4. """
  5. import numpy as np
  6. import pandas as pd
  7. import xlwt
  8. from pathlib import Path
  9. from copy import deepcopy
  10. from collections import OrderedDict
  11. from sklearn.linear_model import LogisticRegression
  12. from sklearn.model_selection import StratifiedKFold
  13. from sklearn.metrics import accuracy_score,mean_absolute_error,f1_score,recall_score
  14. from sklearn.cluster import KMeans
  15. from time import time
  16. from sklearn import datasets
  17. import matplotlib.pyplot as plt
  18. class ALRD():
  19. def __init__(self, X_pool, y_pool, labeled, budget, X_test, y_test):
  20. self.X_pool = X_pool
  21. self.y_pool = y_pool
  22. self.X_test = X_test
  23. self.y_test = y_test
  24. self.nSample, self.nAtt = self.X_pool.shape
  25. self.labeled = list(deepcopy(labeled))
  26. self.unlabeled = self.init_unlabeled()
  27. self.labels = np.sort(np.unique(y_pool))
  28. self.nClass = len(self.labels)
  29. self.budgetLeft = deepcopy(budget)
  30. self.LATmodel = LogisticRegression()
  31. ## ----------------
  32. ## Evaluation criteria
  33. self.AccList = []
  34. self.MAEList = []
  35. self.RecallList = []
  36. self.FscoreList = []
  37. self.ALC_ACC = 0
  38. self.ALC_MAE = 0
  39. self.ALC_F1 = 0
  40. self.ALC_Recall = 0
  41. def init_unlabeled(self):
  42. unlabeled = [i for i in range(self.nSample)]
  43. for idx in self.labeled:
  44. unlabeled.remove(idx)
  45. return unlabeled
  46. def evaluation(self):
  47. self.LATmodel.fit(X=self.X_pool[self.labeled],y=self.y_pool[self.labeled])
  48. y_pred = self.LATmodel.predict(X=self.X_test)
  49. Acc = accuracy_score(y_pred=y_pred,y_true=self.y_test)
  50. MAE = mean_absolute_error(y_pred=y_pred,y_true=self.y_test)
  51. F1 = f1_score(y_pred=y_pred,y_true=self.y_test,average='macro')
  52. Recall = recall_score(y_pred=y_pred,y_true=self.y_test,average='macro')
  53. self.AccList.append(Acc)
  54. self.MAEList.append(MAE)
  55. self.FscoreList.append(F1)
  56. self.RecallList.append(Recall)
  57. self.ALC_ACC += Acc
  58. self.ALC_MAE += MAE
  59. self.ALC_F1 += F1
  60. self.ALC_Recall += Recall
  61. def select(self):
  62. while self.budgetLeft > 0:
  63. nCluster = len(self.labeled)+1
  64. y_pred = KMeans(n_clusters=nCluster).fit_predict(self.X_pool)
  65. cluster_labels, count = np.unique(y_pred,return_counts=True)
  66. cluster_dict = OrderedDict()
  67. for i in cluster_labels:
  68. cluster_dict[i] = []
  69. for idx in self.labeled:
  70. cluster_dict[y_pred[idx]].append(idx)
  71. empty_ids = OrderedDict()
  72. for i in cluster_labels:
  73. if len(cluster_dict[i]) == 0:
  74. empty_ids[i] = count[i]
  75. tar_label = max(empty_ids,key=empty_ids.get)
  76. tar_cluster_ids = []
  77. for idx in range(self.nSample):
  78. if y_pred[idx] == tar_label:
  79. tar_cluster_ids.append(idx)
  80. centroid = np.mean(self.X_pool[tar_cluster_ids],axis=0)
  81. tar_idx = None
  82. close_dist = np.inf
  83. for idx in tar_cluster_ids:
  84. if np.linalg.norm(self.X_pool[idx] - centroid) < close_dist:
  85. close_dist = np.linalg.norm(self.X_pool[idx] - centroid)
  86. tar_idx = idx
  87. self.labeled.append(tar_idx)
  88. self.unlabeled.remove(tar_idx)
  89. self.budgetLeft -= 1
  90. if __name__ == '__main__':
  91. add = [0, 6, 12, 19]
  92. data = None
  93. for i in range(4):
  94. X, yy = datasets.make_blobs(n_samples=200, n_features=2, center_box=(0, 0), cluster_std=3, random_state=45)
  95. X += add[i]
  96. if i == 0:
  97. data = X
  98. else:
  99. data = np.vstack((data, X))
  100. y = np.ones(800)
  101. y[:200] = 0
  102. y[200:400] = 1
  103. y[400:600] = 2
  104. y[600:800] = 3
  105. X = data
  106. labeled = [6, 206, 406, 606]
  107. budget = 40
  108. model = ALRD(X_pool=X,y_pool=y,labeled=labeled,budget=budget,X_test=X,y_test=y)
  109. model.select()
  110. plt.scatter(X[:,0],X[:,1],c=y)
  111. plt.scatter(X[model.labeled][:, 0], X[model.labeled][:, 1], c=y[model.labeled],edgecolors='r',linewidths=1)
  112. plt.show()

 

这个算法有点费时间呢~计算复杂度太高!样本选择效果还是不错的!比GS要好很多,GS最开始只选择离群点!

原文链接:https://blog.csdn.net/DeniuHe/article/details/115702400






所属网站分类: 技术文章 > 博客

作者:丸子

链接:https://www.pythonheidong.com/blog/article/939970/d5fa96e001b520319a54/

来源:python黑洞网

任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任

27 0
收藏该文
已收藏

评论内容:(最多支持255个字符)