+关注
已关注

分类  

暂无分类

标签  

暂无标签

日期归档  

2019-08(55)

2019-09(103)

2019-10(16)

2019-11(6)

2019-12(12)

最大似然法,贝叶斯估计、最小错误贝叶斯决策Excel数据分类处理(介绍+Python实现)

发布于2020-11-09 16:47     阅读(1053)     评论(0)     点赞(17)     收藏(2)


0

1

2

3

4

5

模式识别学习,课程实例分享。

 

文章目录

  • 第一,男女50米跑直方图显示
  • 第二,男女生身高、体重以及50m成绩的平均数与方差最大似然估计
  • 第三,女生身高以及体重的平均数贝叶斯估计
  • 第四,使用贝叶斯最小错误率决策,根据身高和体重两个特征判断输入数据所述的男女性别(决策面图形显示)
  • 第五,简单测试
  • 第六,代码实现

 


第一,男女50米跑直方图显示

      如上图1所示即为男女50米跑步成绩数据直方图,其中青色块为男生数据,红色为女生数据,褐色为前两者直方图重叠区域,X轴为跑步成绩单位为秒,Y轴为每个数据范围内跑步数据的落点概率。

 

第二,男女生身高、体重以及50m成绩的平均数与方差最大似然估计

        假定男女身高、体重和短跑数据模型都满足正态分布,使用最大似然估计方法,即可得到图2所示各个平均数与方差的估计结果,该结果的训练模型是男女体重与身高各取725个数据进行训练而得到的。在此基础上,本人加入了在725个数据总数据集中按比例抽取数据再进行最大似然估计的代码方法,以观察不同比例数据量之下最大似然估计值与真实值的偏差大小,并可以与下文中的贝叶斯估计方法性能进行对比。上图2估计结果,就是在总数据集中随机抽取50%的训练数据极性计算的结果。

 

第三,男女生身高以及体重的平均数贝叶斯估计

        假定男女身高和体重数据模型都满足正态分布,且其平均数的平均数和方差也满足正态分布,在设定平均数的均值和方差初始条件后,使用贝叶斯估计方法即可得到图3的估计结果。经过不断尝试对比可知贝叶斯估计数值与实际数值误差大小受到初始设定条件数据影响很大,所以下面的算法对比结果都是建立在上图中设定的平均数初始分布参数的前提下进行。图3中得到的贝叶斯平均数估计结果与上述最大似然估计结果的前提条件是一样的,都是从725个总数据集中随机抽取50%作为训练集而得到的,贝叶斯平均数估计值与实际值误差值在0.30左右的数量级,比最大似然估计的平均数估计误差小些。

 

第四,使用贝叶斯最小错误率决策,根据身高和体重两个特征判断输入数据所述的男女性别

        上图中的决策面是基于正态分布的二维特征最小错误率贝叶斯决策而产生的,是一种类似椭圆的非线性决策面,实现由身高和体重两个二维特征作为判断依据产生该特征是来源于男生或者女生,图中红色数据为女性特征数据,绿色为男性特征数据,蓝色线为决策面。由图4中不难得到在该决策面下,特征结果判断为男性结果的错误率比判断为女性的错误率低很多,决策面具体方程如下:

                          Hight = (0.2478*w-79.3793*sqrt(-0.000932*w^2+0.08356*w-1)+ 82.895

 

第五,模型测试(格式:[身高,体重] [判断权值])

        如上图所示,当输入身高为178厘米且体重为71千克的特征时,判断函数输出值为-10.22小于零,则非常坚定地判断该特征属于男性;当输入身高为170厘米且体重为52千克的特征时,判断函数输出值为-0.71小于零但近似于0,则错误概率较大地判断该特征属于男性。

 

第六,代码实现

代码如下(示例):

  1. import openpyxl
  2. import numpy as np
  3. import matplotlib.pyplot as plt
  4. import math
  5. import random
  6. import sympy
  7. import os
  8. def array_frequent(lst):
  9. from collections import Counter
  10. HF = Counter(lst).most_common(1)
  11. return HF[0][0]
  12. #输入表格名file_data
  13. # 要读该列的行数从1到row_end
  14. def ReadInCol(file_data,ClassNum,row_end):
  15. BoyData = []
  16. GirData = []
  17. ClassNum = ClassNum+1
  18. for i in range(2, row_end):
  19. G = file_data.cell(i,2).value
  20. if G:
  21. BoyData.append(file_data.cell(i,ClassNum).value)
  22. else:
  23. GirData.append(file_data.cell(i,ClassNum).value)
  24. return BoyData,GirData
  25. #男女50米跑直方图显示
  26. # sheet 数据字典
  27. def Meter50_HistShow(sheet):
  28. # 1性别 2籍贯 3身高 4体重 5鞋码 6(50米成绩) 7肺活量 8喜欢颜色 9喜欢运动 10喜欢文学
  29. Boy50, Gir50 = ReadInCol(Tain_sheet,6,210) # 提取男女50米成绩数组
  30. Boy50_Max = max(Boy50)
  31. Boy50_Min = min(Boy50)
  32. Gir50_Max = max(Gir50)
  33. Gir50_Min = min(Gir50)
  34. DataMax = max(Boy50_Max, Gir50_Max)
  35. DataMin = min(Boy50_Min, Gir50_Min)
  36. X_Show = np.linspace(DataMin, DataMax, round(round(DataMax - DataMin) * 2))
  37. plt.hist(Boy50, X_Show, density=1, color='yellowgreen', histtype='bar', alpha=0.5, edgecolor='white', linewidth=4)
  38. plt.hist(Gir50, X_Show, density=1, color='pink', histtype='bar', alpha=0.5, edgecolor='white', linewidth=4)
  39. plt.xlabel('X_hight')
  40. plt.ylabel('Y_Frequency')
  41. plt.title('50MeterTime DistributionHist')
  42. #最大似然估计均值和方差近似
  43. # 男生总数据集 BoySouceData
  44. # 女生总数聚集 GirSouceData
  45. # 从总样本中抽取的随机数据点比例 Proport
  46. def MyMLE_Mean_Vari(BoySouceData,GirSouceData,Proport):
  47. BoyDataLenth = round(Proport*len(BoySouceData))
  48. GirDataLenth = round(Proport*len(GirSouceData))
  49. BoyRam = random.sample(range(0, len(BoySouceData)), BoyDataLenth)
  50. GirRam = random.sample(range(0, len(GirSouceData)), GirDataLenth)
  51. BoyData = []
  52. GirData = []
  53. for i in range(0,BoyDataLenth):
  54. temp = BoyRam[i]
  55. BoyData.append(BoySouceData[temp])
  56. for i in range(0,GirDataLenth):
  57. temp = GirRam[i]
  58. GirData.append(GirSouceData[temp])
  59. BoyDataLenth_T = 1.0/float(BoyDataLenth)
  60. GirDataLenth_T = 1.0/float(GirDataLenth)
  61. BoyMLE_Param = [0,0] #零元素是均值 1元素是方差
  62. GirMLE_Param = [0,0]
  63. for i in range(0,BoyDataLenth):
  64. BoyMLE_Param[0] = BoyMLE_Param[0] + BoyDataLenth_T * BoyData[i]
  65. for i in range(0, BoyDataLenth):
  66. BoyMLE_Param[1] = BoyMLE_Param[1] + math.pow(BoyData[i] - BoyMLE_Param[0],2)
  67. BoyMLE_Param[1] = BoyDataLenth_T * BoyMLE_Param[1]
  68. for i in range(0, GirDataLenth):
  69. GirMLE_Param[0] = GirMLE_Param[0] + GirDataLenth_T * GirData[i]
  70. for i in range(0, GirDataLenth):
  71. GirMLE_Param[1] = GirMLE_Param[1] + math.pow(GirData[i] - GirMLE_Param[0],2)
  72. GirMLE_Param[1] = GirDataLenth_T * GirMLE_Param[1]
  73. return BoyMLE_Param,GirMLE_Param
  74. #大似然估计男女生身高、体重、50m成绩的分布参数显示
  75. #sheet 数据字典
  76. #MLE_Proport 大似然估计的随机抽取比例
  77. def MLE_ProportShow(sheet,MLE_Proport):
  78. BoyHig, GirHig = ReadInCol(sheet, 3, 737) # 男女身高数据 cm
  79. BoyWei, GirWei = ReadInCol(sheet, 4, 737) # 男女体重数据 kg
  80. Boy50m, Gir50m = ReadInCol(sheet, 6, 210) # 男女50米跑数据 s
  81. BoyHigMLE_Param, GirHigMLE_Param = MyMLE_Mean_Vari(BoyHig, GirHig, MLE_Proport)
  82. BoyWeiMLE_Param, GirWeiMLE_Param = MyMLE_Mean_Vari(BoyWei, GirWei, MLE_Proport)
  83. Boy50mMLE_Param, Gir50mMLE_Param = MyMLE_Mean_Vari(Boy50m, Boy50m, MLE_Proport)
  84. print(" 最大似然估计参数 实际参数", " (随机比例", MLE_Proport, ")")
  85. print("项目 性别 平均数 方差 平均数 方差")
  86. print("身高 男 ", round(BoyHigMLE_Param[0], 2), " ", round(BoyHigMLE_Param[1], 2), \
  87. " ", round(np.mean(BoyHig), 2), " ", round(np.var(BoyHig), 2))
  88. print(" 女 ", round(GirHigMLE_Param[0], 2), " ", round(GirHigMLE_Param[1], 2), \
  89. " ", round(np.mean(GirHig), 2), " ", round(np.var(GirHig), 2))
  90. print("体重 男 ", round(BoyWeiMLE_Param[0], 2), " ", round(BoyWeiMLE_Param[1], 2), \
  91. " ", round(np.mean(BoyWei), 2), " ", round(np.var(BoyWei), 2))
  92. print(" 女 ", round(GirWeiMLE_Param[0], 2), " ", round(GirWeiMLE_Param[1], 2), \
  93. " ", round(np.mean(GirWei), 2), " ", round(np.var(GirWei), 2))
  94. print("短跑 男 ", round(Boy50mMLE_Param[0], 2), " ", round(Boy50mMLE_Param[1], 2), \
  95. " ", round(np.mean(Boy50m), 2), " ", round(np.var(Boy50m), 2))
  96. print(" 女 ", round(Gir50mMLE_Param[0], 2), " ", round(Gir50mMLE_Param[1], 2), \
  97. " ", round(np.mean(Gir50m), 2), " ", round(np.var(Gir50m), 2))
  98. #贝叶斯估计男女生身高以及体重分布的参数(已知方差估计平均值)
  99. #BoySouceData 男生数据集
  100. #GirSouceData 女生数据集
  101. #BoyInitPrama 男生平均数的鲜艳概率正态分布参数 [平均值,方差]
  102. #GirInitPrama 女生平均数的鲜艳概率正态分布参数 [平均值,方差]
  103. #Proport 从总样本中随机抽取比例
  104. def BayesEstim_Mean(BoySouceData,GirSouceData,Proport,BoyInitPrama,GirInitPrama):
  105. BoyDataLenth = round(Proport * len(BoySouceData))
  106. GirDataLenth = round(Proport * len(GirSouceData))
  107. BoyRam = random.sample(range(0, len(BoySouceData)), BoyDataLenth)
  108. GirRam = random.sample(range(0, len(GirSouceData)), GirDataLenth)
  109. BoyData = []
  110. GirData = []
  111. BoySum = 0.0
  112. GirSum = 0.0
  113. for i in range(0, BoyDataLenth):
  114. temp = BoyRam[i]
  115. temp = BoySouceData[temp]
  116. BoySum = BoySum + temp
  117. BoyData.append(temp)
  118. for i in range(0, GirDataLenth):
  119. temp = GirRam[i]
  120. temp = GirSouceData[temp]
  121. GirSum = GirSum + temp
  122. GirData.append(temp)
  123. BoyVari = np.var(BoySouceData) #初始数据集方差已知
  124. GirVari = np.var(GirSouceData) #初始数据集方差已知
  125. BayesEstim_NewMean = []
  126. BayesEstim_NewMean.append( BoyInitPrama[1]*BoySum/(BoyDataLenth*BoyInitPrama[1]+BoyVari)\
  127. +BoyVari*BoyInitPrama[0]/(BoyDataLenth*BoyInitPrama[1]+BoyVari))
  128. BayesEstim_NewMean.append(GirInitPrama[1] * GirSum / (GirDataLenth * GirInitPrama[1] + GirVari) \
  129. + GirVari * GirInitPrama[0] / (GirDataLenth * GirInitPrama[1] + GirVari))
  130. return BayesEstim_NewMean
  131. #贝叶斯估计结果显示
  132. def BayesEstim_MeanShow(sheet,Proport,BoyHighPrama,GirHighPrama,BoyWeigPrama,GirWeigPrama):
  133. # 读入数据集
  134. BoyHig, GirHig = ReadInCol(sheet, 3, 737) # 男女身高数据 cm
  135. BoyWei, GirWei = ReadInCol(sheet, 4, 737) # 男女体重数据 kg
  136. # 贝叶斯估计平均数结果
  137. HigMeanEsti = BayesEstim_Mean(BoyHig, GirHig, Proport, BoyHighPrama, GirHighPrama)
  138. WeiMeanEsti = BayesEstim_Mean(BoyWei, GirWei, Proport, BoyWeigPrama, GirWeigPrama)
  139. print("\n 贝叶斯估计均值 实际均值")
  140. print("身高 男: ", round(HigMeanEsti[0], 2), " ", round(np.mean(BoyHig), 2))
  141. print(" 女: ", round(HigMeanEsti[1], 2), " ", round(np.mean(GirHig), 2))
  142. print("体重 男: ", round(WeiMeanEsti[0], 2), " ", round(np.mean(BoyWei), 2))
  143. print(" 女: ", round(WeiMeanEsti[1], 2), " ", round(np.mean(GirWei), 2))
  144. #协方差矩阵计算
  145. def CovEle(Data1,Data2,Mean1,Mean2):
  146. Length = len(Data1)
  147. temp = 1.0/(Length-1)
  148. Cov00 = 0
  149. Cov01 = 0
  150. Cov10 = 0
  151. Cov11 = 0
  152. for i in range(0,Length):
  153. Cov00 = Cov00 + temp * (Data1[i] - Mean1) * (Data1[i] - Mean1)
  154. Cov01 = Cov01 + temp * (Data1[i] - Mean1) * (Data2[i] - Mean2)
  155. Cov11 = Cov11 + temp * (Data2[i] - Mean2) * (Data2[i] - Mean2)
  156. Cov10 = Cov01
  157. Cov = np.array([[Cov00,Cov01],[Cov10,Cov11]])
  158. return Cov
  159. #判别结果打印
  160. def DiscriminantFunc_TwoClass(New_x,sheet):
  161. BoyHig, GirHig = ReadInCol(sheet, 3, 728) # 男女身高数据 cm
  162. BoyWei, GirWei = ReadInCol(sheet, 4, 728) # 男女体重数据 kg
  163. N = len(BoyHig) + len(GirHig)
  164. # 先计算各均数
  165. BoyHigMean = np.mean(BoyHig) # X11
  166. GirHigMean = np.mean(GirHig) # X21
  167. BoyWeiMean = np.mean(BoyWei) # X12
  168. GirWeiMean = np.mean(GirWei) # X22
  169. # 求均值矩阵
  170. BoyClassMean = np.array([[BoyHigMean],[BoyWeiMean]]) # Mean1
  171. GirClassMean = np.array([[GirHigMean],[GirWeiMean]]) # Mean2
  172. # 协方差矩阵计算
  173. BoyCov = CovEle(BoyHig, BoyWei, BoyHigMean, BoyWeiMean)
  174. GirCov = CovEle(GirHig, GirWei, GirHigMean, GirWeiMean)
  175. # 求ln( |Cov1| / |Cov2| )
  176. ln_Value = math.log(np.linalg.det(BoyCov) \
  177. / np.linalg.det(GirCov), math.e)
  178. # 求ln( P(w1) / P(w2) )
  179. ln_P_w = math.log((len(BoyHig) / N) / (len(GirHig) / N), math.e)
  180. #求协方差的逆
  181. BoyCov_N = np.linalg.inv(BoyCov)
  182. GirCov_N = np.linalg.inv(GirCov)
  183. #判别函数计算
  184. Func = 0.5*np.dot(np.dot(np.transpose(New_x-BoyClassMean),BoyCov_N),New_x-BoyClassMean) \
  185. -0.5*np.dot(np.dot(np.transpose(New_x-GirClassMean),GirCov_N),New_x-GirClassMean) \
  186. +0.5*ln_Value-ln_P_w
  187. #结果判断
  188. Discriminant = np.linalg.det(Func)
  189. if(Discriminant<0):
  190. print("\n对于样本点", np.transpose(New_x),Func, " 判断性别为->男")
  191. else:
  192. print("对于样本点", np.transpose(New_x),Func, " 判断性别为->女")
  193. def DecisionPlaneSolve(Cass1,Class2,DataNum,sheet):
  194. BoyHig, GirHig = ReadInCol(sheet, Cass1, DataNum) # 男女身高数据 cm
  195. BoyWei, GirWei = ReadInCol(sheet, Class2, DataNum) # 男女体重数据 kg
  196. N = len(BoyHig) + len(GirHig)
  197. # 先计算各均数
  198. BoyHigMean = np.mean(BoyHig) # X11
  199. GirHigMean = np.mean(GirHig) # X21
  200. BoyWeiMean = np.mean(BoyWei) # X12
  201. GirWeiMean = np.mean(GirWei) # X22
  202. # 求均值矩阵
  203. BoyClassMean = sympy.Matrix([[BoyHigMean], [BoyWeiMean]]) # Mean1
  204. GirClassMean = sympy.Matrix([[GirHigMean], [GirWeiMean]]) # Mean2
  205. # 协方差矩阵
  206. BoyCov_TEMP = CovEle(BoyHig, BoyWei, BoyHigMean, BoyWeiMean)
  207. GirCov_TEMP = CovEle(GirHig, GirWei, GirHigMean, GirWeiMean)
  208. BoyCov = sympy.Matrix([ [float(BoyCov_TEMP[[0],[0]]),float(BoyCov_TEMP[[0],[1]])] ,\
  209. [float(BoyCov_TEMP[[1],[0]]),float(BoyCov_TEMP[[1],[1]])] ])
  210. GirCov = sympy.Matrix([ [float(GirCov_TEMP[[0],[0]]),float(GirCov_TEMP[[0],[1]])] ,\
  211. [float(GirCov_TEMP[[1],[0]]),float(GirCov_TEMP[[1],[1]])] ])
  212. # 求ln( |Cov1| / |Cov2| )
  213. ln_Value = sympy.ln(BoyCov.det()/GirCov.det())
  214. # 求ln( P(w1) / P(w2) )
  215. ln_P_w = sympy.ln( (len(BoyHig)/N) / (len(GirHig)/N) )
  216. # 判别函数求解
  217. h, w = sympy.symbols('h w')
  218. Func = (0.5 * (sympy.Matrix([[h],[w]]) - BoyClassMean).T * (BoyCov**(-1)) * (sympy.Matrix([[h],[w]]) - BoyClassMean)).det() \
  219. -(0.5 * (sympy.Matrix([[h], [w]]) - GirClassMean).T * (GirCov**(-1)) * (sympy.Matrix([[h], [w]]) - GirClassMean)).det()\
  220. + 0.5 * ln_Value - ln_P_w
  221. SolveFunc = sympy.solve(Func, [h, w])
  222. #决策面显示
  223. Y_Weight = np.linspace(40, 70, 200)
  224. X_Hight = 0.247822699671129 * Y_Weight + 79.3792911179188 * np.sqrt(
  225. -0.000932386643406855 * Y_Weight ** 2 + 0.083558110408116 * Y_Weight - 1) + 82.8949394272113
  226. plt.plot(Y_Weight, X_Hight)
  227. #男女特征点显示
  228. plt.scatter(BoyWei, BoyHig, color='yellowgreen')
  229. plt.scatter(GirWei, GirHig, color='pink')
  230. plt.xlabel('Hight')
  231. plt.ylabel('Weight')
  232. plt.title('Height-weight characteristic distribution map')
  233. return SolveFunc
  234. ####################################### main #############################################
  235. #数据打开文件路径
  236. Tain_set = openpyxl.load_workbook(os.path.abspath('data.xlsx'))
  237. #训练数据读取#(4)
  238. Tain_sheet = Tain_set["Sheet1"]
  239. #(1)男女50米跑直方图显示
  240. Meter50_HistShow(Tain_sheet)
  241. #(2)最大似然估计 男女生身高、体重以及50m成绩的分布参数
  242. MLE_ProportShow(Tain_sheet,0.5)
  243. #(3)贝叶斯估计男女生身高以及体重分布的参数(已知方差估计平均值)
  244. # 需要估计的平均值服从 0元素=平均值 1元素=方差 的正太分布
  245. BoyHigMeanEsti_Aver_Vari = [170,15] #先验概率,经验猜测其分布
  246. GirHigMeanEsti_Aver_Vari = [160,10]
  247. BoyWeiMeanEsti_Aver_Vari = [60,10]
  248. GirWeiMeanEsti_Aver_Vari = [42,5]
  249. BayesEstim_MeanShow(Tain_sheet,0.5,BoyHigMeanEsti_Aver_Vari,GirHigMeanEsti_Aver_Vari,\
  250. BoyWeiMeanEsti_Aver_Vari,GirWeiMeanEsti_Aver_Vari)
  251. #(4)最小错误率贝叶斯决策做身高和体重的决策面
  252. x = np.mat('178;71')
  253. DiscriminantFunc_TwoClass(x,Tain_sheet)
  254. x = np.array([[170],[52]])
  255. DiscriminantFunc_TwoClass(x,Tain_sheet)
  256. #求解得到决策面方程
  257. plt.show()
  258. DecisionPlane = DecisionPlaneSolve(3,4,728,Tain_sheet)
  259. print(DecisionPlane)
  260. plt.show()

 

原文链接:https://blog.csdn.net/weixin_41094315/article/details/109562669

0

1

2

3

4

5

6

7



所属网站分类: 技术文章 > 博客

作者:集天地之正气

链接: https://www.pythonheidong.com/blog/article/611326/5a79387166be0021c89a/

来源: python黑洞网

任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任

17 0
收藏该文
已收藏

评论内容:(最多支持255个字符)