程序员最近都爱上了这个网站  程序员们快来瞅瞅吧!  it98k网:it98k.com

本站消息

站长简介/公众号

  出租广告位,需要合作请联系站长

+关注
已关注

分类  

暂无分类

标签  

暂无标签

日期归档  

2021-12(32)

2022-01(8)

Python机器学习15——XGboost和 LightGBM详细用法(交叉验证,网格搜参,变量筛选)

发布于2022-10-05 08:34     阅读(772)     评论(0)     点赞(18)     收藏(3)


本系列基本不讲数学原理,只从代码角度去让读者们利用最简洁的Python代码实现机器学习方法。


集成模型发展到现在的XGboost,LightGBM,都是目前竞赛项目会采用的主流算法。是真正的具有做项目的价值。这两个方法都是具有很多GBM没有的特点,比如收敛快,精度好,速度快等等。但由于他们底层不是Python,没有进sklearn库,要自己单独安装,用法和sklearn库也不完全相同。

两种模型都有自己的原生用法和sklearn库接口的用法,下面把回归和分类问题案例都一一介绍。只是案例演示,因此使用都采用sklearn库的自带数据集,会方便一点。


模块安装

安装都很简单,在这里输入

 或者Win+R,打开CMD,输入下面两句

  1. pip install xgboost
  2. pip install lightgbm

然后就等自动装了


参数详情

两个模型都要大量的超参数,后面可能用到。先了解一下,后面有不明白的再回头看

还有eta 表示学习率,默认值0.3;

gamma 表示叶子结点进一步分裂的阈值。即分裂这个节点让损失函数下降超过这个值才会进行分裂,默认值0;

max_leaves 表示最大叶子节点数,默认0;

max_bin 最大桶数量,默认值256;

min_child_weigh 表示子节点包含实例权重的最小和,防止过拟合用的,越大越不容易过拟合,默认值0;

subsmaple 表示训练样本的采样率,即划分多少去训练,若是训练前就划分了训练测试集就不用管,默认值1;

colsample_bytree 表示列采样率,默认值1;

colsample_bylevel 每一级每次分裂的采样率,默认值1;

scale_pos_weight 控制正负样本的权重平衡,取值应该设为负类样本量/正类样本量,默认值1;

predictor 预测器类型,默认值cpu_predictor, 还可以换显卡加速:gpu_predictor。

seed 随机数种子,默认0;

slient 打印运行的信息,默认打印,默认值0;

objective [默认值=reg:linear]
reg:linear– 线性回归

reg:logistic – 逻辑回归

binary:logistic – 二分类逻辑回归,输出为概率

binary:logitraw – 二分类逻辑回归,输出的结果为wTx

count:poisson – 计数问题的poisson回归,输出结果为poisson分布。在poisson回归中,max_delta_step的默认值为0.7 (used to safeguard optimization)

multi:softmax – 设置 XGBoost 使用softmax目标函数做多分类,需要设置参数num_class(类别个数)

multi:softprob – 如同softmax,但是输出结果为ndata*nclass的向量,其中的值是每个数据分为每个类的概率。

eval_metric [默认值=取决于目标函数选择]
rmse: 均方根误差

mae: 平均绝对值误差

logloss: negative log-likelihood

error: 二分类错误率。其值通过错误分类数目与全部分类数目比值得到。对于预测,预测值大于0.5被认为是正类,其它归为负类。 error@t: 不同的划分阈值可以通过 ‘t’进行设置

merror: 多分类错误率

mlogloss: 多分类log损失

auc: 曲线下的面积

ndcg: Normalized Discounted Cumulative Gain

map: 平均正确率

一般来说,我们都会使用xgboost.train(params, dtrain)函数来训练我们的模型。这里的params指的是booster参数。


XGboost原生用法

分类

  1. import numpy as np
  2. import pandas as pd
  3. #import pickle
  4. import xgboost as xgb
  5. from sklearn.datasets import load_iris
  6. from sklearn.model_selection import train_test_split
  7. #鸢尾花
  8. iris=load_iris()
  9. X=iris.data
  10. y=iris.target
  11. X.shape,y.shape

最经典的3分类的鸢尾花数据集

划分训练测试集,将数据变为xgb需要的格式

  1. # 做数据切分
  2. X_train, X_test,y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=0)
  3. xgb_train = xgb.DMatrix(X_train, y_train)
  4. xgb_test = xgb.DMatrix( X_test,y_test )

 设置参数

params = {'objective':'multi:softmax','num_class':3,'booster':'gbtree','max_depth':5, 'eta':0.1, 'subsample':0.7, 'colsample_bytree':0.7}

训练

  1. num_round=50
  2. watchlist = [(xgb_train,'train'), (xgb_test,'test')]
  3. model = xgb.train(params, xgb_train, num_round, watchlist)

 

 预测

  1. pred = model.predict(xgb_test)
  2. pred

  1. error_rate=np.sum(pred!=y_test)/y_test.shape[0]
  2. error_rate #错误率

 


回归问题

  1. from sklearn.datasets import load_boston
  2. X,y=load_boston(return_X_y=True)
  3. X.shape,y.shape

波士顿房价数据集,经典回归数据集

  1. X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.2,random_state=0)
  2. print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)
  3. xgb_train = xgb.DMatrix(X_train, y_train)
  4. xgb_test = xgb.DMatrix( X_test,y_test )

 设置参数

  1. params = {'objective':'reg:squarederror','booster':'gbtree','max_depth':5, 'eta':0.1, 'min_child_weight':1}
  2. num_round=50
  3. watchlist = [(xgb_train,'train'), (xgb_test,'test')]
  4. model = xgb.train(params, xgb_train, num_round, watchlist)

  1. pred = model.predict(xgb_test)
  2. pred,y_test

计算均方误差,拟合优度

  1. from sklearn.linear_model import LinearRegression
  2. from sklearn.metrics import mean_squared_error,r2_score
  3. reg=LinearRegression()
  4. reg.fit(y_test.reshape(-1, 1),pred.reshape(-1, 1))
  5. reg.score(y_test.reshape(-1, 1),pred.reshape(-1, 1))
mean_squared_error(y_test,pred),r2_score(y_test,pred)

交叉验证

  1. #交叉验证
  2. result=xgb.cv(params=params,dtrain=xgb_train,nfold=10,metrics='rmse',#'auc'
  3. num_boost_round=300,as_pandas=True,seed=123)
  4. result.shape
  5. result.head()

 

 画出交叉验证误差图

  1. # Plot CV Errors
  2. import matplotlib.pyplot as plt
  3. plt.plot(range(1, 301), result['train-rmse-mean'], 'k', label='Training Error')
  4. plt.plot(range(1, 301), result['test-rmse-mean'], 'b', label='Test Error')
  5. plt.xlabel('Number of Trees')
  6. plt.ylabel('RMSE')
  7. plt.axhline(0, linestyle='--', color='k', linewidth=1)
  8. plt.legend()
  9. plt.title('CV Errors for XGBoost')
  10. plt.show()

 自定义目标函数和损失函数

XGboost还可以自定义损失函数和评价函数

  1. from sklearn.datasets import load_breast_cancer
  2. X,y=load_breast_cancer(return_X_y=True)
  3. X.shape,y.shape

乳腺癌数据集

  1. X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.2,random_state=8)
  2. print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)
  3. xgb_train = xgb.DMatrix(X_train, y_train)
  4. xgb_test = xgb.DMatrix( X_test,y_test )

 

  1. params = {'booster':'gbtree','max_depth':5, 'eta':0.1}
  2. num_round=50
  3. watchlist = [(xgb_train,'train'), (xgb_test,'test')]

 定义损失函数和评价函数

  1. def logregobj(preds,dtrain):
  2. labels=dtrain.get_label()
  3. preds=1.0/(1.0+np.exp(-preds))
  4. grad=preds-labels
  5. hess=preds*(1.0-preds)
  6. return grad,hess
  7. def evalerror(preds,dtrain):
  8. labels=dtrain.get_label()
  9. return 'error',float(sum(labels!=(preds>0.0)))/len(labels)

训练

model = xgb.train(params, xgb_train, num_round, watchlist,obj=logregobj,feval=evalerror)

 交叉验证也可以使用自定义函数

  1. result=xgb.cv(params=params,dtrain=xgb_train,nfold=10,metrics='auc',#'auc'
  2. num_boost_round=300,as_pandas=True,seed=123,obj=logregobj,feval=evalerror)
  3. result.head()

  1. # Plot CV Errors
  2. import matplotlib.pyplot as plt
  3. plt.plot(range(1, 301), result['train-error-mean'], 'k', label='Training Error')
  4. plt.plot(range(1, 301), result['test-error-mean'], 'b', label='Test Error')
  5. plt.xlabel('Number of Trees')
  6. plt.ylabel('AUC')
  7. plt.axhline(0, linestyle='--', color='k', linewidth=1)
  8. plt.legend()
  9. plt.title('CV Errors for XGBoost')
  10. plt.show()

 


XGboost的sklearn库接口

sklearn库接口就好用很多,符合sklearn库的一些常用的函数,例如交叉验证,网格化搜参,变量筛选,都可以用。

回归

  1. import xgboost as xgb
  2. import numpy as np
  3. import pandas as pd
  4. import matplotlib.pyplot as plt
  5. from sklearn.model_selection import KFold, train_test_split, GridSearchCV
  6. from sklearn.metrics import confusion_matrix, mean_squared_error
  7. from sklearn.model_selection import train_test_split
  8. from sklearn.datasets import load_iris, load_boston
  9. from sklearn.datasets import load_breast_cancer
  10. X, y= load_boston(return_X_y=True)

拟合评价(经典sklearn用法)

  1. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
  2. model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=300, max_depth=6,
  3. subsample=0.6, colsample_bytree=0.8, learning_rate=0.1, random_state=0)
  4. model.fit(X_train, y_train)
  5. model.score(X_test, y_test)

  1. pred = model.predict(X_test)
  2. rmse = np.sqrt(mean_squared_error(y_test, pred))
  3. rmse

 


回归交叉验证 

  1. rng = np.random.RandomState(123)
  2. kf = KFold(n_splits=3, shuffle=True, random_state=rng)
  3. print("在3折数据上的交叉验证")
  4. for train_index, test_index in kf.split(X):
  5. xgb_model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=300, max_depth=6,subsample=0.6,
  6. colsample_bytree=0.8, learning_rate=0.1, random_state=0).fit(X[train_index],y[train_index])
  7. predictions = xgb_model.predict(X[test_index])
  8. actuals = y[test_index]
  9. print("均方根误差:")
  10. print(np.sqrt(mean_squared_error(actuals, predictions)))
  11. print('拟合优度')
  12. print(xgb_model.score(X[test_index],y[test_index]))

 

回归网格化搜索最优超参数

  1. # 回归网格化搜索最优超参数
  2. model = xgb.XGBRegressor(objective='reg:squarederror',subsample=0.6, colsample_bytree=0.8, random_state=0,nthread=8)
  3. param_dict = {'max_depth': [5,6,7,8],'n_estimators': [100,200,300],'learning_rate':[0.05,0.1,0.2]}
  4. clf = GridSearchCV(model, param_dict, cv=10,verbose=1 , scoring='r2')
  5. clf.fit(X_train, y_train)
  6. print(clf.best_score_)
  7. print(clf.best_params_)

 

分类交叉验证 

  1. #二分类
  2. rng = np.random.RandomState(123)
  3. X,y=load_breast_cancer(return_X_y=True)
  4. print(X.shape,y.shape)
  5. kf = KFold(n_splits=3, shuffle=True, random_state=rng)
  6. print("在3折数据上的交叉验证")
  7. for train_index, test_index in kf.split(X):
  8. xgb_model = xgb.XGBClassifier(objective='binary:logistic', n_estimators=300,random_state=0,eta=0.1,max_depth=6,
  9. use_label_encoder=False,eval_metric=['logloss','auc','error']).fit(X[train_index],y[train_index])
  10. predictions = xgb_model.predict(X[test_index])
  11. actuals = y[test_index]
  12. print("混淆矩阵:")
  13. print(confusion_matrix(actuals, predictions))

  1. # 多分类:混淆矩阵
  2. print("\nIris: 多分类")
  3. iris = load_iris()
  4. y = iris['target']
  5. X = iris['data']
  6. kf = KFold(n_splits=5, shuffle=True, random_state=rng)
  7. print("在5折数据上的交叉验证")
  8. for train_index, test_index in kf.split(X):
  9. xgb_model = xgb.XGBClassifier(objective='multi:softmax', n_estimators=300,random_state=0,eta=0.1,max_depth=6,
  10. use_label_encoder=False,eval_metric=['logloss','auc','error']).fit(X[train_index],y[train_index])
  11. predictions = xgb_model.predict(X[test_index])
  12. actuals = y[test_index]
  13. print("混淆矩阵:")
  14. print(confusion_matrix(actuals, predictions))


 分类网格化搜索最优超参数

  1. # 网格化搜索最优超参数
  2. print("参数最优化:")
  3. X,y=load_breast_cancer(return_X_y=True)
  4. xgb_model = xgb.XGBClassifier(objective='binary:logistic',random_state=0,use_label_encoder=False,eval_metric=['logloss','auc','error'])
  5. param_dict = {'max_depth': [2,4,6],'n_estimators': [50,100,200],'eta':[0.05,0.1,0.2]}
  6. clf = GridSearchCV(xgb_model, param_dict, verbose=1)
  7. clf.fit(X,y)
  8. print(clf.best_score_)
  9. print(clf.best_params_)


早停 

和神经网络一样,可以使用早停防止过拟合

  1. #早停
  2. X_train, X_val, y_train, y_val = train_test_split(X, y,test_size=0.2, random_state=0)
  3. clf = xgb.XGBClassifier(objective='binary:logistic',use_label_encoder=False,random_state=0)
  4. clf.fit(X_train, y_train, early_stopping_rounds=10,
  5. eval_metric="auc",eval_set=[(X_val, y_val)])


变量重要性 

xgb包自带的画图用法

  1. #变量重要性
  2. xgb.plot_importance(clf,height=0.5,importance_type='gain',max_num_features=10)

 

 sklearn库用法

clf.feature_importances_

  1. cancer=load_breast_cancer()
  2. cancer.feature_names
  3. sorted_index = clf.feature_importances_.argsort()
  4. plt.figure(figsize=(10,5))
  5. plt.barh(range(len(cancer.feature_names)), clf.feature_importances_[sorted_index])
  6. plt.yticks(np.arange(len(cancer.feature_names)),cancer.feature_names[sorted_index])
  7. plt.xlabel('Feature Importance')
  8. plt.ylabel('Feature')
  9. plt.title('GradientBoosting')
  10. #plt.savefig('梯度提升特征排序.png')
  11. plt.tight_layout()


变量筛选 

 根据变量重要性,小于阈值的变量就扔掉

  1. from sklearn.feature_selection import SelectFromModel
  2. selection =SelectFromModel(clf,threshold=0.05,prefit=True)
  3. select_X_train=selection.transform(X_train)
  4. select_X_train.shape

threshold=0.05,表示变量重要性小于0.05就扔掉,最后只留下了四个变量(和上图也一致)

将测试集也筛选一下

  1. select_X_val=selection.transform(X_val)
  2. select_X_val.shape

查看一些筛出来了那些变量

  1. print(selection.get_support())
  2. print(selection.get_support(True))
  3. [cancer.feature_names[i] for i in selection.get_support(True)]

 

 xgb到这了


lightgbm用起来其实和xgboost差不多,就是参数有细微的差别,用sklearn库会更加一致,当然也展示一下原生用法。 

LightGBM原生用法

  1. from sklearn.datasets import load_iris
  2. import lightgbm as lgb
  3. from lightgbm import plot_importance
  4. import matplotlib.pyplot as plt
  5. from sklearn.model_selection import train_test_split
  6. from sklearn.metrics import accuracy_score
  7. # 加载鸢尾花数据集
  8. iris = load_iris()
  9. X,y = iris.data,iris.target
  10. # 数据集分割
  11. X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=123457)
  12. # 参数
  13. params = {
  14. 'booster': 'gbtree',
  15. 'objective': 'multiclass', #回归:'objective': 'regression'
  16. 'num_class': 3,
  17. 'num_leaves': 31,
  18. 'subsample': 0.8,
  19. 'bagging_freq': 1,
  20. 'feature_fraction ': 0.8,
  21. 'slient': 1,
  22. 'learning_rate ': 0.01,
  23. 'seed': 0
  24. }
  25. # 构造训练集
  26. dtrain = lgb.Dataset(X_train,y_train)
  27. dtest = lgb.Dataset(X_test,y_test)
  28. num_rounds = 500
  29. model = lgb.train(params,dtrain, num_rounds, valid_sets=[dtrain, dtest],
  30. verbose_eval=100, early_stopping_rounds=10)

 

  1. # 对测试集进行预测
  2. y_pred = model.predict(X_test)
  3. # 计算准确率
  4. accuracy = accuracy_score(y_test, np.argmax(y_pred, axis=1))
  5. print('accuarcy:%.2f%%'%(accuracy*100))
  1. # 显示重要特征
  2. plot_importance(model)
  3. plt.show()
  4. # 模型保存
  5. #gbm.save_model('model.txt')
  6. # 模型加载
  7. #gbm = lgb.Booster(model_file='model.txt')


LightGBM的sklearn接口 

回归

  1. from lightgbm import LGBMRegressor
  2. from lightgbm import plot_importance
  3. import matplotlib.pyplot as plt
  4. from sklearn.model_selection import train_test_split
  5. from sklearn.datasets import load_boston
  6. from sklearn.metrics import mean_squared_error
  7. # 导入数据集
  8. boston = load_boston()
  9. X ,y = boston.data,boston.target
  10. X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
  11. model = LGBMRegressor(
  12. boosting_type='gbdt',
  13. num_leaves=31,
  14. max_depth=-1,
  15. learning_rate=0.1,
  16. n_estimators=100,
  17. objective='regression', # 默认是二分类
  18. min_split_gain=0.0,
  19. min_child_samples=20,
  20. subsample=1.0,
  21. subsample_freq=0,
  22. colsample_bytree=1.0,
  23. reg_alpha=0.0,
  24. reg_lambda=0.0,
  25. random_state=None,
  26. silent=True
  27. )
  28. model.fit(X_train,y_train, eval_set=[(X_train, y_train), (X_test, y_test)],
  29. verbose=100, early_stopping_rounds=50)
  1. # 对测试集进行预测
  2. y_pred = model.predict(X_test)
  3. mse = mean_squared_error(y_test,y_pred)
  4. print('mse', mse)
  5. # 显示重要特征
  6. plot_importance(model)
  7. plt.show()

 分类

  1. from lightgbm import LGBMClassifier
  2. from sklearn.datasets import load_iris
  3. from lightgbm import plot_importance
  4. import matplotlib.pyplot as plt
  5. from sklearn.model_selection import train_test_split
  6. from sklearn.metrics import accuracy_score
  7. # 加载样本数据集
  8. iris = load_iris()
  9. X,y = iris.data,iris.target
  10. X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=12343)
  11. model = LGBMClassifier(
  12. max_depth=3,
  13. learning_rate=0.1,
  14. n_estimators=200, # 使用多少个弱分类器
  15. objective='multiclass',
  16. num_class=3,
  17. booster='gbtree',
  18. min_child_weight=2,
  19. subsample=0.8,
  20. colsample_bytree=0.8,
  21. reg_alpha=0,
  22. reg_lambda=1,
  23. seed=0 # 随机数种子
  24. )
  25. model.fit(X_train,y_train, eval_set=[(X_train, y_train), (X_test, y_test)],
  26. verbose=100, early_stopping_rounds=50)
  27. # 对测试集进行预测
  28. y_pred = model.predict(X_test)
  29. model.predict_proba
  30. #计算准确率
  31. accuracy = accuracy_score(y_test,y_pred)
  32. print('accuracy:%3.f%%'%(accuracy*100))
  33. # 显示重要特征
  34. plot_importance(model)
  35. plt.show()

 当然,上面使用验证集去早停和评估什么的比较麻烦,模型定义好后,训练可以简单一点

model.fit(X_train,y_train)

 评价预测什么的和sklearn库完全一样,可以参考我以前的文章。原生用法也可以参考xgb。

  1. print(model.score(X_test,y_test))
  2. model.predict(X_test)

 

更新

LGBM的参数其实很多,再放一张图

调参思路其实是:

1.先选取较大的学习率,加速收敛

2.对决策树等参数进行调整:max_depth,num_leaves,subsample,colsample_bytree

3.然后再对正则化参数调整,min_child_weight,lambda...

4.降低学习率,配合估计器个数,进行最后的调整。

必要的时候还可以使用早停机制。

原文链接:https://blog.csdn.net/weixin_46277779/article/details/125835301



所属网站分类: 技术文章 > 博客

作者:我就是喜欢你

链接:https://www.pythonheidong.com/blog/article/1786090/2e5f73ebc8b560ed1cdc/

来源:python黑洞网

任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任

18 0
收藏该文
已收藏

评论内容:(最多支持255个字符)