发布于2020-01-16 22:12 阅读(1208) 评论(0) 点赞(27) 收藏(5)
我想在多元回归分析中选择变量。我尝试使用此代码http://planspace.org/20150423-forward_selection_with_statsmodels/。问题是我想从50个变量中进行选择,这花费了太多时间。我使用Numba使其变得更快,并编写了以下代码:
@jit
def forward_selected(data, response):
"""Linear model designed by forward selection.
Parameters:
-----------
data : pandas DataFrame with all possible predictors and response
response: string, name of response column in data
Returns:
--------
model: an "optimal" fitted statsmodels linear model
with an intercept
selected by forward selection
evaluated by adjusted R-squared
"""
remaining = set(data.columns)
remaining.remove(response)
selected = [str]
current_score, best_new_score = 0.0, 0.0
while remaining and current_score == best_new_score:
scores_with_candidates = [str]
for candidate in remaining:
formula = "{} ~ {} + 1".format(response,
' + '.join(selected + [candidate]))
score = smf.ols(formula, data).fit().rsquared_adj
scores_with_candidates.append((score, candidate))
scores_with_candidates.sort()
best_new_score, best_candidate = scores_with_candidates.pop()
if current_score < best_new_score:
remaining.remove(best_candidate)
selected.append(best_candidate)
current_score = best_new_score
formula = "{} ~ {} + 1".format(response,
' + '.join(selected))
model = smf.ols(formula, data).fit()
return model
model = forward_selected(df, col)
但它返回以下错误:
TypeError:序列项0:预期的str实例,找到类型
请告诉我如何解决它。如果您不明白我的问题,我们将很乐意在评论中提供更多信息。
追溯(最近一次通话):
在第164行的文件“〜/ PycharmProjects / anacondaenv / touhu_1.py”
提交=预测(col)
预测中的文件“〜/ PycharmProjects / anacondaenv / touhu_1.py”,第75行
模型= forward_selected(df,col)TypeError:序列项0:预期的str实例,找到类型
我认为,查看是否numba
确实可以用作助推器的最佳方法之一是尝试njit
使用jit
装饰器,而不是装饰器。njit
强制no-python-mode
并中断,如果有任何事情落到python上(它根本不提供速度优势)。简短答案:请勿使用任何东西np.ndarrays
。因此,没有串,没有元组,没有列表和NO调用未即时编译功能。
因此,我更正了以下错误:numba不允许在主体函数主体中使用空列表...不知道为什么(可能是错误?!),但是如果将其移入while
块中,则可以使用。
import statsmodels.formula.api as smf
import numba as nb
@nb.jit
def forward_selected_nojit(data, response):
"""Linear model designed by forward selection.
Parameters:
-----------
data : pandas DataFrame with all possible predictors and response
response: string, name of response column in data
Returns:
--------
model: an "optimal" fitted statsmodels linear model
with an intercept
selected by forward selection
evaluated by adjusted R-squared
"""
remaining = set(data.columns)
remaining.remove(response)
selected = None # Changed this line
current_score, best_new_score = 0.0, 0.0
while remaining and current_score == best_new_score:
if selected is None: # Changed this and next line
selected = []
scores_with_candidates = []
for candidate in remaining:
formula = "{} ~ {} + 1".format(response,
' + '.join(selected + [candidate]))
score = smf.ols(formula, data).fit().rsquared_adj
scores_with_candidates.append((score, candidate))
scores_with_candidates.sort()
best_new_score, best_candidate = scores_with_candidates.pop()
if current_score < best_new_score:
remaining.remove(best_candidate)
selected.append(best_candidate)
current_score = best_new_score
formula = "{} ~ {} + 1".format(response,
' + '.join(selected))
model = smf.ols(formula, data).fit()
return model
可以用更好的方法解决,但重要的是时间。但是首先检查numba是否确实使任何奇怪的东西:
# With numba
sl ~ rk + yr + 1
0.835190760538
# Without numba
sl ~ rk + yr + 1
0.835190760538
因此,结果是相同的,现在让我们看看它们的性能如何:
# with numba
10 loops, best of 3: 264 ms per loop
# without numba
10 loops, best of 3: 252 ms per loop
因此,这完全符合我的预期。使用python类型并调用未绑定的外部函数,您不会获得任何速度提升。您可以使用numba使其速度更快,但请确保已通读numba文档并查看受支持的内容:Python类型和Numpy类型
作者:黑洞官方问答小能手
链接:https://www.pythonheidong.com/blog/article/226451/d8e68e3c82d9862133d3/
来源:python黑洞网
任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任
昵称:
评论内容:(最多支持255个字符)
---无人问津也好,技不如人也罢,你都要试着安静下来,去做自己该做的事,而不是让内心的烦躁、焦虑,坏掉你本来就不多的热情和定力
Copyright © 2018-2021 python黑洞网 All Rights Reserved 版权所有,并保留所有权利。 京ICP备18063182号-1
投诉与举报,广告合作请联系vgs_info@163.com或QQ3083709327
免责声明:网站文章均由用户上传,仅供读者学习交流使用,禁止用做商业用途。若文章涉及色情,反动,侵权等违法信息,请向我们举报,一经核实我们会立即删除!