广告区

广告区


本站消息

站长简介/公众号

关注本站官方公众号:程序员总部,领取三大福利!
福利一:python和前端辅导
福利二:进程序员交流微信群,专属于程序员的圈子
福利三:领取全套零基础视频教程(python,java,前端,php)

  价值13000svip视频教程,python大神匠心打造,零基础python开发工程师视频教程全套,基础+进阶+项目实战,包含课件和源码

  出租广告位,需要合作请联系站长

+关注
已关注

分类  

暂无分类

标签  

暂无标签

日期归档  

暂无数据

熊猫的问题

发布于2022-06-25 19:31     阅读(732)     评论(0)     点赞(28)     收藏(0)



对模糊的标题感到抱歉,但因为我真的不知道问题出在哪里......问题是我想加载一个 CSV 文件,然后将它分成两个数组并在每个数组上执行一个函数。它适用于第一个数组,但第二个数组会出现问题,即使每件事都是一样的。我真的被困住了。代码如下:

from wordutility import wordutility
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import cross_validation
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
import pandas as pd
import numpy as np

data = pd.read_csv('sts_gold_tweet.csv', header=None, delimiter=';',
               quotechar='"')

# test = pd.read_csv('output.csv', header=None,
#                   delimiter=';', quotechar='"')

split_ratio = 0.9
train = data[:round(len(data)*split_ratio)]
test = data[round(len(data)*split_ratio):]

y = data[1]

print("Cleaning and parsing tweets data...\n")

traindata = []

for i in range(0, len(train[0])):
     traindata.append(" ".join(wordutility.tweet_to_wordlist
                          (train[0][i], False)))

testdata = []

for i in range(0, len(test[0])):
    testdata.append(" ".join(wordutility.tweet_to_wordlist(test[0][i], False)))

该程序一直运行到最后一行。错误是:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python3.4/site-packages/pandas/core/series.py", line 509, in __getitem__
    result = self.index.get_value(self, key)
  File "/usr/lib/python3.4/site-packages/pandas/core/index.py", line   1417, in get_value
    return self._engine.get_value(s, k)
  File "pandas/index.pyx", line 100, in pandas.index.IndexEngine.get_value (pandas/index.c:3097)
  File "pandas/index.pyx", line 108, in pandas.index.IndexEngine.get_value (pandas/index.c:2826)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3692)
  File "pandas/hashtable.pyx", line 381, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7201)
  File "pandas/hashtable.pyx", line 387, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7139)
KeyError: 0

(它在错误代码中显示第 2 行,因为我在 python shell 中尝试代码。所以第 2 行是指上面代码的最后一行。)

希望有人可以帮助我:)。谢谢

编辑

好的,看来拆分没有像我想象的那样工作。我确实得到了我想要的两个数组,但不知何故,这些行仍然好像是一个文件。所以数组序列是从 0 到 1830,数组测试是从 1831 到 2034 ......所以范围是错误的......我将如何“正确”拆分 csv 文件?

2 编辑

>>> print(train[0:5])
                                               0         1
0  the angel is going to miss the athlete this we...  negative 
1  It looks as though Shaq is getting traded to C...  negative
2     @clarianne APRIL 9TH ISN'T COMING SOON ENOUGH   negative
3  drinking a McDonalds coffee and not understand...  negative
4  So dissapointed Taylor Swift doesnt have a Twi...  negativ

>>> print(test[0:5])
                                                  0         1
1831  Why is my PSP always dead when I want to use it?   negative
1832  @hillaryrachel oh i know how you feel. i took ...  negative
1833  @daveknox awesome-  corporate housing took awa...  negative
1834  @lakersnation Is this a joke?  I can't find them   negative
1835                              XBox Live still down   negative

所以你可以看到数组“test”从第 1831 行开始。我原以为它会从 0 开始......我现在通过编辑 for 循环中的范围来解决我的问题

for i in range(len(train[0], len(data)):

所以我最初的问题是固定的,我只是好奇并渴望学习编写更好的代码。这是可以做的事情还是我应该以不同的方式拆分 csv 文件?


解决方案


当你这样做时test[0],你没有得到第一个索引test,更像是你得到了test带有“名称”的列0当您将 pandas DataFrame 一分为二时,原始列名被保留。这意味着对于testDataFrame,它没有 columns 0,因为该列在第一个 DataFrame 中。

让我给你举个例子。假设您有以下 DataFrame:

       0   1   2   3   4   5   6   7   8   9
Ind1   0   1   2   3   4   5   6   7   8   9
Ind2  10  11  12  13  14  15  16  17  18  19

当你拆分它时,你最终会得到这些 DataFrame:

       0   1   2   3   4
Ind1   0   1   2   3   4
Ind2  10  11  12  13  14

和:

       5   6   7   8   9
Ind1   5   6   7   8   9
Ind2  15  16  17  18  19

Notice that the columns of the second DataFrame starts with 5, not 0, because those were the column names before the split. So when you try to get column 0, it isn't there. That is the source of your error.

The simplest solution would just be to use the index, rather than the column name. So instead of something like test[0], use test.iloc[0]. That will give the value based on positional index.







所属网站分类: 技术文章 > 问答

作者:黑洞官方问答小能手

链接:https://www.pythonheidong.com/blog/article/1607341/b74351f50bfe65082adc/

来源:python黑洞网

任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任

28 0
收藏该文
已收藏

评论内容:(最多支持255个字符)