程序员最近都爱上了这个网站  程序员们快来瞅瞅吧!  it98k网:it98k.com

本站消息

站长简介/公众号

  出租广告位,需要合作请联系站长

+关注
已关注

分类  

暂无分类

标签  

暂无标签

日期归档  

暂无数据

如何从特定行和列中找到最常见的单词并列出它在 data.csv 中出现的频率?[复制]

发布于2022-09-14 22:41     阅读(1113)     评论(0)     点赞(4)     收藏(2)


我想通过使用 Pythondata.csv中对前 10 部最长电影的描述中获取20 个最常见的单词。到目前为止,我获得了前 10 部最长的电影,但是我无法从这些特定电影中获得最常见的词,我的代码只给出了整个data.csv本身中最常见的词。我尝试过 Counter、Pandas、Numpy、Mathlib,但我不知道如何让 Python准确查找数据表的特定行和列(电影描述)中最常见的单词

我的代码:

import pandas as pd
import numpy as np
df = pd.read_csv("data.csv")
small_df = df[['title','duration_min','description']]
result_time = small_df.sort_values('duration_min', ascending=False)
print("TOP 10 LONGEST: ")
print(result_time.head(n=10))

most_common = pd.Series(' '.join(result_time['description']).lower().split()).value_counts()[:20]
print("20 Most common words from TOP 10 longest movies: ")
print(most_common)

我的输出:

TOP 10 LONGEST: 
                             title  duration_min                                        description
6840        The School of Mischief         253.0  A high school teacher volunteers to transform ...
4482                No Longer kids         237.0  Hoping to prevent their father from skipping t...
3687            Lock Your Girls In         233.0  A widower believes he must marry off his three...
5100               Raya and Sakina         230.0  When robberies and murders targeting women swe...
5367                        Sangam         228.0  Returning home from war after being assumed de...
3514                        Lagaan         224.0  In 1890s India, an arrogant British commander ...
3190                  Jodhaa Akbar         214.0  In 16th-century India, what begins as a strate...
6497                  The Irishman         209.0  Hit man Frank Sheeran looks back at the secret...
3277      Kabhi Khushi Kabhie Gham         209.0  Years after his father disowns his adopted bro...
4476  No Direction Home: Bob Dylan         208.0  Featuring rare concert footage and interviews ...
20 Most common words from TOP 10 longest movies: 
a        10134
the       7153
to        5653
and       5573
of        4691
in        3840
his       3005
with      1967
her       1803
an        1727
for       1558
on        1528
their     1468
when      1320
this      1240
from      1114
as        1050
is         988
by         894
after      865
dtype: int64

这是数据表: https ://www.dropbox.com/s/hxch4v08bkthvz1/data.csv?dl=1


解决方案


您可以使用 选择数据框的前 10 行iloc[0:10]

在这种情况下,解决方案将如下所示,对现有代码的修改最少:

import pandas as pd
import numpy as np    
df = pd.read_csv("data.csv")
small_df = df[['title','duration_min','description']]
result_time = small_df.sort_values('duration_min', ascending=False)
print("TOP 10 LONGEST: ")
print(result_time.head(n=10))

most_common = pd.Series(' '.join(result_time.iloc[0:10]['description']).lower().split()).value_counts()[:20]
print("20 Most common words from TOP 10 longest movies: ")
print(most_common) 


所属网站分类: 技术文章 > 问答

作者:黑洞官方问答小能手

链接:https://www.pythonheidong.com/blog/article/1736418/ad26d563373991084d1d/

来源:python黑洞网

任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任

4 0
收藏该文
已收藏

评论内容:(最多支持255个字符)