发布于2020-02-14 20:37 阅读(493) 评论(0) 点赞(30) 收藏(3)
数据清洗是数据分析关键的一步,直接影响之后的处理工作
数据需要修改吗?有什么需要修改的吗?数据应该怎么调整才能适用于接下来的分析和挖掘?
是一个迭代的过程,实际项目中可能需要不止一次地执行这些清洗操作
处理缺失数据:pd.fillna(),pd.dropna()
数据连接(pd.merge)
pd.merge
根据单个或多个键将不同DataFrame的行连接起来
类似数据库的连接操作
- import pandas as pd
- import numpy as np
-
- df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
- 'data1' : np.random.randint(0,10,7)})
- df_obj2 = pd.DataFrame({'key': ['a', 'b', 'd'],
- 'data2' : np.random.randint(0,10,3)})
-
- print(df_obj1)
- print(df_obj2)
运行结果:
- data1 key
- data1 key
- 0 8 b
- 1 8 b
- 2 3 a
- 3 5 c
- 4 4 a
- 5 9 a
- 6 6 b
-
- data2 key
- 0 9 a
- 1 0 b
- 2 3 d
1. 默认将重叠列的列名作为“外键”进行连接
- # 默认将重叠列的列名作为“外键”进行连接
- print(pd.merge(df_obj1, df_obj2))
运行结果:
- data1 key data2
- 0 8 b 0
- 1 8 b 0
- 2 6 b 0
- 3 3 a 9
- 4 4 a 9
- 5 9 a 9
2. on显示指定“外键”
- # on显示指定“外键”
- print(pd.merge(df_obj1, df_obj2, on='key'))
运行结果:
- data1 key data2
- 0 8 b 0
- 1 8 b 0
- 2 6 b 0
- 3 3 a 9
- 4 4 a 9
- 5 9 a 9
3. left_on,左侧数据的“外键”,right_on,右侧数据的“外键”
- # left_on,right_on分别指定左侧数据和右侧数据的“外键”
-
- # 更改列名
- df_obj1 = df_obj1.rename(columns={'key':'key1'})
- df_obj2 = df_obj2.rename(columns={'key':'key2'})
-
- print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2'))
运行结果:
- data1 key1 data2 key2
- 0 8 b 0 b
- 1 8 b 0 b
- 2 6 b 0 b
- 3 3 a 9 a
- 4 4 a 9 a
- 5 9 a 9 a
默认是“内连接”(inner),即结果中的键是交集
how
指定连接方式
4. “外连接”(outer),结果中的键是并集
- # “外连接”
- print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='outer'))
运行结果:
- data1 key1 data2 key2
- 0 8.0 b 0.0 b
- 1 8.0 b 0.0 b
- 2 6.0 b 0.0 b
- 3 3.0 a 9.0 a
- 4 4.0 a 9.0 a
- 5 9.0 a 9.0 a
- 6 5.0 c NaN NaN
- 7 NaN NaN 3.0 d
5. “左连接”(left)
- # 左连接
- print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='left'))
运行结果:
- data1 key1 data2 key2
- 0 8 b 0.0 b
- 1 8 b 0.0 b
- 2 3 a 9.0 a
- 3 5 c NaN NaN
- 4 4 a 9.0 a
- 5 9 a 9.0 a
- 6 6 b 0.0 b
6. “右连接”(right)
- # 右连接
- print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='right'))
运行结果:
- data1 key1 data2 key2
- 0 8.0 b 0 b
- 1 8.0 b 0 b
- 2 6.0 b 0 b
- 3 3.0 a 9 a
- 4 4.0 a 9 a
- 5 9.0 a 9 a
- 6 NaN NaN 3 d
7. 处理重复列名
suffixes,默认为_x, _y
-
- # 处理重复列名
- df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
- 'data' : np.random.randint(0,10,7)})
- df_obj2 = pd.DataFrame({'key': ['a', 'b', 'd'],
- 'data' : np.random.randint(0,10,3)})
-
- print(pd.merge(df_obj1, df_obj2, on='key', suffixes=('_left', '_right')))
运行结果:
- data_left key data_right
- 0 9 b 1
- 1 5 b 1
- 2 1 b 1
- 3 2 a 8
- 4 2 a 8
- 5 5 a 8
8. 按索引连接
left_index=True或right_index=True
-
- # 按索引连接
- df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
- 'data1' : np.random.randint(0,10,7)})
- df_obj2 = pd.DataFrame({'data2' : np.random.randint(0,10,3)}, index=['a', 'b', 'd'])
-
- print(pd.merge(df_obj1, df_obj2, left_on='key', right_index=True))
运行结果:
- data1 key data2
- 0 3 b 6
- 1 4 b 6
- 6 8 b 6
- 2 6 a 0
- 4 3 a 0
- 5 0 a 0
数据合并(pd.concat)
1. NumPy的concat
np.concatenate
- import numpy as np
- import pandas as pd
-
- arr1 = np.random.randint(0, 10, (3, 4))
- arr2 = np.random.randint(0, 10, (3, 4))
-
- print(arr1)
- print(arr2)
-
- print(np.concatenate([arr1, arr2]))
- print(np.concatenate([arr1, arr2], axis=1))
运行结果:
- # print(arr1)
- [[3 3 0 8]
- [2 0 3 1]
- [4 8 8 2]]
-
- # print(arr2)
- [[6 8 7 3]
- [1 6 8 7]
- [1 4 7 1]]
-
- # print(np.concatenate([arr1, arr2]))
- [[3 3 0 8]
- [2 0 3 1]
- [4 8 8 2]
- [6 8 7 3]
- [1 6 8 7]
- [1 4 7 1]]
-
- # print(np.concatenate([arr1, arr2], axis=1))
- [[3 3 0 8 6 8 7 3]
- [2 0 3 1 1 6 8 7]
- [4 8 8 2 1 4 7 1]]
2. pd.concat
注意指定轴方向,默认axis=0
join指定合并方式,默认为outer
Series合并时查看行索引有无重复
1) index 没有重复的情况
- # index 没有重复的情况
- ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(0,5))
- ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(5,9))
- ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(9,12))
-
- print(ser_obj1)
- print(ser_obj2)
- print(ser_obj3)
-
- print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
- print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1))
运行结果:
- # print(ser_obj1)
- 0 1
- 1 8
- 2 4
- 3 9
- 4 4
- dtype: int64
-
- # print(ser_obj2)
- 5 2
- 6 6
- 7 4
- 8 2
- dtype: int64
-
- # print(ser_obj3)
- 9 6
- 10 2
- 11 7
- dtype: int64
-
- # print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
- 0 1
- 1 8
- 2 4
- 3 9
- 4 4
- 5 2
- 6 6
- 7 4
- 8 2
- 9 6
- 10 2
- 11 7
- dtype: int64
-
- # print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1))
- 0 1 2
- 0 1.0 NaN NaN
- 1 5.0 NaN NaN
- 2 3.0 NaN NaN
- 3 2.0 NaN NaN
- 4 4.0 NaN NaN
- 5 NaN 9.0 NaN
- 6 NaN 8.0 NaN
- 7 NaN 3.0 NaN
- 8 NaN 6.0 NaN
- 9 NaN NaN 2.0
- 10 NaN NaN 3.0
- 11 NaN NaN 3.0
2) index 有重复的情况
- # index 有重复的情况
- ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(5))
- ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(4))
- ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(3))
-
- print(ser_obj1)
- print(ser_obj2)
- print(ser_obj3)
-
- print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
运行结果:
- # print(ser_obj1)
- 0 0
- 1 3
- 2 7
- 3 2
- 4 5
- dtype: int64
-
- # print(ser_obj2)
- 0 5
- 1 1
- 2 9
- 3 9
- dtype: int64
-
- # print(ser_obj3)
- 0 8
- 1 7
- 2 9
- dtype: int64
-
- # print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
- 0 0
- 1 3
- 2 7
- 3 2
- 4 5
- 0 5
- 1 1
- 2 9
- 3 9
- 0 8
- 1 7
- 2 9
- dtype: int64
-
- # print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1, join='inner'))
- # join='inner' 将去除NaN所在的行或列
- 0 1 2
- 0 0 5 8
- 1 3 1 7
- 2 7 9 9
3) DataFrame合并时同时查看行索引和列索引有无重复
- df_obj1 = pd.DataFrame(np.random.randint(0, 10, (3, 2)), index=['a', 'b', 'c'],
- columns=['A', 'B'])
- df_obj2 = pd.DataFrame(np.random.randint(0, 10, (2, 2)), index=['a', 'b'],
- columns=['C', 'D'])
- print(df_obj1)
- print(df_obj2)
-
- print(pd.concat([df_obj1, df_obj2]))
- print(pd.concat([df_obj1, df_obj2], axis=1, join='inner'))
运行结果:
- # print(df_obj1)
- A B
- a 3 3
- b 5 4
- c 8 6
-
- # print(df_obj2)
- C D
- a 1 9
- b 6 8
-
- # print(pd.concat([df_obj1, df_obj2]))
- A B C D
- a 3.0 3.0 NaN NaN
- b 5.0 4.0 NaN NaN
- c 8.0 6.0 NaN NaN
- a NaN NaN 1.0 9.0
- b NaN NaN 6.0 8.0
-
- # print(pd.concat([df_obj1, df_obj2], axis=1, join='inner'))
- A B C D
- a 3 3 1 9
- b 5 4 6 8
数据重构
1. stack
将列索引旋转为行索引,完成层级索引
DataFrame->Series
- import numpy as np
- import pandas as pd
-
- df_obj = pd.DataFrame(np.random.randint(0,10, (5,2)), columns=['data1', 'data2'])
- print(df_obj)
-
- stacked = df_obj.stack()
- print(stacked)
运行结果:
- # print(df_obj)
- data1 data2
- 0 7 9
- 1 7 8
- 2 8 9
- 3 4 1
- 4 1 2
-
- # print(stacked)
- 0 data1 7
- data2 9
- 1 data1 7
- data2 8
- 2 data1 8
- data2 9
- 3 data1 4
- data2 1
- 4 data1 1
- data2 2
- dtype: int64
2. unstack
将层级索引展开
Series->DataFrame
认操作内层索引,即level=-1
- # 默认操作内层索引
- print(stacked.unstack())
-
- # 通过level指定操作索引的级别
- print(stacked.unstack(level=0))
运行结果:
- # print(stacked.unstack())
- data1 data2
- 0 7 9
- 1 7 8
- 2 8 9
- 3 4 1
- 4 1 2
-
- # print(stacked.unstack(level=0))
- 0 1 2 3 4
- data1 7 7 8 4 1
- data2 9 8 9 1 2
数据转换
1 duplicated()
返回布尔型Series表示每行是否为重复行
- import numpy as np
- import pandas as pd
-
- df_obj = pd.DataFrame({'data1' : ['a'] * 4 + ['b'] * 4,
- 'data2' : np.random.randint(0, 4, 8)})
- print(df_obj)
-
- print(df_obj.duplicated())
运行结果:
- # print(df_obj)
- data1 data2
- 0 a 3
- 1 a 2
- 2 a 3
- 3 a 3
- 4 b 1
- 5 b 0
- 6 b 3
- 7 b 0
-
- # print(df_obj.duplicated())
- 0 False
- 1 False
- 2 True
- 3 True
- 4 False
- 5 False
- 6 False
- 7 True
- dtype: bool
2 drop_duplicates()
过滤重复行
默认判断全部列
可指定按某些列判断
- print(df_obj.drop_duplicates())
- print(df_obj.drop_duplicates('data2'))
运行结果:
- # print(df_obj.drop_duplicates())
- data1 data2
- 0 a 3
- 1 a 2
- 4 b 1
- 5 b 0
- 6 b 3
-
- # print(df_obj.drop_duplicates('data2'))
- data1 data2
- 0 a 3
- 1 a 2
- 4 b 1
- 5 b 0
3. 根据map
传入的函数对每行或每列进行转换
map
传入的函数对每行或每列进行转换示例代码:
- ser_obj = pd.Series(np.random.randint(0,10,10))
- print(ser_obj)
-
- print(ser_obj.map(lambda x : x ** 2))
运行结果:
- # print(ser_obj)
- 0 1
- 1 4
- 2 8
- 3 6
- 4 8
- 5 6
- 6 6
- 7 4
- 8 7
- 9 3
- dtype: int64
-
- # print(ser_obj.map(lambda x : x ** 2))
- 0 1
- 1 16
- 2 64
- 3 36
- 4 64
- 5 36
- 6 36
- 7 16
- 8 49
- 9 9
- dtype: int64
replace
根据值的内容进行替换
- # 单个值替换单个值
- print(ser_obj.replace(1, -100))
-
- # 多个值替换一个值
- print(ser_obj.replace([6, 8], -100))
-
- # 多个值替换多个值
- print(ser_obj.replace([4, 7], [-100, -200]))
运行结果:
- # print(ser_obj.replace(1, -100))
- 0 -100
- 1 4
- 2 8
- 3 6
- 4 8
- 5 6
- 6 6
- 7 4
- 8 7
- 9 3
- dtype: int64
-
- # print(ser_obj.replace([6, 8], -100))
- 0 1
- 1 4
- 2 -100
- 3 -100
- 4 -100
- 5 -100
- 6 -100
- 7 4
- 8 7
- 9 3
- dtype: int64
-
- # print(ser_obj.replace([4, 7], [-100, -200]))
- 0 1
- 1 -100
- 2 8
- 3 6
- 4 8
- 5 6
- 6 6
- 7 -100
- 8 -200
- 9 3
- dtype: int64
作者:232hdsjdh
链接:https://www.pythonheidong.com/blog/article/231520/564b6c69007301e21988/
来源:python黑洞网
任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任
昵称:
评论内容:(最多支持255个字符)
---无人问津也好,技不如人也罢,你都要试着安静下来,去做自己该做的事,而不是让内心的烦躁、焦虑,坏掉你本来就不多的热情和定力
Copyright © 2018-2021 python黑洞网 All Rights Reserved 版权所有,并保留所有权利。 京ICP备18063182号-1
投诉与举报,广告合作请联系vgs_info@163.com或QQ3083709327
免责声明:网站文章均由用户上传,仅供读者学习交流使用,禁止用做商业用途。若文章涉及色情,反动,侵权等违法信息,请向我们举报,一经核实我们会立即删除!