请教一下 Python 中列表字典清洗数据的问题

有一个列表字典是这样的

l = [{'name': 'aa', 'type': '游戏'}, {'name': 'bb', 'type': '游戏'}, {'name': 'cc', 'type': '学习'}]

类似上述的列表包括含有类型的键的字典，如何过滤掉和大部分类型不一样的字典

比如列表中一共有 8 个字典，6 个字典中类型是游戏，1 个字典中类型是学习，还有个字典中类型是玩耍，如何过滤后面两个

当然类型是不确定的，数量多的不一定是游戏，还可能是吃饭。。或睡觉

有木有大佬给思路

ipwx

2018-12-05 11:00:51 +08:00

统计每个类型出现的百分比，然后根据 Zipf's Law 选一个阈值删掉百分比小的类型。

necomancer

2018-12-05 11:24:04 +08:00

数据少的话：
lst = sorted(l, key=(lambda x : x.get('type')))
ret = [[]]
for prv, nxt in zip(lst[:-1], lst[1:]):
....tmp = ret[-1]
....tmp.append(prv)
....if prv['type']!=nxt['type']:
........ret.append([])
tmp = ret[-1]
tmp.append(t[-1])
然后取 ret 里最多的，或者直接用 groupby
[ list(g) for c, g in groupby(lst, key=(lambda x : x.get('type'))) ]
但是都需要排序。

或者用 pandas:
import pandas as pd
l= [{'name': 'aa', 'type': '游戏'},
{'name': 'cc', 'type': '学习'},
{'name': 'bb', 'type': '游戏'}] # 可以不用考虑顺序

list(pd.DataFrame(l).groupby('type')) 可以搞定，输出是 n 个 categories 的 tuple 的 list

[(分组名 1，分组 1 数据的 dataframe),(分组名 2，分组 2 数据的 dataframe)...]，数据大小可以用 dataframe 的 shape 来确定。

In [40]: list(pd.DataFrame(l).groupby('type'))
Out[40]:
[('学习', name type
1 cc 学习), ('游戏', name type
0 aa 游戏
2 bb 游戏)]

In [41]: p=list(pd.DataFrame(l).groupby('type'))[1][1]

In [42]: p.shape
Out[42]: (2, 2)

In [43]: p
Out[43]:
name type
0 aa 游戏
2 bb 游戏

对一定量的数据，pandas 就可以有很高的处理效率了，如果数据量再大，考虑上 #1 的方法吧。

cyy564

2018-12-05 11:24:30 +08:00

@ipwx 从第一步我就没想到好方法来统计每个类型出现的百分比

necomancer

2018-12-05 11:24:39 +08:00

from itertools import groupby
[ list(g) for c, g in groupby(lst, key=(lambda x : x.get('type'))) ]

necomancer

2018-12-05 11:27:13 +08:00

@cyy564 百分比很好统计:

ret = {}
for i in l:
....if not ret.get(i['type']):
........ret[i['type']] = 0
...ret.get(i['type']) +=1

基本上在不知道 type 有多少的情况下也能轻松统计

necomancer

2018-12-05 11:28:25 +08:00

Sorry,

ret = {}
for i in l:
....if not ret.get(i['type']):
........ret[i['type']] = 0
...ret[i['type']] +=1

cyy564

2018-12-05 11:31:56 +08:00

@necomancer 谢谢，这个帮大忙了[ list(g) for c, g in groupby(lst, key=(lambda x : x.get('type'))) ]

cyy564

2018-12-05 11:43:13 +08:00

@necomancer

额。。如果 l 变成[{'name': 'aa', 'type': '游戏'}, {'name': 'bb', 'type': '游戏'}, {'name': 'cc', 'type': '学习'}, {'name': 'dd', 'type': '游戏'}]

用这个[list(g) for c,g in groupby(l, key=(lambda x: x.get('type')))]居然会拆开他们

输出[[{'name': 'aa', 'type': '游戏'}, {'name': 'bb', 'type': '游戏'}], [{'name': 'cc', 'type': '学习'}], [{'name': 'dd', 'type': '游戏'}]]

这就是我不想要的结果了，我还是看看 pandas 中的 group_by

necomancer

2018-12-05 12:01:13 +08:00

@cyy564 我在 #2 已经说了，这个需要先排序。pandas 可以无视顺序。所以数据量小考虑直接 python sorted + itertools.groupby，数据量大一些考虑 pandas.DataFrame.groupby，如果超超超大就考虑 #1 的办法。

darkTianTian

2018-12-06 00:25:27 +08:00

如果 name 没啥用的话可以
from collections import Counter
Counter([x['type'] for x in l]).most_common()

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/514458

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.