问个 Python 性能相关的

有很多个学校，具体就先用一千万吧，每个学校有很多个班级，每个班有很多组男女生(男女成对出现，男女生数量相等)。

现在要求每个班里的男女生的某个差，比如身高、年龄，只需要知道这里会有点耗时，最后按班级吧结果输出到文件

要求用 python 实现，对速度有要求

lithbitren

2020-06-23 20:50:54 +08:00

students = [
ㅤ{
ㅤㅤ'class': random.randrange(2000),
ㅤㅤ'sex': random.randint(0, 1),
ㅤㅤ'height': random.randrange(150, 190)
ㅤ}
ㅤfor _ in range(1_000_000)
]

collect = collections.defaultdict(lambda: {
ㅤ'maleSum': 0,
ㅤ'maleCount': 0,
ㅤ'femaleSum': 0,
ㅤ'femaleCount': 0
})

for student in students:
ㅤif student['sex']:
ㅤㅤcollect[student['class']]['maleSum'] += student['height']
ㅤㅤcollect[student['class']]['maleCount'] += 1
ㅤelse:
ㅤㅤcollect[student['class']]['femaleSum'] += student['height']
ㅤㅤcollect[student['class']]['femaleCount'] += 1

result = [
ㅤClass['maleSum'] / Class['maleCount'] - Class['femaleSum'] / Class['femaleCount']
ㅤfor Class in collect.values()
]

测了测，百万级数据查询时间肯定不超过半秒，这还是用带键名的，如果把临时字典换成数组，估计还能再将快几倍，拆分数组类型到 numpy 然后开 numba，估计还能再快几倍，几十分钟居然就真等了。。。

necomancer

2020-06-23 21:07:48 +08:00

numpy 就可以。anaconda 的 numpy 有 MKL 加速。比如身高，data->(10, 5, 50, 2) 型的数组-> 10 所学校，每个学校 5 个班级，等量男女各 50 人两组身高，只要 np.mean(data, axis=(0,1)) 就是按学校和班级做平均。你还需要什么统计量 numpy 都有现成函数。

linvaux

2020-06-23 23:27:28 +08:00

@sss495088732 6 的不行

btv2bt

2020-06-29 01:52:18 +08:00

pyspark ？

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/684133

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.