M1 芯片与鲲鹏 920 数值计算性能对比

This topic created in 1919 days ago, the information mentioned may be changed or developed.

受此贴启发，除鲲鹏 920 外的数据也来自此贴: https://v2ex.com/t/733777

成绩对比选的是基于 Numpy 的数值计算（ Neon SIMD 加速），测试脚本为：

https://gist.github.com/markus-beuckelmann/8bc25531b11158431a5b09a45abd6276

废话不多说，上成绩：

项目	M1	鲲鹏 920-12 核	鲲鹏 920-24 核	Core i9
4096x4096 矩阵乘法	0.53 s	1.48 s	0.76 s	0.45 s
524288 向量点积	0.25 ms	0.49 ms	0.48 ms	0.05 ms
2048x1024 SVD	0.59 s	1.10 s	0.93 s	0.32 s
2048x2048 Cholesky 分解	0.08 s	0.14 s	0.13 s	0.08 s
2048x2048 特征分解	4.74 s	8.36 s	7.66 s	3.53 s

结论：

由于是调用的底层加速库，Numpy 在数值计算方面可以有效使用多核进行运算。大体上看，哪怕是 24 核鲲鹏 920 的数值计算性能也比 M1 慢一半左右，向量乘法和 SVD 几乎慢一倍。

Core i9 是原帖网友 @pb941129 基于 16 寸 MBP i9 所得，由于数值计算是英特尔的传统强项，外加在 MKL 底层的加持下，各项方面性能均领先 M1 (原帖网友 @YUX 所测).

备注：

1 鲲鹏 920 是在华为云上测试的。

2 除 Core i9 外，Numpy 安装统一用的是 Miniforge，加速库配置为：

blas_info:
libraries = ['cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/root/miniforge3/lib']
include_dirs = ['/root/miniforge3/include']
language = c
define_macros = [('HAVE_CBLAS', None)]

blas_opt_info:
define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
libraries = ['cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/root/miniforge3/lib']
include_dirs = ['/root/miniforge3/include']
language = c

lapack_info:
libraries = ['lapack', 'blas', 'lapack', 'blas']
library_dirs = ['/root/miniforge3/lib']
language = f77

lapack_opt_info:
libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/root/miniforge3/lib']
language = c
define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
include_dirs = ['/root/miniforge3/include']

23 replies • 2021-04-25 16:28:50 +08:00

felixcode

PRO

Apr 23, 2021

华为云上用的是独服吗？
i9 达到功耗墙和温度墙了没有？

YRInc

Apr 23, 2021

@felixcode 功耗墙可能不会影响这个，因为就几秒钟的满负载运算。华为云用的是单节点，也是跑到满负载，不是到是不是独服

FurN1

Apr 23, 2021 via iPhone

0987363

Apr 23, 2021

多核参考意义不大啊，有没有单核成绩

YRInc

Apr 23, 2021 via iPhone

@0987363 因为测的是数值计算，主要是多核和 SIMD 指令的性能对比，通用计算没有进行对比

0987363

Apr 23, 2021

Dotted two 4096x4096 matrices in 0.35 s.
Dotted two vectors of length 524288 in 0.03 ms.
SVD of a 2048x1024 matrix in 0.30 s.
Cholesky decomposition of a 2048x2048 matrix in 0.05 s.
Eigendecomposition of a 2048x2048 matrix in 3.05 s.

黑果 10850k 跑了下
最高只能用到 10 线程

YRInc

Apr 23, 2021 via iPhone

@0987363 嗯嗯，取决于加速库的配置，默认情况下（具体为啥我也不知），最大线程数限制在物理核心数上，不使用超线程。这也是 Matlab 和 Mathmatica 所采用的策略。

0987363

Apr 23, 2021

@YRInc 在 debian 上，能用上超线程，2630l v4
Dotted two 4096x4096 matrices in 0.91 s.
Dotted two vectors of length 524288 in 0.05 ms.
SVD of a 2048x1024 matrix in 0.90 s.
Cholesky decomposition of a 2048x2048 matrix in 0.18 s.
Eigendecomposition of a 2048x2048 matrix in 12.48 s.

YRInc

Apr 23, 2021 via iPhone

@0987363 赞，志强一比确实差了点

YRInc

Apr 23, 2021 via iPhone

@0987363 与鲲鹏 24 核互有胜负

Deepseafish

Apr 23, 2021

非空载跑的，前两项波动比较大
E5-2680 v4
Dotted two 4096x4096 matrices in 0.33 s.
Dotted two vectors of length 524288 in 0.03 ms.
SVD of a 2048x1024 matrix in 0.34 s.
Cholesky decomposition of a 2048x2048 matrix in 0.10 s.
Eigendecomposition of a 2048x2048 matrix in 3.82 s.

Xeon(R) Platinum 8170
Dotted two 4096x4096 matrices in 0.77 s.
Dotted two vectors of length 524288 in 0.13 ms.
SVD of a 2048x1024 matrix in 0.48 s.
Cholesky decomposition of a 2048x2048 matrix in 0.30 s.
Eigendecomposition of a 2048x2048 matrix in 5.94 s.

E5-2690 v4
Dotted two 4096x4096 matrices in 0.93 s.
Dotted two vectors of length 524288 in 0.14 ms.
SVD of a 2048x1024 matrix in 1.60 s.
Cholesky decomposition of a 2048x2048 matrix in 0.13 s.
Eigendecomposition of a 2048x2048 matrix in 6.90 s.

yanwen

Apr 23, 2021

华为云上的。。性能打折扣了。

secondwtq

Apr 23, 2021 via iPhone

就鲲鹏 12 核和 24 核的结果对比来看，貌似除了矩阵乘之外的算法并不能”有效”利用多核啊

alphatoad

Apr 23, 2021

i9 还是很强，是用了 AVX 吗
不过考虑到 M1 只是个低功耗试水产品——很看到后续产品线

YRInc

Apr 23, 2021 via iPhone

@secondwtq 嗯，估计不是所有的运算项目都能并行化。具体也取决于底层加速库的实现了。

YRInc

Apr 23, 2021 via iPhone

@y
@alphatoad 嗯，是 AVX，然后 Arm 用 Neon 。M1 如此低功耗加性能不俗，未来着实可期

jr55475f112iz2tu

Apr 23, 2021 via Android

鲲鹏是服务器 U
M1 是消费级 U
不知道有什么好比的

YRInc

Apr 23, 2021 via iPhone

@czfy 额，这不才说明都服务器级别了，这么多核心了，功耗这么大了，差距还是存在一些，进步空间还不小么

dayeye2006199

Apr 24, 2021

arm 的数值计算有什么技术进展吗？指令集带来的差异，下层的库能拉平性能差异么？求科普

YRInc

Apr 24, 2021 via iPhone

@dayeye2006199 只知道下一代 Arm v9 更新了 SIMD 指令集，SVE2 。以后的数值计算能力也会越来越强吧

dabaibai

Apr 24, 2021

华为云上用的是独服吗？如果是云主机的话毫无参考价值

datou

Apr 25, 2021

Dotted two 4096x4096 matrices in 0.67 s.
Dotted two vectors of length 524288 in 0.05 ms.
SVD of a 2048x1024 matrix in 0.96 s.
Cholesky decomposition of a 2048x2048 matrix in 0.28 s.
Eigendecomposition of a 2048x2048 matrix in 6.69 s.

3500X win10

neosfung

Apr 25, 2021

拿了 2019 年 16 寸 MacBook 和服务器分别测了一下，供参考

Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz * 2
Dotted two 4096x4096 matrices in 1.45 s.
Dotted two vectors of length 524288 in 0.17 ms.
SVD of a 2048x1024 matrix in 0.90 s.
Cholesky decomposition of a 2048x2048 matrix in 0.20 s.
Eigendecomposition of a 2048x2048 matrix in 8.83 s.

Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Dotted two 4096x4096 matrices in 0.73 s.
Dotted two vectors of length 524288 in 0.08 ms.
SVD of a 2048x1024 matrix in 0.52 s.
Cholesky decomposition of a 2048x2048 matrix in 0.09 s.
Eigendecomposition of a 2048x2048 matrix in 4.89 s.

M1 芯片与 鲲鹏 920 数值计算性能对比

M1 芯片与鲲鹏 920 数值计算性能对比