PTX and Threads Scheduler

Assignment 1. Please analyze GPU PTX, SSE Assembler (or NEON Assembler), and CPU Assembler Instruction Queues, (and Cambricon[1] as optional) of matrix operations, for instance, matrix (vector) addition, multiply.

Analyze the reasons of "why GPU is faster at matrix operations", （ and why Cambricon is more efficient than GPU in DNN computations, also optional) .
Please figure out which instructions are loading data, which instructions are SIMD operations, and compare them with traditional x86 instructions (in x86 scalar instructions, matrix operation always are organized with loop).

Assignment 2.Study the threads scheduler of GPGPU by analyzing warp scheduler.

Read the relevant GPGPU-sim code of warp-scheduler and find where the score-boarding algorithm is described. Please flow the algorithm and the warp controller structure (that means drawing the flow diagram and structure diagram).
Please illustration the performances with whether the memory accessing latency is hidden by warp scheduler (the key of this problem is just to construct sufficient operations for scheduler). Note: A Latex Template has been uploaded to overleaf.com with url:https://www.overleaf.com/read/vkyjvtnzrczh Reference

[1] S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, “ Cambricon: An Instruction Set Architecture for Neural Networks,” presented at the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 393 – 405.

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/360225

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

帮朋友找个懂行的同学做四道 gpu 相关的题目

PTX and Threads Scheduler

Assignment 1. Please analyze GPU PTX, SSE Assembler (or NEON Assembler), and CPU Assembler Instruction Queues, (and Cambricon[1] as optional) of matrix operations, for instance, matrix (vector) addition, multiply.

Assignment 2.Study the threads scheduler of GPGPU by analyzing warp scheduler.