ubuntu + 双 N 卡本地部署 DS-R1，性能优化请教

ubuntu + 双 A6000 + R1-q4-70b 模型

GPU 负载、输出性能见下图..

各位大佬有没有什么环境配置的优化建议？另外换 llama.cpp 会有显著提升吗？

maskerTUI

164 天前

ollama 本质还是调用 llama.cpp ，想要提升得换后端推理引擎，比如 vLLM 。

crac

164 天前

@maskerTUI 请教一下，根据您的经验，如果换了 vLLM 这种情况下大概能提升多少

Chihaya0824

164 天前

R1-Llama-70B-Distill-Q5KM-GGUF
VLLM
单次（类似 ollama ）
Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 28.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%.
双卡并发（并行 12 个请求）
Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 98.7 tokens/s, Running: 12 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.7%, CPU KV cache usage: 0.0%.
差不多 3 到 4 倍的样子

maskerTUI

164 天前

@crac 实际使用的话，我在公司的相同的硬件上测试 deepseek-r1:32b ，ollama 最多出 30 字/秒，vLLM 最多每秒 60 字/秒。使用上提升很大。

crac

164 天前

@maskerTUI 感谢，我研究一下

crac

160 天前

@Chihaya0824 效果反馈～切换 VLLM 后输出速度直接提升一倍～

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/1121661

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

ubuntu + 双 N 卡 本地部署 DS-R1，性能优化请教

ubuntu + 双 N 卡本地部署 DS-R1，性能优化请教