[分享] 用腾讯开源的混元翻译模型 1.8B 给翻译插件当本地 API

This topic created in 45 days ago, the information mentioned may be changed or developed.

最近在找本地离线翻译大模型，测试了腾讯开源的 **混元翻译模型 1.8B **。

我用了几篇不同的技术文章进行深度对比，它的翻译质量明显高于 Google 翻译和微软翻译，术语和语序都更符合中文习惯，1.8B 的体量能有这个效果让人非常惊喜。

这里分享一下我的部署和启动参数优化经验。

1. 模型下载

建议下载 GGUF 格式，方便用 llama.cpp / llama-server 直接跑：

Hugging Face 地址： tencent/HY-MT1.5-1.8B-GGUF

2. 启动与优化指令

如果你使用的是 RTX 3060 6GB 显卡，可以使用我优化后的 llama-server 启动命令。

这里开启了 --flash-attn 以及 KV 缓存量化（q8_0），基本可以把模型完全塞进显存，速度飞快：

llama-server -hf tencent/Hy-MT2-1.8B-GGUF:Q8_0 \
  -c 8192 \
  --port 8080 \
  -ngl 99 \
  --flash-attn on \
  -t 6 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --parallel 1 \
  --jinja \
  --n-predict -1 \
  --verbosity 1

3. PowerShell 测试指令

服务启动后，兼容 OpenAI 的 API 格式。在 Windows 下可以用以下 PowerShell 命令直接测试：

Invoke-RestMethod -Method Post -Uri "[http://127.0.0.1:8080/v1/chat/completions]( http://127.0.0.1:8080/v1/chat/completions)" `
  -ContentType "application/json" `
  -Body (@{
    model = "gpt-3.5-turbo"
    messages = @(
        @{role = "user"; content = "Translate to Chinese: Comparing Rust and C++ performance is a topic that all software developers should consider. In this guide, they are compared in terms of memory safety, concurrency models, and compilation performance. You will understand why C++ provides the best performance, and Rust does not compromise on safety as a trade-off. Simple differentiation and real-life examples will help you be prepared to make the correct choice of the right tool 。"}
    )
    stream = $false
  } | ConvertTo-Json)

4. 运行结果与性能 (RTX 3060 6GB)

在我的 3060 上，生成的 Token 速度非常理想，完全能喂饱翻译插件的并发需求：

choices            : {@{finish_reason=stop; index=0; message=}}
created            : 1779531650
model              : tencent/Hy-MT2-1.8B-GGUF:Q8_0
system_fingerprint : b9294-0f3cb3fc8
object             : chat.completion
usage              : @{completion_tokens=68; prompt_tokens=88; total_tokens=156; prompt_tokens_details=}
id                 : chatcmpl-5XADKRfaVh7iZ1Iva7bt1P1oRJkQBr5Q

# 性能耗时指标：
timings            : @{
    cache_n=0; 
    prompt_n=88; 
    prompt_ms=312.969; 
    prompt_per_token_ms=3.556; 
    prompt_per_second=281.178; 
    predicted_n=68; 
    predicted_ms=555.212; 
    predicted_per_token_ms=8.164; 
    predicted_per_second=122.475
}