LLM 模型量化问题为什么不量化 lm_head ？

最近在看模型量化的课。

里面在量化下面这个模型的时候说建议不要量化最后的lm_head。

CodeGenForCausalLM(
  (transformer): CodeGenModel(
    (wte): Embedding(51200, 1024)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0-19): 20 x CodeGenBlock(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): CodeGenAttention(
          (attn_dropout): Dropout(p=0.0, inplace=False)
          (resid_dropout): Dropout(p=0.0, inplace=False)
          (qkv_proj): W8A16LinearLayer()
          (out_proj): W8A16LinearLayer()
        )
        (mlp): CodeGenMLP(
          (fc_in): W8A16LinearLayer()
          (fc_out): W8A16LinearLayer()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=1024, out_features=51200, bias=True)
)

他说的原文如下：


2:14 And as I said we're not going to quantize the language model head

2:18 because since the model is an autoregressive model, it uses 

2:22 the output from the previous iteration to get the output of the next iteration.

2:27 If you quantize the language model head, a lot of errors might 

2:31 might be accumulating over the generation steps.

2:34 And you will most likely end up, having some gibberish after some tokens.

没看懂他说的理由，为什么量化 lm_head 会积累错误？有大佬能简单易懂的解释一下吗？

课程网页如下： https://learn.deeplearning.ai/courses/quantization-in-depth/lesson/12/quantize-any-open-source-pytorch-model

LLM 模型量化问题 为什么不量化 lm_head ？

LLM 模型量化问题为什么不量化 lm_head ？