在 Apple M1 上运行 LLaMA

2023-03-13 11:27:28 +08:00
 charslee013

在 Apple M1 上运行 LLaMA


TL;DR

#!/usr/bin/env bash

# clone repo and install dependences
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
python -m pip install torch numpy sentencepiece

# download 7B model
mkdir -p models/7B/
wget -P models/7B/ https://huggingface.co/nyanko7/LLaMA-7B/resolve/main/consolidated.00.pth
wget -P models/7B/ https://huggingface.co/nyanko7/LLaMA-7B/raw/main/params.json
wget -P models/7B/ https://huggingface.co/nyanko7/LLaMA-7B/raw/main/checklist.chk
wget -P models/ https://huggingface.co/nyanko7/LLaMA-7B/resolve/main/tokenizer.model

# converts the model to "ggml FP16 format"
python convert-pth-to-ggml.py models/7B/ 1
# quantizes the model to 4-bits
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2

# enjoy
./main -m ./models/7B/ggml-model-q4_0.bin \
  -t 8 \
  -n 128 \
  -p 'I Have a Dream'

安装依赖


模型选择

目前已知的模型有:

每个模型的内存占用空间大小约为 4GB,根据自己机器内存大小选择合适的模型

下载地址

Meta 并没有公开模型的 hash 值,所以请自行判断是否要运行 目前已知的泄漏地址有以下几个:

有人在官方库上故意不小心提交了模型的磁力链接

magnet:?xt=urn:btih:ZXXDAUWYLRUXXBHUYEMS6Q5CE5WA3LVA&dn=LLaMA

new bing 找到的库,里面用的好像是作者自己的 API 接口

curl -o- https://raw.githubusercontent.com/shawwn/llama-dl/56f50b96072f42fb2520b1ad5a1d6ef30351f23c/llama.sh | bash

或者通过磁力链接

magnet:?xt=urn:btih:b8287ebfa04f879b048d4d4404108cf3e8014352&dn=LLaMA&tr=udp%3a%2f%2ftracker.opentrackr.org%3a1337%2fannounce

目前找到的只有 7B 和 65B 的模型

https://huggingface.co/nyanko7/LLaMA-7B/tree/main

https://huggingface.co/datasets/nyanko7/LLaMA-65B/tree/main

软 /硬件依赖

笔者机器硬件是 Apple M1 8-core 16GB RAM

系统版本是 12.5.1

clang 版本如下

❯ c++ -v
Apple clang version 14.0.0 (clang-1400.0.29.102)
Target: arm64-apple-darwin21.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Python

Python 目前是基于 3.10 版本

如果没有对应的 python 版本,可以通过 pipenv 或者 conda 创建一个虚拟环境出来

pipenv shell --python 3.10

或者

conda create -n llama python=3.10
conda activate llama

安装依赖

pip install torch numpy sentencepiece

运行模型

拉取项目

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

编译出 mainquantize

make 

确保模型已经下载到对应的文件夹内

下面以 7B 模型举例子

ls ./models
7B
tokenizer.model

将模型转换为 ggml FP16 格式

python convert-pth-to-ggml.py models/7B/ 1

这一步会生成一个 13GB 的 models/7B/ggml-model-f16.bin 文件

下一步将模型量化为 4-bit

./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2

如果你的模型数量有多个,需要分批次来处理

比如 13B 的两个模型文件

./quantize ./models/13B/ggml-model-f16.bin   ./models/13B/ggml-model-q4_0.bin 2
./quantize ./models/13B/ggml-model-f16.bin.1 ./models/13B/ggml-model-q4_0.bin.1 2

享受 AI 的时刻

笔者用的是 13B 模型,-t 是线程数量,-n 是 token 数量 , -p 是你输入的内容

❯ ./main -m models/13B/ggml-model-q4_0.bin -t 8 -n 409600 -p 'I Have a Dream'
main: seed = 1678677633
llama_model_load: loading model from 'models/13B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: ggml ctx size = 8559.49 MB
llama_model_load: memory_size =   800.00 MB, n_mem = 20480
llama_model_load: loading model part 1/2 from 'models/13B/ggml-model-q4_0.bin'
llama_model_load: ............................................. done
llama_model_load: model size =  3880.49 MB / num tensors = 363
llama_model_load: loading model part 2/2 from 'models/13B/ggml-model-q4_0.bin.1'
llama_model_load: ............................................. done
llama_model_load: model size =  3880.49 MB / num tensors = 363

main: prompt: 'I Have a Dream'
main: number of tokens in prompt = 5
     1 -> ''
 29902 -> 'I'
  6975 -> ' Have'
   263 -> ' a'
 16814 -> ' Dream'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


I Have a Dream: A Handbook for Teachers and Students on Martin Luther King, Jr.
Culture is always changing and being influenced by the people around us who we can observe. Ways of thinking about culture are more important than which one you believe in because it could be dangerous if your way off believing in something that isn’t true but also that means there will be changes over time so everyone should learn these things when they start school
Added: Sun, April 29th 2018 [end of text]


main: mem per token = 22439492 bytes
main:     load time =  4974.55 ms
main:   sample time =   300.81 ms
main:  predict time = 90728.84 ms / 824.81 ms per token
main:    total time = 98585.49 ms

参考

Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp

ggerganov/llama.cpp

2329 次点击
所在节点    MacBook
1 条回复
NealLason
2023-03-21 14:09:30 +08:00
7B 模型的中文支持简直像智障。。

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/923536

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX