大模型(如DeepSeek、Qwen等)参数规模动辄数百亿,全精度(FP32)存储和推理会占用大量显存且速度慢。而模型量化技术通过将浮点数压缩为低精度整数,不仅能让大模型“瘦身”至1/4甚至更小体积,还能显著提升推理效率。例如,175B参数的模型用FP32需700GB显存,而量化到INT4仅需约10GB。
一、概念解读
min=-1.2
, max=0.8;激活值范围:
min=0.1
, max=5.6。
scale = (max - min) / (2^n - 1)(
n
为量化位数,如 INT8
时 n=8
,2^8-1=255
)zero_point = round(-min / scale)(确保浮点数
0
映射到整数 0
,避免负数溢出)min=-1.2
, max=0.8
,INT8
量化:scale = (0.8 - (-1.2)) / 255≈0.00784,zero_point = round(-(-1.2) / 0.00784) ≈ 153q = round(x / scale) + zero_point
(将浮点数 x
映射为整数 q
)x' = (q - zero_point) * scale
(将整数 q
还原为浮点数 x'
)INT8
压缩4倍,INT4
压缩8倍,实现“大象变蚂蚁”;二、技术实现
import torch
from torch.quantization import quantize_dynamic
# 加载预训练模型
model = torch.load('model.pth')
model.eval()
# 动态量化(量化Linear和LSTM层)
quantized_model = quantize_dynamic(
model,
{torch.nn.Linear, torch.nn.LSTM}, # 指定量化层类型
dtype=torch.qint8
)
from torch.quantization import prepare, convert
# 准备校准数据集
def calibrate(model, data_loader):
model.eval()
with torch.no_grad():
for inputs in data_loader:
model(inputs)
# 配置量化参数
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
model_prepared = prepare(model) # 插入Observer节点
calibrate(model_prepared, data_loader) # 校准激活值范围
quantized_model = convert(model_prepared) # 转换为量化模型
from torch.quantization import prepare_qat, FakeQuantize
# 定义QAT模型
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
model_prepared = prepare_qat(model) # 插入伪量化节点
# 训练阶段(模拟量化误差)
optimizer = torch.optim.SGD(model_prepared.parameters(), lr=0.01)
for inputs, labels in train_loader:
outputs = model_prepared(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# 转换至最终量化模型
quantized_model = convert(model_prepared)