?
【GiantPandaCV導(dǎo)語】本文介紹了量化感知訓(xùn)練的原理，并基于OneFlow實(shí)現(xiàn)了一個(gè)量化感知訓(xùn)練Demo，并介紹了在具體實(shí)現(xiàn)中的各種細(xì)節(jié)。希望對想學(xué)習(xí)量化感知訓(xùn)練的讀者有用，本文僅做學(xué)習(xí)交流。

0x0. 前言

這篇文章主要是講解一下量化感知訓(xùn)練的原理，以及基于OneFlow實(shí)現(xiàn)一個(gè)Demo級別的手動(dòng)量化感知訓(xùn)練。

0x1. 后量化以及量化感知訓(xùn)練原理

這里說的量化一般都是指的Google TFLite的量化方案，對應(yīng)的是Google 的論文 Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference。雖然TfLite這套量化方案并不是很難，但在實(shí)際處理的時(shí)候細(xì)節(jié)還是比較多，一時(shí)是很難說清楚的。

所以，這里推薦一系列講解TFLite后量化和量化感知訓(xùn)練原理的文章，看一下這幾篇文章閱讀本文就沒有任何問題了。

神經(jīng)網(wǎng)絡(luò)量化入門--基本原理
神經(jīng)網(wǎng)絡(luò)量化入門--后訓(xùn)練量化
神經(jīng)網(wǎng)絡(luò)量化入門--量化感知訓(xùn)練
神經(jīng)網(wǎng)絡(luò)量化入門--Folding BN ReLU代碼實(shí)現(xiàn)

這里我簡單的總結(jié)一下，無論是TFLite的量化方案還是TensorRT的后量化方案，他們都會(huì)基于原始數(shù)據(jù)和量化數(shù)據(jù)的數(shù)值范圍算出一個(gè)縮放系數(shù)scale和零點(diǎn)zero_point，這個(gè)zero_point有可能是0（對應(yīng)對稱量化），也有可能不是0（對應(yīng)非對稱量化）。然后原始數(shù)據(jù)縮放之后減掉零點(diǎn)就獲得了量化后的數(shù)據(jù)。這里的關(guān)鍵就在于縮放系數(shù)scale和zero_point怎么求，Google的TFLite使用下面的公式：

其中，表示浮點(diǎn)實(shí)數(shù)，表示量化后的定點(diǎn)整數(shù)，和分別是的最大值和最小值，和表示的最大值和最小值，如果是有符號8比特量化，那么，，如果是無符號那么，。就代表scale，就代表zero_point。

要求取scale和zero_point關(guān)鍵就是要精確的估計(jì)原始浮點(diǎn)實(shí)數(shù)的最大值和最小值，有了原始浮點(diǎn)實(shí)數(shù)的最大值和最小值就可以代入上面的公式求出scale和zero_point了。所以后訓(xùn)練量化以及量化感知訓(xùn)練的目的是要記錄各個(gè)激活特征圖和權(quán)重參數(shù)的scale和zero_point。

在后訓(xùn)練量化中，做法一般是使用一部分驗(yàn)證集來對網(wǎng)絡(luò)做推理，在推理的過程中記錄激活特征圖以及權(quán)重參數(shù)的最大和最小值，進(jìn)而計(jì)算scale和zero_point。而量化感知訓(xùn)練則在訓(xùn)練的過程中記錄激活特征圖和權(quán)重參數(shù)的最大和最小值來求取scale和zero_point。量化感知訓(xùn)練和后訓(xùn)練量化的主要區(qū)別在于它會(huì)對激活以及權(quán)重做模擬量化操作，即FP32->INT8->FP32。這樣做的好處是可以模擬量化的實(shí)際運(yùn)行過程，將量化過程中產(chǎn)生的誤差也作為一個(gè)特征提供給網(wǎng)絡(luò)學(xué)習(xí)，一般來說量化感知訓(xùn)練會(huì)獲得比后訓(xùn)練量化更好的精度。

0x2. 組件

在上一節(jié)中主要提到了記錄激活和權(quán)重的scale和zero_point，以及模擬量化，量化這些操作。這對應(yīng)著三個(gè)量化訓(xùn)練中用到的三個(gè)基本組件，即MinMaxObserver，FakeQuantization，Quantization。下面我們分別看一下在OneFlow中這三個(gè)組件的實(shí)現(xiàn)。

組件1. MinMaxObserver

從這個(gè)文檔我們可以看到MinMaxObserver操作被封裝成oneflow.nn.MinMaxObserver這個(gè)Module（Module在Pytorch中對應(yīng)torch.nn.Module，然后OneFlow的接口也在靠近Pytorch，也對應(yīng)有oneflow.nn.Module，因此這里將其封裝為oneflow.nn.Module）。這個(gè)Module的參數(shù)有：

quantization_bit表示量化Bit數(shù)
quantization_scheme 表示量化的方式，有對稱量化symmetric和非對稱量化affine兩種，區(qū)別就是對稱量化浮點(diǎn)0和量化空間中的0一致
quantization_formula 表示量化的方案，有Google和Cambricon兩種，Cambricon是中科寒武紀(jì)的意思
per_layer_quantization 表示對當(dāng)前的輸入Tensor是PerChannel還是PerLayer量化，如果是PerLayer量化設(shè)置為True。一般激活特征圖的量化都是PerLayer，而權(quán)重的量化可以選擇PerLayer或者PerChannel。

下面看一下在Python層的用法：

>>> import numpy as np
>>> import oneflow as flow

>>> weight = (np.random.random((2, 3, 4, 5)) - 0.5).astype(np.float32)

>>> input_tensor = flow.Tensor(
...    weight, dtype=flow.float32
... )

>>> quantization_bit = 8
>>> quantization_scheme = "symmetric"
>>> quantization_formula = "google"
>>> per_layer_quantization = True

>>> min_max_observer = flow.nn.MinMaxObserver(quantization_formula=quantization_formula, quantization_bit=quantization_bit,
... quantization_scheme=quantization_scheme, per_layer_quantization=per_layer_quantization)

>>> scale, zero_point = min_max_observer(
...    input_tensor, )

在設(shè)定好相關(guān)量化配置參數(shù)后，傳入給定Tensor即可統(tǒng)計(jì)和計(jì)算出該設(shè)置下的Tensor的scale和zero_point。

上面講的是Python前端的接口和用法，下面看一下在OneFlow中這個(gè)Module的具體實(shí)現(xiàn)，我們以CPU版本為例（GPU和CPU的Kernel實(shí)現(xiàn)是一致的），文件在oneflow/user/kernels/min_max_observer_kernel.cpp，核心實(shí)現(xiàn)是如下三個(gè)函數(shù)：

// TFLite量化方案，對稱量化
template<typename T>
void GenQuantScaleSymmetric(const T* in_ptr, const int32_t quantization_bit,
                            const int64_t num_elements, T* scale, T* zero_point) {
  T in_max = *std::max_element(in_ptr, in_ptr + num_elements);
  T in_min = *std::min_element(in_ptr, in_ptr + num_elements);

  in_max = std::max(std::abs(in_max), std::abs(in_min));

  T denominator = static_cast<T>(pow(2.0, quantization_bit - 1)) - 1;

  *scale = in_max / denominator;
  *zero_point = 0;
}

// TFLite量化方案，非對稱量化
template<typename T>
void GenQuantScaleAffine(const T* in_ptr, const int32_t quantization_bit,
                         const int64_t num_elements, T* scale, T* zero_point) {
  T in_max = *std::max_element(in_ptr, in_ptr + num_elements);
  T in_min = *std::min_element(in_ptr, in_ptr + num_elements);

  T denominator = static_cast<T>(pow(2.0, quantization_bit)) - 1;

  *scale = (in_max - in_min) / denominator;
  *zero_point = -std::nearbyint(in_min / (*scale));
}

//寒武紀(jì)量化方案
template<typename T>
void GenQuantScaleCambricon(const T* in_ptr, const int32_t quantization_bit,
                            const int64_t num_elements, T* scale, T* zero_point) {
  T in_max = *std::max_element(in_ptr, in_ptr + num_elements);
  T in_min = *std::min_element(in_ptr, in_ptr + num_elements);

  in_max = std::max(std::abs(in_max), std::abs(in_min));

  *scale = std::floor(std::log2(in_max)) - (quantization_bit - 2);
  *zero_point = 0;
}

除了這三個(gè)函數(shù)之外，另外一個(gè)關(guān)鍵點(diǎn)就是對per_layer_quantization參數(shù)的處理了，邏輯如下：

如果是PerChannel量化則對每個(gè)輸出通道求一個(gè)scale和zero_point。想了解更多PerLayer量化以及PerChannel量化的知識可以看這篇文章：神經(jīng)網(wǎng)絡(luò)量化--per-channel量化。

組件2：FakeQuantization

同樣，F(xiàn)akeQuantization也被封裝為一個(gè)oneflow.nn.Module。在上一節(jié)提到，量化感知訓(xùn)練和后訓(xùn)練量化的主要區(qū)別在于它會(huì)對激活以及權(quán)重參數(shù)做模擬量化操作，即FP32->INT8->FP32。通過這種模擬將量化過程中產(chǎn)生的誤差也作為一個(gè)特征提供給網(wǎng)絡(luò)學(xué)習(xí)，以期在實(shí)際量化部署時(shí)獲得更好的準(zhǔn)確率。這個(gè)接口有以下參數(shù)：

scale：由MinMaxObserver組件算出來的量化scale
zero_point：由MinMaxObserver組件算出來的量化zero_point
quantization_bit：量化比特?cái)?shù)
quantization_scheme 表示量化的方式，有對稱量化symmetric和非對稱量化affine兩種，區(qū)別就是對稱量化浮點(diǎn)0和量化空間中的0一致
quantization_formula 表示量化的方案，有Google和Cambricon兩種，Cambricon是中科寒武紀(jì)的意思

Python層的示例用法如下：

>>> import numpy as np
>>> import oneflow as flow

>>> weight = (np.random.random((2, 3, 4, 5)) - 0.5).astype(np.float32)

>>> input_tensor = flow.Tensor(
...    weight, dtype=flow.float32
... )

>>> quantization_bit = 8
>>> quantization_scheme = "symmetric"
>>> quantization_formula = "google"
>>> per_layer_quantization = True

>>> min_max_observer = flow.nn.MinMaxObserver(quantization_formula=quantization_formula, quantization_bit=quantization_bit,
... quantization_scheme=quantization_scheme, per_layer_quantization=per_layer_quantization)
>>> fake_quantization = flow.nn.FakeQuantization(quantization_formula=quantization_formula, quantization_bit=quantization_bit,
... quantization_scheme=quantization_scheme)

>>> scale, zero_point = min_max_observer(
...    input_tensor,
... )

>>> output_tensor = fake_quantization(
...    input_tensor,
...    scale,
...    zero_point,
... )

在執(zhí)行FakeQuantizaton必須知道輸入Tensor的scale和zero_point，這是由上面的MinMaxObserver組件獲得的。

接下來看一下FakeQuantization組件C++層的實(shí)現(xiàn)，仍然有三個(gè)核心函數(shù)：

// TFLite量化方案，對稱量化
template<typename T>
void FakeQuantizationPerLayerSymmetric(const T* in_ptr, const T scale,
                                       const int32_t quantization_bit, const int64_t num_elements,
                                       T* out_ptr) {
  T upper_bound = static_cast<T>(pow(2.0, quantization_bit - 1)) - 1;
  T lower_bound = -upper_bound - 1;
  FOR_RANGE(int64_t, i, 0, num_elements) {
    T out = std::nearbyint(in_ptr[i] / scale);
    out = out > upper_bound ? upper_bound : out;
    out = out < lower_bound ? lower_bound : out;
    out_ptr[i] = out * scale;
  }
}

// TFLite量化方案，非對稱量化
template<typename T>
void FakeQuantizationPerLayerAffine(const T* in_ptr, const T scale, const T zero_point,
                                    const int32_t quantization_bit, const int64_t num_elements,
                                    T* out_ptr) {
  T upper_bound = static_cast<T>(pow(2.0, quantization_bit)) - 1;
  T lower_bound = 0;
  uint8_t zero_point_uint8 = static_cast<uint8_t>(std::round(zero_point));
  FOR_RANGE(int64_t, i, 0, num_elements) {
    T out = std::nearbyint(in_ptr[i] / scale + zero_point_uint8);
    out = out > upper_bound ? upper_bound : out;
    out = out < lower_bound ? lower_bound : out;
    out_ptr[i] = (out - zero_point_uint8) * scale;
  }
}
// 寒武紀(jì)量化方案
template<typename T>
void FakeQuantizationPerLayerCambricon(const T* in_ptr, const T shift,
                                       const int32_t quantization_bit, const int64_t num_elements,
                                       T* out_ptr) {
  T upper_bound = static_cast<T>(pow(2.0, quantization_bit - 1)) - 1;
  T lower_bound = -upper_bound - 1;
  T scale = static_cast<T>(pow(2.0, static_cast<int32_t>(shift)));
  FOR_RANGE(int64_t, i, 0, num_elements) {
    T out = std::nearbyint(in_ptr[i] / scale);
    out = out > upper_bound ? upper_bound : out;
    out = out < lower_bound ? lower_bound : out;
    out_ptr[i] = out * scale;
  }
}

需要注意的一點(diǎn)是由于FakeQuantization要參與訓(xùn)練，所以我們要考慮梯度怎么計(jì)算？從上面的三個(gè)核心函數(shù)實(shí)現(xiàn)中我們可以發(fā)現(xiàn)里面都用了std::nearbyint函數(shù)，這個(gè)函數(shù)其實(shí)就對應(yīng)numpy的round操作。而我們知道round函數(shù)中幾乎每一處梯度都是0，所以如果網(wǎng)絡(luò)中存在這個(gè)函數(shù)，反向傳播的梯度也會(huì)變成0。

因此為了解決這個(gè)問題，引入了Straight Through Estimator。即直接把卷積層（這里以卷積層為例子，還包含全連接層等需要量化訓(xùn)練的層）的梯度回傳到偽量化之前的weight上。這樣一來，由于卷積中用的weight是經(jīng)過偽量化操作的，因此可以模擬量化誤差，把這些誤差的梯度回傳到原來的 weight，又可以更新權(quán)重，使其適應(yīng)量化產(chǎn)生的誤差，量化訓(xùn)練也可以正常運(yùn)行。

具體的實(shí)現(xiàn)就非常簡單了，直接將dy賦值給dx，在OneFlow中通過identity這個(gè)Op即可：

組件三：Quantization

上面的FakeQuantization實(shí)現(xiàn)了FP32->INT8->FP32的過程，這里還實(shí)現(xiàn)了一個(gè)Quantization組件備用。它和FakeQuantization的區(qū)別在于它沒有INT8->FP32這個(gè)過程，直接輸出定點(diǎn)的結(jié)果。所以這個(gè)組件的接口和C++代碼實(shí)現(xiàn)和FakeQuantization基本完全一樣（反向就不需要了），這里不再贅述。之所以要獨(dú)立這個(gè)組件是為了在訓(xùn)練完模型之后可以將神經(jīng)網(wǎng)絡(luò)的權(quán)重直接以定點(diǎn)的方式存儲下來。后面的Demo中將體現(xiàn)這一點(diǎn)。

0x3. 基于OneFlow量化感知訓(xùn)練AlexNet

下面以AlexNet為例，基于OneFlow的三個(gè)量化組件完成一個(gè)量化感知訓(xùn)練Demo。這里先貼一下實(shí)驗(yàn)結(jié)果：

訓(xùn)練的數(shù)據(jù)集是ImageNet的一個(gè)子集，詳細(xì)信息可以https://github.com/Oneflow-Inc/models/pull/78看到。在8Bit的時(shí)候無論是選用Google還是寒武紀(jì)，對稱還是非對稱，PerLayer還是PerChannel，量化感知訓(xùn)練后的模型精度沒有明顯降低。一旦將量化Bit數(shù)從8降到4，在相同的超參配置下精度有了明顯下降。

下面分享一下這個(gè)基于OneFlow的量化感知訓(xùn)練Demo的做法：

首先代碼結(jié)構(gòu)如下：

- quantization
 - quantization_ops 偽量化OP實(shí)現(xiàn)
     - q_module.py 實(shí)現(xiàn)了Qparam類來管理偽量化參數(shù)和操作和QModule基類管理偽量化OP的實(shí)現(xiàn)
     - conv.py 繼承QModule基類，實(shí)現(xiàn)卷積的偽量化實(shí)現(xiàn)
     - linear.py 繼承QModule基類，實(shí)現(xiàn)全連接層的偽量化實(shí)現(xiàn)
     - ...
 - models 量化模型實(shí)現(xiàn)
     - q_alexnet.py 量化版AlexNet模型
 - quantization_aware_training.py 量化訓(xùn)練實(shí)現(xiàn)
 - quantization_infer.py 量化預(yù)測實(shí)現(xiàn)
 - train.sh 量化訓(xùn)練腳本
 - infer.sh 量化預(yù)測腳本

由于量化訓(xùn)練時(shí)需要先統(tǒng)計(jì)樣本以及中間層的 scale、zeropoint，同時(shí)也頻繁涉及到一些量化、反量化操作，所以實(shí)現(xiàn)一個(gè)QParam基類封裝這些功能。
實(shí)現(xiàn)了一個(gè)量化基類QModule，提供了三個(gè)成員函數(shù)__init__，freeze。
后量化公式，定點(diǎn)計(jì)算

__init__函數(shù)除了需要i指定quantization_bit，quantization_scheme，quantization_formula，per_layer_quantization參數(shù)外，還需要指定是否提供量化輸入?yún)?shù)(qi) 及輸出參數(shù) (qo)。這是因?yàn)椴皇敲恳粋€(gè)網(wǎng)絡(luò)模塊都需要統(tǒng)計(jì)輸入的 scale，zero_point，大部分中間層都是用上一層的qo來作為自己的qi，另外有些中間層的激活函數(shù)也是直接用上一層的 qi來作為自己的qi和qo。
freeze 這個(gè)函數(shù)會(huì)在統(tǒng)計(jì)完 scale，zero_point 后發(fā)揮作用，這個(gè)函數(shù)和后訓(xùn)練量化和模型轉(zhuǎn)換有關(guān)。如下面的量化公式所示，其中很多項(xiàng)是可以提前計(jì)算好的，freeze 就是把這些項(xiàng)提前固定下來，同時(shí)也將網(wǎng)絡(luò)的權(quán)重由浮點(diǎn)實(shí)數(shù)轉(zhuǎn)化為定點(diǎn)整數(shù)。

基于這個(gè)QModule基類定義QConv2d，QReLU，QConvBN等等。

QConvBN表示Conv和BN融合后再模擬量化。原理可以看第一節(jié)的第4篇參考資料。這里以QConv2d為例看看它的實(shí)現(xiàn)：

import oneflow as flow
from quantization_ops.q_module import QModule, QParam

__all__ = ["QConv2d"]


class QConv2d(QModule):

    def __init__(self, conv_module, qi=True, qo=True, quantization_bit=8, quantization_scheme='symmetric', quantization_formula='google', per_layer_quantization=True):
        super(QConv2d, self).__init__(qi=qi, qo=qo, quantization_bit=quantization_bit, quantization_scheme=quantization_scheme,
                                      quantization_formula=quantization_formula, per_layer_quantization=per_layer_quantization)
        self.quantization_bit = quantization_bit
        self.quantization_scheme = quantization_scheme
        self.quantization_formula = quantization_formula
        self.per_layer_quantization = per_layer_quantization
        self.conv_module = conv_module
        self.fake_quantization = flow.nn.FakeQuantization(
            quantization_formula=quantization_formula, quantization_bit=quantization_bit, quantization_scheme=quantization_scheme)
        self.qw = QParam(quantization_bit=quantization_bit, quantization_scheme=quantization_scheme,
                         quantization_formula=quantization_formula, per_layer_quantization=per_layer_quantization)
        self.quantization = flow.nn.Quantization(
            quantization_bit=32, quantization_scheme="affine", quantization_formula="google")

    def forward(self, x):
        if hasattr(self, 'qi'):
            self.qi.update(x)
            x = self.qi.fake_quantize_tensor(x)

        self.qw.update(self.conv_module.weight)

        x = flow.F.conv2d(x, self.qw.fake_quantize_tensor(self.conv_module.weight), self.conv_module.bias,
                          stride=self.conv_module.stride,
                          padding=self.conv_module.padding, dilation=self.conv_module.dilation,
                          groups=self.conv_module.groups)

        if hasattr(self, 'qo'):
            self.qo.update(x)
            x = self.qo.fake_quantize_tensor(x)

        return x

    def freeze(self, qi=None, qo=None):

        if hasattr(self, 'qi') and qi is not None:
            raise ValueError('qi has been provided in init function.')
        if not hasattr(self, 'qi') and qi is None:
            raise ValueError('qi is not existed, should be provided.')

        if hasattr(self, 'qo') and qo is not None:
            raise ValueError('qo has been provided in init function.')
        if not hasattr(self, 'qo') and qo is None:
            raise ValueError('qo is not existed, should be provided.')

        if qi is not None:
            self.qi = qi
        if qo is not None:
            self.qo = qo
        self.M = self.qw.scale.numpy() * self.qi.scale.numpy() / self.qo.scale.numpy()

        self.conv_module.weight = flow.nn.Parameter(
            self.qw.quantize_tensor(self.conv_module.weight) - self.qw.zero_point)
        self.conv_module.bias = flow.nn.Parameter(self.quantization(
            self.conv_module.bias, self.qi.scale * self.qw.scale, flow.Tensor([0])))

在QConv2d的__init__.py中，conv_module是原始的FP32的卷積module，其它的參數(shù)都是量化配置參數(shù)需要在定義模型的時(shí)候指定，forward函數(shù)模擬了FakeQuantization的過程，freeze函數(shù)則實(shí)現(xiàn)了凍結(jié)權(quán)重參數(shù)為定點(diǎn)的功能。其它的量化Module實(shí)現(xiàn)類似。

基于這些Module，我們可以定義AlexNet的量化版模型結(jié)構(gòu)https://github.com/Oneflow-Inc/models/blob/add_quantization_model/quantization/models/q_alexnet.py ，完成量化感知訓(xùn)練以及模型參數(shù)定點(diǎn)固化等。

想完成完整的訓(xùn)練和測試可以直接訪問：https://github.com/Oneflow-Inc/models 倉庫。

0x4. 注意，上面的實(shí)現(xiàn)只是Demo級別的

查看了上面的具體實(shí)現(xiàn)之后，我們會(huì)發(fā)現(xiàn)最拉胯的問題是在量化模型的時(shí)候需要你手動(dòng)去調(diào)整模型結(jié)構(gòu)。其實(shí)不僅OneFlow的這個(gè)Demo是這樣，在Pytorch1.8.0推出FX的量化方案之前（這里叫第一代量化方案吧）的第一代量化方案也是這樣。這里放一段調(diào)研報(bào)告。

Pytorch第一代量化叫作Eager Mode Quantization，然后從1.8開始推出FX Graph Mode Quantization。Eager Mode Quantization需要用戶手動(dòng)更改模型，并手動(dòng)指定需要融合的Op。FX Graph Mode Quantization解放了用戶，一鍵自動(dòng)量化，無需用戶修改模型和關(guān)心內(nèi)部操作。這個(gè)改動(dòng)具體可以體現(xiàn)在下面的圖中。

下面以一段代碼為例解釋一下Pytorch這兩種量化方式的區(qū)別。

Eager Mode Quantization

class Net(nn.Module):

    def __init__(self, num_channels=1):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(num_channels, 40, 3, 1)
        self.conv2 = nn.Conv2d(40, 40, 3, 1)
        self.fc = nn.Linear(5*5*40, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2, 2)
        x = x.reshape(-1, 5*5*40)
        x = self.fc(x)
        return x

Pytorch可以在Module的foward里面隨意構(gòu)造網(wǎng)絡(luò)，可以調(diào)用Module，也可以調(diào)用Functional，甚至可以在里面寫If這種控制邏輯。但這也帶來了一個(gè)問題，就是比較難獲取這個(gè)模型的圖結(jié)構(gòu)。因?yàn)樵贓ager Mode Quantization中，要量化這個(gè)網(wǎng)絡(luò)必須做手動(dòng)修改：

class NetQuant(nn.Module):

    def __init__(self, num_channels=1):
        super(NetQuant, self).__init__()
        self.conv1 = nn.Conv2d(num_channels, 40, 3, 1)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(40, 40, 3, 1)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(2, 2)
        self.fc = nn.Linear(5*5*40, 10)

        self.quant = torch.quantization.QuantStub()
        self.dequant = torch.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.relu1(self.conv1(x))
        x = self.pool1(x)
        x = self.relu2(self.conv2(x))
        x = self.pool2(x)
        x = x.reshape(-1, 5*5*40)
        x = self.fc(x)
        x = self.dequant(x)
        return x

也就是說，除了Conv，Linear這些含有參數(shù)的Module外，ReLU，MaxPool2d也要在__init__中定義，Eager Mode Quantization才可以處理。

除了這一點(diǎn)，由于一些幾點(diǎn)是要Fuse之后做量化比如Conv+ReLU，那么還需要手動(dòng)指定這些層進(jìn)行折疊，目前支持ConV + BN、ConV + BN + ReLU、Conv + ReLU、Linear + ReLU、BN + ReLU的折疊。

model = NetQuant()model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
modules_to_fuse = [['conv1', 'relu1'], ['conv2', 'relu2']]  # 指定合并layer的名字
model_fused = torch.quantization.fuse_modules(model, modules_to_fuse)
model_prepared = torch.quantization.prepare(model_fused)
post_training_quantize(model_prepared, train_loader)   # 這一步是做后訓(xùn)練量化
model_int8 = torch.quantization.convert(model_prepared)

整個(gè)流程比較逆天，不知道有沒有人用。

FX Graph Mode Quantization

由于 FX 可以自動(dòng)跟蹤 forward 里面的代碼，因此它是真正記錄了網(wǎng)絡(luò)里面的每個(gè)節(jié)點(diǎn)，在 fuse 和動(dòng)態(tài)插入量化節(jié)點(diǎn)方面，比 Eager 模式更友好。對于前面那個(gè)模型代碼，我們不需要對網(wǎng)絡(luò)做修改，直接讓 FX 幫我們自動(dòng)修改網(wǎng)絡(luò)即可：

from torch.quantization import get_default_qconfig, quantize_jit
from torch.quantization.quantize_fx import prepare_fx, convert_fx
model = Net()  
qconfig = get_default_qconfig("fbgemm")
qconfig_dict = {"": qconfig}
model_prepared = prepare_fx(model, qconfig_dict)
post_training_quantize(model_prepared, train_loader)      # 這一步是做后訓(xùn)練量化
model_int8 = convert_fx(model_prepared)

理解

個(gè)人感覺基于OneFlow的Eager接口（OneFlow的Eager接口和Pytorch將完全對齊，用戶可零成本遷移算法，并享受OneFlow在多機(jī)多卡上的速度紅利）做量化感知訓(xùn)練也是要做到完全自動(dòng)的。Pytorch FX的好處就在于它可以將一個(gè)Module通過插入一些Pass轉(zhuǎn)化成一個(gè)類似的Module，只要開發(fā)者實(shí)現(xiàn)了某個(gè)Pass，就不需要用戶操心了。OneFlow Eager版本的自動(dòng)量化開發(fā)正在進(jìn)行中（對于Lazy版本，我們是支持一鍵自動(dòng)量化訓(xùn)練的），敬請期待。打個(gè)廣告，歡迎關(guān)注我司的OneFlow：https://github.com/Oneflow-Inc/oneflow 。

0x5. 總結(jié)

本文分享了筆者最近的一項(xiàng)工作，基于OneFlow Eager版本做量化感知訓(xùn)練，目前手動(dòng)做量化感知訓(xùn)練對用戶沒有友好性。但對于想學(xué)習(xí)量化感知訓(xùn)練的讀者來說，通過這個(gè)Demo來學(xué)習(xí)一些技巧還是不錯(cuò)的。另外，本文還調(diào)研了Pytorch FX的自動(dòng)量化方案，它確實(shí)比Pytorch的第一代方案更友好，我們的目標(biāo)也是做出更自動(dòng)，更友好的量化訓(xùn)練接口。

0x6. 參考

神經(jīng)網(wǎng)絡(luò)量化入門--基本原理
神經(jīng)網(wǎng)絡(luò)量化入門--后訓(xùn)練量化
神經(jīng)網(wǎng)絡(luò)量化入門--量化感知訓(xùn)練
神經(jīng)網(wǎng)絡(luò)量化入門--Folding BN ReLU代碼實(shí)現(xiàn)

歡迎關(guān)注GiantPandaCV, 在這里你將看到獨(dú)家的深度學(xué)習(xí)分享，堅(jiān)持原創(chuàng)，每天分享我們學(xué)習(xí)到的新鮮知識。( ? ?ω?? )?

有對文章相關(guān)的問題，或者想要加入交流群，歡迎添加BBuf微信：

二維碼

基于OneFlow實(shí)現(xiàn)量化感知訓(xùn)練