[CVPR 2020] Regularizing Class-Wise Predictions via Self-Knowledge Distillation
Contents
- Introduction
- Class-wise self-knowledge distillation (CS-KD)
-
- Class-wise regularization
- Effects of class-wise regularization
- Experiments
-
- Classification accuracy
- References
Introduction
- 为了缓解模型过拟合,作者提出 Class-wise self-knowledge distillation (CS-KD),用同一类别的其他样本的预测类别概率去进行自蒸馏,使得模型输出更有意义和更加一致的预测结果
Class-wise self-knowledge distillation (CS-KD)
Class-wise regularization
- class-wise regularization loss. 使得属于同一类别样本的预测概率分布彼此接近,相当于对模型自身的 dark knowledge (i.e., the knowledge on wrong predictions) 进行蒸馏
其中,x,x′\\mathbf x,\\mathbf x'x,x′ 为属于同一类别的不同样本,P(y∣x;θ,T)=exp(fy(x;θ)/T)∑i=1Cexp(fi(x;θ)/T)P(y \\mid \\mathbf{x} ; \\theta, T)=\\frac{\\exp \\left(f_y(\\mathbf{x} ; \\theta) / T\\right)}{\\sum_{i=1}^C \\exp \\left(f_i(\\mathbf{x} ; \\theta) / T\\right)}P(y∣x;θ,T)=∑i=1Cexp(fi(x;θ)/T)exp(fy(x;θ)/T),TTT 为温度参数;注意到,θ~\\tilde \\thetaθ~ 为 fixed copy of the parameters θ\\thetaθ,梯度不会通过 θ~\\tilde \\thetaθ~ 回传到模型参数,从而避免 model collapse (cf. Miyaeto et al.)
- total training loss
Effects of class-wise regularization
- Reducing the intra-class variations.
- Preventing overconfident predictions. CS-KD 通过将同一类别其他样本的预测类别分布作为软标签来避免 overconfident predictions,这比一般的 label-smoothing 方法生成的软标签更真实 (more ‘realistic’)
Experiments
Classification accuracy
- Comparison with output regularization methods.
- Comparison with self-distillation methods.
- Evaluation on large-scale datasets.
- Compatibility with other regularization methods.
- Ablation study.
(1) Feature embedding analysis.
(2) Hierarchical image classification.
- Calibration effects.
References
- Yun, Sukmin, et al. “Regularizing class-wise predictions via self-knowledge distillation.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
- code: https://github.com/alinlab/cs-kd