目标检测 pytorch复现SSD目标检测项目

文章列表

0、简介
1、模型整体框架（以VGG16为特征提取网络）
3、默认框（default box）的生成--相当于Faster-RCNN中生成的anchor
4、预测层的实现原理：
5、正负样本的选取
6、损失的计算原理
6、以ResNet50作为特征提取backbone
7、ResNet50+SSD网络模型搭建
8、Default Box的生成原理
、训练自己的SSD目标检测模型

0、简介

SSD（Single Shot MultiBox Detector）是大神Wei Liu在 ECCV 2016上发表的一种的目标检测算法。对于输入图像大小300x300的版本在VOC2007数据集上达到了72.1%mAP的准确率并且检测速度达到了惊人的58FPS（ Faster RCNN：73.2%mAP，7FPS； YOLOv1： 63.4%mAP，45FPS ），500x500的版本达到了75.1%mAP的准确率。当然算法YOLOv2已经赶上了SSD，YOLOv3已经超越SSD，但SSD算法依旧值得研究。
目标检测 pytorch复现SSD目标检测项目

1、模型整体框架（以VGG16为特征提取网络）

目标检测 pytorch复现SSD目标检测项目

即根据论文，体征提取backbonk对应着上图中VGG16黄色框选的区域，也就是Conv5_3输出数据为提取的数据特征

在不同特征层上，匹配不同尺度大小的目标：
目标检测 pytorch复现SSD目标检测项目
上图中8x8的特征矩阵相对于4x4的特征矩阵，保留的特征信息会更多些，所以，在相对底层的特征矩阵上去预测小目标，在相对高层的特征矩阵上去预测大些的目标。

3、默认框（default box）的生成–相当于Faster-RCNN中生成的anchor

目标检测 pytorch复现SSD目标检测项目

class DefaultBoxes(object):def __init__(self, fig_size, feat_size, steps, scales, aspect_ratios, scale_xy=0.1, scale_wh=0.2):self.fig_size = fig_size   # 输入网络的图像大小 300# [38, 19, 10, 5, 3, 1]self.feat_size = feat_size  # 每个预测层的feature map尺寸self.scale_xy_ = scale_xyself.scale_wh_ = scale_wh# According to https://github.com/weiliu89/caffe# Calculation method slightly different from paper# [8, 16, 32, 64, 100, 300]self.steps = steps    # 每个特征层上的一个cell在原图上的跨度# [21, 45, 99, 153, 207, 261, 315]self.scales = scales  # 每个特征层上预测的default box的scale  [21, 45, 99, 153, 207, 261, 315]fk = fig_size / np.array(steps)     # 计算每层特征层的fk# [[2], [2, 3], [2, 3], [2, 3], [2], [2]]self.aspect_ratios = aspect_ratios  # 每个预测特征层上预测的default box的ratiosself.default_boxes = []# size of feature and number of feature# 遍历每层特征层，计算default boxfor idx, sfeat in enumerate(self.feat_size):sk1 = scales[idx] / fig_size  # scale转为相对值[0-1]sk2 = scales[idx + 1] / fig_size  # scale转为相对值[0-1]sk3 = sqrt(sk1 * sk2)# 先添加两个1:1比例的default box宽和高all_sizes = [(sk1, sk1), (sk3, sk3)]# 再将剩下不同比例的default box宽和高添加到all_sizes中for alpha in aspect_ratios[idx]:w, h = sk1 * sqrt(alpha), sk1 / sqrt(alpha)all_sizes.append((w, h))all_sizes.append((h, w))# 计算当前特征层对应原图上的所有default boxfor w, h in all_sizes:for i, j in itertools.product(range(sfeat), repeat=2):  # i -> 行（y）， j -> 列（x）# 计算每个default box的中心坐标（范围是在0-1之间）cx, cy = (j + 0.5) / fk[idx], (i + 0.5) / fk[idx]self.default_boxes.append((cx, cy, w, h))# 将default_boxes转为tensor格式self.dboxes = torch.as_tensor(self.default_boxes, dtype=torch.float32)  # 这里不转类型会报错self.dboxes.clamp_(min=0, max=1)  # 将坐标（x, y, w, h）都限制在0-1之间# For IoU calculation# ltrb is left top coordinate and right bottom coordinate# 将(x, y, w, h)转换成(xmin, ymin, xmax, ymax)，方便后续计算IoU(匹配正负样本时)self.dboxes_ltrb = self.dboxes.clone()self.dboxes_ltrb[:, 0] = self.dboxes[:, 0] - 0.5 * self.dboxes[:, 2]   # xminself.dboxes_ltrb[:, 1] = self.dboxes[:, 1] - 0.5 * self.dboxes[:, 3]   # yminself.dboxes_ltrb[:, 2] = self.dboxes[:, 0] + 0.5 * self.dboxes[:, 2]   # xmaxself.dboxes_ltrb[:, 3] = self.dboxes[:, 1] + 0.5 * self.dboxes[:, 3]   # ymax@propertydef scale_xy(self):return self.scale_xy_@propertydef scale_wh(self):return self.scale_wh_def __call__(self, order='ltrb'):# 根据需求返回对应格式的default boxif order == 'ltrb':return self.dboxes_ltrbif order == 'xywh':return self.dboxes

4、预测层的实现原理：

假设生成k个Default box，则：
目标检测 pytorch复现SSD目标检测项目
在Feature Map上的每个像素上，都会生成k个default box
即对每个Default box，都会预测C个类别分数（C包括了背景类别）

5、正负样本的选取

正样本的选取：
目标检测 pytorch复现SSD目标检测项目
匹配准则1：
对于每一个GT box ，去匹配与其IOU值最大的Default Box
匹配准则2：
对于任意的Default Box，只要与任何一个GT Box的IOU值大于0.5，也认为其为正样本

负样本的选取：
目标检测 pytorch复现SSD目标检测项目
对于剩下的负样本，选取Confidence Loss（即该值越大，表示网络将对应的box预测为目标的概率越大）靠前的样本作为负样本
负样本与正样本的比例为3:1

6、损失的计算原理

类别损失：
目标检测 pytorch复现SSD目标检测项目

定位损失：
定位损失值针对于正样本而言

class Loss(nn.Module):"""Implements the loss as the sum of the followings:1. Confidence Loss: All labels, with hard negative mining2. Localization Loss: Only on positive labelsSuppose input dboxes has the shape 8732x4"""def __init__(self, dboxes):super(Loss, self).__init__()# Two factor are from following links# http://jany.st/post/2017-11-05-single-shot-detector-ssd-from-scratch-in-tensorflow.htmlself.scale_xy = 1.0 / dboxes.scale_xy  # 10self.scale_wh = 1.0 / dboxes.scale_wh  # 5self.location_loss = nn.SmoothL1Loss(reduction='none')# [num_anchors, 4] -> [4, num_anchors] -> [1, 4, num_anchors]self.dboxes = nn.Parameter(dboxes(order="xywh").transpose(0, 1).unsqueeze(dim=0),requires_grad=False)self.confidence_loss = nn.CrossEntropyLoss(reduction='none')def _location_vec(self, loc):# type: (Tensor) -> Tensor"""Generate Location Vectors计算ground truth相对anchors的回归参数:param loc: anchor匹配到的对应GTBOX Nx4x8732:return:"""gxy = self.scale_xy * (loc[:, :2, :] - self.dboxes[:, :2, :]) / self.dboxes[:, 2:, :]  # Nx2x8732gwh = self.scale_wh * (loc[:, 2:, :] / self.dboxes[:, 2:, :]).log()  # Nx2x8732return torch.cat((gxy, gwh), dim=1).contiguous()def forward(self, ploc, plabel, gloc, glabel):# type: (Tensor, Tensor, Tensor, Tensor) -> Tensor"""ploc, plabel: Nx4x8732, Nxlabel_numx8732predicted location and labelsgloc, glabel: Nx4x8732, Nx8732ground truth location and labels"""# 获取正样本的mask  Tensor: [N, 8732]mask = torch.gt(glabel, 0)  # (gt: >)# mask1 = torch.nonzero(glabel)# 计算一个batch中的每张图片的正样本个数 Tensor: [N]pos_num = mask.sum(dim=1)# 计算gt的location回归参数 Tensor: [N, 4, 8732]vec_gd = self._location_vec(gloc)# sum on four coordinates, and mask# 计算定位损失(只有正样本)loc_loss = self.location_loss(ploc, vec_gd).sum(dim=1)  # Tensor: [N, 8732]loc_loss = (mask.float() * loc_loss).sum(dim=1)  # Tenosr: [N]# hard negative mining Tenosr: [N, 8732]con = self.confidence_loss(plabel, glabel)# positive mask will never selected# 获取负样本con_neg = con.clone()con_neg[mask] = 0.0# 按照confidence_loss降序排列 con_idx(Tensor: [N, 8732])_, con_idx = con_neg.sort(dim=1, descending=True)_, con_rank = con_idx.sort(dim=1)  # 这个步骤比较巧妙# number of negative three times positive# 用于损失计算的负样本数是正样本的3倍（在原论文Hard negative mining部分），# 但不能超过总样本数8732neg_num = torch.clamp(3 * pos_num, max=mask.size(1)).unsqueeze(-1)neg_mask = torch.lt(con_rank, neg_num)  # (lt: <) Tensor [N, 8732]# confidence最终loss使用选取的正样本loss+选取的负样本losscon_loss = (con * (mask.float() + neg_mask.float())).sum(dim=1)  # Tensor [N]# avoid no object detected# 避免出现图像中没有GTBOX的情况total_loss = loc_loss + con_loss# eg. [15, 3, 5, 0] -> [1.0, 1.0, 1.0, 0.0]num_mask = torch.gt(pos_num, 0).float()  # 统计一个batch中的每张图像中是否存在正样本pos_num = pos_num.float().clamp(min=1e-6)  # 防止出现分母为零的情况ret = (total_loss * num_mask / pos_num).mean(dim=0)  # 只计算存在正样本的图像损失return ret

6、以ResNet50作为特征提取backbone

目标检测 pytorch复现SSD目标检测项目
ResNet50+SSD整体架构：↓↓↓

7、ResNet50+SSD网络模型搭建

res50_backbone.py ResNet50特征提取网络搭建

import torch.nn as nn
import torchclass Bottleneck(nn.Module):expansion = 4def __init__(self, in_channel, out_channel, stride=1, downsample=None):super(Bottleneck, self).__init__()self.conv1 = nn.Conv2d(in_channels=in_channel, out_channels=out_channel,kernel_size=1, stride=1, bias=False)  # squeeze channelsself.bn1 = nn.BatchNorm2d(out_channel)# -----------------------------------------self.conv2 = nn.Conv2d(in_channels=out_channel, out_channels=out_channel,kernel_size=3, stride=stride, bias=False, padding=1)self.bn2 = nn.BatchNorm2d(out_channel)# -----------------------------------------self.conv3 = nn.Conv2d(in_channels=out_channel, out_channels=out_channel*self.expansion,kernel_size=1, stride=1, bias=False)  # unsqueeze channelsself.bn3 = nn.BatchNorm2d(out_channel*self.expansion)self.relu = nn.ReLU(inplace=True)self.downsample = downsampledef forward(self, x):identity = xif self.downsample is not None:identity = self.downsample(x)out = self.conv1(x)out = self.bn1(out)out = self.relu(out)out = self.conv2(out)out = self.bn2(out)out = self.relu(out)out = self.conv3(out)out = self.bn3(out)out += identityout = self.relu(out)return outclass ResNet(nn.Module):def __init__(self, block, blocks_num, num_classes=1000, include_top=True):super(ResNet, self).__init__()self.include_top = include_topself.in_channel = 64self.conv1 = nn.Conv2d(3, self.in_channel, kernel_size=7, stride=2,padding=3, bias=False)self.bn1 = nn.BatchNorm2d(self.in_channel)self.relu = nn.ReLU(inplace=True)self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)self.layer1 = self._make_layer(block, 64, blocks_num[0])self.layer2 = self._make_layer(block, 128, blocks_num[1], stride=2)self.layer3 = self._make_layer(block, 256, blocks_num[2], stride=2)self.layer4 = self._make_layer(block, 512, blocks_num[3], stride=2)if self.include_top:self.avgpool = nn.AdaptiveAvgPool2d((1, 1))  # output size = (1, 1)self.fc = nn.Linear(512 * block.expansion, num_classes)for m in self.modules():if isinstance(m, nn.Conv2d):nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')def _make_layer(self, block, channel, block_num, stride=1):downsample = Noneif stride != 1 or self.in_channel != channel * block.expansion:downsample = nn.Sequential(nn.Conv2d(self.in_channel, channel * block.expansion, kernel_size=1, stride=stride, bias=False),nn.BatchNorm2d(channel * block.expansion))layers = []layers.append(block(self.in_channel, channel, downsample=downsample, stride=stride))self.in_channel = channel * block.expansionfor _ in range(1, block_num):layers.append(block(self.in_channel, channel))return nn.Sequential(*layers)def forward(self, x):x = self.conv1(x)x = self.bn1(x)x = self.relu(x)x = self.maxpool(x)x = self.layer1(x)x = self.layer2(x)x = self.layer3(x)x = self.layer4(x)if self.include_top:x = self.avgpool(x)x = torch.flatten(x, 1)x = self.fc(x)return xdef resnet50(num_classes=1000, include_top=True):return ResNet(Bottleneck, [3, 4, 6, 3], num_classes=num_classes, include_top=include_top)

ssd_model.py ResNet50+SSD结构搭建

import torch
from torch import nn, Tensor
from torch.jit.annotations import Listfrom .res50_backbone import resnet50
from .utils import dboxes300_coco, Encoder, PostProcessclass Backbone(nn.Module):def __init__(self, pretrain_path=None):super(Backbone, self).__init__()net = resnet50()self.out_channels = [1024, 512, 512, 256, 256, 256]   # 对应着每一个预测特征层的channelsif pretrain_path is not None:net.load_state_dict(torch.load(pretrain_path))self.feature_extractor = nn.Sequential(*list(net.children())[:7])   # 构建特征提取部分conv4_block1 = self.feature_extractor[-1][0]# 修改conv4_block1的步距，从2->1conv4_block1.conv1.stride = (1, 1)conv4_block1.conv2.stride = (1, 1)conv4_block1.downsample[0].stride = (1, 1)def forward(self, x):x = self.feature_extractor(x)return xclass SSD300(nn.Module):def __init__(self, backbone=None, num_classes=21):super(SSD300, self).__init__()if backbone is None:raise Exception("backbone is None")if not hasattr(backbone, "out_channels"):raise Exception("the backbone not has attribute: out_channel")self.feature_extractor = backboneself.num_classes = num_classes# out_channels = [1024, 512, 512, 256, 256, 256] for resnet50self._build_additional_features(self.feature_extractor.out_channels)self.num_defaults = [4, 6, 6, 6, 4, 4]location_extractors = []confidence_extractors = []# out_channels = [1024, 512, 512, 256, 256, 256] for resnet50for nd, oc in zip(self.num_defaults, self.feature_extractor.out_channels):# nd is number_default_boxes, oc is output_channellocation_extractors.append(nn.Conv2d(oc, nd * 4, kernel_size=3, padding=1))confidence_extractors.append(nn.Conv2d(oc, nd * self.num_classes, kernel_size=3, padding=1))self.loc = nn.ModuleList(location_extractors)self.conf = nn.ModuleList(confidence_extractors)self._init_weights()default_box = dboxes300_coco()self.compute_loss = Loss(default_box)self.encoder = Encoder(default_box)self.postprocess = PostProcess(default_box)def _build_additional_features(self, input_size):"""为backbone(resnet50)添加额外的一系列卷积层，得到相应的一系列特征提取器:param input_size::return:"""additional_blocks = []# input_size = [1024, 512, 512, 256, 256, 256] for resnet50middle_channels = [256, 256, 128, 128, 128]for i, (input_ch, output_ch, middle_ch) in enumerate(zip(input_size[:-1], input_size[1:], middle_channels)):padding, stride = (1, 2) if i < 3 else (0, 1)layer = nn.Sequential(nn.Conv2d(input_ch, middle_ch, kernel_size=1, bias=False),nn.BatchNorm2d(middle_ch),nn.ReLU(inplace=True),nn.Conv2d(middle_ch, output_ch, kernel_size=3, padding=padding, stride=stride, bias=False),nn.BatchNorm2d(output_ch),nn.ReLU(inplace=True),)additional_blocks.append(layer)self.additional_blocks = nn.ModuleList(additional_blocks)def _init_weights(self):layers = [*self.additional_blocks, *self.loc, *self.conf]for layer in layers:for param in layer.parameters():if param.dim() > 1:nn.init.xavier_uniform_(param)# Shape the classifier to the view of bboxesdef bbox_view(self, features, loc_extractor, conf_extractor):locs = []confs = []for f, l, c in zip(features, loc_extractor, conf_extractor):# [batch, n*4, feat_size, feat_size] -> [batch, 4, -1]locs.append(l(f).view(f.size(0), 4, -1))# [batch, n*classes, feat_size, feat_size] -> [batch, classes, -1]confs.append(c(f).view(f.size(0), self.num_classes, -1))locs, confs = torch.cat(locs, 2).contiguous(), torch.cat(confs, 2).contiguous()return locs, confsdef forward(self, image, targets=None):x = self.feature_extractor(image)# Feature Map 38x38x1024, 19x19x512, 10x10x512, 5x5x256, 3x3x256, 1x1x256detection_features = torch.jit.annotate(List[Tensor], [])  # [x]detection_features.append(x)for layer in self.additional_blocks:x = layer(x)detection_features.append(x)# Feature Map 38x38x4, 19x19x6, 10x10x6, 5x5x6, 3x3x4, 1x1x4locs, confs = self.bbox_view(detection_features, self.loc, self.conf)# For SSD 300, shall return nbatch x 8732 x {nlabels, nlocs} results# 38x38x4 + 19x19x6 + 10x10x6 + 5x5x6 + 3x3x4 + 1x1x4 = 8732if self.training:if targets is None:raise ValueError("In training mode, targets should be passed")# bboxes_out (Tensor 8732 x 4), labels_out (Tensor 8732)bboxes_out = targets['boxes']bboxes_out = bboxes_out.transpose(1, 2).contiguous()# print(bboxes_out.is_contiguous())labels_out = targets['labels']# print(labels_out.is_contiguous())# ploc, plabel, gloc, glabelloss = self.compute_loss(locs, confs, bboxes_out, labels_out)return {"total_losses": loss}# 将预测回归参数叠加到default box上得到最终预测box，并执行非极大值抑制虑除重叠框# results = self.encoder.decode_batch(locs, confs)results = self.postprocess(locs, confs)return resultsclass Loss(nn.Module):"""Implements the loss as the sum of the followings:1. Confidence Loss: All labels, with hard negative mining2. Localization Loss: Only on positive labelsSuppose input dboxes has the shape 8732x4"""def __init__(self, dboxes):super(Loss, self).__init__()# Two factor are from following links# http://jany.st/post/2017-11-05-single-shot-detector-ssd-from-scratch-in-tensorflow.htmlself.scale_xy = 1.0 / dboxes.scale_xy  # 10self.scale_wh = 1.0 / dboxes.scale_wh  # 5self.location_loss = nn.SmoothL1Loss(reduction='none')# [num_anchors, 4] -> [4, num_anchors] -> [1, 4, num_anchors]self.dboxes = nn.Parameter(dboxes(order="xywh").transpose(0, 1).unsqueeze(dim=0),requires_grad=False)self.confidence_loss = nn.CrossEntropyLoss(reduction='none')def _location_vec(self, loc):# type: (Tensor) -> Tensor"""Generate Location Vectors计算ground truth相对anchors的回归参数:param loc: anchor匹配到的对应GTBOX Nx4x8732:return:"""gxy = self.scale_xy * (loc[:, :2, :] - self.dboxes[:, :2, :]) / self.dboxes[:, 2:, :]  # Nx2x8732gwh = self.scale_wh * (loc[:, 2:, :] / self.dboxes[:, 2:, :]).log()  # Nx2x8732return torch.cat((gxy, gwh), dim=1).contiguous()def forward(self, ploc, plabel, gloc, glabel):# type: (Tensor, Tensor, Tensor, Tensor) -> Tensor"""ploc, plabel: Nx4x8732, Nxlabel_numx8732predicted location and labelsgloc, glabel: Nx4x8732, Nx8732ground truth location and labels"""# 获取正样本的mask  Tensor: [N, 8732]mask = torch.gt(glabel, 0)  # (gt: >)# mask1 = torch.nonzero(glabel)# 计算一个batch中的每张图片的正样本个数 Tensor: [N]pos_num = mask.sum(dim=1)# 计算gt的location回归参数 Tensor: [N, 4, 8732]vec_gd = self._location_vec(gloc)# sum on four coordinates, and mask# 计算定位损失(只有正样本)loc_loss = self.location_loss(ploc, vec_gd).sum(dim=1)  # Tensor: [N, 8732]loc_loss = (mask.float() * loc_loss).sum(dim=1)  # Tenosr: [N]# hard negative mining Tenosr: [N, 8732]con = self.confidence_loss(plabel, glabel)# positive mask will never selected# 获取负样本con_neg = con.clone()con_neg[mask] = 0.0# 按照confidence_loss降序排列 con_idx(Tensor: [N, 8732])_, con_idx = con_neg.sort(dim=1, descending=True)_, con_rank = con_idx.sort(dim=1)  # 这个步骤比较巧妙# number of negative three times positive# 用于损失计算的负样本数是正样本的3倍（在原论文Hard negative mining部分），# 但不能超过总样本数8732neg_num = torch.clamp(3 * pos_num, max=mask.size(1)).unsqueeze(-1)neg_mask = torch.lt(con_rank, neg_num)  # (lt: <) Tensor [N, 8732]# confidence最终loss使用选取的正样本loss+选取的负样本losscon_loss = (con * (mask.float() + neg_mask.float())).sum(dim=1)  # Tensor [N]# avoid no object detected# 避免出现图像中没有GTBOX的情况total_loss = loc_loss + con_loss# eg. [15, 3, 5, 0] -> [1.0, 1.0, 1.0, 0.0]num_mask = torch.gt(pos_num, 0).float()  # 统计一个batch中的每张图像中是否存在正样本pos_num = pos_num.float().clamp(min=1e-6)  # 防止出现分母为零的情况ret = (total_loss * num_mask / pos_num).mean(dim=0)  # 只计算存在正样本的图像损失return ret

8、Default Box的生成原理

、训练自己的SSD目标检测模型

ResNet50官方预训练权重
预训练权重下载地址（下载后放入src文件夹中）：
ResNet50+SSD: https://ngc.nvidia.com/catalog/models
搜索ssd -> 找到SSD for PyTorch(FP32) -> download FP32 -> 解压文件

目标检测 pytorch复现SSD目标检测项目

目标检测 pytorch复现SSD目标检测项目