paddleocr 的使用要点3 （仪表识别）

文章列表

要点：

文本识别

1 文本识别算法理论

本章主要介绍文本识别算法的理论知识，包括背景介绍、算法分类和部分经典论文思路。

通过本章的学习，你可以掌握：

文本识别的目标
文本识别算法的分类
各类算法的典型思想

1.1 背景介绍

文本识别是OCR（Optical Character Recognition）的一个子任务，其任务为识别一个固定区域的文本内容。在OCR的两阶段方法里，它接在文本检测后面，将图像信息转换为文字信息。

具体地，模型输入一张定位好的文本行，由模型预测出图片中的文字内容和置信度，可视化结果

文本识别的应用场景很多，有文档识别、路标识别、车牌识别、工业编号识别等等，根据实际场景可以把文本识别任务分为两个大类：规则文本识别和不规则文本识别。

规则文本识别：主要指印刷字体、扫描文本等，认为文本大致处在水平线位置
不规则文本识别：往往出现在自然场景中，且由于文本曲率、方向、变形等方面差异巨大，文字往往不在水平位置，存在弯曲、遮挡、模糊等问题。

下图展示的是 IC15 和 IC13 的数据样式，它们分别代表了不规则文本和规则文本。可以看出不规则文本往往存在扭曲、模糊、字体差异大等问题，更贴近真实场景，也存在更大的挑战性。

因此目前各大算法都试图在不规则数据集上获得更高的指标。

不同的识别算法在对比能力时，往往也在这两大类公开数据集上比较。对比多个维度上的效果，目前较为通用的英文评估集合分类如下：

1.2 文本识别算法分类

在传统的文本识别方法中，任务分为3个步骤，即图像预处理、字符分割和字符识别。需要对特定场景进行建模，一旦场景变化就会失效。面对复杂的文字背景和场景变动，基于深度学习的方法具有更优的表现。

多数现有的识别算法可用如下统一框架表示，算法流程被划分为4个阶段：

1 文本识别实战

上一章理论部分，介绍了文本识别领域的主要方法，其中CRNN是较早被提出也是目前工业界应用较多的方法。本章将详细介绍如何基于PaddleOCR完成CRNN文本识别模型的搭建、训练、评估和预测。数据集采用 icdar 2015，其中训练集有4468张，测试集有2077张。

通过本章的学习，你可以掌握：

如何使用PaddleOCR whl包快速完成文本识别预测
CRNN的基本原理和网络结构
模型训练的必须步骤和调参方式
使用自定义的数据集训练网络

注：paddleocr指代PaddleOCR whl包

1.1 安装相关的依赖及whl包

首先确认安装了 paddle 以及 paddleocr，如果已经安装过，忽略该步骤。

# 安装 PaddlePaddle GPU 版本
!pip install paddlepaddle-gpu
# 安装 PaddleOCR whl包
! pip install -U pip
! pip install paddleocr

1.2 快速预测文字内容

PaddleOCR whl包会自动下载ppocr轻量级模型作为默认模型

下面展示如何使用whl包进行识别预测：

测试图片：

from paddleocr import PaddleOCRocr = PaddleOCR()  # need to run only once to download and load model into memory
img_path = '/home/aistudio/work/word_19.png'
result = ocr.ocr(img_path, det=False)
for line in result:print(line)

执行完上述代码块，将返回识别结果和识别置信度

('SLOW', 0.9776376)

至此，你掌握了如何使用 paddleocr whl 包进行预测。./work/ 路径下有更多测试图片，可以尝试其他图片结果。

1.2 预测原理详解

第一节中 paddleocr 加载训练好的 CRNN 识别模型 进行预测，本节将详细介绍 CRNN 的原理及流程。

1.2.1 所属类别

CRNN 是基于CTC的算法，在理论部分介绍的分类图中，处在如下位置。可以看出CRNN主要用于解决规则文本，基于CTC的算法有较快的预测速度并且很好的适用长文本。因此CRNN是PPOCR选择的中文识别算法。

1.2.3 代码实现

整个网络结构非常简洁，代码实现也相对简单，可以跟随预测流程依次搭建模块。本节需要完成：数据输入、backbone搭建、neck搭建、head搭建。

【数据输入】

数据送入网络前需要缩放到统一尺寸(3,32,320)，并完成归一化处理。这里省略掉训练时需要的数据增强部分，以单张图为例展示预处理的必须步骤（源码位置）：

import cv2
import math
import numpy as npdef resize_norm_img(img):"""数据缩放和归一化:param img: 输入图片"""# 默认输入尺寸imgC = 3imgH = 32imgW = 320# 图片的真实高宽h, w = img.shape[:2]# 图片真实长宽比ratio = w / float(h)# 按比例缩放if math.ceil(imgH * ratio) > imgW:# 如大于默认宽度，则宽度为imgWresized_w = imgWelse:# 如小于默认宽度则以图片真实宽为准resized_w = int(math.ceil(imgH * ratio))# 缩放resized_image = cv2.resize(img, (resized_w, imgH))resized_image = resized_image.astype('float32')# 归一化resized_image = resized_image.transpose((2, 0, 1)) / 255resized_image -= 0.5resized_image /= 0.5# 对宽度不足的位置，补0padding_im = np.zeros((imgC, imgH, imgW), dtype=np.float32)padding_im[:, :, 0:resized_w] = resized_image# 转置 padding 后的图片用于可视化draw_img = padding_im.transpose((1,2,0))return padding_im, draw_img

import matplotlib.pyplot as plt
# 读图
raw_img = cv2.imread("/home/aistudio/work/word_1.png")
plt.figure()
plt.subplot(2,1,1)
# 可视化原图
plt.imshow(raw_img)
# 缩放并归一化
padding_im, draw_img = resize_norm_img(raw_img)
plt.subplot(2,1,2)
# 可视化网络输入图
plt.imshow(draw_img)
plt.show()

【网络结构】

backbone

PaddleOCR 使用 MobileNetV3 作为骨干网络，组网顺序与网络结构一致。首先，定义网络中的公共模块(源码位置)：ConvBNLayer、ResidualUnit、make_divisible。

import paddle
import paddle.nn as nn
import paddle.nn.functional as Fclass ConvBNLayer(nn.Layer):def __init__(self,in_channels,out_channels,kernel_size,stride,padding,groups=1,if_act=True,act=None):"""卷积BN层:param in_channels: 输入通道数:param out_channels: 输出通道数:param kernel_size: 卷积核尺寸:parma stride: 步长大小:param padding: 填充大小:param groups: 二维卷积层的组数:param if_act: 是否添加激活函数:param act: 激活函数"""super(ConvBNLayer, self).__init__()self.if_act = if_actself.act = actself.conv = nn.Conv2D(in_channels=in_channels,out_channels=out_channels,kernel_size=kernel_size,stride=stride,padding=padding,groups=groups,bias_attr=False)self.bn = nn.BatchNorm(num_channels=out_channels, act=None)def forward(self, x):# conv层x = self.conv(x)# batchnorm层x = self.bn(x)# 是否使用激活函数if self.if_act:if self.act == "relu":x = F.relu(x)elif self.act == "hardswish":x = F.hardswish(x)else:print("The activation function({}) is selected incorrectly.".format(self.act))exit()return xclass SEModule(nn.Layer):def __init__(self, in_channels, reduction=4):"""SE模块:param in_channels: 输入通道数:param reduction: 通道缩放率"""        super(SEModule, self).__init__()self.avg_pool = nn.AdaptiveAvgPool2D(1)self.conv1 = nn.Conv2D(in_channels=in_channels,out_channels=in_channels // reduction,kernel_size=1,stride=1,padding=0)self.conv2 = nn.Conv2D(in_channels=in_channels // reduction,out_channels=in_channels,kernel_size=1,stride=1,padding=0)def forward(self, inputs):# 平均池化outputs = self.avg_pool(inputs)# 第一个卷积层outputs = self.conv1(outputs)# relu激活函数outputs = F.relu(outputs)# 第二个卷积层outputs = self.conv2(outputs)# hardsigmoid 激活函数outputs = F.hardsigmoid(outputs, slope=0.2, offset=0.5)return inputs * outputsclass ResidualUnit(nn.Layer):def __init__(self,in_channels,mid_channels,out_channels,kernel_size,stride,use_se,act=None):"""残差层:param in_channels: 输入通道数:param mid_channels: 中间通道数:param out_channels: 输出通道数:param kernel_size: 卷积核尺寸:parma stride: 步长大小:param use_se: 是否使用se模块:param act: 激活函数""" super(ResidualUnit, self).__init__()self.if_shortcut = stride == 1 and in_channels == out_channelsself.if_se = use_seself.expand_conv = ConvBNLayer(in_channels=in_channels,out_channels=mid_channels,kernel_size=1,stride=1,padding=0,if_act=True,act=act)self.bottleneck_conv = ConvBNLayer(in_channels=mid_channels,out_channels=mid_channels,kernel_size=kernel_size,stride=stride,padding=int((kernel_size - 1) // 2),groups=mid_channels,if_act=True,act=act)if self.if_se:self.mid_se = SEModule(mid_channels)self.linear_conv = ConvBNLayer(in_channels=mid_channels,out_channels=out_channels,kernel_size=1,stride=1,padding=0,if_act=False,act=None)def forward(self, inputs):x = self.expand_conv(inputs)x = self.bottleneck_conv(x)if self.if_se:x = self.mid_se(x)x = self.linear_conv(x)if self.if_shortcut:x = paddle.add(inputs, x)return xdef make_divisible(v, divisor=8, min_value=None):"""确保被8整除""" if min_value is None:min_value = divisornew_v = max(min_value, int(v + divisor / 2) // divisor * divisor)if new_v < 0.9 * v:new_v += divisorreturn new_v

利用公共模块搭建骨干网络：

class MobileNetV3(nn.Layer):def __init__(self,in_channels=3,model_name='small',scale=0.5,small_stride=None,disable_se=False,kwargs):super(MobileNetV3, self).__init__()self.disable_se = disable_sesmall_stride = [1, 2, 2, 2]if model_name == "small":cfg = [# k, exp, c,  se,     nl,  s,[3, 16, 16, True, 'relu', (small_stride[0], 1)],[3, 72, 24, False, 'relu', (small_stride[1], 1)],[3, 88, 24, False, 'relu', 1],[5, 96, 40, True, 'hardswish', (small_stride[2], 1)],[5, 240, 40, True, 'hardswish', 1],[5, 240, 40, True, 'hardswish', 1],[5, 120, 48, True, 'hardswish', 1],[5, 144, 48, True, 'hardswish', 1],[5, 288, 96, True, 'hardswish', (small_stride[3], 1)],[5, 576, 96, True, 'hardswish', 1],[5, 576, 96, True, 'hardswish', 1],]cls_ch_squeeze = 576else:raise NotImplementedError("mode[" + model_name +"_model] is not implemented!")supported_scale = [0.35, 0.5, 0.75, 1.0, 1.25]assert scale in supported_scale, \\"supported scales are {} but input scale is {}".format(supported_scale, scale)inplanes = 16# conv1self.conv1 = ConvBNLayer(in_channels=in_channels,out_channels=make_divisible(inplanes * scale),kernel_size=3,stride=2,padding=1,groups=1,if_act=True,act='hardswish')i = 0block_list = []inplanes = make_divisible(inplanes * scale)for (k, exp, c, se, nl, s) in cfg:se = se and not self.disable_seblock_list.append(ResidualUnit(in_channels=inplanes,mid_channels=make_divisible(scale * exp),out_channels=make_divisible(scale * c),kernel_size=k,stride=s,use_se=se,act=nl))inplanes = make_divisible(scale * c)i += 1self.blocks = nn.Sequential(*block_list)self.conv2 = ConvBNLayer(in_channels=inplanes,out_channels=make_divisible(scale * cls_ch_squeeze),kernel_size=1,stride=1,padding=0,groups=1,if_act=True,act='hardswish')self.pool = nn.MaxPool2D(kernel_size=2, stride=2, padding=0)self.out_channels = make_divisible(scale * cls_ch_squeeze)def forward(self, x):x = self.conv1(x)x = self.blocks(x)x = self.conv2(x)x = self.pool(x)return x

至此就完成了骨干网络的定义，可通过 paddle.summary 结构可视化整个网络结构：

# 定义网络输入shape
IMAGE_SHAPE_C = 3
IMAGE_SHAPE_H = 32
IMAGE_SHAPE_W = 320# 可视化网络结构
paddle.summary(MobileNetV3(),[(1, IMAGE_SHAPE_C, IMAGE_SHAPE_H, IMAGE_SHAPE_W)])

# 图片输入骨干网络
backbone = MobileNetV3()
# 将numpy数据转换为Tensor
input_data = paddle.to_tensor([padding_im])
# 骨干网络输出
feature = backbone(input_data)
# 查看feature map的纬度
print("backbone output:", feature.shape)

neck

neck 部分将backbone输出的视觉特征图转换为1维向量输入送到 LSTM 网络中，输出序列特征（源码位置）：

class Im2Seq(nn.Layer):def __init__(self, in_channels, kwargs):"""图像特征转换为序列特征:param in_channels: 输入通道数""" super().__init__()self.out_channels = in_channelsdef forward(self, x):B, C, H, W = x.shapeassert H == 1x = x.squeeze(axis=2)x = x.transpose([0, 2, 1])  # (NWC)(batch, width, channels)return xclass EncoderWithRNN(nn.Layer):def __init__(self, in_channels, hidden_size):super(EncoderWithRNN, self).__init__()self.out_channels = hidden_size * 2self.lstm = nn.LSTM(in_channels, hidden_size, direction='bidirectional', num_layers=2)def forward(self, x):x, _ = self.lstm(x)return xclass SequenceEncoder(nn.Layer):def __init__(self, in_channels, hidden_size=48, kwargs):"""序列编码:param in_channels: 输入通道数:param hidden_size: 隐藏层size""" super(SequenceEncoder, self).__init__()self.encoder_reshape = Im2Seq(in_channels)self.encoder = EncoderWithRNN(self.encoder_reshape.out_channels, hidden_size)self.out_channels = self.encoder.out_channelsdef forward(self, x):x = self.encoder_reshape(x)x = self.encoder(x)return x

neck = SequenceEncoder(in_channels=288)
sequence = neck(feature)
print("sequence shape:", sequence.shape)

head

预测头部分由全连接层和softmax组成，用于计算序列特征时间步上的标签概率分布，本示例仅支持模型识别小写英文字母和数字（26+10）36个类别（源码位置）:

class CTCHead(nn.Layer):def __init__(self,in_channels,out_channels,kwargs):"""CTC 预测层:param in_channels: 输入通道数:param out_channels: 输出通道数""" super(CTCHead, self).__init__()self.fc = nn.Linear(in_channels,out_channels)# 思考：out_channels 应该等于多少？self.out_channels = out_channelsdef forward(self, x):predicts = self.fc(x)result = predictsif not self.training:predicts = F.softmax(predicts, axis=2)result = predictsreturn result

在网络随机初始化的情况下，输出结果是无序的，经过SoftMax之后，可以得到各时间步上的概率最大的预测结果，其中：pred_id 代表预测的标签ID，pre_scores 代表预测结果的置信度：

ctc_head = CTCHead(in_channels=96, out_channels=37)
predict = ctc_head(sequence)
print("predict shape:", predict.shape)
result = F.softmax(predict, axis=2)
pred_id = paddle.argmax(result, axis=2)
pred_socres = paddle.max(result, axis=2)
print("pred_id:", pred_id)
print("pred_scores:", pred_socres)

后处理

识别网络最终返回的结果是各个时间步上的最大索引值，最终期望的输出是对应的文字结果，因此CRNN的后处理是一个解码过程，主要逻辑如下：

def decode(text_index, text_prob=None, is_remove_duplicate=False):""" convert text-index into text-label. """character = "-0123456789abcdefghijklmnopqrstuvwxyz"result_list = []# 忽略tokens [0] 代表ctc中的blank位ignored_tokens = [0]batch_size = len(text_index)for batch_idx in range(batch_size):char_list = []conf_list = []for idx in range(len(text_index[batch_idx])):if text_index[batch_idx][idx] in ignored_tokens:continue# 合并blank之间相同的字符if is_remove_duplicate:# only for predictif idx > 0 and text_index[batch_idx][idx - 1] == text_index[batch_idx][idx]:continue# 将解码结果存在char_list内char_list.append(character[int(text_index[batch_idx][idx])])# 记录置信度if text_prob is not None:conf_list.append(text_prob[batch_idx][idx])else:conf_list.append(1)text = ''.join(char_list)# 输出结果result_list.append((text, np.mean(conf_list)))return result_list

以 head 部分随机初始化预测出的结果为例，进行解码得到：

pred_id = paddle.argmax(result, axis=2)
pred_socres = paddle.max(result, axis=2)
print(pred_id)
decode_out = decode(pred_id, pred_socres)
print("decode out:", decode_out)

小测试： 如果输入模型训练好的index，解码结果是否正确呢？

# 替换模型预测好的结果
right_pred_id = paddle.to_tensor([['xxxxxxxxxxxxx']])
tmp_scores = paddle.ones(shape=right_pred_id.shape)
out = decode(right_pred_id, tmp_scores)
print("out:",out)

上述步骤完成了网络的搭建，也实现了一个简单的前向预测过程。

没有经过训练的网络无法正确预测结果，因此需要定义损失函数、优化策略，将整个网络run起来，下面将详细介绍网络训练原理。

1.3 训练原理详解

1.3.1 准备训练数据

PaddleOCR 支持两种数据格式:

lmdb 用于训练以lmdb格式存储的数据集(LMDBDataSet);
通用数据 用于训练以文本文件存储的数据集(SimpleDataSet);

本次只介绍通用数据格式读取

训练数据的默认存储路径是 ./train_data, 执行以下命令解压数据：

!cd /home/aistudio/work/train_data/ && tar xf ic15_data.tar

解压完成后，训练图片都在同一个文件夹内，并有一个txt文件（rec_gt_train.txt）记录图片路径和标签，txt文件里的内容如下:

" 图像文件名         图像标注信息 "train/word_1.png    Genaxis Theatre
train/word_2.png    [06]
...

注意： txt文件中默认将图片路径和图片标签用 \\t 分割，如用其他方式分割将造成训练报错。

数据集应有如下文件结构：

|-train_data|-ic15_data|- rec_gt_train.txt|- train|- word_001.png|- word_002.jpg|- word_003.jpg| ...|- rec_gt_test.txt|- test|- word_001.png|- word_002.jpg|- word_003.jpg| ...

确认配置文件中的数据路径是否正确，以 rec_icdar15_train.yml为例：

1.3.2 数据预处理

送入网络的训练数据，需要保证一个batch内维度一致，同时为了不同维度之间的特征在数值上有一定的比较性，需要对数据做统一尺度缩放和归一化。

为了增加模型的鲁棒性，抑制过拟合提升泛化性能，需要实现一定的数据增广。

缩放和归一化

第二节中已经介绍了相关内容，这是图片送入网络之前的最后一步操作。调用 resize_norm_img 完成图片缩放、padding和归一化。

数据增广

PaddleOCR中实现了多种数据增广方式，如：颜色反转、随机切割、仿射变化、随机噪声等等，这里以简单的随机切割为例，更多增广方式可参考：rec_img_aug.py

def get_crop(image):"""random crop"""import randomh, w, _ = image.shapetop_min = 1top_max = 8top_crop = int(random.randint(top_min, top_max))top_crop = min(top_crop, h - 1)crop_img = image.copy()ratio = random.randint(0, 1)if ratio:crop_img = crop_img[top_crop:h, :, :]else:crop_img = crop_img[0:h - top_crop, :, :]return crop_img

# 读图
raw_img = cv2.imread("/home/aistudio/work/word_1.png")
plt.figure()
plt.subplot(2,1,1)
# 可视化原图
plt.imshow(raw_img)
# 随机切割
crop_img = get_crop(raw_img)
plt.subplot(2,1,2)
# 可视化增广图
plt.imshow(crop_img)
plt.show()

1.3.3 训练主程序

模型训练的入口代码是 train.py，它展示了训练中所需的各个模块： build dataloader, build post process, build model , build loss, build optim, build metric，将各部分串联后即可开始训练：

构建 dataloader

训练模型需要将数据组成指定数目的 batch ，并在训练过程中依次 yield 出来，本例中调用了 PaddleOCR 中实现的 SimpleDataSet

基于原始代码稍作修改，其返回单条数据的主要逻辑如下

def __getitem__(data_line, data_dir):import osmode = "train"delimiter = '\\t'try:substr = data_line.strip("\\n").split(delimiter)file_name = substr[0]label = substr[1]img_path = os.path.join(data_dir, file_name)data = {'img_path': img_path, 'label': label}if not os.path.exists(img_path):raise Exception("{} does not exist!".format(img_path))with open(data['img_path'], 'rb') as f:img = f.read()data['image'] = img# 预处理操作，先注释掉# outs = transform(data, self.ops)outs = dataexcept Exception as e:print("When parsing line {}, error happened with msg: {}".format(data_line, e))outs = Nonereturn outs

假设当前输入的标签为 train/word_1.png Genaxis Theatre, 训练数据的路径为 /home/aistudio/work/train_data/ic15_data/, 解析出的结果是一个字典，里面包含 img_path label image 三个字段：

data_line = "train/word_1.png	Genaxis Theatre"
data_dir = "/home/aistudio/work/train_data/ic15_data/"item = __getitem__(data_line, data_dir)
print(item)

实现完单条数据返回逻辑后，调用 padde.io.Dataloader 即可把数据组合成batch，具体可参考 build_dataloader。

build model

build model 即搭建主要网络结构，具体细节如《2.3 代码实现》所述，本节不做过多介绍，各模块代码可参考modeling
build loss

CRNN 模型的损失函数为 CTC loss, 飞桨集成了常用的 Loss 函数，只需调用实现即可：

import paddle.nn as nn
class CTCLoss(nn.Layer):def __init__(self, use_focal_loss=False, kwargs):super(CTCLoss, self).__init__()# blank 是 ctc 的无意义连接符self.loss_func = nn.CTCLoss(blank=0, reduction='none')def forward(self, predicts, batch):if isinstance(predicts, (list, tuple)):predicts = predicts[-1]# 转置模型 head 层的预测结果，沿channel层排列predicts = predicts.transpose((1, 0, 2)) #[80,1,37]N, B, _ = predicts.shapepreds_lengths = paddle.to_tensor([N] * B, dtype='int64')labels = batch[1].astype("int32")label_lengths = batch[2].astype('int64')# 计算损失函数loss = self.loss_func(predicts, labels, preds_lengths, label_lengths)loss = loss.mean()return {'loss': loss}

build post process

具体细节同样在《2.3 代码实现》有详细介绍，实现逻辑与之前一致。

build optim

优化器使用 Adam , 同样调用飞桨API： paddle.optimizer.Adam

build metric

metric 部分用于计算模型指标，PaddleOCR的文本识别中，将整句预测正确判断为预测正确，因此准确率计算主要逻辑如下：

def metric(preds, labels):    correct_num = 0all_num = 0norm_edit_dis = 0.0for (pred), (target) in zip(preds, labels):pred = pred.replace(" ", "")target = target.replace(" ", "")if pred == target:correct_num += 1all_num += 1correct_num += correct_numall_num += all_numreturn {'acc': correct_num / all_num,}

preds = ["aaa", "bbb", "ccc", "123", "456"]
labels = ["aaa", "bbb", "ddd", "123", "444"]
acc = metric(preds, labels)
print("acc:", acc)
# 五个预测结果中,完全正确的有3个，因此准确率应为0.6

将以上各部分组合起来，即是完整的训练流程：

def main(config, device, logger, vdl_writer):# init dist environmentif config['Global']['distributed']:dist.init_parallel_env()global_config = config['Global']# build dataloadertrain_dataloader = build_dataloader(config, 'Train', device, logger)if len(train_dataloader) == 0:logger.error("No Images in train dataset, please ensure\\n" +"\\t1. The images num in the train label_file_list should be larger than or equal with batch size.\\n"+"\\t2. The annotation file and path in the configuration file are provided normally.")returnif config['Eval']:valid_dataloader = build_dataloader(config, 'Eval', device, logger)else:valid_dataloader = None# build post processpost_process_class = build_post_process(config['PostProcess'],global_config)# build model# for rec algorithmif hasattr(post_process_class, 'character'):char_num = len(getattr(post_process_class, 'character'))if config['Architecture']["algorithm"] in ["Distillation",]:  # distillation modelfor key in config['Architecture']["Models"]:config['Architecture']["Models"][key]["Head"]['out_channels'] = char_numelse:  # base rec modelconfig['Architecture']["Head"]['out_channels'] = char_nummodel = build_model(config['Architecture'])if config['Global']['distributed']:model = paddle.DataParallel(model)# build lossloss_class = build_loss(config['Loss'])# build optimoptimizer, lr_scheduler = build_optimizer(config['Optimizer'],epochs=config['Global']['epoch_num'],step_each_epoch=len(train_dataloader),parameters=model.parameters())# build metriceval_class = build_metric(config['Metric'])# load pretrain modelpre_best_model_dict = load_model(config, model, optimizer)logger.info('train dataloader has {} iters'.format(len(train_dataloader)))if valid_dataloader is not None:logger.info('valid dataloader has {} iters'.format(len(valid_dataloader)))use_amp = config["Global"].get("use_amp", False)if use_amp:AMP_RELATED_FLAGS_SETTING = {'FLAGS_cudnn_batchnorm_spatial_persistent': 1,'FLAGS_max_inplace_grad_add': 8,}paddle.fluid.set_flags(AMP_RELATED_FLAGS_SETTING)scale_loss = config["Global"].get("scale_loss", 1.0)use_dynamic_loss_scaling = config["Global"].get("use_dynamic_loss_scaling", False)scaler = paddle.amp.GradScaler(init_loss_scaling=scale_loss,use_dynamic_loss_scaling=use_dynamic_loss_scaling)else:scaler = None# start trainprogram.train(config, train_dataloader, valid_dataloader, device, model,loss_class, optimizer, lr_scheduler, post_process_class,eval_class, pre_best_model_dict, logger, vdl_writer, scaler)

1.4 完整训练任务

1.4.1 启动训练

PaddleOCR 识别任务与检测任务类似，是通过配置文件传输参数的。

要进行完整的模型训练，首先需要下载整个项目并安装相关依赖：

# 克隆PaddleOCR代码
#!git clone https://gitee.com/paddlepaddle/PaddleOCR
# 修改代码运行的默认目录为 /home/aistudio/PaddleOCR
import os
os.chdir("/home/aistudio/PaddleOCR")
# 安装PaddleOCR第三方依赖
!pip install -r requirements.txt

创建软链，将训练数据放在PaddleOCR项目下：

!ln -s /home/aistudio/work/train_data/ /home/aistudio/PaddleOCR/

下载预训练模型：

为了加快收敛速度，建议下载训练好的模型在 icdar2015 数据上进行 finetune

!cd PaddleOCR/
# 下载MobileNetV3的预训练模型
!wget -nc -P ./pretrain_models/ https://paddleocr.bj.bcebos.com/dygraph_v2.0/en/rec_mv3_none_bilstm_ctc_v2.0_train.tar
# 解压模型参数
!tar -xf pretrain_models/rec_mv3_none_bilstm_ctc_v2.0_train.tar && rm -rf pretrain_models/rec_mv3_none_bilstm_ctc_v2.0_train.tar

启动训练命令很简单，指定好配置文件即可。另外在命令行中可以通过 -o 修改配置文件中的参数值。启动训练命令如下所示

其中：

Global.pretrained_model: 加载的预训练模型路径
Global.character_dict_path ：字典路径（这里只支持26个小写字母+数字）
Global.eval_batch_step ：评估频率
Global.epoch_num：总训练轮数

!python3 tools/train.py -c configs/rec/rec_icdar15_train.yml \\-o Global.pretrained_model=rec_mv3_none_bilstm_ctc_v2.0_train/best_accuracy \\Global.character_dict_path=ppocr/utils/ic15_dict.txt \\Global.eval_batch_step=[0,200] \\Global.epoch_num=40

根据配置文件中设置的 save_model_dir 字段，会有以下几种参数被保存下来：

output/rec/ic15
├── best_accuracy.pdopt  
├── best_accuracy.pdparams  
├── best_accuracy.states  
├── config.yml  
├── iter_epoch_3.pdopt  
├── iter_epoch_3.pdparams  
├── iter_epoch_3.states  
├── latest.pdopt  
├── latest.pdparams  
├── latest.states  
└── train.log

其中 best_accuracy.* 是评估集上的最优模型；iter_epoch_x.* 是以 save_epoch_step 为间隔保存下来的模型；latest.* 是最后一个epoch的模型。

总结：

如果需要训练自己的数据需要修改：

训练和评估数据路径（必须）
字典路径（必须）
预训练模型（可选）
学习率、image shape、网络结构（可选）

1.4.2 模型评估

评估数据集可以通过 configs/rec/rec_icdar15_train.yml 修改Eval中的 label_file_path 设置。

这里默认使用 icdar2015 的评估集，加载刚刚训练好的模型权重：

!python tools/eval.py -c configs/rec/rec_icdar15_train.yml -o Global.checkpoints=output/rec/ic15/best_accuracy \\Global.character_dict_path=ppocr/utils/ic15_dict.txt

评估后，可以看到训练模型在验证集上的精度。

PaddleOCR支持训练和评估交替进行, 可在 configs/rec/rec_icdar15_train.yml 中修改 eval_batch_step 设置评估频率，默认每2000个iter评估一次。评估过程中默认将最佳acc模型，保存为 output/rec/ic15/best_accuracy 。

如果验证集很大，测试将会比较耗时，建议减少评估次数，或训练完再进行评估。

1.4.3 预测

使用 PaddleOCR 训练好的模型，可以通过以下脚本进行快速预测。

预测图片：

默认预测图片存储在 infer_img 里，通过 -o Global.checkpoints 加载训练好的参数文件：

!python tools/infer_rec.py -c configs/rec/rec_icdar15_train.yml -o Global.checkpoints=output/rec/ic15/best_accuracy Global.character_dict_path=ppocr/utils/ic15_dict.txt

得到输入图像的预测结果：

infer_img: doc/imgs_words_en/word_19.pngresult: slow    0.8795223

paddleocr 的使用要点3 （仪表识别）

1 文本识别算法理论

1.1 背景介绍

1.2 文本识别算法分类

1 文本识别实战

1.1 安装相关的依赖及whl包

1.2 快速预测文字内容

1.2 预测原理详解

1.2.1 所属类别

1.2.3 代码实现

1.3 训练原理详解

1.3.1 准备训练数据

1.3.2 数据预处理

1.3.3 训练主程序

1.4 完整训练任务

1.4.1 启动训练

1.4.2 模型评估

1.4.3 预测

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签

paddleocr 的使用要点3 （仪表识别）

1 文本识别算法理论

1.1 背景介绍

1.2 文本识别算法分类

1 文本识别实战

1.1 安装相关的依赖及whl包

1.2 快速预测文字内容

1.2 预测原理详解

1.2.1 所属类别

1.2.3 代码实现

1.3 训练原理详解

1.3.1 准备训练数据

1.3.2 数据预处理

1.3.3 训练主程序

1.4 完整训练任务

1.4.1 启动训练

1.4.2 模型评估

1.4.3 预测

相关问题

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签