> 文章列表 > SynthText文本数据详细解析

SynthText文本数据详细解析

SynthText文本数据详细解析

目录

1. 数据整体官方描述

2. 数据特点

2.1 imnames

2.2 wordBB:单词级别

2.3 charBB:字符级别的bbox

2.4 txt:文本级别


1. 数据整体官方描述

SynthText in the Wild Dataset
-----------------------------
Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman
Visual Geometry Group, University of Oxford, 2016Data format:
------------SynthText.zip (size = 42074172 bytes (41GB)) contains 858,750 synthetic
scene-image files (.jpg) split into 200 directories, with 
7,266,866 word-instances, and 28,971,487 characters.Ground-truth annotations are contained in the file "gt.mat" (Matlab format).
The file "gt.mat" contains the following cell-arrays, each of size 1x858750:1. imnames :  names of the image files2. wordBB  :  word-level bounding-boxes for each image, represented bytensors of size 2x4xNWORDS_i, where:- the first dimension is 2 for x and y respectively,- the second dimension corresponds to the 4 points(clockwise, starting from top-left), and-  the third dimension of size NWORDS_i, corresponds tothe number of words in the i_th image.3. charBB  : character-level bounding-boxes,each represented by a tensor of size 2x4xNCHARS_i(format is same as wordBB's above)4. txt     : text-strings contained in each image (char array).Words which belong to the same "instance", i.e.,those rendered in the same region with the same font, color,distortion etc., are grouped together; the instanceboundaries are demarcated by the line-feed character (ASCII: 10)A "word" is any contiguous substring of non-whitespacecharacters.A "character" is defined as any non-whitespace character.For any questions or comments, contact Ankush Gupta at:
removethisifyouarehuman-ankush@robots.ox.ac.uk

2. 数据特点

  数据集下文件如下。

(1)数据集总共有41g,858750张合成图片,jpg格式,这么图片分成200个场景图片(即图片背景不同,其实有202个场景),单词有7,266,866个,字符有28,971,487个;

(2)标注文件时mat格式,读取后保存内容如下。

2.1 imnames

保存图片文件相对路径

2.2 wordBB单词级别

每张图片对应其中一个标注tensor,该tensor的size是(2, 4, n_word_i):2是xy坐标;4是表示4个点,左上角开始,顺时针方向;n_word_i是第i张图片中的word个数。

“单词”是指任何非空白的连续字符串。

2.3 charBB字符级别的bbox

size也是(2, 4, n_char_i). 意义同wordBB.

字符是指任何非空白字符。

2.4 txt文本级别

每个图像中包含的文本字符串(字符数组)。

 

以图片ballet_106_0.jpg为例. 其标注有8个文本,同一个区域、且字体、颜色、扭曲等特征相同的单词被视为一个文本。