SynthText文本数据详细解析
目录
1. 数据整体官方描述
2. 数据特点
2.1 imnames
2.3 charBB:字符级别的bbox
2.4 txt:文本级别
1. 数据整体官方描述
SynthText in the Wild Dataset
-----------------------------
Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman
Visual Geometry Group, University of Oxford, 2016Data format:
------------SynthText.zip (size = 42074172 bytes (41GB)) contains 858,750 synthetic
scene-image files (.jpg) split into 200 directories, with
7,266,866 word-instances, and 28,971,487 characters.Ground-truth annotations are contained in the file "gt.mat" (Matlab format).
The file "gt.mat" contains the following cell-arrays, each of size 1x858750:1. imnames : names of the image files2. wordBB : word-level bounding-boxes for each image, represented bytensors of size 2x4xNWORDS_i, where:- the first dimension is 2 for x and y respectively,- the second dimension corresponds to the 4 points(clockwise, starting from top-left), and- the third dimension of size NWORDS_i, corresponds tothe number of words in the i_th image.3. charBB : character-level bounding-boxes,each represented by a tensor of size 2x4xNCHARS_i(format is same as wordBB's above)4. txt : text-strings contained in each image (char array).Words which belong to the same "instance", i.e.,those rendered in the same region with the same font, color,distortion etc., are grouped together; the instanceboundaries are demarcated by the line-feed character (ASCII: 10)A "word" is any contiguous substring of non-whitespacecharacters.A "character" is defined as any non-whitespace character.For any questions or comments, contact Ankush Gupta at:
removethisifyouarehuman-ankush@robots.ox.ac.uk
2. 数据特点
数据集下文件如下。
(1)数据集总共有41g,858750张合成图片,jpg格式,这么图片分成200个场景图片(即图片背景不同,其实有202个场景),单词有7,266,866个,字符有28,971,487个;
(2)标注文件时mat格式,读取后保存内容如下。
2.1 imnames
保存图片文件相对路径
2.2 wordBB:单词级别
每张图片对应其中一个标注tensor,该tensor的size是(2, 4, n_word_i):2是xy坐标;4是表示4个点,左上角开始,顺时针方向;n_word_i是第i张图片中的word个数。
“单词”是指任何非空白的连续字符串。
2.3 charBB:字符级别的bbox
size也是(2, 4, n_char_i). 意义同wordBB.
字符是指任何非空白字符。
2.4 txt:文本级别
每个图像中包含的文本字符串(字符数组)。
以图片ballet_106_0.jpg为例. 其标注有8个文本,同一个区域、且字体、颜色、扭曲等特征相同的单词被视为一个文本。