【李宏毅】深度学习——HW5-Machine Translation
Machine Translation
1. Goal
给定一段英文,翻译成繁体中文
2. Introduction
2.1 Dataset
training dataset
- TED2020: TED talks with transcriptions translated by a global community of volunteers to more than 100 language.
- we will use (en, zh-tw) aligned pairs.
Monolingual data
- More TED talks in traditional chinese.
2.2. Evaluation
如何评估我们的模型的表现?
使用BLUE
brevity penalty: penalizes short hypotheses
c是假设长度,r是参考长度
也就说,我们的模型翻译出来的句子和标签中的句子作比较,当二者相似的词越多,则翻译的准确率越高
2.3. Workflow
Preprocessing
- download raw data
- clean and normalize
- remove bad data(too long/short)
Training
- initialize a model
- train it with training data
Testing
- generate translation of data
- evaluate the performance
2.4. Training tips
- Tokenize data with sub_word units
- For one, we can reduce the vocabulary size
- For another, alleviate the open vocabulary problem
- example: transportation => trans port ation
- Lable smoothing regularization
- When calculating loss, reserve some probability for incorrect labels
- Avoids overfitting
- Learning rate scheduling
- Linearly increase le and then decay by inverse square root of steps
- Stablilize training of transformers in early stage
2.5 Back-translation(BT)
得到单语言的数据是很容易的,比如想要中文数据,可以在网站上直接爬下来,但不是所有的英文句子都能得到中文翻译,所以,这里使用得到的中文(也就是数据集里的monolingual data)翻译成英文,做一个BT,就得到了又一个训练数据集,数据集变多,模型相应的得到更多训练,表现也可能会更好。(但是给的数据集中,monolingual data=》test/test.zh全是句号)
3. Code
3.1 数据预处理
数据集
双眼平行语料库2020年:
原始:398066(句子)
已处理:393980(句子)
测试数据:
大小:4000句子
中文翻译未公开,每一行都是“。”
- 处理的内容
- 下载并解压缩文件
- 将文件重命名
# 下载档案并解压缩
data_dir = './DATA/rawdata'
dataset_name = 'ted2020'
urls = ('"https://onedrive.live.com/download?cid=3E549F3B24B238B4&resid=3E549F3B24B238B4%214989&authkey=AGgQ-DaR8eFSl1A"', '"https://onedrive.live.com/download?cid=3E549F3B24B238B4&resid=3E549F3B24B238B4%214987&authkey=AA4qP_azsicwZZM"',
# # If the above links die, use the following instead.
# "https://www.csie.ntu.edu.tw/~r09922057/ML2021-hw5/ted2020.tgz",
# "https://www.csie.ntu.edu.tw/~r09922057/ML2021-hw5/test.tgz",
# # If the above links die, use the following instead.
# "https://mega.nz/#!vEcTCISJ!3Rw0eHTZWPpdHBTbQEqBDikDEdFPr7fI8WxaXK9yZ9U",
# "https://mega.nz/#!zNcnGIoJ!oPJX9AvVVs11jc0SaK6vxP_lFUNTkEcK2WbxJpvjU5Y",
)
file_names = ('ted2020.tgz', # train & dev'test.tgz', # test
)
prefix = Path(data_dir).absolute() / dataset_nameprefix.mkdir(parents=True, exist_ok=True)
for u, f in zip(urls, file_names):path = prefix/fif not path.exists():if 'mega' in u:!megadl {u} --path {path}else:!wget {u} -O {path}if path.suffix == ".tgz":!tar -xvf {path} -C {prefix}elif path.suffix == ".zip":!unzip -o {path} -d {prefix}
# 重命名文件,加上前缀train_dev/test
!mv {prefix/'raw.en'} {prefix/'train_dev.raw.en'}
!mv {prefix/'raw.zh'} {prefix/'train_dev.raw.zh'}
!mv {prefix/'test.en'} {prefix/'test.raw.en'}
!mv {prefix/'test.zh'} {prefix/'test.raw.zh'}#设定语言
src_lang = 'en'
tgt_lang = 'zh'data_prefix = f'{prefix}/train_dev.raw'
test_prefix = f'{prefix}/test.raw'!head {data_prefix+'.'+src_lang} -n 5
!head {data_prefix+'.'+tgt_lang} -n 5
Thank you so much, Chris.
And it’s truly a great honor to have the opportunity to come to this stage twice; I’m extremely grateful.
I have been blown away by this conference, and I want to thank all of you for the many nice comments about what I had to say the other night.
And I say that sincerely, partly because I need that.
Put yourselves in my position.
非常謝謝你,克里斯。能有這個機會第二度踏上這個演講台
真是一大榮幸。我非常感激。
這個研討會給我留下了極為深刻的印象,我想感謝大家 對我之前演講的好評。
我是由衷的想這麼說,有部份原因是因為 —— 我真的有需要!
請你們設身處地為我想一想!
- 处理的内容
- 把字符串全形转半形
- 将字符串的特殊字符与内容以‘ ’分割
- 去掉或者替换掉一些特殊字符
# 去掉或者替换掉一些特殊字符
def clean_s(s, lang):if lang == 'en':s = re.sub(r"\\([^()]*\\)", "", s) # remove ([text])s = s.replace('-', '') # remove '-'s = re.sub('([.,;!?()\\"])', r' \\1 ', s) # keep punctuationelif lang == 'zh':s = strQ2B(s) # Q2Bs = re.sub(r"\\([^()]*\\)", "", s) # remove ([text])s = s.replace(' ', '')s = s.replace('—', '')s = s.replace('“', '"')s = s.replace('”', '"')s = s.replace('_', '')s = re.sub('([。,;!?()\\"~「」])', r' \\1 ', s) # keep punctuations = ' '.join(s.strip().split()) # 将字符串以空格分割return sdef len_s(s, lang):if lang == 'zh':return len(s)return len(s.split())# clean后文件名称是:train_dev.raw.clean.en, train_dev.raw.clean.zh, text.raw.clean.en. test.raw.clean.zh
def clean_corpus(prefix, l1, l2, ratio=9, max_len=1000, min_len=1):if Path(f'{prefix}.clean.{l1}').exists() and Path(f'{prefix}.clean.{l2}').exists():print(f'{prefix}.clean.{l1} & {l2} exists. skipping clean.')returnwith open(f'{prefix}.{l1}', 'r', encoding='utf-8') as l1_in_f:with open(f'{prefix}.{l2}', 'r', encoding='utf-8') as l2_in_f:with open(f'{prefix}.clean.{l1}', 'w', encoding='utf-8') as l1_out_f:with open(f'{prefix}.clean.{l2}', 'w', encoding='utf-8') as l2_out_f:for s1 in l1_in_f:s1 = s1.strip()s2 = l2_in_f.readline().strip()s1 = clean_s(s1, l1)s2 = clean_s(s2, l2)s1_len = len_s(s1, l1)s2_len = len_s(s2, l2)if min_len > 0: # remove short sentenceif s1_len < min_len or s2_len < min_len:continueif max_len > 0: # remove long sentenceif s1_len > max_len or s2_len > max_len:continueif ratio > 0: # remove by ratio of lengthif s1_len / s2_len > ratio or s2_len / s1_len > ratio:continueprint(s1, file=l1_out_f)print(s2, file=l2_out_f)clean_corpus(data_prefix, src_lang, tgt_lang)
clean_corpus(test_prefix, src_lang, tgt_lang, ratio=-1, min_len=-1, max_len=-1)
Thank you so much , Chris .
And it’s truly a great honor to have the opportunity to come to this stage twice ; I’m extremely grateful .
I have been blown away by this conference , and I want to thank all of you for the many nice comments about what I had to say the other night .
And I say that sincerely , partly because I need that .
Put yourselves in my position .
非常謝謝你 , 克里斯 。 能有這個機會第二度踏上這個演講台
真是一大榮幸 。 我非常感激 。
這個研討會給我留下了極為深刻的印象 , 我想感謝大家對我之前演講的好評 。
我是由衷的想這麼說 , 有部份原因是因為我真的有需要 !
請你們設身處地為我想一想 !
3.2 划分训练集和验证集
验证集只需要3000~4000个句子即可,将上面处理完的训练集进行划分,分别放到train.clean.en/.zh、valid.clean.zn/.zh文件中。
# 划分训练集和验证集
# 验证集300~400句即可
valid_ratio = 0.01
train_ratio = 1 - valid_ratio
data_dir = './data'
dataset_name = 'prefix'# 最后划分为训练集和验证集 文件名称分别为 train.clean.en train.clean.zh valid.clean.en valid.clean.zh
if Path(f'{prefix}/train.clean.{src_lang}').exists() \\
and Path(f'{prefix}/train.clean.{tgt_lang}').exists() \\
and Path(f'{prefix}/valid.clean.{src_lang}').exists() \\
and Path(f'{prefix}/valid.clean.{tgt_lang}').exists():print(f'train/valid splits exists. skipping split.')
else:line_num = sum(1 for line in open(f'{data_prefix}.clean.{src_lang}', encoding='utf-8'))labels = list(range(line_num))random.shuffle(labels)for lang in [src_lang, tgt_lang]:train_f = open(os.path.join(data_dir, dataset_name, f'train.clean.{lang}'), 'w', encoding='utf-8')valid_f = open(os.path.join(data_dir, dataset_name, f'valid.clean.{lang}'), 'w', encoding='utf-8')count = 0for line in open(f'{data_prefix}.clean.{lang}', 'r', encoding='utf-8'):if labels[count]/line_num < train_ratio:train_f.write(line)else:valid_f.write(line)count += 1train_f.close()valid_f.close()
划分的最后结果如下
3.3 Subword Units(分词)
翻译中存在一大问题是未登录词(out of vocabulary),可以使用subword units作为短词单位来解决。
- 使用
sentencepiece
套件 - 用
unigram
或byte-pair encoding(BPE)
# Subword Units
# 分词
# 使用sentencepiece中的spm对训练集和验证集进行分词建模,模型名称是spm8000.model,同时产生词汇库spm8000.vocab
# 使用模型对训练集、验证集、测试集进行分词处理,得到文件train.en, train.zh, valid.en, valid.zh, test.en, test.zh
import sentencepiece as spm
vocab_size = 8000
if Path(f'{prefix}/spm{vocab_size}.model').exists():print(f'{prefix}/spm{vocab_size},model exits. skipping spm_train')
else:spm.SentencePieceTrainer.train(input=','.join([f'{prefix}/train.clean.{src_lang}',f'{prefix}/valid.clean.{src_lang}',f'{prefix}/train.clean.{tgt_lang}',f'{prefix}/valid.clean.{tgt_lang}']),model_prefix=f'{prefix}/spm{vocab_size}',vocab_size=vocab_size,character_coverage=1,model_type='unigram', # 'bpe' 也可input_sentence_size=1e6,shuffle_input_sentence=True,normalization_rule_name='nmt_nfkc_cf',)
spm_model = spm.SentencePieceProcessor(model_file=str(f'{prefix}/spm{vocab_size}.model'))
in_tag = {'train': 'train.clean','valid': 'valid.clean','test': 'test.raw.clean',
}
for split in ['train', 'valid', 'test']:for lang in [src_lang, tgt_lang]:out_path = Path(f'{prefix}/{split}.{lang}')if out_path.exists():print(f"{out_path} exists. skipping spm_encode.")else:with open(f'{prefix}/{split}.{lang}', 'w', encoding='utf-8') as out_f:with open(f'{prefix}/{in_tag[split]}.{lang}', 'r', encoding='utf-8') as in_f:for line in in_f:line = line.strip()tok = spm_model.encode(line, out_type=str)print(' '.join(tok), file=out_f)
分词后得到的词汇表spm8000.vocab中的部分内容如下:
分词处理后的train.en和对应的train.zh文件内容:
▁thank ▁you ▁so ▁much ▁, ▁chris ▁.
▁and ▁it ’ s ▁ t ru ly ▁a ▁great ▁ho n or ▁to ▁have ▁the ▁ op port un ity ▁to ▁come ▁to ▁this ▁st age ▁ t wi ce ▁; ▁i ’ m ▁ex t re me ly ▁gr ate ful ▁.
▁i ▁have ▁been ▁ bl ow n ▁away ▁by ▁this ▁con fer ence ▁, ▁and ▁i ▁want ▁to ▁thank ▁all ▁of ▁you ▁for ▁the ▁many ▁ ni ce ▁ com ment s ▁about ▁what ▁i ▁had ▁to ▁say ▁the ▁other ▁night ▁.
▁and ▁i ▁say ▁that ▁since re ly ▁, ▁part ly ▁because ▁i ▁need ▁that ▁.
▁put ▁your s el ve s ▁in ▁my ▁po s ition ▁.
▁ 非常 謝 謝 你 ▁, ▁ 克 里 斯 ▁。 ▁ 能 有 這個 機會 第二 度 踏 上 這個 演講 台
▁ 真 是 一 大 榮 幸 ▁。 ▁我 非常 感 激 ▁。
▁這個 研 討 會 給我 留 下 了 極 為 深 刻 的 印 象 ▁, ▁我想 感 謝 大家 對我 之前 演講 的 好 評 ▁。
▁我 是由 衷 的 想 這麼 說 ▁, ▁有 部份 原因 是因為 我 真的 有 需要 ▁!
▁ 請 你們 設 身 處 地 為 我想 一 想 ▁!
3.4用fairseq将资料转为二进制
下面的程序可以在jupyter中运行,或者在python解释器
# 使用fairseq将数据二进制化 最终生成的文件在目录./data/data_bin下
binpath = Path('./data/data-bin')
if binpath.exists():print(binpath, "exists, will not overwrite!")
else:!python -m fairseq_cli.preprocess \\--source-lang en\\--target-lang zh\\--trainpref ./data/prefix/train\\--validpref ./data/prefix/valid\\--testpref ./data/prefix/test\\--destdir ./data/data_bin\\--joined-dictionary\\--workers 2
生成了一系列文件在data_bin目录下
4. 实验准备
4.1 实验参数设定
config = Namespace(datadir = "./data/data_bin",savedir = "./checkpoints/rnn",source_lang = "en",target_lang = "zh",# cpu threads when fetching & processing data.num_workers=2, # batch size in terms of tokens. gradient accumulation increases the effective batchsize.max_tokens=8192,accum_steps=2,# the lr s calculated from Noam lr scheduler. you can tune the maximum lr by this factor.lr_factor=2.,lr_warmup=4000,# clipping gradient norm helps alleviate gradient explodingclip_norm=1.0,# maximum epochs for trainingmax_epoch=30,start_epoch=1,# beam size for beam searchbeam=5, # generate sequences of maximum length ax + b, where x is the source lengthmax_len_a=1.2, max_len_b=10,# when decoding, post process sentence by removing sentencepiece symbols.post_process = "sentencepiece",# checkpointskeep_last_epochs=5,resume=None, # if resume from checkpoint name (under config.savedir)# logginguse_wandb=False,
)
4.2 logging
# logging套件记录一般讯息 wandb记录训练过程的loss, blue, model, weight等
logging.basicConfig(format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",datefmt="%Y-%m-%d %H:%M:%S",level="INFO", # "DEBUG" "WARNING" "ERROR"stream=sys.stdout,
)
proj = "hw5.seq2seq"
logger = logging.getLogger(proj)
if config.use_wandb:import wandbwandb.init(project=proj, name=Path(config.savedir).stem, config=config)
4.3 cuda环境
cuda_env = utils.CudaEnvironment()
utils.CudaEnvironment.pretty_print_cuda_env_list([cuda_env])
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
4.4 读取资料集
借用fairsq的TranslationTask
- 用来加载上面创建的二进制数据
- 实现良好的数据迭代器(dataloader)
- 字典task.source_directionary 和 task.targrt_directionary也很好用
- 有實做 beam search
from fairseq.tasks.translation import TranslationConfig, TranslationTask## setup task
task_cfg = TranslationConfig(data=config.datadir,source_lang=config.source_lang,target_lang=config.target_lang,train_subset="train",required_seq_len_multiple=8,dataset_impl="mmap",upsample_primary=1,
)
task = TranslationTask.setup_task(task_cfg)
logger.info("loading data for epoch 1")
task.load_dataset(split="train", epoch=1, combine=True) # combine if you have back-translation data.
task.load_dataset(split="valid", epoch=1)sample = task.dataset("valid")[1]
pprint.pprint(sample)
pprint.pprint("Source: " + \\task.source_dictionary.string(sample['source'],config.post_process,)
)
pprint.pprint("Target: " + \\task.target_dictionary.string(sample['target'],config.post_process,)
)
运行结果:
{‘id’: 1,
‘source’: tensor([ 18, 14, 6, 2234, 60, 19, 80, 5, 256, 16, 405, 1407,
1706, 7, 2]),
‘target’: tensor([ 140, 690, 28, 270, 45, 151, 1142, 660, 606, 369, 3114, 2434,
1434, 192, 2])}
“Source: that’s exactly what i do optical mind control .”
‘Target: 這實在就是我所做的–光學操控思想’
4.5 数据集迭代器
- 将每个batch控制在
N
个token让GPU记忆体更有效被利用 - 让training set每个
epoch
有不同shuffling - 滤掉长度太长的句子
- 将每个batch内的句子
pad
成一样长,好让GPU平行运算 - 加上
eos
并shift一格- teacher forcing:为了训练模型根据prefix生成下个字,decoder的输入会是输出目标序列往右shift一格。
- 一般是会在输入开头加个
bos token
(如下图)
fairseq则是直接吧eos挪到begining,训练起来其实效果差不多。例如:
输出目标 (target) 和 Decoder输入 (prev_output_tokens): eos = 2 target = 419, 711, 238, 888, 792, 60, 968, 8, 2 prev_output_tokens = 2, 419, 711, 238, 888, 792, 60, 968, 8
def load_data_iterator(task, split, epoch=1, max_tokens=4000, num_workers=1, cached=True):batch_iterator = task.get_batch_iterator(dataset=task.dataset(split),max_tokens=max_tokens,max_sentences=None,max_positions=utils.resolve_max_positions(task.max_positions(),max_tokens,),ignore_invalid_inputs=True,seed=seed,num_workers=num_workers,epoch=epoch,disable_iterator_cache=not cached,# Set this to False to speed up. However, if set to False, changing max_tokens beyond# first call of this method has no effect.)return batch_iteratorif __name__=='__main__':demo_epoch_obj = load_data_iterator(task, "valid", epoch=1, max_tokens=20, num_workers=1, cached=False)demo_iter = demo_epoch_obj.next_epoch_itr(shuffle=True)sample = next(demo_iter)print(sample)
输出信息和解释说明如下:
{‘id’: tensor([723]), # 每个example的id
‘nsentences’: 1, # batch size 句子数
‘ntokens’: 18, # batch size 字数
‘net_input’: {
‘src_tokens’: tensor([[ 1, 1, 1, 1, 1, 18, 26, 82, 8, 480, 15, 651,
1361, 38, 6, 176, 2696, 39, 5, 822, 92, 260, 7, 2]]), # 来源语言的序列
‘src_lengths’: tensor([19]), # 每句话没有pad过的长度
‘prev_output_tokens’: tensor([[ 2, 140, 296, 318, 1560, 51, 568, 316, 225, 1952, 254, 78, # 上面提到shift 一格后的目标序列
151, 2691, 9, 215, 1680, 10, 1, 1, 1, 1, 1, 1]])
},
‘target’: tensor([[ 140, 296, 318, 1560, 51, 568, 316, 225, 1952, 254, 78, 151,
2691, 9, 215, 1680, 10, 2, 1, 1, 1, 1, 1, 1]]) # 目标序列
}
5. 定义模型架构
- 继承
fairseq
的Encoder,decoder,model,这样测试阶段才能直接用它写好的beam search函式
5.1 Encoder编码器
seq2seq模型的编码器为RNN或Transformer Encoder,一下说明以RNN
为例。
对应每一个输入,Encoder
会输出一个向量和一个隐藏状态(hidden state),并将隐藏状态用于下一个输入。换句话说,Encoder会逐步读取输入序列,并在每个timestep输出单个向量,以及在最后timestep输出隐藏状态(content vector)。
下面解释一下本实验中的GRU
本实验使用的是
GRU
,GRU的输入输出参数如下:
输入的参数有两个,分别是input
和h_0
。
Inputs: input, h_0
①input的shape
The shape of input:(seq_len, batch, input_size) : tensor containing the feature of the input sequence. The input can also be a packed variable length sequence。See functorch.nn.utils.rnn.pack_padded_sequencefor details.
②h_0的shape
从下面的解释中也可以看出,这个参数可以不提供,那么就默认为0.
The shape of h_0:(num_layers * num_directions, batch, hidden_size): tensor containing the initial hidden state for each element in the batch. Defaults to zero if not provided. If the RNN is bidirectional num_directions should be 2, else it should be 1.
输出有两个,分别是output
和h_n
①output
output 的shape是:(seq_len, batch, num_directions * hidden_size): tensor containing the output features h_t from the last layer of the GRU, for each t.
If a class:torch.nn.utils.rnn.PackedSequence has been given as the input, the output will also be a packed sequence. For the unpacked case, the directions can be separated using output.view(seq_len, batch, num_directions, hidden_size), with forward and backward being direction 0 and 1 respectively.
Similarly, the directions can be separated in the packed case.
②h_n
h_n的shape是:(num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for t = seq_len
Like output, the layers can be separated using
h_n.view(num_layers, num_directions, batch, hidden_size).
当双向 = True 时,h_n将分别包含最终正向和反向隐藏状态的串联。
大致如下图所示:
# 定义模型架构
# 使用fairsq的Encoder,decoder and model
class RNNEncoder(FairseqEncoder):def __init__(self, args, dictionary, embed_tokens):''':param args:encoder_embed_dim 是embedding的维度,主要将one-hot vect的单词向量压缩到指定的维度encoder_ffn_embed_dim 是RNN输出和隐藏状态的维度(hidden dimension)encoder_layers 是RNN要叠多少层dropout 是决定有大欧少的几率会将某个节点变为0,主要是为了防止overfitting,一般来说训练时用:param dictionary: fairseq帮我们做好的dictionary 再次用来得到padding index,好用来得到encoder padding mask:param embed_tokens: 事先做好的词嵌入(nn.Embedding)'''super().__init__(dictionary)self.embed_tokens = embed_tokensself.embed_dim = args.encoder_embed_dimself.hidden_dim = args.encoder_ffn_embed_dimself.num_layers = args.encoder_layersself.dropout_in_module = nn.Dropout(args.dropout)self.rnn = nn.GRU(self.embed_dim,self.hidden_dim,self.num_layers,dropout=args.dropout,batch_first=False,bidirectional=True,)self.dropout_out_module = nn.Dropout(args.dropout)self.padding_idx = dictionary.pad()def combine_bidir(self, outs, bsz:int):out = outs.view(self.num_layers, 2, bsz, -1).transpose(1, 2).contiguous()return out.view(self.num_layers, bsz, -1)def forward(self, src_tokens, unused):''':param src_tokens: 英文的整数序列:param unused::return:outputs: 最上层RNN每个timestep的输出,最后可以用Attention再进行处理final_hiddens: 每层最终timestep的隐藏状态,将传递到Decoder进行解码encoder_padding_mask: 告诉我们那些事位置的资讯不重要'''bsz, seqlen = src_tokens.size()# get embeddingsx = self.embed_tokens(src_tokens)x = self.dropout_in_module(x)# B x T x C => T x B x Cx = x.transpose(0, 1)# 过双向RNNh0 = x.new_zeros(2 * self.num_layers, bsz, self.hidden_dim)x, final_hiddens = self.rnn(x, h0)outputs = self.dropout_out_module(x)# outputs = [sequence len, batch size, hid dim * directions] 是最上面RNN的输出# hidden = [num_layers * directions, batch size , hid dim]# 因为Encoder是双向的,我们需要链接两个方向的隐藏状态final_hiddens = self.combine_bidir(final_hiddens, bsz)# hidden = [num_layers x batch x num_directions*hidden]encoder_padding_mask = src_tokens.eq(self.padding_idx).t()return tuple((outputs, # seq_len x batch x hiddenfinal_hiddens, # num_layers x batch x num_directions*hiddenencoder_padding_mask, # seq_len x batch))def reorder_encoder_out(self, encoder_out, new_order):return tuple((encoder_out[0].index_select(1, new_order),encoder_out[1].index_select(1, new_order),encoder_out[2].index_select(1, new_order),))
5.2 Attention
- 当输入过长,或是单独靠“content vector”无法取得整个输入的意思时,用
Attention Mechanism
来提供Decoder更多的资讯- 根据现在
Decoder embeddings
,去计算在Encoder outputs
中,那些与其有较高的关系,根据关系的数值来把Encoder outputs平均起来作为Decoder RNN的输入- 常见Attention的作用是用Neural Network / Dot Product来算
query
(decoder embedding)和key
(Encoder outputs)之间的关系,再对所有算出来的数值做softmax
得到分布,最后根据这个分布对values
(Encoder outputs)做weight sum
# Attention
class AttentionLayer(nn.Module):def __init__(self, input_embed_dim, source_embed_dim, output_embed_dim, bias=False):''':param input_embed_dim: key 的维度,应是 decoder 要做 attend 时的向量的维度:param source_embed_dim: query 的维度,应是要被 attend 的向量(encoder outputs)的维度:param output_embed_dim: value 的维度,应是做完 attention 后,下一层预期的向量维度:param bias:'''super().__init__()self.input_proj = nn.Linear(input_embed_dim, source_embed_dim, bias=bias)self.output_proj = nn.Linear(input_embed_dim + source_embed_dim, output_embed_dim, bias=bias)def forward(self, inputs, encoder_outputs, encoder_padding_mask):''':param inputs: 就是key,要attend别人的向量:param encoder_outputs: 是query/value,被attend的向量:param encoder_padding_mask: 告诉我们哪些是位置的资讯不重要:return:output: 做完attention后的context vectorattention score: attention的分布'''# inputs: T, B, dim# encoder_outputs: S x B x dim# padding mask: S x B# convert all to batch firstinputs = inputs.transpose(1, 0) # B, T, dimencoder_outputs = encoder_outputs.transpose(1, 0) #B, S, dimencoder_padding_mask = encoder_padding_mask.transpose(1, 0) # B, S# 投影到encoder_outputs的维度# (B, T, dm) x (B, dim, S) = (B, T, S)attn_scores = torch.bmm(x, encoder_outputs.transpose(1, 2))# 挡住padding位置的attentionif encoder_padding_mask is not None:# 利用broadcast B, S -> (B, 1, S)encoder_padding_mask = encoder_padding_mask.unsqueeze(1)attn_scores = (attn_scores.float().masked_dill_(encoder_padding_mask, float("-inf"))# 用来mask掉当前时刻后面时刻的序列信息.type_as(attn_scores)# 按照给定的tensor进行类型转换)# 在source对应维度softmaxattn_scores = F.softmax(attn_scores, dim=-1)# 形状(B, T, S) x (B, S, dim) = (B, T, dim)加权平均x = torch.bmm(attn_scores, encoder_outputs)# (B, T, dim)x = torch.cat((x, inputs), dim=-1)x = torch.tanh(self.output_proj(x)) # output + linear + tanh# 回复形状(B, T, dim) -> (T, B, dim)return x.transpose(1, 0), attn_scores
5.3 Decoder解码器
- 解码器的hidden states会用编码器最终隐藏状态来初始化
- 解码器同时也根据目前timestep的输入(也就是前一个timestep的output),改变hidden states,并输出结果
- 如果加入attention可以使其表现更好
- 我们把seq2seq步骤写在解码器里,好让等等seq2seq这个类可以通用RNN和Transformer,而不用再改写
# Decoder
class RNNDecoder(FairseqIncrementalDecoder):def __init__(self, args, dictionary, embed_tokens):super().__init__(dictionary)self.embed_tokens = embed_tokensassert args.decoder_layers == args.encoder_layers, f"""seq2seq rnn requires that encoder and decoder have same layers of rnn. got: {args.encoder_layers, args.decoder_layers}"""assert args.decoder_ffn_embed_dim == args.encoder_ffn_embed_dim * 2, f"""seq2seq-rnn requires that decoder hidden to be 2*encoder hidden dim. got: {args.decoder_ffn_embed_dim, args.encoder_ffn_embed_dim * 2}"""self.embed_dim = args.decoder_embed_dimself.hidden_dim = args.decoder_ffn_embed_dimself.num_layers = args.decoder_layersself.dropout_in_module = nn.Dropout(args.dropout)self.rnn = nn.GRU(self.embed_dim,self.hidden_dim,self.num_layers,dropout=args.dropout,batch_first=False,bidirectional=False,)self.attention = AttentionLayer(self.embed_dim, self.hidden_dim, self.embed_dim, bias=False)# self.attention = Noneself.dropout_out_module = nn.Dropout(args.dropout)if self.hidden_dim != self.embed_dim:self.project_out_dim = nn.Linear(self.hidden_dim, self.embed_dim)else:self.project_out_dim = Noneif args.share_decoder_input_output_embed:self.output_projection = nn.Linear(self.embed_tokens.weight.shape[1],self.embed_tokens.weight.shape[0],bias=False,)self.output_projection.weight = self.embed_tokens.weightelse:self.output_projection = nn.Linear(self.output_embed_dim, len(dictionary), bias=False)nn.init.normal_(self.output_projection.weight, mean=0, std=self.output_embed_dim -0.5)def forward(self, prev_output_tokens, encoder_out, incremental_state=None, unused):# 取出encoder的输出encoder_outputs, encoder_hiddens, encoder_padding_mask = encoder_out# outputs: seq_len x batch x num_directions*hidden# encoder_hiddens: num_layers x batch x num_directions*encoder_hidden# padding_mask: seq_len x batchif incremental_state is not None and len(incremental_state)>0:# 如果保留了上一个timestep留下的资讯,我们可以从那里进来,而不是从bos开始prev_output_tokens = prev_output_tokens[:, -1:]cache_state = self.get_incremental_state(incremental_state, "cache_state")prev_hiddens = cache_state["prev_hiddens"]else:# 沒有incremental state代表这是training或者是test time时的第一步# 准备seq2seq: 把encoder_hiddens pass进去decoder的hidden statesprev_hiddens = encoder_hiddensbsz, seqlen = prev_output_tokens.size()# embed tokensx = self.embed_tokens(prev_output_tokens)x = self.dropout_in_module(x)# B x T x C -> T x B x Cx = x.transpose(0, 1)# 做decoder-to-encoder attentionif self.attention is not None:x, attn = self.attention(x, encoder_outputs, encoder_padding_mask)# 过单向RNNx, final_hiddens = self.rnn(x, prev_hiddens)# outputs = [sequence len, batch size, hid dim]# hidden = [num_layers * directions, batch size , hid dim]x = self.dropout_out_module(x)# 投影到embedding size (如果hidden 和embed size不一样,然后share_embedding又变成True,需要额外project一次)if self.project_out_dim != None:x = self.project_out_dim(x)# 投影到vocab size 的分布x = self.output_projection(x)# T x B x C -> B x T x Cx = x.transpose(1, 0)# 如果是Incremental, 记录这个timestep的hidden states, 下个timestep读回来cache_state = {"prev_hiddens": final_hiddens,}self.set_incremental_state(incremental_state, "cached_state", cache_state)return x, Nonedef reorder_incremental_state(self,incremental_state,new_order,):# 这个beam search时会用到,意思并不是很重要cache_state = self.get_incremental_state(incremental_state, "cached_state")prev_hiddens = cache_state["prev_hiddens"]prev_hiddens = [p.index_select(0, new_order) for p in prev_hiddens]cache_state = {"prev_hiddens": torch.stack(prev_hiddens),}self.set_incremental_state(incremental_state, "cached_state", cache_state)return
5.4 Seq2Seq
- 由
Encoder
和Decoder
组成 - 接收输入并传给
Encoder
- 将
Encoder
的输出传给Decoder
Decoder
根据前一个timestep的输出和Encoder
输出进行解码- 当解码完成后,将
Decoder
的输出传回
# Seq2Seq
class Seq2Seq(FairseqEncoderDecoderModel):def __init__(self, args, encoder, decoder):super().__init__(encoder, decoder)self.args = argsdef forward(self, src_tokens, src_lengths, prev_output_tikens, return_all_hiddens: bool = True):encoder_out = self.encoder(src_tokens, src_lengths=src_lengths, return_all_hiddens=return_all_hiddens)logits, extra = self.decoder(prev_output_tikens,encoder_out=encoder_out,src_lengths=src_lengths,return_all_hiddens=return_all_hiddens,)return logits, extra
5.5 模型初始化
# 模型初始化
def build_model(args, task):src_dict, tgt_dict = task.source_dictionary, task.target_dictionary# 词嵌入encoder_embed_tokens = nn.Embedding(len(src_dict), args.encoder_embed_dim, src_dict.pad())decoder_embed_tokens = nn.Embedding(len(tgt_dict), args.decoder_embed_dim, tgt_dict.pad())# 编码器和解码器encoder = RNNEncoder(args, src_dict, encoder_embed_tokens)decoder = RNNDecoder(args, tgt_dict, decoder_embed_tokens)# 序列到序列模型model = Seq2Seq(args, encoder, decoder)# 序列到序列模型的初始化很重要 需要特别处理def init_params(module):from fairseq.modules import MultiheadAttentionif isinstance(module, nn.Linear):module.weight.data.normal_(mean=0.0, std=0.02)if module.bias is not None:module.bias.data.zero_()if isinstance(module, nn.Embedding):module.weight.data.normal_(mean=0.0, std=0.02)if module.padding_idx is not None:module.weight.data[module.padding_idx].zero_()if isinstance(module, MultiheadAttention):module.q_proj.weight.data.normal_(mean=0.0, std=0.02)module.k_proj.weight.data.normal_(mean=0.0, std=0.02)module.v_proj.weight.data.normal_(mean=0.0, std=0.02)if isinstance(module, nn.RNNBase):for name, param in module.named_parameters():if "weight" in name or "bias" in name:param.data.uniform_(-0.1, 0.1)# 初始化模型model.apply(init_params)return model
5.6 设置模型相关参数
arch_args = Namespace(encoder_embed_dim=256,encoder_ffn_embed_dim=512,encoder_layers=1,decoder_embed_dim=256,decoder_ffn_embed_dim=1024,decoder_layers=1,share_decoder_input_output_embed=True,dropout=0.3,
)model = build_model(arch_args, task)
logger.info(model)
Seq2Seq(
(encoder): RNNEncoder(
(embed_tokens): Embedding(8000, 256, padding_idx=1)
(dropout_in_module): Dropout(p=0.3, inplace=False)
(rnn): GRU(256, 512, dropout=0.3, bidirectional=True)
(dropout_out_module): Dropout(p=0.3, inplace=False)
)
(decoder): RNNDecoder(
(embed_tokens): Embedding(8000, 256, padding_idx=1)
(dropout_in_module): Dropout(p=0.3, inplace=False)
(rnn): GRU(256, 1024, dropout=0.3)
(attention): AttentionLayer(
(input_proj): Linear(in_features=256, out_features=1024, bias=False)
(output_proj): Linear(in_features=1280, out_features=256, bias=False)
)
(dropout_out_module): Dropout(p=0.3, inplace=False)
(project_out_dim): Linear(in_features=1024, out_features=256, bias=True)
(output_projection): Linear(in_features=256, out_features=8000, bias=False)
)
)
5.7 Optimization 最佳化
Loss : Label Smoothing Regularization
- 让模型学习输出较不集中的分布,防止模型过度自信
- 有时候Cround Truth并非唯一答案,所以在算loss时,我们会保留一部分概率给正确答案以外的label
- 可以有效防止过度拟合
class LabelSmoothedCrossEntropyCriterion(nn.Module):def __init__(self, smoothing, ignore_index=None, reduce=True):super().__init__()self.smoothing = smoothingself.ignore_index = ignore_indexself.reduce = reducedef forward(self, lprobs, target):if target.dim() == lprobs.dim() - 1:target = target.unsqueeze(-1)# nll: Negative log likelihood,當目標是one-hot時的cross-entropy loss. 以下同 F.nll_lossnll_loss = -lprobs.gather(dim=-1, index=target)# 將一部分正確答案的機率分配給其他label 所以當計算cross-entropy時等於把所有label的log prob加起來smooth_loss = -lprobs.sum(dim=-1, keepdim=True)if self.ignore_index is not None:pad_mask = target.eq(self.ignore_index)nll_loss.masked_fill_(pad_mask, 0.0)smooth_loss.masked_fill_(pad_mask, 0.0)else:nll_loss = nll_loss.squeeze(-1)smooth_loss = smooth_loss.squeeze(-1)if self.reduce:nll_loss = nll_loss.sum()smooth_loss = smooth_loss.sum()# 計算cross-entropy時 加入分配給其他label的losseps_i = self.smoothing / lprobs.size(-1)loss = (1.0 - self.smoothing) * nll_loss + eps_i * smooth_lossreturn loss# 一般都用0.1效果就很好了
criterion = LabelSmoothedCrossEntropyCriterion(smoothing=0.1,ignore_index=task.target_dictionary.pad(),
)
5.8 Optimizer: Adam + lr scheduling
Inverse square root 排程对于训练 Transformer 时的稳定性很重要,后来也用在 RNN 上。 根据底下公式来更新 learning rate,前期线性增长,后期根据更新步数方根的倒数来递减。
class NoamOpt:"Optim wrapper that implements rate."def __init__(self, model_size, factor, warmup, optimizer):self.optimizer = optimizerself._step = 0self.warmup = warmupself.factor = factorself.model_size = model_sizeself._rate = 0@propertydef param_groups(self):return self.optimizer.param_groupsdef multiply_grads(self, c):"""Multiplies grads by a constant *c*.""" for group in self.param_groups:for p in group['params']:if p.grad is not None:p.grad.data.mul_(c)def step(self):"Update parameters and rate"self._step += 1rate = self.rate()for p in self.param_groups:p['lr'] = rateself._rate = rateself.optimizer.step()def rate(self, step = None):"Implement `lrate` above"if step is None:step = self._stepreturn 0 if not step else self.factor * \\(self.model_size (-0.5) *min(step (-0.5), step * self.warmup (-1.5)))
排程视觉化
optimizer = NoamOpt(model_size=arch_args.encoder_embed_dim, factor=config.lr_factor, warmup=config.lr_warmup, optimizer=torch.optim.AdamW(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9, weight_decay=0.0001))
plt.plot(np.arange(1, 100000), [optimizer.rate(i) for i in range(1, 100000)])
plt.legend([f"{optimizer.model_size}:{optimizer.warmup}"])
None
6. train
def train_one_epoch(epoch_itr, model, task, criterion, optimizer, accum_steps=1):itr = epoch_itr.next_epoch_itr(shuffle=True)itr = iterators.GroupedIterator(itr, accum_steps) # 梯度累积: 每 accum_steps 个 sample 更新一次stats = {"loss":[]}scaler = GradScaler()model.train()progress = tqdm.tqdm(itr, desc=f"train epoch {epoch_itr}", leave=False)for samples in progress:model.zero_grad()accum_loss = 0sample_size = 0# 梯度累积:没accum_steps个sample更新一次for i, sample in enumerate(samples):if i == 1:torch.cuda.empty_cache()sample = utils.move_to_cuda(sample, device=device)traget = sample["traget"]sample_size_i = sample["ntokens"]sample_size += sample_size_i# 混合精度训练with autocast():net_output = model.forward(sample["net_input"])lprobs = F.log_softmax(net_output[0], -1)loss = criterion(lprobs.view(-1, lprobs.size(-1)), target.view(-1))# loggingaccum_loss += loss.item()# back-propscaler.scale(loss).backward()scaler.unscale_(optimizer)optimizer.multiply_grads(1 / (sample_size or 1.0)) # (sample_size or 1.0) handles the case of a zero gradientgnorm = nn.utils.clip_grad_norm_(model.parameters(), config.clip_norm) # 梯度裁剪 防止梯度爆炸scaler.step(optimizer)scaler.update()# loggingloss_print = accum_loss / sample_sizestats["loss"].append(loss_print)progress.set_postfix(loss=loss_print)# if config.use_wandb:# wandb.log({# "train/loss": loss_print,# "train/grad_norm": gnorm.item(),# "train/lr": optimizer.rate(),# "train/sample_size": sample_size,# })loss_print = np.mean(stats["loss"])logger.info(f"training loss: {loss_print:.4f}")return stats
7. Validation & Inference 检验和推论
为了防止过拟合,每个epoch都要进行验证,计算模型在未看过的资料上的表现
- 过程基本上和training一样,另外加上inference
- 验证后,我们可以保存模型权重
仅验证损失,无法描述模型真实的性能 - 直接用当前模型去生成翻译结果(hypothesis),再和正确答案(reference)计算BLUE score
- 我们用fairseq写好的sequence generator来进行beam search生成翻译结果
# fairseq 的 beam search generator
# 給定模型和輸入序列,用 beam search 生成翻譯結果
sequence_generator = task.build_generator([model], config)def decode(toks, dictionary):# 從 Tensor 轉成人看得懂的句子s = dictionary.string(toks.int().cpu(),config.post_process,)return s if s else "<unk>"def inference_step(sample, model):gen_out = sequence_generator.generate([model], sample)srcs = []hyps = []refs = []for i in range(len(gen_out)):# 對於每個 sample, 收集輸入,輸出和參考答案,稍後計算 BLEUsrcs.append(decode(utils.strip_pad(sample["net_input"]["src_tokens"][i], task.source_dictionary.pad()), task.source_dictionary,))hyps.append(decode(gen_out[i][0]["tokens"], # 0 代表取出 beam 內分數第一的輸出結果task.target_dictionary,))refs.append(decode(utils.strip_pad(sample["target"][i], task.target_dictionary.pad()), task.target_dictionary,))return srcs, hyps, refs
import shutil
import sacrebleudef validate(model, task, criterion, log_to_wandb=True):logger.info('begin validation')itr = load_data_iterator(task, "valid", 1, config.max_tokens, config.num_workers).next_epoch_itr(shuffle=False)stats = {"loss":[], "bleu": 0, "srcs":[], "hyps":[], "refs":[]}srcs = []hyps = []refs = []model.eval()progress = tqdm.tqdm(itr, desc=f"validation", leave=False)with torch.no_grad():for i, sample in enumerate(progress):# validation losssample = utils.move_to_cuda(sample, device=device)net_output = model.forward(sample["net_input"])lprobs = F.log_softmax(net_output[0], -1)target = sample["target"]sample_size = sample["ntokens"]loss = criterion(lprobs.view(-1, lprobs.size(-1)), target.view(-1)) / sample_sizeprogress.set_postfix(valid_loss=loss.item())stats["loss"].append(loss)# 進行推論s, h, r = inference_step(sample, model)srcs.extend(s)hyps.extend(h)refs.extend(r)tok = 'zh' if task.cfg.target_lang == 'zh' else '13a'stats["loss"] = torch.stack(stats["loss"]).mean().item()stats["bleu"] = sacrebleu.corpus_bleu(hyps, [refs], tokenize=tok) # 計算BLEU scorestats["srcs"] = srcsstats["hyps"] = hypsstats["refs"] = refsif config.use_wandb and log_to_wandb:wandb.log({"valid/loss": stats["loss"],"valid/bleu": stats["bleu"].score,}, commit=False)showid = np.random.randint(len(hyps))logger.info("example source: " + srcs[showid])logger.info("example hypothesis: " + hyps[showid])logger.info("example reference: " + refs[showid])# show bleu resultslogger.info(f"validation loss:\\t{stats['loss']:.4f}")logger.info(stats["bleu"].format())return stats
8. 储存及载入模型参数
def validate_and_save(model, task, criterion, optimizer, epoch, save=True): stats = validate(model, task, criterion)bleu = stats['bleu']loss = stats['loss']if save:# save epoch checkpointssavedir = Path(config.savedir).absolute()savedir.mkdir(parents=True, exist_ok=True)check = {"model": model.state_dict(),"stats": {"bleu": bleu.score, "loss": loss},"optim": {"step": optimizer._step}}torch.save(check, savedir/f"checkpoint{epoch}.pt")shutil.copy(savedir/f"checkpoint{epoch}.pt", savedir/f"checkpoint_last.pt")logger.info(f"saved epoch checkpoint: {savedir}/checkpoint{epoch}.pt")# save epoch sampleswith open(savedir/f"samples{epoch}.{config.source_lang}-{config.target_lang}.txt", "w") as f:for s, h in zip(stats["srcs"], stats["hyps"]):f.write(f"{s}\\t{h}\\n")# get best valid bleu if getattr(validate_and_save, "best_bleu", 0) < bleu.score:validate_and_save.best_bleu = bleu.scoretorch.save(check, savedir/f"checkpoint_best.pt")del_file = savedir / f"checkpoint{epoch - config.keep_last_epochs}.pt"if del_file.exists():del_file.unlink()return statsdef try_load_checkpoint(model, optimizer=None, name=None):name = name if name else "checkpoint_last.pt"checkpath = Path(config.savedir)/nameif checkpath.exists():check = torch.load(checkpath)model.load_state_dict(check["model"])stats = check["stats"]step = "unknown"if optimizer != None:optimizer._step = step = check["optim"]["step"]logger.info(f"loaded checkpoint {checkpath}: step={step} loss={stats['loss']} bleu={stats['bleu']}")else:logger.info(f"no checkpoints found at {checkpath}!")
9. 开始训练
model = model.to(device=device)
criterion = criterion.to(device=device)logger.info("task: {}".format(task.__class__.__name__))
logger.info("encoder: {}".format(model.encoder.__class__.__name__))
logger.info("decoder: {}".format(model.decoder.__class__.__name__))
logger.info("criterion: {}".format(criterion.__class__.__name__))
logger.info("optimizer: {}".format(optimizer.__class__.__name__))
logger.info("num. model params: {:,} (num. trained: {:,})".format(sum(p.numel() for p in model.parameters()),sum(p.numel() for p in model.parameters() if p.requires_grad),)
)
logger.info(f"max tokens per batch = {config.max_tokens}, accumulate steps = {config.accum_steps}")epoch_itr = load_data_iterator(task, "train", config.start_epoch, config.max_tokens, config.num_workers)
try_load_checkpoint(model, optimizer, name=config.resume)
while epoch_itr.next_epoch_idx <= config.max_epoch:# train for one epochtrain_one_epoch(epoch_itr, model, task, criterion, optimizer, config.accum_steps)stats = validate_and_save(model, task, criterion, optimizer, epoch=epoch_itr.epoch)logger.info("end of epoch {}".format(epoch_itr.epoch)) epoch_itr = load_data_iterator(task, "train", epoch_itr.next_epoch_idx, config.max_tokens, config.num_workers)
终于跑起来了!!!
虽然翻译的略微离谱hhh