【李宏毅】深度学习——HW5-Machine Translation

文章列表

Machine Translation

1. Goal

给定一段英文，翻译成繁体中文

2. Introduction

2.1 Dataset

training dataset

TED2020: TED talks with transcriptions translated by a global community of volunteers to more than 100 language.

we will use (en, zh-tw) aligned pairs.

Monolingual data

More TED talks in traditional chinese.

2.2. Evaluation

如何评估我们的模型的表现？
使用BLUE
brevity penalty: penalizes short hypotheses
【李宏毅】深度学习——HW5-Machine Translation
c是假设长度，r是参考长度
也就说，我们的模型翻译出来的句子和标签中的句子作比较，当二者相似的词越多，则翻译的准确率越高

2.3. Workflow

Preprocessing

download raw data

clean and normalize

remove bad data(too long/short)

Training

initialize a model

train it with training data

Testing

generate translation of data

evaluate the performance

2.4. Training tips

Tokenize data with sub_word units
- For one, we can reduce the vocabulary size
- For another, alleviate the open vocabulary problem
- example: transportation => trans port ation
Lable smoothing regularization
- When calculating loss, reserve some probability for incorrect labels
- Avoids overfitting
Learning rate scheduling
- Linearly increase le and then decay by inverse square root of steps
- Stablilize training of transformers in early stage

2.5 Back-translation(BT)

得到单语言的数据是很容易的，比如想要中文数据，可以在网站上直接爬下来，但不是所有的英文句子都能得到中文翻译，所以，这里使用得到的中文（也就是数据集里的monolingual data）翻译成英文，做一个BT，就得到了又一个训练数据集，数据集变多，模型相应的得到更多训练，表现也可能会更好。（但是给的数据集中，monolingual data=》test/test.zh全是句号）

3. Code

3.1 数据预处理

数据集
双眼平行语料库2020年：
原始：398066（句子）
已处理：393980（句子）
测试数据：
大小：4000句子
中文翻译未公开，每一行都是“。”

处理的内容

下载并解压缩文件

将文件重命名

# 下载档案并解压缩
data_dir = './DATA/rawdata'
dataset_name = 'ted2020'
urls = ('"https://onedrive.live.com/download?cid=3E549F3B24B238B4&resid=3E549F3B24B238B4%214989&authkey=AGgQ-DaR8eFSl1A"', '"https://onedrive.live.com/download?cid=3E549F3B24B238B4&resid=3E549F3B24B238B4%214987&authkey=AA4qP_azsicwZZM"',
# # If the above links die, use the following instead. 
#     "https://www.csie.ntu.edu.tw/~r09922057/ML2021-hw5/ted2020.tgz",
#     "https://www.csie.ntu.edu.tw/~r09922057/ML2021-hw5/test.tgz",
# # If the above links die, use the following instead. 
#     "https://mega.nz/#!vEcTCISJ!3Rw0eHTZWPpdHBTbQEqBDikDEdFPr7fI8WxaXK9yZ9U",
#     "https://mega.nz/#!zNcnGIoJ!oPJX9AvVVs11jc0SaK6vxP_lFUNTkEcK2WbxJpvjU5Y",
)
file_names = ('ted2020.tgz', # train & dev'test.tgz', # test
)
prefix = Path(data_dir).absolute() / dataset_nameprefix.mkdir(parents=True, exist_ok=True)
for u, f in zip(urls, file_names):path = prefix/fif not path.exists():if 'mega' in u:!megadl {u} --path {path}else:!wget {u} -O {path}if path.suffix == ".tgz":!tar -xvf {path} -C {prefix}elif path.suffix == ".zip":!unzip -o {path} -d {prefix}
# 重命名文件，加上前缀train_dev/test
!mv {prefix/'raw.en'} {prefix/'train_dev.raw.en'}
!mv {prefix/'raw.zh'} {prefix/'train_dev.raw.zh'}
!mv {prefix/'test.en'} {prefix/'test.raw.en'}
!mv {prefix/'test.zh'} {prefix/'test.raw.zh'}#设定语言
src_lang = 'en'
tgt_lang = 'zh'data_prefix = f'{prefix}/train_dev.raw'
test_prefix = f'{prefix}/test.raw'!head {data_prefix+'.'+src_lang} -n 5
!head {data_prefix+'.'+tgt_lang} -n 5

Thank you so much, Chris.
And it’s truly a great honor to have the opportunity to come to this stage twice; I’m extremely grateful.
I have been blown away by this conference, and I want to thank all of you for the many nice comments about what I had to say the other night.
And I say that sincerely, partly because I need that.
Put yourselves in my position.
非常謝謝你，克里斯。能有這個機會第二度踏上這個演講台
真是一大榮幸。我非常感激。
這個研討會給我留下了極為深刻的印象，我想感謝大家對我之前演講的好評。
我是由衷的想這麼說，有部份原因是因為 —— 我真的有需要!
請你們設身處地為我想一想！

处理的内容

把字符串全形转半形

将字符串的特殊字符与内容以‘ ’分割

去掉或者替换掉一些特殊字符

# 去掉或者替换掉一些特殊字符
def clean_s(s, lang):if lang == 'en':s = re.sub(r"\\([^()]*\\)", "", s)  # remove ([text])s = s.replace('-', '')  # remove '-'s = re.sub('([.,;!?()\\"])', r' \\1 ', s)  # keep punctuationelif lang == 'zh':s = strQ2B(s)  # Q2Bs = re.sub(r"\\([^()]*\\)", "", s)  # remove ([text])s = s.replace(' ', '')s = s.replace('—', '')s = s.replace('“', '"')s = s.replace('”', '"')s = s.replace('_', '')s = re.sub('([。,;!?()\\"~「」])', r' \\1 ', s)  # keep punctuations = ' '.join(s.strip().split()) # 将字符串以空格分割return sdef len_s(s, lang):if lang == 'zh':return len(s)return len(s.split())# clean后文件名称是：train_dev.raw.clean.en, train_dev.raw.clean.zh, text.raw.clean.en. test.raw.clean.zh
def clean_corpus(prefix, l1, l2, ratio=9, max_len=1000, min_len=1):if Path(f'{prefix}.clean.{l1}').exists() and Path(f'{prefix}.clean.{l2}').exists():print(f'{prefix}.clean.{l1} & {l2} exists. skipping clean.')returnwith open(f'{prefix}.{l1}', 'r', encoding='utf-8') as l1_in_f:with open(f'{prefix}.{l2}', 'r', encoding='utf-8') as l2_in_f:with open(f'{prefix}.clean.{l1}', 'w', encoding='utf-8') as l1_out_f:with open(f'{prefix}.clean.{l2}', 'w', encoding='utf-8') as l2_out_f:for s1 in l1_in_f:s1 = s1.strip()s2 = l2_in_f.readline().strip()s1 = clean_s(s1, l1)s2 = clean_s(s2, l2)s1_len = len_s(s1, l1)s2_len = len_s(s2, l2)if min_len > 0:  # remove short sentenceif s1_len < min_len or s2_len < min_len:continueif max_len > 0:  # remove long sentenceif s1_len > max_len or s2_len > max_len:continueif ratio > 0:  # remove by ratio of lengthif s1_len / s2_len > ratio or s2_len / s1_len > ratio:continueprint(s1, file=l1_out_f)print(s2, file=l2_out_f)clean_corpus(data_prefix, src_lang, tgt_lang)
clean_corpus(test_prefix, src_lang, tgt_lang, ratio=-1, min_len=-1, max_len=-1)

Thank you so much , Chris .
And it’s truly a great honor to have the opportunity to come to this stage twice ; I’m extremely grateful .
I have been blown away by this conference , and I want to thank all of you for the many nice comments about what I had to say the other night .
And I say that sincerely , partly because I need that .
Put yourselves in my position .
非常謝謝你 , 克里斯。能有這個機會第二度踏上這個演講台
真是一大榮幸。我非常感激。
這個研討會給我留下了極為深刻的印象 , 我想感謝大家對我之前演講的好評。
我是由衷的想這麼說 , 有部份原因是因為我真的有需要 !
請你們設身處地為我想一想 !

3.2 划分训练集和验证集

验证集只需要3000~4000个句子即可，将上面处理完的训练集进行划分，分别放到train.clean.en/.zh、valid.clean.zn/.zh文件中。

# 划分训练集和验证集
# 验证集300~400句即可
valid_ratio = 0.01
train_ratio = 1 - valid_ratio
data_dir = './data'
dataset_name = 'prefix'# 最后划分为训练集和验证集 文件名称分别为 train.clean.en train.clean.zh valid.clean.en valid.clean.zh
if Path(f'{prefix}/train.clean.{src_lang}').exists() \\
and Path(f'{prefix}/train.clean.{tgt_lang}').exists() \\
and Path(f'{prefix}/valid.clean.{src_lang}').exists() \\
and Path(f'{prefix}/valid.clean.{tgt_lang}').exists():print(f'train/valid splits exists. skipping split.')
else:line_num = sum(1 for line in open(f'{data_prefix}.clean.{src_lang}', encoding='utf-8'))labels = list(range(line_num))random.shuffle(labels)for lang in [src_lang, tgt_lang]:train_f = open(os.path.join(data_dir, dataset_name, f'train.clean.{lang}'), 'w', encoding='utf-8')valid_f = open(os.path.join(data_dir, dataset_name, f'valid.clean.{lang}'), 'w', encoding='utf-8')count = 0for line in open(f'{data_prefix}.clean.{lang}', 'r', encoding='utf-8'):if labels[count]/line_num < train_ratio:train_f.write(line)else:valid_f.write(line)count += 1train_f.close()valid_f.close()

划分的最后结果如下
【李宏毅】深度学习——HW5-Machine Translation

3.3 Subword Units（分词）

翻译中存在一大问题是未登录词（out of vocabulary），可以使用subword units作为短词单位来解决。

使用sentencepiece套件
用unigram 或 byte-pair encoding（BPE）

# Subword Units
# 分词
# 使用sentencepiece中的spm对训练集和验证集进行分词建模，模型名称是spm8000.model，同时产生词汇库spm8000.vocab
# 使用模型对训练集、验证集、测试集进行分词处理，得到文件train.en, train.zh, valid.en, valid.zh, test.en, test.zh
import sentencepiece as spm
vocab_size = 8000
if Path(f'{prefix}/spm{vocab_size}.model').exists():print(f'{prefix}/spm{vocab_size},model exits. skipping spm_train')
else:spm.SentencePieceTrainer.train(input=','.join([f'{prefix}/train.clean.{src_lang}',f'{prefix}/valid.clean.{src_lang}',f'{prefix}/train.clean.{tgt_lang}',f'{prefix}/valid.clean.{tgt_lang}']),model_prefix=f'{prefix}/spm{vocab_size}',vocab_size=vocab_size,character_coverage=1,model_type='unigram', # 'bpe' 也可input_sentence_size=1e6,shuffle_input_sentence=True,normalization_rule_name='nmt_nfkc_cf',)
spm_model = spm.SentencePieceProcessor(model_file=str(f'{prefix}/spm{vocab_size}.model'))
in_tag = {'train': 'train.clean','valid': 'valid.clean','test': 'test.raw.clean',
}
for split in ['train', 'valid', 'test']:for lang in [src_lang, tgt_lang]:out_path = Path(f'{prefix}/{split}.{lang}')if out_path.exists():print(f"{out_path} exists. skipping spm_encode.")else:with open(f'{prefix}/{split}.{lang}', 'w', encoding='utf-8') as out_f:with open(f'{prefix}/{in_tag[split]}.{lang}', 'r', encoding='utf-8') as in_f:for line in in_f:line = line.strip()tok = spm_model.encode(line, out_type=str)print(' '.join(tok), file=out_f)

分词后得到的词汇表spm8000.vocab中的部分内容如下：
【李宏毅】深度学习——HW5-Machine Translation
分词处理后的train.en和对应的train.zh文件内容:

▁thank ▁you ▁so ▁much ▁, ▁chris ▁.
▁and ▁it ’ s ▁ t ru ly ▁a ▁great ▁ho n or ▁to ▁have ▁the ▁ op port un ity ▁to ▁come ▁to ▁this ▁st age ▁ t wi ce ▁; ▁i ’ m ▁ex t re me ly ▁gr ate ful ▁.
▁i ▁have ▁been ▁ bl ow n ▁away ▁by ▁this ▁con fer ence ▁, ▁and ▁i ▁want ▁to ▁thank ▁all ▁of ▁you ▁for ▁the ▁many ▁ ni ce ▁ com ment s ▁about ▁what ▁i ▁had ▁to ▁say ▁the ▁other ▁night ▁.
▁and ▁i ▁say ▁that ▁since re ly ▁, ▁part ly ▁because ▁i ▁need ▁that ▁.
▁put ▁your s el ve s ▁in ▁my ▁po s ition ▁.
▁ 非常謝謝你 ▁, ▁ 克里斯 ▁。 ▁ 能有這個機會第二度踏上這個演講台
▁ 真是一大榮幸 ▁。 ▁我非常感激 ▁。
▁這個研討會給我留下了極為深刻的印象 ▁, ▁我想感謝大家對我之前演講的好評 ▁。
▁我是由衷的想這麼說 ▁, ▁有部份原因是因為我真的有需要 ▁!
▁ 請你們設身處地為我想一想 ▁!

3.4用fairseq将资料转为二进制

下面的程序可以在jupyter中运行，或者在python解释器

# 使用fairseq将数据二进制化 最终生成的文件在目录./data/data_bin下
binpath = Path('./data/data-bin')
if binpath.exists():print(binpath, "exists, will not overwrite!")
else:!python -m fairseq_cli.preprocess \\--source-lang en\\--target-lang zh\\--trainpref ./data/prefix/train\\--validpref ./data/prefix/valid\\--testpref ./data/prefix/test\\--destdir ./data/data_bin\\--joined-dictionary\\--workers 2

生成了一系列文件在data_bin目录下

4. 实验准备

4.1 实验参数设定

config = Namespace(datadir = "./data/data_bin",savedir = "./checkpoints/rnn",source_lang = "en",target_lang = "zh",# cpu threads when fetching & processing data.num_workers=2,  # batch size in terms of tokens. gradient accumulation increases the effective batchsize.max_tokens=8192,accum_steps=2,# the lr s calculated from Noam lr scheduler. you can tune the maximum lr by this factor.lr_factor=2.,lr_warmup=4000,# clipping gradient norm helps alleviate gradient explodingclip_norm=1.0,# maximum epochs for trainingmax_epoch=30,start_epoch=1,# beam size for beam searchbeam=5, # generate sequences of maximum length ax + b, where x is the source lengthmax_len_a=1.2, max_len_b=10,# when decoding, post process sentence by removing sentencepiece symbols.post_process = "sentencepiece",# checkpointskeep_last_epochs=5,resume=None, # if resume from checkpoint name (under config.savedir)# logginguse_wandb=False,
)

4.2 logging

# logging套件记录一般讯息 wandb记录训练过程的loss， blue， model， weight等
logging.basicConfig(format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",datefmt="%Y-%m-%d %H:%M:%S",level="INFO", # "DEBUG" "WARNING" "ERROR"stream=sys.stdout,
)
proj = "hw5.seq2seq"
logger = logging.getLogger(proj)
if config.use_wandb:import wandbwandb.init(project=proj, name=Path(config.savedir).stem, config=config)

4.3 cuda环境

cuda_env = utils.CudaEnvironment()
utils.CudaEnvironment.pretty_print_cuda_env_list([cuda_env])
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

4.4 读取资料集

借用fairsq的TranslationTask

用来加载上面创建的二进制数据
实现良好的数据迭代器（dataloader）
字典task.source_directionary 和 task.targrt_directionary也很好用
有實做 beam search

from fairseq.tasks.translation import TranslationConfig, TranslationTask## setup task
task_cfg = TranslationConfig(data=config.datadir,source_lang=config.source_lang,target_lang=config.target_lang,train_subset="train",required_seq_len_multiple=8,dataset_impl="mmap",upsample_primary=1,
)
task = TranslationTask.setup_task(task_cfg)
logger.info("loading data for epoch 1")
task.load_dataset(split="train", epoch=1, combine=True) # combine if you have back-translation data.
task.load_dataset(split="valid", epoch=1)sample = task.dataset("valid")[1]
pprint.pprint(sample)
pprint.pprint("Source: " + \\task.source_dictionary.string(sample['source'],config.post_process,)
)
pprint.pprint("Target: " + \\task.target_dictionary.string(sample['target'],config.post_process,)
)

运行结果：

{‘id’: 1,
‘source’: tensor([ 18, 14, 6, 2234, 60, 19, 80, 5, 256, 16, 405, 1407,
1706, 7, 2]),
‘target’: tensor([ 140, 690, 28, 270, 45, 151, 1142, 660, 606, 369, 3114, 2434,
1434, 192, 2])}
“Source: that’s exactly what i do optical mind control .”
‘Target: 這實在就是我所做的–光學操控思想’

4.5 数据集迭代器

将每个batch控制在N个token让GPU记忆体更有效被利用
让training set每个epoch有不同shuffling
滤掉长度太长的句子
将每个batch内的句子pad成一样长，好让GPU平行运算
加上eos并shift一格
- teacher forcing：为了训练模型根据prefix生成下个字，decoder的输入会是输出目标序列往右shift一格。
- 一般是会在输入开头加个bos token (如下图)

fairseq则是直接吧eos挪到begining，训练起来其实效果差不多。例如：
输出目标 (target) 和 Decoder输入 (prev_output_tokens): eos = 2 target = 419, 711, 238, 888, 792, 60, 968, 8, 2 prev_output_tokens = 2, 419, 711, 238, 888, 792, 60, 968, 8

def load_data_iterator(task, split, epoch=1, max_tokens=4000, num_workers=1, cached=True):batch_iterator = task.get_batch_iterator(dataset=task.dataset(split),max_tokens=max_tokens,max_sentences=None,max_positions=utils.resolve_max_positions(task.max_positions(),max_tokens,),ignore_invalid_inputs=True,seed=seed,num_workers=num_workers,epoch=epoch,disable_iterator_cache=not cached,# Set this to False to speed up. However, if set to False, changing max_tokens beyond# first call of this method has no effect.)return batch_iteratorif __name__=='__main__':demo_epoch_obj = load_data_iterator(task, "valid", epoch=1, max_tokens=20, num_workers=1, cached=False)demo_iter = demo_epoch_obj.next_epoch_itr(shuffle=True)sample = next(demo_iter)print(sample)

输出信息和解释说明如下：

{‘id’: tensor([723]), # 每个example的id
‘nsentences’: 1, # batch size 句子数
‘ntokens’: 18, # batch size 字数
‘net_input’: {
‘src_tokens’: tensor([[ 1, 1, 1, 1, 1, 18, 26, 82, 8, 480, 15, 651,
1361, 38, 6, 176, 2696, 39, 5, 822, 92, 260, 7, 2]]), # 来源语言的序列
‘src_lengths’: tensor([19]), # 每句话没有pad过的长度
‘prev_output_tokens’: tensor([[ 2, 140, 296, 318, 1560, 51, 568, 316, 225, 1952, 254, 78, # 上面提到shift 一格后的目标序列
151, 2691, 9, 215, 1680, 10, 1, 1, 1, 1, 1, 1]])
},
‘target’: tensor([[ 140, 296, 318, 1560, 51, 568, 316, 225, 1952, 254, 78, 151,
2691, 9, 215, 1680, 10, 2, 1, 1, 1, 1, 1, 1]]) # 目标序列
}

5. 定义模型架构

继承fairseq的Encoder，decoder，model，这样测试阶段才能直接用它写好的beam search函式

5.1 Encoder编码器

seq2seq模型的编码器为RNN或Transformer Encoder，一下说明以RNN为例。
对应每一个输入，Encoder会输出一个向量和一个隐藏状态（hidden state），并将隐藏状态用于下一个输入。换句话说，Encoder会逐步读取输入序列，并在每个timestep输出单个向量，以及在最后timestep输出隐藏状态（content vector）。

下面解释一下本实验中的GRU

本实验使用的是GRU，GRU的输入输出参数如下：
输入的参数有两个，分别是input和h_0。
Inputs: input, h_0
①input的shape
The shape of input:(seq_len, batch, input_size) : tensor containing the feature of the input sequence. The input can also be a packed variable length sequence。See functorch.nn.utils.rnn.pack_padded_sequencefor details.
②h_0的shape
从下面的解释中也可以看出，这个参数可以不提供，那么就默认为0.
The shape of h_0:(num_layers * num_directions, batch, hidden_size): tensor containing the initial hidden state for each element in the batch. Defaults to zero if not provided. If the RNN is bidirectional num_directions should be 2, else it should be 1.
输出有两个，分别是output和h_n
①output
output 的shape是：(seq_len, batch, num_directions * hidden_size): tensor containing the output features h_t from the last layer of the GRU, for each t.
If a class:torch.nn.utils.rnn.PackedSequence has been given as the input, the output will also be a packed sequence. For the unpacked case, the directions can be separated using output.view(seq_len, batch, num_directions, hidden_size), with forward and backward being direction 0 and 1 respectively.
Similarly, the directions can be separated in the packed case.
②h_n
h_n的shape是：(num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for t = seq_len
Like output, the layers can be separated using
h_n.view(num_layers, num_directions, batch, hidden_size).
当双向 = True 时，h_n将分别包含最终正向和反向隐藏状态的串联。
大致如下图所示：

# 定义模型架构
# 使用fairsq的Encoder,decoder and model
class RNNEncoder(FairseqEncoder):def __init__(self, args, dictionary, embed_tokens):''':param args:encoder_embed_dim 是embedding的维度，主要将one-hot vect的单词向量压缩到指定的维度encoder_ffn_embed_dim 是RNN输出和隐藏状态的维度（hidden dimension）encoder_layers 是RNN要叠多少层dropout 是决定有大欧少的几率会将某个节点变为0，主要是为了防止overfitting，一般来说训练时用:param dictionary: fairseq帮我们做好的dictionary 再次用来得到padding index，好用来得到encoder padding mask:param embed_tokens: 事先做好的词嵌入（nn.Embedding）'''super().__init__(dictionary)self.embed_tokens = embed_tokensself.embed_dim = args.encoder_embed_dimself.hidden_dim = args.encoder_ffn_embed_dimself.num_layers = args.encoder_layersself.dropout_in_module = nn.Dropout(args.dropout)self.rnn = nn.GRU(self.embed_dim,self.hidden_dim,self.num_layers,dropout=args.dropout,batch_first=False,bidirectional=True,)self.dropout_out_module = nn.Dropout(args.dropout)self.padding_idx = dictionary.pad()def combine_bidir(self, outs, bsz:int):out = outs.view(self.num_layers, 2, bsz, -1).transpose(1, 2).contiguous()return out.view(self.num_layers, bsz, -1)def forward(self, src_tokens, unused):''':param src_tokens: 英文的整数序列:param unused::return:outputs: 最上层RNN每个timestep的输出，最后可以用Attention再进行处理final_hiddens: 每层最终timestep的隐藏状态，将传递到Decoder进行解码encoder_padding_mask: 告诉我们那些事位置的资讯不重要'''bsz, seqlen = src_tokens.size()# get embeddingsx = self.embed_tokens(src_tokens)x = self.dropout_in_module(x)# B x T x C => T x B x Cx = x.transpose(0, 1)# 过双向RNNh0 = x.new_zeros(2 * self.num_layers, bsz, self.hidden_dim)x, final_hiddens = self.rnn(x, h0)outputs = self.dropout_out_module(x)# outputs = [sequence len, batch size, hid dim * directions] 是最上面RNN的输出# hidden =  [num_layers * directions, batch size  , hid dim]# 因为Encoder是双向的，我们需要链接两个方向的隐藏状态final_hiddens = self.combine_bidir(final_hiddens, bsz)# hidden =  [num_layers x batch x num_directions*hidden]encoder_padding_mask = src_tokens.eq(self.padding_idx).t()return tuple((outputs,  # seq_len x batch x hiddenfinal_hiddens,  # num_layers x batch x num_directions*hiddenencoder_padding_mask,  # seq_len x batch))def reorder_encoder_out(self, encoder_out, new_order):return tuple((encoder_out[0].index_select(1, new_order),encoder_out[1].index_select(1, new_order),encoder_out[2].index_select(1, new_order),))

5.2 Attention

当输入过长，或是单独靠“content vector”无法取得整个输入的意思时，用Attention Mechanism来提供Decoder更多的资讯

根据现在Decoder embeddings，去计算在Encoder outputs中，那些与其有较高的关系，根据关系的数值来把Encoder outputs平均起来作为Decoder RNN的输入

常见Attention的作用是用Neural Network / Dot Product来算query（decoder embedding）和key（Encoder outputs）之间的关系，再对所有算出来的数值做softmax得到分布，最后根据这个分布对values（Encoder outputs）做weight sum

# Attention
class AttentionLayer(nn.Module):def __init__(self, input_embed_dim, source_embed_dim, output_embed_dim, bias=False):''':param input_embed_dim: key 的维度，应是 decoder 要做 attend 时的向量的维度:param source_embed_dim: query 的维度，应是要被 attend 的向量(encoder outputs)的维度:param output_embed_dim: value 的维度，应是做完 attention 后，下一层预期的向量维度:param bias:'''super().__init__()self.input_proj = nn.Linear(input_embed_dim, source_embed_dim, bias=bias)self.output_proj = nn.Linear(input_embed_dim + source_embed_dim, output_embed_dim, bias=bias)def forward(self, inputs, encoder_outputs, encoder_padding_mask):''':param inputs: 就是key，要attend别人的向量:param encoder_outputs: 是query/value，被attend的向量:param encoder_padding_mask: 告诉我们哪些是位置的资讯不重要:return:output: 做完attention后的context vectorattention score: attention的分布'''# inputs: T, B, dim# encoder_outputs: S x B x dim# padding mask: S x B# convert all to batch firstinputs = inputs.transpose(1, 0) # B, T, dimencoder_outputs = encoder_outputs.transpose(1, 0) #B, S, dimencoder_padding_mask = encoder_padding_mask.transpose(1, 0) # B, S# 投影到encoder_outputs的维度# (B, T, dm) x (B, dim, S) = (B, T, S)attn_scores = torch.bmm(x, encoder_outputs.transpose(1, 2))# 挡住padding位置的attentionif encoder_padding_mask is not None:# 利用broadcast B, S -> (B, 1, S)encoder_padding_mask = encoder_padding_mask.unsqueeze(1)attn_scores = (attn_scores.float().masked_dill_(encoder_padding_mask, float("-inf"))# 用来mask掉当前时刻后面时刻的序列信息.type_as(attn_scores)# 按照给定的tensor进行类型转换)# 在source对应维度softmaxattn_scores = F.softmax(attn_scores, dim=-1)# 形状(B, T, S) x (B, S, dim) = (B, T, dim)加权平均x = torch.bmm(attn_scores, encoder_outputs)# (B, T, dim)x = torch.cat((x, inputs), dim=-1)x = torch.tanh(self.output_proj(x)) # output + linear + tanh# 回复形状(B, T, dim) -> (T, B, dim)return x.transpose(1, 0), attn_scores

5.3 Decoder解码器

解码器的hidden states会用编码器最终隐藏状态来初始化
解码器同时也根据目前timestep的输入（也就是前一个timestep的output），改变hidden states，并输出结果
如果加入attention可以使其表现更好
我们把seq2seq步骤写在解码器里，好让等等seq2seq这个类可以通用RNN和Transformer，而不用再改写

# Decoder
class RNNDecoder(FairseqIncrementalDecoder):def __init__(self, args, dictionary, embed_tokens):super().__init__(dictionary)self.embed_tokens = embed_tokensassert args.decoder_layers == args.encoder_layers, f"""seq2seq rnn requires that encoder and decoder have same layers of rnn. got: {args.encoder_layers, args.decoder_layers}"""assert args.decoder_ffn_embed_dim == args.encoder_ffn_embed_dim * 2, f"""seq2seq-rnn requires that decoder hidden to be 2*encoder hidden dim. got: {args.decoder_ffn_embed_dim, args.encoder_ffn_embed_dim * 2}"""self.embed_dim = args.decoder_embed_dimself.hidden_dim = args.decoder_ffn_embed_dimself.num_layers = args.decoder_layersself.dropout_in_module = nn.Dropout(args.dropout)self.rnn = nn.GRU(self.embed_dim,self.hidden_dim,self.num_layers,dropout=args.dropout,batch_first=False,bidirectional=False,)self.attention = AttentionLayer(self.embed_dim, self.hidden_dim, self.embed_dim, bias=False)# self.attention = Noneself.dropout_out_module = nn.Dropout(args.dropout)if self.hidden_dim != self.embed_dim:self.project_out_dim = nn.Linear(self.hidden_dim, self.embed_dim)else:self.project_out_dim = Noneif args.share_decoder_input_output_embed:self.output_projection = nn.Linear(self.embed_tokens.weight.shape[1],self.embed_tokens.weight.shape[0],bias=False,)self.output_projection.weight = self.embed_tokens.weightelse:self.output_projection = nn.Linear(self.output_embed_dim, len(dictionary), bias=False)nn.init.normal_(self.output_projection.weight, mean=0, std=self.output_embed_dim  -0.5)def forward(self, prev_output_tokens, encoder_out, incremental_state=None, unused):# 取出encoder的输出encoder_outputs, encoder_hiddens, encoder_padding_mask = encoder_out# outputs:          seq_len x batch x num_directions*hidden# encoder_hiddens:  num_layers x batch x num_directions*encoder_hidden# padding_mask:     seq_len x batchif incremental_state is not None and len(incremental_state)>0:# 如果保留了上一个timestep留下的资讯，我们可以从那里进来，而不是从bos开始prev_output_tokens = prev_output_tokens[:, -1:]cache_state = self.get_incremental_state(incremental_state, "cache_state")prev_hiddens = cache_state["prev_hiddens"]else:# 沒有incremental state代表这是training或者是test time时的第一步# 准备seq2seq: 把encoder_hiddens pass进去decoder的hidden statesprev_hiddens = encoder_hiddensbsz, seqlen = prev_output_tokens.size()# embed tokensx = self.embed_tokens(prev_output_tokens)x = self.dropout_in_module(x)# B x T x C -> T x B x Cx = x.transpose(0, 1)# 做decoder-to-encoder attentionif self.attention is not None:x, attn = self.attention(x, encoder_outputs, encoder_padding_mask)# 过单向RNNx, final_hiddens = self.rnn(x, prev_hiddens)# outputs = [sequence len, batch size, hid dim]# hidden =  [num_layers * directions, batch size  , hid dim]x = self.dropout_out_module(x)# 投影到embedding size (如果hidden 和embed size不一样，然后share_embedding又变成True,需要额外project一次)if self.project_out_dim != None:x = self.project_out_dim(x)# 投影到vocab size 的分布x = self.output_projection(x)# T x B x C -> B x T x Cx = x.transpose(1, 0)# 如果是Incremental, 记录这个timestep的hidden states, 下个timestep读回来cache_state = {"prev_hiddens": final_hiddens,}self.set_incremental_state(incremental_state, "cached_state", cache_state)return x, Nonedef reorder_incremental_state(self,incremental_state,new_order,):# 这个beam search时会用到，意思并不是很重要cache_state = self.get_incremental_state(incremental_state, "cached_state")prev_hiddens = cache_state["prev_hiddens"]prev_hiddens = [p.index_select(0, new_order) for p in prev_hiddens]cache_state = {"prev_hiddens": torch.stack(prev_hiddens),}self.set_incremental_state(incremental_state, "cached_state", cache_state)return

5.4 Seq2Seq

由Encoder和Decoder组成
接收输入并传给Encoder
将Encoder的输出传给Decoder
Decoder根据前一个timestep的输出和Encoder输出进行解码
当解码完成后，将Decoder的输出传回

# Seq2Seq
class Seq2Seq(FairseqEncoderDecoderModel):def __init__(self, args, encoder, decoder):super().__init__(encoder, decoder)self.args = argsdef forward(self, src_tokens, src_lengths, prev_output_tikens, return_all_hiddens: bool = True):encoder_out = self.encoder(src_tokens, src_lengths=src_lengths, return_all_hiddens=return_all_hiddens)logits, extra = self.decoder(prev_output_tikens,encoder_out=encoder_out,src_lengths=src_lengths,return_all_hiddens=return_all_hiddens,)return logits, extra

5.5 模型初始化

# 模型初始化
def build_model(args, task):src_dict, tgt_dict = task.source_dictionary, task.target_dictionary# 词嵌入encoder_embed_tokens = nn.Embedding(len(src_dict), args.encoder_embed_dim, src_dict.pad())decoder_embed_tokens = nn.Embedding(len(tgt_dict), args.decoder_embed_dim, tgt_dict.pad())# 编码器和解码器encoder = RNNEncoder(args, src_dict, encoder_embed_tokens)decoder = RNNDecoder(args, tgt_dict, decoder_embed_tokens)# 序列到序列模型model = Seq2Seq(args, encoder, decoder)# 序列到序列模型的初始化很重要 需要特别处理def init_params(module):from fairseq.modules import MultiheadAttentionif isinstance(module, nn.Linear):module.weight.data.normal_(mean=0.0, std=0.02)if module.bias is not None:module.bias.data.zero_()if isinstance(module, nn.Embedding):module.weight.data.normal_(mean=0.0, std=0.02)if module.padding_idx is not None:module.weight.data[module.padding_idx].zero_()if isinstance(module, MultiheadAttention):module.q_proj.weight.data.normal_(mean=0.0, std=0.02)module.k_proj.weight.data.normal_(mean=0.0, std=0.02)module.v_proj.weight.data.normal_(mean=0.0, std=0.02)if isinstance(module, nn.RNNBase):for name, param in module.named_parameters():if "weight" in name or "bias" in name:param.data.uniform_(-0.1, 0.1)# 初始化模型model.apply(init_params)return model

5.6 设置模型相关参数

arch_args = Namespace(encoder_embed_dim=256,encoder_ffn_embed_dim=512,encoder_layers=1,decoder_embed_dim=256,decoder_ffn_embed_dim=1024,decoder_layers=1,share_decoder_input_output_embed=True,dropout=0.3,
)model = build_model(arch_args, task)
logger.info(model)

Seq2Seq(
(encoder): RNNEncoder(
(embed_tokens): Embedding(8000, 256, padding_idx=1)
(dropout_in_module): Dropout(p=0.3, inplace=False)
(rnn): GRU(256, 512, dropout=0.3, bidirectional=True)
(dropout_out_module): Dropout(p=0.3, inplace=False)
)
(decoder): RNNDecoder(
(embed_tokens): Embedding(8000, 256, padding_idx=1)
(dropout_in_module): Dropout(p=0.3, inplace=False)
(rnn): GRU(256, 1024, dropout=0.3)
(attention): AttentionLayer(
(input_proj): Linear(in_features=256, out_features=1024, bias=False)
(output_proj): Linear(in_features=1280, out_features=256, bias=False)
)
(dropout_out_module): Dropout(p=0.3, inplace=False)
(project_out_dim): Linear(in_features=1024, out_features=256, bias=True)
(output_projection): Linear(in_features=256, out_features=8000, bias=False)
)
)

5.7 Optimization 最佳化

Loss ： Label Smoothing Regularization

让模型学习输出较不集中的分布，防止模型过度自信
有时候Cround Truth并非唯一答案，所以在算loss时，我们会保留一部分概率给正确答案以外的label
可以有效防止过度拟合

class LabelSmoothedCrossEntropyCriterion(nn.Module):def __init__(self, smoothing, ignore_index=None, reduce=True):super().__init__()self.smoothing = smoothingself.ignore_index = ignore_indexself.reduce = reducedef forward(self, lprobs, target):if target.dim() == lprobs.dim() - 1:target = target.unsqueeze(-1)# nll: Negative log likelihood，當目標是one-hot時的cross-entropy loss. 以下同 F.nll_lossnll_loss = -lprobs.gather(dim=-1, index=target)# 將一部分正確答案的機率分配給其他label 所以當計算cross-entropy時等於把所有label的log prob加起來smooth_loss = -lprobs.sum(dim=-1, keepdim=True)if self.ignore_index is not None:pad_mask = target.eq(self.ignore_index)nll_loss.masked_fill_(pad_mask, 0.0)smooth_loss.masked_fill_(pad_mask, 0.0)else:nll_loss = nll_loss.squeeze(-1)smooth_loss = smooth_loss.squeeze(-1)if self.reduce:nll_loss = nll_loss.sum()smooth_loss = smooth_loss.sum()# 計算cross-entropy時 加入分配給其他label的losseps_i = self.smoothing / lprobs.size(-1)loss = (1.0 - self.smoothing) * nll_loss + eps_i * smooth_lossreturn loss# 一般都用0.1效果就很好了
criterion = LabelSmoothedCrossEntropyCriterion(smoothing=0.1,ignore_index=task.target_dictionary.pad(),
)

5.8 Optimizer: Adam + lr scheduling

Inverse square root 排程对于训练 Transformer 时的稳定性很重要，后来也用在 RNN 上。根据底下公式来更新 learning rate，前期线性增长，后期根据更新步数方根的倒数来递减。

class NoamOpt:"Optim wrapper that implements rate."def __init__(self, model_size, factor, warmup, optimizer):self.optimizer = optimizerself._step = 0self.warmup = warmupself.factor = factorself.model_size = model_sizeself._rate = 0@propertydef param_groups(self):return self.optimizer.param_groupsdef multiply_grads(self, c):"""Multiplies grads by a constant *c*."""                for group in self.param_groups:for p in group['params']:if p.grad is not None:p.grad.data.mul_(c)def step(self):"Update parameters and rate"self._step += 1rate = self.rate()for p in self.param_groups:p['lr'] = rateself._rate = rateself.optimizer.step()def rate(self, step = None):"Implement `lrate` above"if step is None:step = self._stepreturn 0 if not step else self.factor * \\(self.model_size  (-0.5) *min(step  (-0.5), step * self.warmup  (-1.5)))

排程视觉化

optimizer = NoamOpt(model_size=arch_args.encoder_embed_dim, factor=config.lr_factor, warmup=config.lr_warmup, optimizer=torch.optim.AdamW(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9, weight_decay=0.0001))
plt.plot(np.arange(1, 100000), [optimizer.rate(i) for i in range(1, 100000)])
plt.legend([f"{optimizer.model_size}:{optimizer.warmup}"])
None

6. train

def train_one_epoch(epoch_itr, model, task, criterion, optimizer, accum_steps=1):itr = epoch_itr.next_epoch_itr(shuffle=True)itr = iterators.GroupedIterator(itr, accum_steps) # 梯度累积: 每 accum_steps 个 sample 更新一次stats = {"loss":[]}scaler = GradScaler()model.train()progress = tqdm.tqdm(itr, desc=f"train epoch {epoch_itr}", leave=False)for samples in progress:model.zero_grad()accum_loss = 0sample_size = 0# 梯度累积：没accum_steps个sample更新一次for i, sample in enumerate(samples):if i == 1:torch.cuda.empty_cache()sample = utils.move_to_cuda(sample, device=device)traget = sample["traget"]sample_size_i = sample["ntokens"]sample_size += sample_size_i# 混合精度训练with autocast():net_output = model.forward(sample["net_input"])lprobs = F.log_softmax(net_output[0], -1)loss = criterion(lprobs.view(-1, lprobs.size(-1)), target.view(-1))# loggingaccum_loss += loss.item()# back-propscaler.scale(loss).backward()scaler.unscale_(optimizer)optimizer.multiply_grads(1 / (sample_size or 1.0))  # (sample_size or 1.0) handles the case of a zero gradientgnorm = nn.utils.clip_grad_norm_(model.parameters(), config.clip_norm)  # 梯度裁剪 防止梯度爆炸scaler.step(optimizer)scaler.update()# loggingloss_print = accum_loss / sample_sizestats["loss"].append(loss_print)progress.set_postfix(loss=loss_print)# if config.use_wandb:#     wandb.log({#         "train/loss": loss_print,#         "train/grad_norm": gnorm.item(),#         "train/lr": optimizer.rate(),#         "train/sample_size": sample_size,#     })loss_print = np.mean(stats["loss"])logger.info(f"training loss: {loss_print:.4f}")return stats

7. Validation & Inference 检验和推论

为了防止过拟合，每个epoch都要进行验证，计算模型在未看过的资料上的表现

过程基本上和training一样，另外加上inference
验证后，我们可以保存模型权重
仅验证损失，无法描述模型真实的性能
直接用当前模型去生成翻译结果（hypothesis），再和正确答案（reference）计算BLUE score
我们用fairseq写好的sequence generator来进行beam search生成翻译结果

# fairseq 的 beam search generator
# 給定模型和輸入序列，用 beam search 生成翻譯結果
sequence_generator = task.build_generator([model], config)def decode(toks, dictionary):# 從 Tensor 轉成人看得懂的句子s = dictionary.string(toks.int().cpu(),config.post_process,)return s if s else "<unk>"def inference_step(sample, model):gen_out = sequence_generator.generate([model], sample)srcs = []hyps = []refs = []for i in range(len(gen_out)):# 對於每個 sample, 收集輸入，輸出和參考答案，稍後計算 BLEUsrcs.append(decode(utils.strip_pad(sample["net_input"]["src_tokens"][i], task.source_dictionary.pad()), task.source_dictionary,))hyps.append(decode(gen_out[i][0]["tokens"], # 0 代表取出 beam 內分數第一的輸出結果task.target_dictionary,))refs.append(decode(utils.strip_pad(sample["target"][i], task.target_dictionary.pad()), task.target_dictionary,))return srcs, hyps, refs

import shutil
import sacrebleudef validate(model, task, criterion, log_to_wandb=True):logger.info('begin validation')itr = load_data_iterator(task, "valid", 1, config.max_tokens, config.num_workers).next_epoch_itr(shuffle=False)stats = {"loss":[], "bleu": 0, "srcs":[], "hyps":[], "refs":[]}srcs = []hyps = []refs = []model.eval()progress = tqdm.tqdm(itr, desc=f"validation", leave=False)with torch.no_grad():for i, sample in enumerate(progress):# validation losssample = utils.move_to_cuda(sample, device=device)net_output = model.forward(sample["net_input"])lprobs = F.log_softmax(net_output[0], -1)target = sample["target"]sample_size = sample["ntokens"]loss = criterion(lprobs.view(-1, lprobs.size(-1)), target.view(-1)) / sample_sizeprogress.set_postfix(valid_loss=loss.item())stats["loss"].append(loss)# 進行推論s, h, r = inference_step(sample, model)srcs.extend(s)hyps.extend(h)refs.extend(r)tok = 'zh' if task.cfg.target_lang == 'zh' else '13a'stats["loss"] = torch.stack(stats["loss"]).mean().item()stats["bleu"] = sacrebleu.corpus_bleu(hyps, [refs], tokenize=tok) # 計算BLEU scorestats["srcs"] = srcsstats["hyps"] = hypsstats["refs"] = refsif config.use_wandb and log_to_wandb:wandb.log({"valid/loss": stats["loss"],"valid/bleu": stats["bleu"].score,}, commit=False)showid = np.random.randint(len(hyps))logger.info("example source: " + srcs[showid])logger.info("example hypothesis: " + hyps[showid])logger.info("example reference: " + refs[showid])# show bleu resultslogger.info(f"validation loss:\\t{stats['loss']:.4f}")logger.info(stats["bleu"].format())return stats

8. 储存及载入模型参数

def validate_and_save(model, task, criterion, optimizer, epoch, save=True):   stats = validate(model, task, criterion)bleu = stats['bleu']loss = stats['loss']if save:# save epoch checkpointssavedir = Path(config.savedir).absolute()savedir.mkdir(parents=True, exist_ok=True)check = {"model": model.state_dict(),"stats": {"bleu": bleu.score, "loss": loss},"optim": {"step": optimizer._step}}torch.save(check, savedir/f"checkpoint{epoch}.pt")shutil.copy(savedir/f"checkpoint{epoch}.pt", savedir/f"checkpoint_last.pt")logger.info(f"saved epoch checkpoint: {savedir}/checkpoint{epoch}.pt")# save epoch sampleswith open(savedir/f"samples{epoch}.{config.source_lang}-{config.target_lang}.txt", "w") as f:for s, h in zip(stats["srcs"], stats["hyps"]):f.write(f"{s}\\t{h}\\n")# get best valid bleu    if getattr(validate_and_save, "best_bleu", 0) < bleu.score:validate_and_save.best_bleu = bleu.scoretorch.save(check, savedir/f"checkpoint_best.pt")del_file = savedir / f"checkpoint{epoch - config.keep_last_epochs}.pt"if del_file.exists():del_file.unlink()return statsdef try_load_checkpoint(model, optimizer=None, name=None):name = name if name else "checkpoint_last.pt"checkpath = Path(config.savedir)/nameif checkpath.exists():check = torch.load(checkpath)model.load_state_dict(check["model"])stats = check["stats"]step = "unknown"if optimizer != None:optimizer._step = step = check["optim"]["step"]logger.info(f"loaded checkpoint {checkpath}: step={step} loss={stats['loss']} bleu={stats['bleu']}")else:logger.info(f"no checkpoints found at {checkpath}!")

9. 开始训练

model = model.to(device=device)
criterion = criterion.to(device=device)logger.info("task: {}".format(task.__class__.__name__))
logger.info("encoder: {}".format(model.encoder.__class__.__name__))
logger.info("decoder: {}".format(model.decoder.__class__.__name__))
logger.info("criterion: {}".format(criterion.__class__.__name__))
logger.info("optimizer: {}".format(optimizer.__class__.__name__))
logger.info("num. model params: {:,} (num. trained: {:,})".format(sum(p.numel() for p in model.parameters()),sum(p.numel() for p in model.parameters() if p.requires_grad),)
)
logger.info(f"max tokens per batch = {config.max_tokens}, accumulate steps = {config.accum_steps}")epoch_itr = load_data_iterator(task, "train", config.start_epoch, config.max_tokens, config.num_workers)
try_load_checkpoint(model, optimizer, name=config.resume)
while epoch_itr.next_epoch_idx <= config.max_epoch:# train for one epochtrain_one_epoch(epoch_itr, model, task, criterion, optimizer, config.accum_steps)stats = validate_and_save(model, task, criterion, optimizer, epoch=epoch_itr.epoch)logger.info("end of epoch {}".format(epoch_itr.epoch))    epoch_itr = load_data_iterator(task, "train", epoch_itr.next_epoch_idx, config.max_tokens, config.num_workers)

终于跑起来了！！！

虽然翻译的略微离谱hhh
【李宏毅】深度学习——HW5-Machine Translation

【李宏毅】深度学习——HW5-Machine Translation

Machine Translation

1. Goal

2. Introduction

2.1 Dataset

2.2. Evaluation

2.3. Workflow

2.4. Training tips

2.5 Back-translation(BT)

3. Code

3.1 数据预处理

3.2 划分训练集和验证集

3.3 Subword Units（分词）

3.4用fairseq将资料转为二进制

4. 实验准备

4.1 实验参数设定

4.2 logging

4.3 cuda环境

4.4 读取资料集

4.5 数据集迭代器

5. 定义模型架构

5.1 Encoder编码器

5.2 Attention

5.3 Decoder解码器

5.4 Seq2Seq

5.5 模型初始化

5.6 设置模型相关参数

5.7 Optimization 最佳化

5.8 Optimizer: Adam + lr scheduling

6. train

7. Validation & Inference 检验和推论

8. 储存及载入模型参数

9. 开始训练

终于跑起来了！！！

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签

【李宏毅】深度学习——HW5-Machine Translation

Machine Translation

1. Goal

2. Introduction

2.1 Dataset

2.2. Evaluation

2.3. Workflow

2.4. Training tips

2.5 Back-translation(BT)

3. Code

3.1 数据预处理

3.2 划分训练集和验证集

3.3 Subword Units（分词）

3.4用fairseq将资料转为二进制

4. 实验准备

4.1 实验参数设定

4.2 logging

4.3 cuda环境

4.4 读取资料集

4.5 数据集迭代器

5. 定义模型架构

5.1 Encoder编码器

5.2 Attention

5.3 Decoder解码器

5.4 Seq2Seq

5.5 模型初始化

5.6 设置模型相关参数

5.7 Optimization 最佳化

5.8 Optimizer: Adam + lr scheduling

6. train

7. Validation & Inference 检验和推论

8. 储存及载入模型参数

9. 开始训练

终于跑起来了！！！

相关问题

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签