> 文章列表 > 用huggingface.transformers.AutoModelForTokenClassification实现命名实体识别任务

用huggingface.transformers.AutoModelForTokenClassification实现命名实体识别任务

用huggingface.transformers.AutoModelForTokenClassification实现命名实体识别任务

诸神缄默不语-个人CSDN博文目录
huggingface transformers包 文档学习笔记(持续更新ing…)

本文主要介绍使用AutoModelForTokenClassification在典型序列识别任务,即命名实体识别任务 (NER) 上,微调Bert模型
主要参考huggingface官方教程:Token classification

本文中给出的例子是英文数据集,且使用transformers.Trainer来训练,以后可能会补充使用中文数据、使用原生PyTorch框架的训练代码。
使用原生PyTorch框架反正不难,可以参考文本分类那边的改法:用huggingface.transformers.AutoModelForSequenceClassification在文本分类任务上微调预训练模型
整个代码是用VSCode内置对Jupyter Notebook支持的编辑器来写的,所以是分cell的。

序列标注和NER都是啥我就不写了,之前笔记写过的我也尽量都不写了。
本文直接使用本地已下载好的checkpoint文件夹路径来调取预训练模型。checkpoint下载自:https://huggingface.co/distilbert-base-uncased
需要预先安装包:pip install datasets evaluate seqeval

文章目录

  • 1. 登录huggingface
  • 2. 数据集:WNUT 17
  • 3. 数据预处理
  • 4. 建立评估指标
  • 5. 训练
  • 6. 推理
    • 6.1 直接使用pipeline
    • 6.2 使用模型实现推理
  • 7. 其他本文撰写过程中使用的参考资料

1. 登录huggingface

虽然不用,但是登录一下(如果在后面训练部分,将push_to_hub入参置为True的话,可以直接将模型上传到Hub)

from huggingface_hub import notebook_loginnotebook_login()

输出:

Login successful
Your token has been saved to my_path/.huggingface/token
Authenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the defaultgit config --global credential.helper store

2. 数据集:WNUT 17

直接运行load_dataset()会报ConnectionError,所以可参考之前我写过的huggingface.datasets无法加载数据集和指标的解决方案先下载到本地,然后加载:

import datasets
wnut=datasets.load_from_disk('/data/datasets_file/wnut17')

用huggingface.transformers.AutoModelForTokenClassification实现命名实体识别任务

用huggingface.transformers.AutoModelForTokenClassification实现命名实体识别任务

ner_tags数字对应的标签:
用huggingface.transformers.AutoModelForTokenClassification实现命名实体识别任务

3. 数据预处理

from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("/data/pretrained_model/distilbert-base-uncased")

在上面可以看到,文本已经以词为单位进行了分割。对这种文本用DisdilBert进行tokenize的示例:

example = wnut["train"][0]
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
tokens

输出:
用huggingface.transformers.AutoModelForTokenClassification实现命名实体识别任务

可以看出,有增加special tokens、还有把word变成subword,这都使原标签序列与现在的token序列不再对应,因此现在需要重新匹配与token序列对应的标签序列:
3. 用token对应的word_ids1匹配原属的word,也就匹配到了原属的标签。只标注第一个subword的标签
4. 第二个及以后subword,和special tokens的标签标注为-100。这样会自动使PyTorch计算交叉熵损失函数时忽略这些token。在后续计算指标时需要另行考虑对这些token进行处理
(注意这里的-100,在教程中说这样会被PyTorch损失函数忽略,我一开始很震惊,因为在教程中没有在损失函数处考虑-100,而且DistilBertForTokenClassifcation源码2里面算loss时显然也没有考虑这回事,
然后我去百度了一下,发现……
是PyTorch的CrossEntropyLoss默认忽略-100值(捂脸):
用huggingface.transformers.AutoModelForTokenClassification实现命名实体识别任务
(图片截自PyTorch官方文档3
我之前还在huggingface论坛里提问了,我还猜想是别的原因,跑去提问,果然没人回4,最后还得靠我自己查)

5. truncation=True:将文本truncate到模型的最大长度

这是一个批量处理代码:

def tokenize_and_align_labels(examples):tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)labels = []for i, label in enumerate(examples[f"ner_tags"]):word_ids = tokenized_inputs.word_ids(batch_index=i)  #返回batch_index对应的batch中的序列索引的token对应的word id列表(special tokens是None)previous_word_idx = Nonelabel_ids = []for word_idx in word_ids:  # Set the special tokens to -100.if word_idx is None:label_ids.append(-100)elif word_idx != previous_word_idx:  # Only label the first token of a given word.label_ids.append(label[word_idx])else:label_ids.append(-100)previous_word_idx = word_idxlabels.append(label_ids)tokenized_inputs["labels"] = labelsreturn tokenized_inputs
  1. is_split_into_words:以词为列表元素输入文本
  2. return_offsets_mapping:返回每个token的(char_start, char_end)
  3. truncation / padding(这里没有直接应用padding,应该是因为后面直接使用DataCollatorWithPadding来实现padding了)

将批量预处理的代码应用在数据集上(batched=True入参使一次可处理多个元素):

tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)

用huggingface.transformers.AutoModelForTokenClassification实现命名实体识别任务

为了实现mini-batch,直接用原生PyTorch框架的话就是建立DataSet和DataLoader对象之类的,也可以直接用DataCollatorWithPadding:动态将每一batch padding到最长长度,而不用直接对整个数据集进行padding;能够同时padding label:

from transformers import DataCollatorForTokenClassificationdata_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

(没查具体pad label的方式。总之如果想手动预处理数据的话,可以先定义一个所有值都是-100的列表)

4. 建立评估指标

evaluate库:🤗 Evaluate

import evaluateseqeval = evaluate.load("seqeval")

建立计算指标的函数:将logits转换为预测标签(过argmax),删除-100标签的token

label_list = wnut["train"].features[f"ner_tags"].feature.names  #BIOimport numpy as npdef compute_metrics(p):predictions, labels = ppredictions = np.argmax(predictions, axis=2)true_predictions = [[label_list[p] for (p, l) in zip(prediction, label) if l != -100]for prediction, label in zip(predictions, labels)]true_labels = [[label_list[l] for (p, l) in zip(prediction, label) if l != -100]for prediction, label in zip(predictions, labels)]results = seqeval.compute(predictions=true_predictions, references=true_labels)return {"precision": results["overall_precision"],"recall": results["overall_recall"],"f1": results["overall_f1"],"accuracy": results["overall_accuracy"],}

compute()的文档:https://huggingface.co/docs/evaluate/v0.4.0/en/package_reference/main_classes#evaluate.EvaluationModule.compute

5. 训练

id2label = {0: "O",1: "B-corporation",2: "I-corporation",3: "B-creative-work",4: "I-creative-work",5: "B-group",6: "I-group",7: "B-location",8: "I-location",9: "B-person",10: "I-person",11: "B-product",12: "I-product",
}
label2id = {"O": 0,"B-corporation": 1,"I-corporation": 2,"B-creative-work": 3,"I-creative-work": 4,"B-group": 5,"I-group": 6,"B-location": 7,"I-location": 8,"B-person": 9,"I-person": 10,"B-product": 11,"I-product": 12,
}
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainermodel = AutoModelForTokenClassification.from_pretrained("/data/pretrained_model/distilbert-base-uncased/", num_labels=13, id2label=id2label, label2id=label2id
)
  1. 定义训练超参(需要指定保存模型的文件夹路径)
    如果将push_to_hub置True,将使训练好的模型自动上传到Hub,这需要登录huggingface账号(可以用hub_model_id入参设置完整的repo名,需要包括namespace,如sgugger/bert-finetuned-ner
    每个epoch最后会评估seqeval指标
    https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/trainer#transformers.TrainingArguments
    (本项目中没有传入,默认的第一个入参是model_name)
  2. 将超参、模型、数据集、指标计算函数传入Trainer
    https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/trainer#transformers.Trainer
  3. 微调模型
    https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/trainer#transformers.Trainer.train

(注意这个代码会自动调用wandb包……我是没想到的,港真。项目是huggingface,run的名称就是output_dir的值)

training_args = TrainingArguments(output_dir="/data/wanghuijuan/pretrained_model/my_awesome_wnut_model",learning_rate=2e-5,per_device_train_batch_size=16,per_device_eval_batch_size=16,num_train_epochs=2,weight_decay=0.01,evaluation_strategy="epoch",save_strategy="epoch",load_best_model_at_end=True,push_to_hub=False,
)trainer = Trainer(model=model,args=training_args,train_dataset=tokenized_wnut["train"],eval_dataset=tokenized_wnut["test"],tokenizer=tokenizer,data_collator=data_collator,compute_metrics=compute_metrics,
)trainer.train()

模型会自动上多卡进行训练。我服务器上4张卡都能用,所以1个batch可以跑64个样本

用huggingface.transformers.AutoModelForTokenClassification实现命名实体识别任务
用huggingface.transformers.AutoModelForTokenClassification实现命名实体识别任务

Training completed. Do not forget to share your model on huggingface.co/models =)Loading best model from /data/pretrained_model/my_awesome_wnut_model/checkpoint-108 (score: 0.3335016071796417).
TrainOutput(global_step=108, training_loss=0.42667872817428026, metrics={'train_runtime': 38.1861, 'train_samples_per_second': 177.761, 'train_steps_per_second': 2.828, 'total_flos': 103933773439080.0, 'train_loss': 0.42667872817428026, 'epoch': 2.0})

wandb的图懒得截了

如果想把模型上传到Hub:trainer.push_to_hub()(https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/trainer#transformers.Trainer.push_to_hub)

另外一种写法:https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb#scrollTo=7sZOdRlRIrJd(别的都差不多,比较明显的区别就是:计算指标是用datasets库,数据集是按照比较标准的训练集-验证集-测试集模式来实现的)

6. 推理

text = "The Golden State Warriors are an American professional basketball team based in San Francisco."

(我第一次跑出来的结果有问题,完全不能正确预测结果,所以我重跑了一次,就不改上面一节的截图了)

6.1 直接使用pipeline

from transformers import pipelineclassifier=pipeline("ner",model="/data/pretrained_model/my_awesome_wnut_model/checkpoint-108")
classifier(text)

输出的警告内容:

loading configuration file /data/pretrained_model/my_awesome_wnut_model/checkpoint-108/config.json
Model config DistilBertConfig {"_name_or_path": "/data/pretrained_model/my_awesome_wnut_model/checkpoint-108","activation": "gelu","architectures": ["DistilBertForTokenClassification"],"attention_dropout": 0.1,"dim": 768,"dropout": 0.1,"hidden_dim": 3072,"id2label": {"0": "O","1": "B-corporation","2": "I-corporation","3": "B-creative-work","4": "I-creative-work","5": "B-group","6": "I-group","7": "B-location","8": "I-location","9": "B-person","10": "I-person","11": "B-product","12": "I-product"},"initializer_range": 0.02,"label2id": {"B-corporation": 1,"B-creative-work": 3,"B-group": 5,"B-location": 7,"B-person": 9,"B-product": 11,"I-corporation": 2,"I-creative-work": 4,"I-group": 6,"I-location": 8,"I-person": 10,"I-product": 12,"O": 0},"max_position_embeddings": 512,"model_type": "distilbert","n_heads": 12,"n_layers": 6,"pad_token_id": 0,"qa_dropout": 0.1,"seq_classif_dropout": 0.2,"sinusoidal_pos_embds": false,"tie_weights_": true,"torch_dtype": "float32","transformers_version": "4.21.1","vocab_size": 30522
}loading configuration file /data/pretrained_model/my_awesome_wnut_model/checkpoint-108/config.json
Model config DistilBertConfig {"_name_or_path": "/data/pretrained_model/my_awesome_wnut_model/checkpoint-108","activation": "gelu","architectures": ["DistilBertForTokenClassification"],"attention_dropout": 0.1,"dim": 768,"dropout": 0.1,"hidden_dim": 3072,"id2label": {"0": "O","1": "B-corporation","2": "I-corporation","3": "B-creative-work","4": "I-creative-work","5": "B-group","6": "I-group","7": "B-location","8": "I-location","9": "B-person","10": "I-person","11": "B-product","12": "I-product"},"initializer_range": 0.02,"label2id": {"B-corporation": 1,"B-creative-work": 3,"B-group": 5,"B-location": 7,"B-person": 9,"B-product": 11,"I-corporation": 2,"I-creative-work": 4,"I-group": 6,"I-location": 8,"I-person": 10,"I-product": 12,"O": 0},"max_position_embeddings": 512,"model_type": "distilbert","n_heads": 12,"n_layers": 6,"pad_token_id": 0,"qa_dropout": 0.1,"seq_classif_dropout": 0.2,"sinusoidal_pos_embds": false,"tie_weights_": true,"torch_dtype": "float32","transformers_version": "4.21.1","vocab_size": 30522
}loading weights file /data/pretrained_model/my_awesome_wnut_model/checkpoint-108/pytorch_model.bin
All model checkpoint weights were used when initializing DistilBertForTokenClassification.All the weights of DistilBertForTokenClassification were initialized from the model checkpoint at /data/pretrained_model/my_awesome_wnut_model/checkpoint-108.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertForTokenClassification for predictions without further training.
Didn't find file /data/pretrained_model/my_awesome_wnut_model/checkpoint-108/added_tokens.json. We won't load it.
loading file /data/pretrained_model/my_awesome_wnut_model/checkpoint-108/vocab.txt
loading file /data/pretrained_model/my_awesome_wnut_model/checkpoint-108/tokenizer.json
loading file None
loading file /data/pretrained_model/my_awesome_wnut_model/checkpoint-108/special_tokens_map.json
loading file /data/pretrained_model/my_awesome_wnut_model/checkpoint-108/tokenizer_config.json
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

输出:

[{'entity': 'B-group','score': 0.28309435,'index': 1,'word': 'the','start': 0,'end': 3},{'entity': 'B-location','score': 0.41352233,'index': 2,'word': 'golden','start': 4,'end': 10},{'entity': 'I-location','score': 0.44743603,'index': 3,'word': 'state','start': 11,'end': 16},{'entity': 'B-group','score': 0.2455212,'index': 4,'word': 'warriors','start': 17,'end': 25},{'entity': 'B-location','score': 0.2583066,'index': 7,'word': 'american','start': 33,'end': 41},{'entity': 'B-location','score': 0.54653203,'index': 13,'word': 'san','start': 80,'end': 83},{'entity': 'B-location','score': 0.43548092,'index': 14,'word': 'francisco','start': 84,'end': 93},{'entity': 'I-location','score': 0.16240601,'index': 15,'word': '.','start': 93,'end': 94}]

6.2 使用模型实现推理

import torchmodel.to('cpu')
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():logits = model(**inputs).logits
predictions = torch.argmax(logits, dim=2)
predicted_token_class = [model.config.id2label[t.item()] for t in predictions[0]]
predicted_token_class
['O','B-group','B-location','I-location','I-group','O','O','O','O','O','O','O','O','B-location','B-location','I-location','I-location']

理论上这里还应该有一步把原始token和预测值对应起来,但是教程里没写,我也懒得写了。

7. 其他本文撰写过程中使用的参考资料

  1. 利用huggingface的transformers库在自己的数据集上实现中文NER(命名实体识别) - 知乎
  2. HuggingFace Datasets来写一个数据加载脚本_名字填充中的博客-CSDN博客:这个是讲如何将自己的数据集构建为datasets格式的数据集的
  3. huggingface使用BERT对自己的数据集进行命名实体识别方法_vanilla_hxy的博客-CSDN博客:这个是用transformers官方token classification示例代码来改的代码
  4. Huggingface-transformers项目源码剖析及Bert命名实体识别实战_野猪向前冲_真的博客-CSDN博客:这篇前半部分看起来还可以,看到第5节的时候发现图挂了……天涯何处无芳草,我不看了,润!
  5. PyTorch学习—12.损失函数_ignore_index is not supported for floating point t_哎呦-_-不错的博客-CSDN博客

  1. https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/tokenizer#transformers.BatchEncoding.word_ids
    (顺带一提,原文档的超链接似乎给错了,我已经提issue了:Report a hyperlink mistake · Issue #739 · huggingface/hub-docs) ↩︎

  2. https://github.com/huggingface/transformers/blob/fe1f5a639d93c9272856c670cff3b0e1a10d5b2b/src/transformers/models/distilbert/modeling_distilbert.py#L939 ↩︎

  3. https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss ↩︎

  4. https://discuss.huggingface.co/t/will-trainer-loss-functions-automatically-ignore-100/36134 ↩︎