pytorch文本任务pipeline
2022/3/2 23:46:46
本文主要是介绍pytorch文本任务pipeline,对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!
文章目录
- 1. 读取文本数据制作成pytorch专用数据集
- 2. 对句子进行分词得到词库
- 3. 构建DataLoadar
https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html
https://github.com/lankuohsing/pytorch-study/tree/main/utils/text_pipline
1. 读取文本数据制作成pytorch专用数据集
以读取CSV数据为例,里面每行的数据格式为:label,text。下面的代码会得到一个IterableDataset类型对象,可用于后面的流程
import io from torchtext.utils import unicode_csv_reader from torchtext.data.datasets_utils import _RawTextIterableDataset def _create_data_from_csv(data_path): with io.open(data_path, encoding="utf8") as f: reader = unicode_csv_reader(f) for row in reader: yield int(row[0]), ' '.join(row[1:]) def _create_raw_text_iterable_dataset(description=None,path=None): iterator=_create_data_from_csv(path) length=sum(1 for _ in iterator) iterator=_create_data_from_csv(path) # print("lenght: ",length) return _RawTextIterableDataset(description=description,full_num_lines=length,iterator=iterator) train_iter = _create_raw_text_iterable_dataset(description="AG_NEWS",path="../dataset/AG_NEWS/train.csv")#
2. 对句子进行分词得到词库
from torchtext.data import get_tokenizer from torchtext.vocab import build_vocab_from_iterator tokenizer = get_tokenizer(tokenizer='basic_english', language='en') #输入是字符串,输出是单词列表 tokens = tokenizer("You can now install TorchText using pip!") print(tokens) def yield_tokens(data_iter): for _, text in data_iter: yield tokenizer(text) vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"]) vocab.set_default_index(vocab["<unk>"])
注意,参数tokenizer可以是其他分词函数。如果是None,则直接按空格切分(无法剔除紧挨着单词的标点符号);如果是basic_english则会把标点符号单独拎出来,称为normalize
voc的常用属性/方法
- len() → int: 返回词库大小
- getitem(token: str) → int: 给定token返回其下标
- set_default_index(index: Optional[int]) → None:用于设置OOV对应的词和下标
3. 构建DataLoadar
# In[] #构建两个比较有用的lambda表达式 text_pipeline = lambda x: vocab(tokenizer(x)) label_pipeline = lambda x: int(x) - 1 print(text_pipeline('here is the an example')) print(label_pipeline('10')) from torch.utils.data import DataLoader device = torch.device("cuda" if torch.cuda.is_available() else "cpu") def collate_batch(batch): label_list, text_list, offsets = [], [], [0] for (_label, _text) in batch: label_list.append(label_pipeline(_label)) processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64) text_list.append(processed_text) offsets.append(processed_text.size(0)) label_list = torch.tensor(label_list, dtype=torch.int64) offsets = torch.tensor(offsets[:-1]).cumsum(dim=0) text_list = torch.cat(text_list) return label_list.to(device), text_list.to(device), offsets.to(device) dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)
这篇关于pytorch文本任务pipeline的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!
- 2024-05-15PingCAP 黄东旭参与 CCF 秀湖会议,共探开源教育未来
- 2024-05-13PingCAP 戴涛:构建面向未来的金融核心系统
- 2024-05-09flutter3.x_macos桌面os实战
- 2024-05-09Rust中的并发性:Sync 和 Send Traits
- 2024-05-08使用Ollama和OpenWebUI在CPU上玩转Meta Llama3-8B
- 2024-05-08完工标准(DoD)与验收条件(AC)究竟有什么不同?
- 2024-05-084万 star 的 NocoDB 在 sealos 上一键起,轻松把数据库编程智能表格
- 2024-05-08Mac 版Stable Diffusion WebUI的安装
- 2024-05-08解锁CodeGeeX智能问答中3项独有的隐藏技能
- 2024-05-08RAG算法优化+新增代码仓库支持,CodeGeeX的@repo功能效果提升