TFIDF
2022/4/20 23:15:33
本文主要是介绍TFIDF,对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!
TF-IDF是什么:
文本向量化工具
TF-IDF基本概念:
TF:词频除以文章总词数
IDF:ln(语料库的总个数/出现该词的语料库个数+1)
TF-IDF = TF * IDF
1 def forward_segment(text, dic): 2 word_list = [] 3 i = 0 4 while i < len(text): 5 logest_word = text[i] 6 for j in range(i+1, len(text)+1): 7 word = text[i:j] 8 if word in dic: 9 logest_word = word 10 word_list.append(logest_word) 11 i += len(logest_word) 12 return word_list正向最长匹配
1 def backward_segment(text, dic): 2 word_list = [] 3 i = len(text) - 1 4 while i >= 0: 5 longest_word = text[i] 6 for j in range(0, i): 7 word = text[j:i+1] 8 if word in dic: 9 longest_word = word 10 break 11 word_list.insert(0, longest_word) 12 i -= len(longest_word) 13 return word_list逆向最长匹配
def bidirectional_segment(text, dic): forward_result = forward_segment(text, dic) backward_result = backward_segment(text, dic) #默认逆向分词 final_result = backward_result if len(forward_result) < len(backward_result): final_result = forward_result elif count_single(forward_result) < count_single(backward_result): final_result = forward_result return final_result双向最长匹配
文本向量化步骤
1.遍历每个文档,使用jieba等分词工具进行分词,形成文档库,通常将数词(m)、非语言词(x)、中文停用词等排除在语料库之外。
2.剔除重复字段形成字典库
3.计算TF 、IDF
4.输出结果TF/IDF
import jieba import jieba.posseg as pseg import numpy as np def process(documents): file_list = [] vocab = set() for content in documents: psresult = pseg.cut(content) document = [x.word for x in psresult if x.flag not in ['x', 'm']] file_list.append(document) vocab.update(document) return file_list, vocab class TFIDFVector: def __init__(self,documents): self.documents = documents self.file_list = [] self.vocab = set() self.prosses() #文档预处理,生成文档列表和语料库 def prosses(self): self.file_list, self.vocab = process(self.documents) #文本向量化 def vector(self): self.vocab_list = list(self.vocab) #计算TF tf_list = [np.array([sent.count(word) for word in self.vocab_list])/len(sent) for sent in self.file_list] #计算IDF idf_list = [np.log(len(self.file_list)/sum([1 for sent in self.file_list if word in sent])) for word in self.vocab_list] tf_arrays = np.array(tf_list) idf_arrays = np.array(idf_list) #计算TF-IDF=TF*IDF # result = [sent*np.array(idf_list) for sent in tf_list] result = tf_arrays * idf_arrays return result if __name__ == '__main__': #dirName = r'd:\deepblue\course\NO.11\datas\datas'# r'D:\AI课程\05.NLP基础\代码\datas' #documents = load_data(dirName) documents = ['行政机关强行解除行政协议造成损失,如何索取赔偿?', '借钱给朋友到期不还得什么时候可以起诉?怎么起诉?', '我在微信上被骗了,请问被骗多少钱才可以立案?', '公民对于选举委对选民的资格申诉的处理决定不服,能不能去法院起诉吗?', '有人走私两万元,怎么处置他?', '法律上餐具、饮具集中消毒服务单位的责任是不是对消毒餐具、饮具进行检验?', '走私是如何量刑的?', '如果微信上被骗怎么办?'] vector = TFIDFVector(documents) print(vector.vector())文本向量化python代码示例
这篇关于TFIDF的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!
- 2024-05-13TiDB + ES:转转业财系统亿级数据存储优化实践
- 2024-05-09“2024鸿蒙零基础快速实战-仿抖音App开发(ArkTS版)”实战课程已上线
- 2024-05-09聊聊如何通过arthas-tunnel-server来远程管理所有需要arthas监控的应用
- 2024-05-09log4j2这么配就对了
- 2024-05-09nginx修改Content-Type
- 2024-05-09Redis多数据源,看这篇就够了
- 2024-05-09Google Chrome驱动程序 124.0.6367.62(正式版本)去哪下载?
- 2024-05-09有没有大佬知道这种数据应该怎么抓取呀?
- 2024-05-09这种运行结果里的10.100000001,怎么能最快改成10.1?
- 2024-05-09企业src漏洞挖掘-有意思的命令执行