TF与IDF

(一) 定义

TF-IDF(Term Frequency-Inverse DocumentFrequency, 词频-逆文件频率)，一种用于资讯检索和资讯探勘的常用加权技术。

TF-IDF是一种统计方法，用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。TF-IDF加权的各种形式常被搜寻引擎应用，作为文件与用户查询之间相关程度的度量或评级。

(二) TF和IDF各自的含义

1. TF（词频）

某一给定词语在该文本中出现次数。该数字通常会被归一化（分子一般小于分母），以防止它偏向长的文件，因为不管该词语重要与否，它在长文件中出现的次数很可能比在段文件中出现的次数更大。

2. IDF（逆文档频率）

一个词语普遍重要性的度量。主要思想是：如果包含词条t的文档越少, IDF越大，则说明词条具有很好的类别区分能力。某一特定词语的IDF，可以由总文件数目除以包含该词语之文件的数目，再将得到的商取对数得到。

(三) 数学定义

tf-idf是两个统计量，词频和逆文档频率的乘积。有多种方法可以用来确定两个统计量的确切值。

1. TF

对于在某一个特定文件里的词语ti来说，它的重要性可以表示为：

\(tf_{i,j}=\frac{n_{i,j}}{\sum_{k}n_{k,j}}\)

其中\(n_{i,j}\)是该词在文件j中出现的次数，而分母则是文件j中所有词汇出现的次数总和。更通俗的表示方式就是：

\(TF(t)=\frac{Number of times term t appears in a document}{Total number of terms in the document}\)

2. IDF

逆文档频率（IDF）是一个词语普遍重要性的度量。某一特定词语的IDF，可以由总文件数目除以包含该词语之文件的数目，再将得到的商取对数得到：

\(idf_{i}=log\frac{\left | D \right |}{\left | \left \{ j:t_{i}\in d_{j} \right \} \right |}\)

其中，\(\left | D \right |\)是语料库中的文件总数。\(\left | \left \{ j:t_{i}\in d_{j} \right \} \right |\)表示包含词语\(t_{i}\)的文件数目（即\(t_{i}\)的文件数目）。如果该词语不在语料库中，就会导致分母为零，因此一般情况下使用\(\left | \left \{ j:t_{i}\in d_{j} \right \} \right |\) + 1 。同样，如果用更直白的语言表示就是
\(TF(t)=\frac{Total number of documents}{Number of documents with term in it}\)

3. TF-IDF

TF-IDF(t) = TF(t) * IDF(t)

公式中，我们可以看出TF-IDF与一个词在文档中出现的次数成正比，与该词在整个语言中该出现的次数成反比。

某一特定文件内的高词语频率，以及该词语在整个文件集合中的低文件频率，可以产生出高权重的TF-IDF。因此，TF-IDF倾向于过滤掉常见的词语，保留重要的词语。

（三）python实现TF-IDF

import nltk
import math
import string
from nltk.corpus import stopwords
from collections import Counter
from nltk.stem.porter import *
from sklearn.feature_extraction.text import TfidfVectorizer


#分词(剔除标点)
def get_tokens(text):
    lowers = text.lower()
    #remove the punctuation using the character deletion step of translate
    remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
    #去除标点符号
    no_punctuation = lowers.translate(remove_punctuation_map)
    tokens = nltk.word_tokenize(no_punctuation)
    return tokens

#词干提取（stemmer为词干提取算法）
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

#计算word的词频
def tf(word, count):
    return count[word] / sum(count.values())

#计算不包括word的文档数
def n_containing(word, count_list):
    return sum(1 for count in count_list if word in count)

#计算word的idf（进行+1平滑）
def idf(word, count_list):
    return math.log(len(count_list) / (1 + n_containing(word, count_list)))

#计算法tfidf
def tfidf(word, count, count_list):
    return tf(word, count) * idf(word, count_list)

tokens = get_tokens(text1)
filtered = [w for w in tokens if not w in stopwords.words('english')]
count = Counter(filtered)
print (count.most_common(10))

text1 = "Python is a 2000 made-for-TV horror movie directed by Richard \
Clabaugh. The film features several cult favorite actors, including William \
Zabka of The Karate Kid fame, Wil Wheaton, Casper Van Dien, Jenny McCarthy, \
Keith Coogan, Robert Englund (best known for his role as Freddy Krueger in the \
A Nightmare on Elm Street series of films), Dana Barron, David Bowe, and Sean \
Whalen. The film concerns a genetically engineered snake, a python, that \
escapes and unleashes itself on a small town. It includes the classic final\
girl scenario evident in films like Friday the 13th. It was filmed in Los Angeles, \
 California and Malibu, California. Python was followed by two sequels: Python \
 II (2002) and Boa vs. Python (2004), both also made-for-TV films."

text2 = "Python, from the Greek word (πύθων/πύθωνας), is a genus of \
nonvenomous pythons[2] found in Africa and Asia. Currently, 7 species are \
recognised.[2] A member of this genus, P. reticulatus, is among the longest \
snakes known."

text3 = "The Colt Python is a .357 Magnum caliber revolver formerly \
manufactured by Colt's Manufacturing Company of Hartford, Connecticut. \
It is sometimes referred to as a \"Combat Magnum\".[1] It was first introduced \
in 1955, the same year as Smith &amp; Wesson's M29 .44 Magnum. The now discontinued \
Colt Python targeted the premium revolver market segment. Some firearm \
collectors and writers such as Jeff Cooper, Ian V. Hogg, Chuck Hawks, Leroy \
Thompson, Renee Smeets and Martin Dougherty have described the Python as the \
finest production revolver ever made."

#nltk中的词干提取算法
stemmer = PorterStemmer()

tokens1 = get_tokens(text1)
filtered1 = [w for w in tokens1 if not w in stopwords.words('english')]
stemmed1 = stem_tokens(filtered1, stemmer)
count1 = Counter(stemmed1)

tokens2 = get_tokens(text2)
filtered2 = [w for w in tokens2 if not w in stopwords.words('english')]
stemmed2 = stem_tokens(filtered2, stemmer)
count2 = Counter(stemmed2)

tokens3 = get_tokens(text3)
filtered3 = [w for w in tokens3 if not w in stopwords.words('english')]
stemmed3 = stem_tokens(filtered3, stemmer)
count3 = Counter(stemmed3)

countlist = [count1, count2, count3]
for i, count in enumerate(countlist):
    print("Top words in document {}".format(i + 1))
    scores = {word: tfidf(word, count, countlist) for word in count}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:3]:
        print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))
        

Top words in document 1
	Word: film, TF-IDF: 0.02829
	Word: madefortv, TF-IDF: 0.00943
	Word: includ, TF-IDF: 0.00943
Top words in document 2
	Word: genu, TF-IDF: 0.03686
	Word: greek, TF-IDF: 0.01843
	Word: word, TF-IDF: 0.01843
Top words in document 3
	Word: colt, TF-IDF: 0.02097
	Word: revolv, TF-IDF: 0.02097
	Word: magnum, TF-IDF: 0.01398

（四）利用Scikit-Learn实现的TF-IDF

TF-IDF 在文本数据挖掘时十分常用，所以在Python的机器学习包中也提供了内置的TF-IDF实现。主要使用的函数就是TfidfVectorizer()，来看一个简单的例子。

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ['This is the first document.',
         'This is the second second document.',
         'And the third one.',
         'Is this the first document?',]
vectorizer = TfidfVectorizer(min_df=1)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
print(vectorizer.fit_transform(corpus).toarray())


[[0.         0.43877674 0.54197657 0.43877674 0.         0.
  0.35872874 0.         0.43877674]
 [0.         0.27230147 0.         0.27230147 0.         0.85322574
  0.22262429 0.         0.27230147]
 [0.55280532 0.         0.         0.         0.55280532 0.
  0.28847675 0.55280532 0.        ]
 [0.         0.43877674 0.54197657 0.43877674 0.         0.
  0.35872874 0.         0.43877674]]

参考博客

TF-IDF原理理解

TF-IDF的理解

TF-IDF算法解析与Python实现