Latent Semantic Analysis (LSA) Tutorial
Latent Semantic Analysis (LSA) Tutorial - Part 2 - Modify the Counts with TFIDF PDF Print E-mail
Article Index
Latent Semantic Analysis (LSA) Tutorial
A Small Example
Part 1 - Creating the Count Matrix
Part 2 - Modify the Counts with TFIDF
Part 3 - Using the Singular Value Decomposition
Part 4 - Clustering by Color
Part 5 - Clustering by Value
Advantages, Disadvantages, and Applications of LSA
All Pages

Part 2 - Modify the Counts with TFIDF

In sophisticated Latent Semantic Analysis systems, the raw matrix counts are usually modified so that rare words are weighted more heavily than common words. For example, a word that occurs in only 5% of the documents should probably be weighted more heavily than a word that occurs in 90% of the documents. The most popular weighting is TFIDF (Term Frequency - Inverse Document Frequency). Under this method, the count in each cell is replaced by the following formula.

TFIDFi,j = ( Ni,j / N*,j ) * log( D / Di ) where

  • Ni,j = the number of times word i appears in document j (the original cell count).
  • N*,j = the number of total words in document j (just add the counts in column j).
  • D = the number of documents (the number of columns).
  • Di = the number of documents in which word i appears (the number of non-zero columns in row i).

In this formula, words that concentrate in certain documents are emphasized (by the Ni,j / N*,j ratio) and words that only appear in a few documents are also emphasized (by the log( D / Di ) term).

Since we have such a small example, we will skip this step and move on the heart of LSA, doing the singular value decomposition of our matrix of counts. However, if we did want to add TFIDF to our LSA class we could add the following two lines at the beginning of our python file to import the log, asarray, and sum functions.

from math import log
from numpy import asarray, sum

Then we would add the following TFIDF method to our LSA class. WordsPerDoc (N*,j) just holds the sum of each column, which is the total number of index words in each document. DocsPerWord (Di) uses asarray to create an array of what would be True and False values, depending on whether the cell value is greater than 0 or not, but the 'i' argument turns it into 1's and 0's instead. Then each row is summed up which tells us how many documents each word appears in. Finally, we just step through each cell and apply the formula. We do have to change cols (which is the number of documents) into a float to prevent integer division.

def TFIDF(self):
WordsPerDoc = sum(self.A, axis=0)
DocsPerWord = sum(asarray(self.A > 0, 'i'), axis=1)
rows, cols = self.A.shape
for i in range(rows):
for j in range(cols):
self.A[i,j] = (self.A[i,j] / WordsPerDoc[j]) * log(float(cols) / DocsPerWord[i])


 
Joomla Templates by Joomlashack