| Latent Semantic Analysis (LSA) Tutorial - Part 2 - Modify the Counts with TFIDF |
|
|
|
|
Page 4 of 8
Part 2 - Modify the Counts with TFIDFIn sophisticated Latent Semantic Analysis systems, the raw matrix counts are usually modified so that rare words are weighted more heavily than common words. For example, a word that occurs in only 5% of the documents should probably be weighted more heavily than a word that occurs in 90% of the documents. The most popular weighting is TFIDF (Term Frequency - Inverse Document Frequency). Under this method, the count in each cell is replaced by the following formula. TFIDFi,j = ( Ni,j / N*,j ) * log( D / Di ) where
In this formula, words that concentrate in certain documents are emphasized (by the Ni,j / N*,j ratio) and words that only appear in a few documents are also emphasized (by the log( D / Di ) term). Since we have such a small example, we will skip this step and move on the heart of LSA, doing the singular value decomposition of our matrix of counts. However, if we did want to add TFIDF to our LSA class we could add the following two lines at the beginning of our python file to import the log, asarray, and sum functions. from math import log Then we would add the following TFIDF method to our LSA class. WordsPerDoc (N*,j) just holds the sum of each column, which is the total number of index words in each document. DocsPerWord (Di) uses asarray to create an array of what would be True and False values, depending on whether the cell value is greater than 0 or not, but the 'i' argument turns it into 1's and 0's instead. Then each row is summed up which tells us how many documents each word appears in. Finally, we just step through each cell and apply the formula. We do have to change cols (which is the number of documents) into a float to prevent integer division.
|


