|Latent Semantic Analysis (LSA) Tutorial|
Latent Semantic Analysis (LSA), also known as Latent Semantic Indexing (LSI) literally means analyzing documents to find the underlying meaning or concepts of those documents. If each word only meant one concept, and each concept was only described by one word, then LSA would be easy since there is a simple mapping from words to concepts.
Unfortunately, this problem is difficult because English has different words that mean the same thing (synonyms), words with multiple meanings, and all sorts of ambiguities that obscure the concepts to the point where even people can have a hard time understanding.
For example, the word bank when used together with mortgage, loans, and rates probably means a financial institution. However, the word bank when used together with lures, casting, and fish probably means a stream or river bank.
How Latent Semantic Analysis Works
Latent Semantic Analysis arose from the problem of how to find relevant documents from search words. The fundamental difficulty arises when we compare words to find relevant documents, because what we really want to do is compare the meanings or concepts behind the words. LSA attempts to solve this problem by mapping both words and documents into a "concept" space and doing the comparison in this space.
Since authors have a wide choice of words available when they write, the concepts can be obscured due to different word choices from different authors. This essentially random choice of words introduces noise into the word-concept relationship. Latent Semantic Analysis filters out some of this noise and also attempts to find the smallest set of concepts that spans all the documents.
In order to make this difficult problem solvable, LSA introduces some dramatic simplifications.
To see a small example of LSA, take a look at the next section.
A Small Example
As a small example, I searched for books using the word “investing” at Amazon.com and took the top 10 book titles that appeared. One of these titles was dropped because it had only one index word in common with the other titles. An index word is any word that:
In this example we have removed the following stop words: “and”, “edition”, “for”, “in”, “little”, “of”, “the”, “to”.
Here are the 9 remaining tiles. The index words (words that appear in 2 or more titles and are not stop words) are underlined.
Once Latent Semantic Analysis has been run on this example, we can plot the index words and titles on an XY graph and identify clusters of titles. The 9 titles are plotted with blue circles and the 11 index words are plotted with red squares. Not only can we spot clusters of titles, but since index words can be plotted along with titles, we can label the clusters. For example, the blue cluster, containing titles T7 and T9, is about real estate. The green cluster, with titles T2, T4, T5, and T8, is about value investing, and finally the red cluster, with titles T1 and T3, is about the stock market. The T6 title is an outlier, off on its own.
In the next few sections, we'll go through all steps needed to run Latent Semantic Analysis on this example.
Part 1 - Creating the Count Matrix
The first step in Latent Semantic Analysis is to create the word by title (or document) matrix. In this matrix, each index word is a row and each title is a column. Each cell contains the number of times that word occurs in that title. For example, the word "book" appears one time in title T3 and one time in title T4, whereas "investing" appears one time in every title. In general, the matrices built during LSA tend to be very large, but also very sparse (most cells contain 0). That is because each title or document usually contains only a small number of all the possible words. This sparseness can be taken advantage of in both memory and time by more sophisticated LSA implementations.
In the following matrix, we have left out the 0's to reduce clutter.
Python - Getting Started
Download the python code here.
Throughout this article, we'll give Python code that implements all the steps necessary for doing Latent Semantic Analysis. We'll go through the code section by section and explain everything. The Python code used in this article can be downloaded here and then run in Python. You need to have already installed the Python NumPy and SciPy libraries.
Python - Import Functions
First we need to import a few functions from Python libraries to handle some of the math we need to do. NumPy is the Python numerical library, and we'll import zeros, a function that creates a matrix of zeros that we use when building our words by titles matrix. From the linear algebra part of the scientific package (scipy.linalg) we import the svd function that actually does the singular value decomposition, which is the heart of LSA.
from numpy import zeros
Python - Define Data
Next, we define the data that we are using. Titles holds the 9 book titles that we have gathered, stopwords holds the 8 common words that we are going to ignore when we count the words in each title, and ignorechars has all the punctuation characters that we will remove from words. We use Python's triple quoted strings, so there are actually only 4 punctuation symbols we are removing: comma (,), colon (:), apostrophe ('), and exclamation point (!).
Python - Define LSA Class
The LSA class has methods for initialization, parsing documents, building the matrix of word counts, and calculating. The first method is the __init__ method, which is called whenever an instance of the LSA class is created. It stores the stopwords and ignorechars so they can be used later, and then initializes the word dictionary and the document count variables.
Python - Parse Documents
The parse method takes a document, splits it into words, removes the ignored characters and turns everything into lowercase so the words can be compared to the stop words. If the word is a stop word, it is ignored and we move on to the next word. If it is not a stop word, we put the word in the dictionary, and also append the current document number to keep track of which documents the word appears in.
The documents that each word appears in are kept in a list associated with that word in the dictionary. For example, since the word book appears in titles 3 and 4, we would have self.wdict['book'] = [3, 4] after all titles are parsed.
After processing all words from the current document, we increase the document count in preparation for the next document to be parsed.
Python - Build the Count Matrix
Once all documents are parsed, all the words (dictionary keys) that are in more than 1 document are extracted and sorted, and a matrix is built with the number of rows equal to the number of words (keys), and the number of columns equal to the document count. Finally, for each word (key) and document pair the corresponding matrix cell is incremented.
Python - Print the Count Matrix
The printA() method is very simple, it just prints out the matrix that we have built so it can be checked.
Python - Test the LSA Class
After defining the LSA class, it's time to try it out on our 9 book titles. First we create an instance of LSA, called mylsa, and pass it the stopwords and ignorechars that we defined. During creation, the __init__ method is called which stores the stopwords and ignorechars and initializes the word dictionary and document count.
Next, we call the parse method on each title. This method extracts the words in each title, strips out punctuation characters, converts each word to lower case, throws out stop words, and stores remaining words in a dictionary along with what title number they came from.
Finally we call the build() method to create the matrix of word by title counts. This extracts all the words we have seen so far, throws out words that occur in less than 2 titles, sorts them, builds a zero matrix of the right size, and then increments the proper cell whenever a word appears in a title.
mylsa = LSA(stopwords, ignorechars)
Here is the raw output produced by printA(). As you can see, it's the same as the matrix that we showed earlier.
[[ 0. 0. 1. 1. 0. 0. 0. 0. 0.]
Part 2 - Modify the Counts with TFIDF
In sophisticated Latent Semantic Analysis systems, the raw matrix counts are usually modified so that rare words are weighted more heavily than common words. For example, a word that occurs in only 5% of the documents should probably be weighted more heavily than a word that occurs in 90% of the documents. The most popular weighting is TFIDF (Term Frequency - Inverse Document Frequency). Under this method, the count in each cell is replaced by the following formula.
TFIDFi,j = ( Ni,j / N*,j ) * log( D / Di ) where
In this formula, words that concentrate in certain documents are emphasized (by the Ni,j / N*,j ratio) and words that only appear in a few documents are also emphasized (by the log( D / Di ) term).
Since we have such a small example, we will skip this step and move on the heart of LSA, doing the singular value decomposition of our matrix of counts. However, if we did want to add TFIDF to our LSA class we could add the following two lines at the beginning of our python file to import the log, asarray, and sum functions.
from math import log
Then we would add the following TFIDF method to our LSA class. WordsPerDoc (N*,j) just holds the sum of each column, which is the total number of index words in each document. DocsPerWord (Di) uses asarray to create an array of what would be True and False values, depending on whether the cell value is greater than 0 or not, but the 'i' argument turns it into 1's and 0's instead. Then each row is summed up which tells us how many documents each word appears in. Finally, we just step through each cell and apply the formula. We do have to change cols (which is the number of documents) into a float to prevent integer division.
Part 3 - Using the Singular Value Decomposition
Once we have built our (words by titles) matrix, we call upon a powerful but little known technique called Singular Value Decomposition or SVD to analyze the matrix for us. The "Singular Value Decomposition Tutorial" is a gentle introduction for readers that want to learn more about this powerful and useful algorithm.
The reason SVD is useful, is that it finds a reduced dimensional representation of our matrix that emphasizes the strongest relationships and throws away the noise. In other words, it makes the best possible reconstruction of the matrix with the least possible information. To do this, it throws out noise, which does not help, and emphasizes strong patterns and trends, which do help. The trick in using SVD is in figuring out how many dimensions or "concepts" to use when approximating the matrix. Too few dimensions and important patterns are left out, too many and noise caused by random word choices will creep back in.
The SVD algorithm is a little involved, but fortunately Python has a library function that makes it simple to use. By adding the one line method below to our LSA class, we can factor our matrix into 3 other matrices. The U matrix gives us the coordinates of each word on our “concept” space, the Vt matrix gives us the coordinates of each document in our “concept” space, and the S matrix of singular values gives us a clue as to how many dimensions or “concepts” we need to include.
In order to choose the right number of dimensions to use, we can make a histogram of the square of the singular values. This graphs the importance each singular value contributes to approximating our matrix. Here is the histogram in our example.
For large collections of documents, the number of dimensions used is in the 100 to 500 range. In our little example, since we want to graph it, we’ll use 3 dimensions, throw out the first dimension, and graph the second and third dimensions.
The reason we throw out the first dimension is interesting. For documents, the first dimension correlates with the length of the document. For words, it correlates with the number of times that word has been used in all documents. If we had centered our matrix, by subtracting the average column value from each column, then we would use the first dimension. As an analogy, consider golf scores. We don’t want to know the actual score, we want to know the score after subtracting it from par. That tells us whether the player made a birdie, bogie, etc.
The reason we don't center the matrix when using LSA, is that we would turn a sparse matrix into a dense matrix and dramatically increase the memory and computation requirements. It's more efficient to not center the matrix and then throw out the first dimension.
Here is the complete 3 dimensional Singular Value Decomposition of our matrix. Each word has 3 numbers associated with it, one for each dimension. The first number tends to correspond to the number of times that word appears in all titles and is not as informative as the second and third dimensions, as we discussed. Similarly, each title also has 3 numbers associated with it, one for each dimension. Once again, the first dimension is not very interesting because it tends to correspond to the number of words in the title.
Part 4 - Clustering by Color
We can also turn the numbers into colors. For instance, here is a color display that corresponds to the first 3 dimensions of the Titles matrix that we showed above. It contains exactly the same information, except that blue shows negative numbers, red shows positive numbers, and numbers close to 0 are white. For example, Title 9, which is strongly positive in all 3 dimensions, is also strongly red in all 3 dimensions.
We can use these colors to cluster the titles. We ignore the first dimension for clustering because all titles are red. In the second dimension, we have the following result.
Using the third dimension, we can split each of these groups again the same way. For example, looking at the third dimension, title 6 is blue, but title 7 and title 9 are still red. Doing this for both groups, we end up with these 4 groups.
It’s interesting to compare this table with what we get when we graph the results in the next section.
Part 5 - Clustering by Value
Leaving out the first dimension, as we discussed, let's graph the second and third dimensions using a XY graph. We'll put the second dimension on the X axis and the third dimension on the Y axis and graph each word and title. It's interesting to compare the XY graph with the table we just created that clusters the documents.
In the graph below, words are represented by red squares and titles are represented by blue circles. For example the word "book" has dimension values (0.15, -0.27, 0.04). We ignore the first dimension value 0.15 and graph "book" to position (x = -0.27, y = 0.04) as can be seen in the graph. Titles are similarly graphed.
One advantage of this technique is that both words and titles are placed on the same graph. Not only can we identify clusters of titles, but we can also label the clusters by looking at what words are also in the cluster. For example, the lower left cluster has titles 1 and 3 which are both about stock market investing. The words "stock" and "market" are conveniently located in the cluster, making it easy to see what the cluster is about. Another example is the middle cluster which has titles 2, 4, 5, and, to a somewhat lesser extent, title 8. Titles 2, 4, and 5 are close to the words "value" and "investing" which summarizes those titles quite well.
Advantages, Disadvantages, and Applications of LSA
Latent Semantic Analysis has many nice properties that make it widely applicable to many problems.
There are a few limitations that must be considered when deciding whether to use LSA. Some of these are:
In spite of these limitations, LSA is widely used for finding and organizing search results, grouping documents into clusters, spam filtering, speech recognition, patent searches, automated essay evaluation, etc.
As an example, iMetaSearch uses LSA to map search results and words to a “concept” space. Users can then find which results are closest to which words and vice versa. The LSA results are also used to cluster search results together so that you save time when looking for related results.
revised on January 30, 2010