Lighthouse Puffinware LLC

Our Products
iMetaSearch
Information
Information & Support
News
OtherSVD and LSI Tutorial
Latent Semantic Analysis Tutorial
Latent Sematic Analysis or LSA is a way of finding patterns among a collection of documents such as web pages. It is increasingly used by major search engines, such as Google, in ranking websites and determining what AdSense ads to show on a page. To see how Latent Semantic Analysis works, imagine that you have a collection of documents such as web pages, or in the simple example we show below, book titles. How would you go about finding similarities and differences between the documents?

One way is to form a large matrix with each column representing a document, and each row representing a word that has been extracted from the documents. Then, each cell of the matrix is simply the number of times that word appears in that document. For example, if the word "farm" appears in the first document 7 times, then that cell would have a 7 in it. Each cell is simply a count of the number of times that word appears in that document.

The cell numbers are usually massaged, so that whatever patterns are present can be seen more clearly. This step corresponds to "cleaning the data" so that, for example, frequent words are not weighted too heavily, some natural language constructs are simplified, and long documents don't have an unfair advantage. Some ways that cell numbers are massaged are:

  • Log of Counts - the log of the counts in each cell may be used instead of the actual counts.
  • Stemming - related words may have the same root word and should be considered the same (such as golf and golfing).
  • TF-IDF - (term frequency - inverse document frequency) attempts to measure the importance of a term or word.
  • Entropy - another way to measure term importance based on the distribution of the term through documents.
  • Normalization - sets each document vector to length 1 so documents with more words don't have an unfair advantage.

In order to give an example of Latent Sematic Analysis, I went to Amazon.com, searched on "investing", and took the top 10 books titles that were displayed. One of those book titles only had one index word so I dropped it. Here are the other 9 titles with the index words underlined. To be an index word, the word must occur in 2 or more titles, and not be a "noise" or stop word such as "the", "to", "of", etc.

  1. The Neatest Little Guide to Stock Market Investing
  2. Investing For Dummies, 4th Edition
  3. The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns
  4. The Little Book of Value Investing
  5. Value Investing: From Graham to Buffett and Beyond
  6. Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!
  7. Investing in Real Estate, 5th Edition
  8. Stock Investing For Dummies
  9. Rich Dad's Advisors®: The ABC's of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss

We can represent this information as a matrix. For this matrix, each index word is a row and each title is a column. Each cell contains the number of times that word occurs in that title. For example, the word "book" appears one time in title 3 and one time in title 4, whereas "investing" appears one time in every title. In general matrices of this type, where we count the number of words in each document, tend to be sparse, that is most entries are 0. Instead of displaying all the 0's, we have just left them out so as not to clutter the matrix, but in calculations the 0's are used.

In this example we have not massaged the counts at all, but in more sophisticated systems the counts are usually processed by one or more ways as described above. Once the problem is reduced to matrix form, we can analyze it by Latent Semantic Analysis.

Titles
Index Words 1 2 3 4 5 6 7 8 9
book     1 1          
dad's           1     1
dummies   1           1  
estate             1   1
guide 1         1      
investing 1 1 1 1 1 1 1 1 1
market 1   1            
real             1   1
rich           2     1
stock 1   1         1  
value       1 1        

Once we have built our (words x titles) matrix, we call upon a powerful but little known technique called Singular Value Decomposition or SVD to analyze the matrix for us. The "Singular Value Decomposition Tutorial" is a gentle introduction for those readers that want to learn more about this powerful and useful algorithm. SVD is the muscle behind Latent Semantic Analysis, but for this discussion we'll just accept the results that it gives us. Briefly, Singular Value Decomposition allows you to find patterns in the matrix and identify which words and documents are similar to each other.

For this example we will do a 3 factor Singular Value Decomposition of the matrix. Therefore each word and title will have 3 numbers associated with it, one for each factor. The first factor tends to be the average of the counts for the words and titles and is not as informative as the second and third factors. Here is the SVD factorization of this matrix.

book 0.15 -0.27 -0.04
dad's 0.24 0.38 0.09
dummies 0.13 -0.17 -0.07
estate 0.18 0.19 -0.45
guide 0.22 0.09 0.46
investing 0.74 -0.21 -0.21
market 0.18 -0.30 0.28
real 0.18 0.19 -0.45
rich 0.36 0.59 0.34
stock 0.25 -0.42 0.28
value 0.12 -0.14 -0.23
*
3.91 0 0
0 2.61 0
0 0 2.00
*
1 2 3 4 5 6 7 8 9
0.35 0.22 0.34 0.26 0.22 0.49 0.28 0.29 0.44
-0.32 -0.15 -0.46 -0.24 -0.14 0.55 0.07 -0.31 0.44
0.41 -0.14 0.16 -0.25 -0.22 0.51 -0.55 0.00 -0.34

One interesting method of looking at these results is to treat the factors as xy coordinates in a graph and see where the words and titles end up. This helps find patterns in how words and titles are related to each other. In the graph below, words are represented by pink squares and titles are represented by blue diamonds. For example the word "book" is associated with the factors (0.15, -0.27, -0.04). We ignore the first factor 0.15 and graph "book" to position (x = -0.27, y = -0.04) as can be seen in the graph. Titles are similarly graphed.

Once advantage of this technique is that both words and titles are placed on the same graph. Not only can we identify clusters of titles, but we can also label the clusters by looking at what words are also in the cluster. For example, the top left cluster has titles 1 and 3 which are both about stock market investing. The words "stock" and "market" are conveniently located in the cluster, making it easy to see what the cluster is about. Another example is the middle cluster which has titles 2, 4, 5, and, to a somewhat lesser extent, title 8. Titles 2, 4, and 5 are close to the words "value" and "investing" which summarizes those titles quite well.

2-dimensional LSI example showing clustering of titles and index words

The larger the set of documents or web pages, the more information and connections there are to work with, and the better clusters and cluster descriptions that can be found. This example illustrates, on a small scale, the methods that iMetaSearch uses to analyze searches. Of course iMetaSearch deals with much larger amounts of data, does pre-processing of cell counts, works with phrases in addition to words, and uses sorted lists instead of a graph but the general principles are the same.

Copyright © Puffinware LLC 2007. All trademarks are copyright of their respective owners.