| Latent
Sematic Analysis or LSA is a way of finding patterns
among a collection of documents such as web pages. It is
increasingly used by major search engines, such as Google, in
ranking websites and determining what AdSense ads to show on
a page. To see how Latent Semantic Analysis works, imagine
that you have a collection of documents such as web pages, or
in the simple example we show below, book titles. How would
you go about finding similarities and differences between the
documents?
One way is to form a large matrix with each column
representing a document, and each row representing a word
that has been extracted from the documents. Then, each cell
of the matrix is simply the number of times that word appears
in that document. For example, if the word "farm" appears in
the first document 7 times, then that cell would have a 7 in
it. Each cell is simply a count of the number of times that
word appears in that document.
The cell numbers are usually massaged, so that whatever
patterns are present can be seen more clearly. This step
corresponds to "cleaning the data" so that, for example,
frequent words are not weighted too heavily, some natural
language constructs are simplified, and long documents don't
have an unfair advantage. Some ways that cell numbers are
massaged are:
- Log of Counts - the log of the counts in each cell may
be used instead of the actual counts.
- Stemming
- related words may have the same root word and should be
considered the same (such as golf and golfing).
- TF-IDF
- (term frequency - inverse document frequency) attempts
to measure the importance of a term or word.
- Entropy
- another way to measure term importance based on the
distribution of the term through documents.
- Normalization
- sets each document vector to length 1 so documents with
more words don't have an unfair advantage.
In order to give an example of Latent Sematic Analysis, I
went to Amazon.com, searched on "investing", and took the top
10 books titles that were displayed. One of those book titles
only had one index word so I dropped it. Here are the other 9
titles with the index words underlined. To be an index word,
the word must occur in 2 or more titles, and not be a "noise"
or stop word such as "the", "to", "of", etc.
- The Neatest Little Guide to
Stock Market Investing
- Investing For Dummies, 4th Edition
- The Little Book of Common
Sense Investing: The Only Way
to Guarantee Your Fair Share of Stock Market Returns
- The Little Book of Value Investing
- Value Investing: From Graham to
Buffett and Beyond
- Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor
and the Middle Class Do Not!
- Investing in Real Estate, 5th Edition
- Stock Investing For Dummies
- Rich Dad's Advisors®: The ABC's
of Real Estate Investing: The Secrets of
Finding Hidden Profits Most Investors Miss
We can represent this information as a matrix. For this
matrix, each index word is a row and each title is a column.
Each cell contains the number of times that word occurs in
that title. For example, the word "book" appears one time in
title 3 and one time in title 4, whereas "investing" appears
one time in every title. In general matrices of this type,
where we count the number of words in each document, tend to
be sparse, that is most entries are 0. Instead of displaying
all the 0's, we have just left them out so as not to clutter
the matrix, but in calculations the 0's are used.
In this example we have not massaged the counts at all,
but in more sophisticated systems the counts are usually
processed by one or more ways as described above. Once the
problem is reduced to matrix form, we can analyze it by
Latent Semantic Analysis.
|
Titles |
| Index Words |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
| book |
|
|
1 |
1 |
|
|
|
|
|
| dad's |
|
|
|
|
|
1 |
|
|
1 |
| dummies |
|
1 |
|
|
|
|
|
1 |
|
| estate |
|
|
|
|
|
|
1 |
|
1 |
| guide |
1 |
|
|
|
|
1 |
|
|
|
| investing |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
| market |
1 |
|
1 |
|
|
|
|
|
|
| real |
|
|
|
|
|
|
1 |
|
1 |
| rich |
|
|
|
|
|
2 |
|
|
1 |
| stock |
1 |
|
1 |
|
|
|
|
1 |
|
| value |
|
|
|
1 |
1 |
|
|
|
|
Once we have built our (words x titles) matrix, we call
upon a powerful but little known technique called Singular
Value Decomposition or SVD to analyze the matrix for us. The
"Singular
Value Decomposition Tutorial" is a gentle introduction
for those readers that want to learn more about this powerful
and useful algorithm. SVD is the muscle behind Latent
Semantic Analysis, but for this discussion we'll just accept
the results that it gives us. Briefly, Singular Value
Decomposition allows you to find patterns in the matrix and
identify which words and documents are similar to each
other.
For this example we will do a 3 factor Singular Value
Decomposition of the matrix. Therefore each word and title
will have 3 numbers associated with it, one for each factor.
The first factor tends to be the average of the counts for
the words and titles and is not as informative as the second
and third factors. Here is the SVD factorization of this
matrix.
|
| book |
0.15 |
-0.27 |
-0.04 |
| dad's |
0.24 |
0.38 |
0.09 |
| dummies |
0.13 |
-0.17 |
-0.07 |
| estate |
0.18 |
0.19 |
-0.45 |
| guide |
0.22 |
0.09 |
0.46 |
| investing |
0.74 |
-0.21 |
-0.21 |
| market |
0.18 |
-0.30 |
0.28 |
| real |
0.18 |
0.19 |
-0.45 |
| rich |
0.36 |
0.59 |
0.34 |
| stock |
0.25 |
-0.42 |
0.28 |
| value |
0.12 |
-0.14 |
-0.23 |
|
* |
| 3.91 |
0 |
0 |
| 0 |
2.61 |
0 |
| 0 |
0 |
2.00 |
|
* |
| 1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
| 0.35 |
0.22 |
0.34 |
0.26 |
0.22 |
0.49 |
0.28 |
0.29 |
0.44 |
| -0.32 |
-0.15 |
-0.46 |
-0.24 |
-0.14 |
0.55 |
0.07 |
-0.31 |
0.44 |
| 0.41 |
-0.14 |
0.16 |
-0.25 |
-0.22 |
0.51 |
-0.55 |
0.00 |
-0.34 |
|
One interesting method of looking at these results is to
treat the factors as xy coordinates in a graph and see where
the words and titles end up. This helps find patterns in how
words and titles are related to each other. In the graph
below, words are represented by pink squares and titles are
represented by blue diamonds. For example the word "book" is
associated with the factors (0.15, -0.27, -0.04). We ignore
the first factor 0.15 and graph "book" to position (x =
-0.27, y = -0.04) as can be seen in the graph. Titles are
similarly graphed.
Once advantage of this technique is that both words and
titles are placed on the same graph. Not only can we identify
clusters of titles, but we can also label the clusters by
looking at what words are also in the cluster. For example,
the top left cluster has titles 1 and 3 which are both about
stock market investing. The words "stock" and "market" are
conveniently located in the cluster, making it easy to see
what the cluster is about. Another example is the middle
cluster which has titles 2, 4, 5, and, to a somewhat lesser
extent, title 8. Titles 2, 4, and 5 are close to the words
"value" and "investing" which summarizes those titles quite
well.

The larger the set of documents or web pages, the more
information and connections there are to work with, and the
better clusters and cluster descriptions that can be found.
This example illustrates, on a small scale, the methods that
iMetaSearch uses to analyze searches. Of course iMetaSearch
deals with much larger amounts of data, does pre-processing
of cell counts, works with phrases in addition to words, and
uses sorted lists instead of a graph but the general
principles are the same.
|