Latent Semantic Analysis (LSA) Tutorial
Latent Semantic Analysis (LSA) Tutorial - A Small Example PDF Print E-mail
Article Index
Latent Semantic Analysis (LSA) Tutorial
A Small Example
Part 1 - Creating the Count Matrix
Part 2 - Modify the Counts with TFIDF
Part 3 - Using the Singular Value Decomposition
Part 4 - Clustering by Color
Part 5 - Clustering by Value
Advantages, Disadvantages, and Applications of LSA
All Pages

A Small Example

As a small example, I searched for books using the word “investing” at Amazon.com and took the top 10 book titles that appeared. One of these titles was dropped because it had only one index word in common with the other titles. An index word is any word that:

  • appears in 2 or more titles, and
  • is not a very common word such as “and”, “the”, and so on (known as stop words). These words are not included because do not contribute much (if any) meaning.

In this example we have removed the following stop words: “and”, “edition”, “for”, “in”, “little”, “of”, “the”, “to”.

Here are the 9 remaining tiles. The index words (words that appear in 2 or more titles and are not stop words) are underlined.

  1. The Neatest Little Guide to Stock Market Investing
  2. Investing For Dummies, 4th Edition
  3. The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns
  4. The Little Book of Value Investing
  5. Value Investing: From Graham to Buffett and Beyond
  6. Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!
  7. Investing in Real Estate, 5th Edition
  8. Stock Investing For Dummies
  9. Rich Dad's Advisors: The ABC's of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss

Once Latent Semantic Analysis has been run on this example, we can plot the index words and titles on an XY graph and identify clusters of titles. The 9 titles are plotted with blue circles and the 11 index words are plotted with red squares. Not only can we spot clusters of titles, but since index words can be plotted along with titles, we can label the clusters. For example, the blue cluster, containing titles T7 and T9, is about real estate. The green cluster, with titles T2, T4, T5, and T8, is about value investing, and finally the red cluster, with titles T1 and T3, is about the stock market. The T6 title is an outlier, off on its own.

xygraph2

In the next few sections, we'll go through all steps needed to run Latent Semantic Analysis on this example.



 
Joomla Templates by Joomlashack