| Latent Semantic Analysis (LSA) Tutorial - Part 3 - Using the Singular Value Decomposition |
|
|
|
|
Page 5 of 8
Part 3 - Using the Singular Value DecompositionOnce we have built our (words by titles) matrix, we call upon a powerful but little known technique called Singular Value Decomposition or SVD to analyze the matrix for us. The "Singular Value Decomposition Tutorial" is a gentle introduction for readers that want to learn more about this powerful and useful algorithm. The reason SVD is useful, is that it finds a reduced dimensional representation of our matrix that emphasizes the strongest relationships and throws away the noise. In other words, it makes the best possible reconstruction of the matrix with the least possible information. To do this, it throws out noise, which does not help, and emphasizes strong patterns and trends, which do help. The trick in using SVD is in figuring out how many dimensions or "concepts" to use when approximating the matrix. Too few dimensions and important patterns are left out, too many and noise caused by random word choices will creep back in. The SVD algorithm is a little involved, but fortunately Python has a library function that makes it simple to use. By adding the one line method below to our LSA class, we can factor our matrix into 3 other matrices. The U matrix gives us the coordinates of each word on our “concept” space, the Vt matrix gives us the coordinates of each document in our “concept” space, and the S matrix of singular values gives us a clue as to how many dimensions or “concepts” we need to include.
In order to choose the right number of dimensions to use, we can make a histogram of the square of the singular values. This graphs the importance each singular value contributes to approximating our matrix. Here is the histogram in our example.
For large collections of documents, the number of dimensions used is in the 100 to 500 range. In our little example, since we want to graph it, we’ll use 3 dimensions, throw out the first dimension, and graph the second and third dimensions. The reason we throw out the first dimension is interesting. For documents, the first dimension correlates with the length of the document. For words, it correlates with the number of times that word has been used in all documents. If we had centered our matrix, by subtracting the average column value from each column, then we would use the first dimension. As an analogy, consider golf scores. We don’t want to know the actual score, we want to know the score after subtracting it from par. That tells us whether the player made a birdie, bogie, etc. The reason we don't center the matrix when using LSA, is that we would turn a sparse matrix into a dense matrix and dramatically increase the memory and computation requirements. It's more efficient to not center the matrix and then throw out the first dimension. Here is the complete 3 dimensional Singular Value Decomposition of our matrix. Each word has 3 numbers associated with it, one for each dimension. The first number tends to correspond to the number of times that word appears in all titles and is not as informative as the second and third dimensions, as we discussed. Similarly, each title also has 3 numbers associated with it, one for each dimension. Once again, the first dimension is not very interesting because it tends to correspond to the number of words in the title.
|



