Topic analysis of the 100 files medline12n06xx by SAGE
Tomonari Masada (Nov. 1, 2012)
- Method:
- Use 100 files named as medline12n06xx from the MEDLINE dataset.
- Remove these stop words. Further remove the most frequent 10 words and the words of frequency less than 50. Do no stemming and no lower-casing.
- Run a collapsed Gibbs sampling (CGS) for Latent Dirichlet Allocation with 100 topics.
- For this large dataset, # topics should be larger. This is just a test run of my implementation of VB for SAGE.
- Initialize parameters of SAGE by using the result of CGS and run a variational Bayesian inference for SAGE.
- Count the number of topic assignments for each pair of topic and word.
- Remove the two most popular topics.
- For each topic, remove every word whose topic assignment frequency is less than 500.
- Visualize the numbers with D3 bubble chart.
- Interpretation:
- Different colors of circles correspond to different topics.
- The words of the same color are often used together for expressing the same topic.
- The same word may be used for expressing more than one topics.
- A larger circle shows that the word is more often used for expressing the topic corresponding to the circle color.