Of the many interesting things in Matthew Jockers’s Macroanalysis, I was most intrigued by his discussion of interpreting the topics in topic models. Interpretation is what literary scholars are trained for and tend to excel at, and I’m somewhat skeptical of the notion of an “uninterpretable” topic. I prefer to think of it as a topic that hasn’t yet met its match, hermeneutically speaking. In my experience building topic models of scholarly journals, I have found clear examples of lumping and splitting—terms that are either separated from their natural place or agglomerated into an unhappy mass. The ‘right’ number of topics for a given corpus is generally the one which has the lowest visible proportion of lumped and split topics. But there are other issues in topic-interpretation that can’t easily be resolved this way.
A problem I’ve found in modeling scholarship is how “evidence/argument words” are always highly represented in any given corpus. If you use hyperparameter optimization, which weighs topics according to the relative proportion in the corpus, words like “fact evidence argue make” tend to compose the most representative topics. Options include simply eliminating the topic from the browser, which seems to eliminate a large number of documents that would be otherwise classified, or trying to add all of the evidence words to a stop list. The aggressive pursuit of stop-words degrades the model, though this observation is more of an intuition than anything I can now document.
I thought it might be helpful to others who are interested in working with topic models to create several models of the same corpus and look at the effects created by small changes in the parameters (number of topics, lemmatization of corpus, and stop-words). The journal that I chose to use for this example is the Journal of Law and Economics, for both its ideological interest and methodological consistency. The law-and-economics movement is about as far away from literary studies as it’s possible to be while still engaging in a type of discourse analysis, I think, and I find this contrast both amusing and potentially illuminating. That the field of law-and-economics is perhaps the most well-known (even infamous) example of quantified reasoning used in support of what many view as a distinct political agenda is what led me to choose it to begin to explore the potential critical usefulness of another quantitative method of textual analysis.
I began by downloading all of the research articles published in the journal from JSTOR’s Data for Research. There were 1281 articles. I then converted the word-frequency lists to bags-of-words and created a 70-topic model using MALLET.* The browsable model is here. The first topic is the most general of academic evidence/argument words: “made, make, case, part, view, difficult. . .” I was intrigued by the high-ranking presence of articles by Milton Friedman and R. H. Coase in this topic; it would be suggestive if highly cited or otherwise important articles were most strongly associated with the corpus’s “evidence” terms, but I can’t say that this is anything other than coincidence. The next topic shows the influence of the journal’s title: “law, economics, economic, system, problem, individual.” The duplication of the adjective and noun form of “economics” can be eliminated with stemming or lemmatizing the corpus, though it is not clear if this increases the overall clarity of the model. I noticed that articles “revisiting” topics such as “social cost” and “public goods” are prominent in this topic, which is perhaps explainable by an unusually high proportion of intra-journal citations. (I want to bemoan, for the thousandth time, the loss of JSTOR’s citation data from its API.)
The next two topics are devoted to methodology. Econometric techniques dominate the content of the Journal of Law and Economics, so there’s no surprise that topics featuring those terms would be this widely distributed. Of the next three topics, one seems spuriously related to citations and the other two are also devoted to statistical methodology. It is only the eighth topic that is unambiguously associated with a recognizable subject in the journal: market efficiency. Is this apparent overemphasis on evidence/methodology a problem? And if so, what do you do about it? One approach would be to add many of the evidence-related words to a stop-list. Another would be to label all the topics and let the browser decide which are valuable. Here is a rough attempt at labeling the seventy-topic model.
The number of topics generated is the most obvious and effective parameter to adjust. Though I ended up labeling several of the topics the same way, I’m not sure that I would define those as split topics. The early evidence/methodology related topics do have slightly distinct frames of reference. The topics labeled “Pricing” also refer to different aspects of price theory, which I could have specified. The only obviously lumped-together topic was the final one, with its mixture of sex-worker and file-sharing economics. If there is evidence of both lumping and splitting, then simply adjusting the number of topics is unlikely to solve both problems.
An alternative to aggressive stop-wording is lemmatization. The Natural Language Toolkit has a lemmatizer that calls on the WordNet database. Implementation is simple in python, though slow to execute. A seventy-topic model generated with the lemmatized corpus has continuities with the non-lemmatized model. The browser shows that there are fewer evidence-related topics. Since the default stop-word list does not include the lemmatized forms “ha,” “doe,” “wa,” or “le,” it aggregates those in topics that are more strongly representative than the similar topics in the non-lemmatized model. Comparing the labeled topics with the non-lemmatized model show that there are many direct correspondences. The two insurance-related topics, for instance, have very similar lists of articles. The trend lines do not always match very well, which I believe is caused by the much higher weighting of the first “argument words” topic in the lemmatized corpus (plus also issues about the reliability of graphing these very small changes).
Labeling is inherently subjective, and my adopted labels for the lemmatized corpus were both whimsical in places and also influenced by the first labels that I had chosen. As I mentioned in my comments on Matthew Jockers’s Macroanalysis, computer scientists have developed automatic labeling techniques for topic models. While labor-intensive, doing it by hand forces you to consider each topic’s coherence and reliability in a way that might be easy to miss otherwise. The browser format that shows the articles most closely associated with each topic helps label them as well, I find. It might not be a bad idea for a topic model of journal articles to label each topic based on the title of the article most closely associated with it; this technique would only mislead on deeply divided or clustered topics, or on those which have only one article strongly associated with it (a sign of too many topics in my experience).
(UPDATE: My initial labeling of the tables below was in error because of an indexing error with the topic numbers. The correlations below make much more sense in terms of the topics’ relative weights, and I’m embarrassed that I didn’t notice the problem earlier.)
The topics were not strongly correlated with each other in either direction. In the non-lemmatized model, the only topics with a Pearson correlation above .4 were
The negative correlations below -.4 were
Ted Underwood and Andrew Goldstone’s PMLA topic-modeling post used network graphs to visualize their models and produce identifiable clusters. I suspect this particular model could be graphed in the same way, but the relatively low correlations between topics makes me a little leery of trying it. I generated a few network graphs for John Laudun’s and my folklore project, but we didn’t end up using them for the first article. They weren’t as snazzy as the Underwood and Goldstone graphs, as my gephi patience often runs very thin. (Gephi also has problems with the latest java update, as Ian Milligan pointed out to me on twitter. I intend to update this post before too long with a D3 network graph of the topic correlations.)
The most strongly correlated topics in the lemmatized corpus were
Here is a simple network graph of the positively correlated topics above .2 (thicker lines indicate stronger correlation):
My goal is to integrate a D3.js version of these network graphs into the browsers, so that the nodes link to the topics and that the layout is adjustable. I haven’t yet learned the software well enough to do this however. The simple graph above was made using the R igraph package. [UDPATE: See here for a simple D3.js browser.]
And the negative correlations:
The fact that some topics appear at the top of both the negative and positive correlations in both of the models suggests to me that there is some artifact of the hyperparameter optimization process responsible for this in a way that I don’t quite grasp (though I am aware, sadly enough, that the explanation could be very simple). The .4 threshold I chose is arbitrary, and the correlations follow a consistent and smooth pattern in both models. The related articles section of these browsers is based on Kullback-Leibler divergence, a metric apparently more useful than Manhattan distance. It seems to me that the articles listed under each topic are much more likely to be related to one another than any metric I’ve used to compare the overall weighting of topics.
Another way of assessing the models and label-interpretations is to check where they place highly cited articles. According to google scholar, the most highly cited article** in Journal of Law and Economics is Fama and Jensen’s “Separation of Ownership and Control.” In the non-lemmatized model, it is associated with the AGENTS AND ORGANIZATIONS topic. It appears in the topic I labeled INVESTORS in the lemmatized corpus, but further reflection shows that these terms are closer than I first thought. My intuition, as I have mentioned before in this discussion of Pierre Nora’s “Between Memory and History,” is that highly cited articles are somehow more central to the corpus because they affect the subsequent distribution of terms. The next-most cited article, Oliver Williamson’s “Transaction-cost Economics: The Governance of Contractual Relations” appears, suitably enough, in the topics devoted to contracts in both browsers. And R. H. Coase’s “The Federal Communications Commission” is in the COMMUNICATIONS REGULATION topic in both browsers, a topic whose continuing theoretical interest to the journal was established by Coase’s early article.
As I mentioned in the beginning, I chose the Journal of Law and Economics for this project in interpreting topics in part because of its ideological interest. I have little sympathy for Chicago-style economics and its dire public policy recommendations, but I only expressed that in this project through some sarcastic topic-labeling. Does the classification and sorted browsing enabled by topic modeling affect how a reader perceives antagonistic material? Labeling can be an aggressive activity; would automated labeling of topics alleviate this tendency or reinforce it? I don’t know if this subject has been addressed in informational-retrieval research, but I’d like to find out.
*I am leaving out some steps here. My code that processes the MALLET output into a browser uses scripts in perl and R to link the metadata to the files and create graphs of each topic. Andrew Goldstone’s code performs much the same functions and is much more structurally sound than what I created, which is why I haven’t shared my code. For creating browsers, Allison Chaney’s topic-modeling visualization engine is what I recommend, though I was unsure how to convert MALLET’s output to the lda-c output that it expects (though doing so would doubtlessly be much simpler than writing on your own as I did).
**That is the most highly cited article anywhere that google’s bots have found, not just in the journal itself. I am aware of the assumption inherent in claiming that a highly cited article would necessarily be influential to that particular journal’s development, since disciplinary and discourse boundaries would have to be taken into account. All highly cited articles are cited in multiple disciplines, I believe, and that applies even to a journal carving out new territory in two well-established ones like law and economics.