I’ve been thinking a lot recently about a simple question: can machine learning detect patterns of disciplinary change that are at odds with received understanding? The forms of machine learning that I’ve been using to try to test this—LDA and the dynamic LDA variant—do a very good job of picking up the patterns that you would suspect to find in, say, a large corpus of literary journals. The model I built of several theoretically oriented journals in JSTOR, for example, shows much the same trends that anyone familiar with the broad contours of literary theory would expect to find. The relative absence of historicism as a topic of self-reflective inquiry is also explainable by the journals represented and historicism’s comparatively low incidence of keywords and rote-citations.
I’ve heard from people on twitter that it’s a widely held belief that machine-learning techniques (and, by extension, all quantitative methods) can only tell us what we already know about the texts. I admit some initial skepticism about the prevalence of this claim, but I’ve now seen more evidence of it in the wild, so to speak, and I think I understand where some of this overly categorical skepticism comes from. A test of the validity of topic modeling, for example, would be if it produces a coherent model of a well-known corpus. If it does, then it is likely that it will do the same for an unknown or unread group of texts. The models that I have built of scholarly literature from JSTOR, I can see, are thought by some of the people who’ve seen them to be well-understood corpora. If the models reflect the general topics and trends that people know from their knowledge of the field, then that’s great as far as it goes, but we’ll have to reserve judgment on the great unread.
One issue here is that I don’t think the disciplinary history of any field is well understood. Topic modeling’s disinterested aggregations have the potential to show an unrecognized formation or the persistence of a trend long-thought dormant. Clancy found some clustering of articles in rhetoric journals associated with a topic that she initially would have labeled as “expressivist” from several decades before she would expect. Part of this has to do with the eclectic nature of what’s published in College English, of course, and part has to do with the parallels between creative writing and expressivist pedagogy. But it’s the type of specific connection that someone following established histories is not likely to find.
Ben Schmidt noted that topic modeling was designed and marketed, to some degree, as a replacement for keyword search. Schmidt is more skeptical than I am of the usefulness of this higher-level of abstraction for general scholarly research. I know enough about anthropology to have my eyebrows raised by this Nicholas Wade essay on Napoleon Chagnon, for example, and I still find this browser of American Anthropologist to be a quicker way of finding articles than JSTOR’s interface. I created this browser to compare with the folklore browser* of the corpus that John Laudun and I have been working with. We wanted to see if topic models would reflect our intuition that the cultural/linguistic turn in anthropology and folklore diffused through their respective disciplines’ scholarly journals (the folklore corpus contains the journal most analogous to American Anthropologist, The Journal of American Folklore, but it also has other folklore journals as well) at the expected time (earlier in anthropology than folklore).
A very promising, to my mind, way of correlating topic models of journals is with networks of citations. I’ve done enough network graphs of scholarly citations to know that, unless you heavily prune and categorize the citations, the results are going to be hard to visualize in any meaningful way. (One of the first network graphs I created all of the citations in thirty years of JAF required zooming in to something like 1000x magnification to make out individual nodes. I’m far from an expert at creating efficient network visualizations, needless to say.) JSTOR once provided citation data through its Data for Research interface; it does not any longer as far as I know. This has been somewhat frustrating.
If we had citation data, taking two topics that both seem reflective of a general cultural/linguistic/poststructuralist influence, such as this folklore topic and this anthropological one would allow us to compare the citation networks to see if the concomitant rise in proportion was reflected in references to shared sources (Lévi-Strauss, for example, I know to be one of the most cited authors in the folklore corpus.) I would also like to explore the method described in this paper that uses a related form of posterior inference to discover the most influential documents in a corpus.**
This type of comparative exploration, while presenting an interesting technical challenge to implement (to me, that is, and I fully recognize the incommensurable gulf between using these algorithms and creating and refining them) can’t (yet) be mistaken for discovery. You can’t go from this to an a priori proof of non-discovery, however. Maybe no one is actually arguing this position, and I’m fabricating this straw argument out of supercilious tweets and decontextualized and half-remembered blog posts.
A more serious personal intellectual problem for me is that I find the dispute between Peter Norvig and Noam Chomsky to be either a case of mutual misunderstanding or one where Chomsky has by far the more persuasive case. If I’m being consistent then, I’d have to reject at least some of the methodological premises behind topic-modeling and related techniques. Perhaps “practical value” and “exploration/discovery” can share a peaceful co-existence.
*These browsers work by showing an index page with the first four words of each topic. You can then click on any one of the topics to see the full list of words associated with it, together with a list of articles sorted by how strongly they represent that topic. Clicking then on an individual article takes you to page that shows the other topics most associated with that article, also clickable, and a link to the JSTOR page of the article itself.
**The note about the model taking more than ten hours to run fills me with foreboding, however. My (doubtlessly inefficient) browser-creating scripts can take more than hour to run on a corpus of 10K documents, combined with another hour or more w/ MALLET and R–it really grinds down a person conditioned to expect instant results in today’s attention economy.