The Awakening of My Interest in Annular Systems

Fri, Feb 22, 2013

I’ve been thinking a lot recently about a simple question: can machine learning detect patterns of disciplinary change that are at odds with received understanding? The forms of machine learning that I’ve been using to try to test this—LDA and the dynamic LDA variant—do a very good job of picking up the patterns that you would suspect to find in, say, a large corpus of literary journals. The model I built of several theoretically oriented journals in JSTOR, for example, shows much the same trends that anyone familiar with the broad contours of literary theory would expect to find. The relative absence of historicism as a topic of self-reflective inquiry is also explainable by the journals represented and historicism’s comparatively low incidence of keywords and rote-citations.

I’ve heard from people on twitter that it’s a widely held belief that machine-learning techniques (and, by extension, all quantitative methods) can only tell us what we already know about the texts. I admit some initial skepticism about the prevalence of this claim, but I’ve now seen more evidence of it in the wild, so to speak, and I think I understand where some of this overly categorical skepticism comes from. A test of the validity of topic modeling, for example, would be if it produces a coherent model of a well-known corpus. If it does, then it is likely that it will do the same for an unknown or unread group of texts. The models that I have built of scholarly literature from JSTOR, I can see, are thought by some of the people who’ve seen them to be well-understood corpora. If the models reflect the general topics and trends that people know from their knowledge of the field, then that’s great as far as it goes, but we’ll have to reserve judgment on the great unread.

One issue here is that I don’t think the disciplinary history of any field is well understood. Topic modeling’s disinterested aggregations have the potential to show an unrecognized formation or the persistence of a trend long-thought dormant. Clancy found some clustering of articles in rhetoric journals associated with a topic that she initially would have labeled as “expressivist” from several decades before she would expect. Part of this has to do with the eclectic nature of what’s published in College English, of course, and part has to do with the parallels between creative writing and expressivist pedagogy. But it’s the type of specific connection that someone following established histories is not likely to find.

Ben Schmidt noted that topic modeling was designed and marketed, to some degree, as a replacement for keyword search. Schmidt is more skeptical than I am of the usefulness of this higher-level of abstraction for general scholarly research. I know enough about anthropology to have my eyebrows raised by this Nicholas Wade essay on Napoleon Chagnon, for example, and I still find this browser of American Anthropologist to be a quicker way of finding articles than JSTOR’s interface. I created this browser to compare with the folklore browser* of the corpus that John Laudun and I have been working with. We wanted to see if topic models would reflect our intuition that the cultural/linguistic turn in anthropology and folklore diffused through their respective disciplines’ scholarly journals (the folklore corpus contains the journal most analogous to American Anthropologist, The Journal of American Folklore, but it also has other folklore journals as well) at the expected time (earlier in anthropology than folklore).

A very promising, to my mind, way of correlating topic models of journals is with networks of citations. I’ve done enough network graphs of scholarly citations to know that, unless you heavily prune and categorize the citations, the results are going to be hard to visualize in any meaningful way. (One of the first network graphs I created all of the citations in thirty years of JAF required zooming in to something like 1000x magnification to make out individual nodes. I’m far from an expert at creating efficient network visualizations, needless to say.) JSTOR once provided citation data through its Data for Research interface; it does not any longer as far as I know. This has been somewhat frustrating.

If we had citation data, taking two topics that both seem reflective of a general cultural/linguistic/poststructuralist influence, such as this folklore topic and this anthropological one would allow us to compare the citation networks to see if the concomitant rise in proportion was reflected in references to shared sources (Levi-Strauss, for example, I know to be one of the most cited authors in the folklore corpus.) I would also like to explore the method described in this paper that uses a related form of posterior inference to discover the most influential documents in a corpus.**

This type of comparative exploration, while presenting an interesting technical challenge to implement (to me, that is, and I fully recognize the incommensurable gulf between using these algorithms and creating and refining them) can’t (yet) be mistaken for discovery. You can’t go from this to an a priori proof of non-discovery, however. Maybe no one is actually arguing this position, and I’m fabricating this straw argument out of supercilious tweets and decontextualized and half-remembered blog posts.

A more serious personal intellectual problem for me is that I find the dispute between Peter Norvig and Noam Chomsky to be either a case of mutual misunderstanding or one where Chomsky has by far the more persuasive case. If I’m being consistent then, I’d have to reject at least some of the methodological premises behind topic-modeling and related techniques. Perhaps “practical value” and “exploration/discovery” can share a peaceful co-existence.

*These browsers work by showing an index page with the first four words of each topic. You can then click on any one of the topics to see the full list of words associated with it, together with a list of articles sorted by how strongly they represent that topic. Clicking then on an individual article takes you to page that shows the other topics most associated with that article, also clickable, and a link to the JSTOR page of the article itself.

**The note about the model taking more than ten hours to run fills me with foreboding, however. My (doubtlessly inefficient) browser-creating scripts can take more than hour to run on a corpus of 10K documents, combined with another hour or more w/ MALLET and R–it really grinds down a person conditioned to expect instant results in today’s attention economy.

Comments

There were some useful comments on this post, which I’ve reproduced below:

Brian Lennon 2/22/2013 at 12:42 AM

I don’t know if this will clarify anything; I hope it will.

For me, your project here, with the category of projects it identifies itself with, is bounded by what I would call “the great unwritten” as the non-scale, so to speak, that the scale of the “great unread” merely limits.

I won’t elaborate that point here, since if the point is worth pursuing, it can be pursued all through my last book. But for me, the (1) stories scholars tell themselves about disciplinary history and (2) whatever versions of those stories might be revealed by machine learning are never going to be that different, even where they do differ, only because they are working within the same frame — the frame of an archive, a corpus, or [insert other term from other lexicon of one’s choice, here]: a body of evidence that is, in fact, always there to be reviewed (only more so with digitization and its consolidations), and which serves for me, at least, not only (not even principally) as evidence of our activities as creators of intellectual history, but as non-evidence of what and who we have excluded, and often enough completely destroyed, in creating that history. That violence is the non-object of my own interest; and that interest is the basis for my own, certainly tendentially harsh assessment of many projects of this type, which I think are far too content to take the world as they find it, despite advertising something new.

Something like that, perhaps, is one version of what others might be trying to say, more (or less!) crudely, when they reject either “disinterest” or “surprise,” in a disaggregation of aggregated evidence. It’s true that both tweeting and blogging encourage rushed, reactive thinking and expression, on all sides of any debate — but I think many (if not all) of my colleagues at large who are pursuing these kinds of projects are really failing to grapple with the range of sources and motives for the criticism sometimes directed at them, just as many (if not all) of those offering such criticism aren’t grappling with either the intellectual motives or the mathematical and more literally technical domains attached to such projects.

Jonathan 2/22/2013 at 9:53 AM

One element of the “disinterested aggregation” that I find interesting is the potential disconnect it would reveal between what are essentially oral traditions of disciplinary history—those passed down in seminars, repeated in conference papers, and then finding themselves codified in articles and books—and an insensate algorithm’s classification of that same disciplinary history.

Your point is that the texts themselves—what the algorithm has to work with—contain purposeful and violent omissions. The point is indisputable. But, over a long enough period of time, many ideas once repressed will emerge, often in direct proportion to the violence of their repression. I haven’t yet read your book, so I don’t know the specific examples you’re working with there (or the details of your engagement with Moretti), but I look forward to it.

Brian Lennon 2/22/2013 at 11:45 AM

Thanks, Jonathan. Some further (equally compressed) thoughts here: http://www.personal.psu.edu/bul5/chronodocket/2013-archive/2013-archive-notes/20130221-N-TheGreatUnwritten.html

Jonathan 2/23/2013 at 10:32 AM

I found your comparison of historicism and text-mining to be very provocative. Some of my work has been historicist in orientation, and I’ve also done (and enjoyed doing) archival research in support of it. Nothing could be more counterintuitive to the proponents of vinegar-smells and marginal doodles than your proposed equivalence, which I guess is its point.

I agree that there’s some symbolic violence in the angry peer review. Rejecting a paper for failing to cite the reviewer’s work (while never overtly acknowledging this) is common enough to have long passed into the profession’s lore, and it’s a practice that encapsulates much of what it is absurd and infantilizing about academic work. It’s one of the main reasons I take seriously the utopian claims about open-access and alternative forms of scholarly publication, even as they would introduce (and have already) another form of violence-by-exclusion.

I understand your point about actual destruction of archives combined with the small percentage of surviving literature in the global context means that any attempt to use ‘distant reading’ techniques on world literature must take into account the inherent bias of the sample. And that attempts up to this point haven’t adequately done so? I haven’t read your book, as I mentioned, so I’m just trying to reconstruct what the argument would be based on your comment. (And I can see that would engage with Moretti, Casanova, etc.)

But I admit with this work on scholarly articles to have a much more limited focus and understanding of the discipline. An association journal such as PMLA, American Anthropologist, or the Journal of American Folklore does provide a good overview of how a discipline constitutes itself in one national context (the most blatant example I’ve ever seen of “this isn’t good because it doesn’t cite my [irrelevant] scholarship” came from one of those journals, so I do take your point about what’s excluded.) These might not be the ideal corpora to study, but they are nonetheless interesting.

And, finally, re Wallace, his archive may not be the most popular thing at the Harry Ransom Center, but, from what I’ve witnessed, it has to be close. I suspect this confirms the point you’re making in some way; but Wallace (and Joyce), along with other writers who support an industry, do so for a reason. Is it a contingent reason? What larger purposes does it serve? Is it sinister? Aren’t these historicist questions?

Brian Lennon 2/23/2013 at 11:44 AM

It’s a briefer answer than the question deserves, but I’d say there’s just no accounting for (if that means integrating, compensating for, or eliminating, or otherwise successfully rejecting criticism of) the bias of any sample. The bias is in the sampling, not the sample.

This is the predicament of all scholarship, computer-assisted or not — scholarship being that which takes its objects as given and addresses itself to them, in a division of labor that in U.S. literary and cultural studies, for example, separates the scholar from the “creative writer,” who provides the given object, and also from the “critic,” who occupies an intermediate position, addressing the object but trying to do so in real time, often rejecting the empirically minded historicism of the scholar, often explicitly engaging the political question of who gets to write for publication and who does not, which is a question that any scholar must eventually let go.

As for limiting one’s focus. We’re all entitled to both our finitude and the recognition of that finitude; in personal life, I think that is health itself. In work life (intellectual life), on the other hand, I think limiting one’s goals is less healthy. Our activities as knowledge workers are massively overdetermined, and the truth is that we all have the energy to spare to investigate that.