Some Notes on the MLA Job Information List

I don’t remember exactly when the MLA digitized all of the issues of the Job Information List, but I was excited about what these documents could tell us about institutional history, the job market, salary trends, and many other things. The PDFs hosted by MLA are image scans, however, which are not immediately searchable as plain text. A variety of OCR solutions are available, but I personally was too lazy to attempt to use any of them.

Jim Ridolfo, not suffering from this despicable sloth, used OCRkit to create searchable versions of the JIL. He then generously made them available. There are several ways that you can work with these documents after you’ve extracted them from the tar archive: you can search them with your machine’s built-in indexer (I’ve only tried this with the Mac OS), convert them with pdftotext or similar to text documents and then use regular command-line utilities (or not convert them and use grep with the appropriate flags to handle binary files—I found this too annoying to deal with personally). Converting each PDF to text with pdftotext requires the use of find, xargs, or a simple shell script, as globbing of the form pdftotext *.pdf will not work.

I first, following Ridolfo’s example, looked for the first example of various things. The first mention of “post-colonial” was in an ad from Williams in 1983, “science fiction” in Clemson’s ad from 1974, for example. I discovered some brazen evidence of the gender dynamics of the profession from the mid-late 60s:

Screen Shot 2014-01-26 at 2.15.38 PM

And another disturbing thing from that ad is the apparent evidence that salaries have not kept pace with inflation. The Bureau of Labor Statistics’ inflation calculator reveals $15.5K to be more than $98K today, for example, which I suspect is not an Asst./Assoc. salary at the same institution today.

[UPDATE: Jack Lynch on facebook (one of the few facebook comments I was able to see, so I apologize if someone else pointed this out as well), noted that the St. Cloud State ad actually mentions this as a range for Full Professor, which is a reasonable figure adjusted for inflation.]

Screen Shot 2014-01-26 at 2.47.48 PM

Using a bit more automation, I used grep, sed, and uniq to create a csv of the frequency of search terms for each year, which I then imported into R and plotted. Here, for example, is a graph of the occurrences of the phrase “literary theory”

[UPDATE: The figures are not normalized for the number of jobs (or words in the job ads), so keep that in mind. 2007 had many more ads than 2008, for example. Again, I saw that Jack Lynch pointed this out on facebook.]


Even though this is a rough count because of OCR imperfections, overattention to verbose ads, and only counting the phrase itself, not jobs specifically asking for it alone, I still think this a useful measure of the institutionalization of the concept. I also charted “Shakespeare”

[UPDATE: Here is a graph to compare with the one above on "Shakespeare" that is normalized for the percentage of total words in all of the ads for that year. This is not the ideal way of counting its relative frequency. It's quite possible that the OCR does a better job with the more modern typesetting, and I haven't investigated this thoroughly.]

Screen Shot 2014-01-28 at 6.56.19 PM



“Post-Colonial” and “Postcolonial”




These graphs are just a simple and preliminary indication of what can be done with this data. With more attention, a queryable database that could create custom graphs of changes in sub-fields over time could be made. Slurping all of the salary ranges out of these ads and charting their growth (or lack thereof) relative to inflation could give us some more insight into the economic realities of the job market.

The Distribution of PhDs in Modernism/modernity

Modernism/modernity is an important and relatively new journal (1994-) that publishes interdisciplinary work in modernist studies. Though I’ve never submitted an article to it (I did publish a book review there), I’ve long heard that it is very difficult to publish in. The last time I checked, the journal did not submit acceptance statistics to the MLA Directory of Periodicals (these statistics make for interesting reading if you’ve never looked at them, by the way).

I thought it would be interesting and sociologically revealing to chart the PhD institutions of the journal’s authors. I decided to check only those who had published research articles—not review-essays, shorter symposium pieces, questionnaires, or book reviews. There were 358 unique authors, along with twenty-two whose PhD institutions I could not track down. (These were mostly UK-based academics. Many UK departments do not list the PhD institution of their academics, while virtually all US ones do.) There were also approximately ten authors who had published more than one article there (six times was the record).

I had hoped that there would be a way to automate this tedious procedure, involving a web crawler and perhaps automatically querying the Proquest dissertation database; but it quickly became evident that any automation I was capable of devising would be error-ridden and require as much time to check as doing it all by hand. Out of resignation, stubbornness, and a deeply misplaced set of priorities, I checked them all by hand and graphed the results:


The above image displays all institutions with at least three authors. Those schools with two authors each were: Boston U, Brandeis, Essex, Exeter, London, Manchester, McGill, McMaster, Monash, Oregon, Sheffield, SUNY-Buffalo, Temple, Texas, Tufts, Uppsala, and WUSTL. There were 49 universities with one author each.

From a statistical perspective, I don’t think there’s anything too unusual about this distribution. It is more tightly clustered among elite institutions than I would have guessed beforehand, however. Clancy told me that this project seemed to be a public shaming, which is not my intention at all. I do think that a comparison with another literary journal that publishes in modernist studies would reveal a broader distribution, but I think that this might be as easily explainable by Modernism/modernity‘s interdisciplinary focus as its elitism.

UPDATE (1/3/14):

I had a suggestion to divide the data chronologically. Here is the first half (minimum=2):
Screen Shot 2014-01-03 at 8.14.05 AM

And the second:
Screen Shot 2014-01-03 at 8.05.41 AM

Decluttering Network Graphs

A problem that many of the co-citation graphs I discussed in the last post share is that they are too dense to be easily readable. I created the sliders as a way of alleviating this problem, but some of the data sets are too dense at any citation-threshold. Being able to view only one of the communities at a time seemed like a plausible solution, but I was far from sure how to implement it using d3.js. Solutions that involved pre-processing the data the way that I did for the sliders didn’t seem to be very useful for this problem.

I realized two months ago that I wouldn’t have time to learn d3.js (and javascript in general) well enough to solve this problem this semester, as I’m working very hard on teaching, research, and service. A few moments of lazy, idle scheming today, however, led me to this intriguing gist. The hard work of identifying centroid nodes of different communities and only generating their force-directed graphs when selected has already been done here. I wanted to add text labels to the nodes, to make them more like my other graphs. (The tooltip mouseovers just don’t seem information-rich enough for this purpose, though the graphs are certainly tidier without visible labels.)

As I should have realized, making this minor adjustment was far from easy. I eventually realized that I had to change the selection code from the DOM element “circle.node” to just the general grouping element “g.” With a few other tweaks to the force-layout settings, I tested it out with one citation graph that wasn’t too cluttered (compare here). By far the worst graph I’ve created for illegibility has been the philosophy of science one (see also here for an earlier attempt to make it more legible by adding a chronological slider).

Despite my best efforts, these floating balloons of philosophy of science aren’t a great improvement. Labeling the initial beach balls with the centroid node is probably a good idea, along with just some explanatory text. I do think a similar approach is the way forward with this particular technology for visualizing citation graphs. D3.js is very flexible and powerful, particularly in its ability to create and record illustrative animations. I hope to be able to do some more detailed work with it on citation graphs after the semester ends.

Some Thoughts on Twitter

Ted Underwood made the following comment on Scott Weingart’s post about a recent controversy with the Journal of Digital Humanities:

I can also imagine framing the issue, for instance, as a question about the way power tends to be exercised in a one-to-many social medium. I don’t know many academic fields that rely on Twitter as heavily as DH does. It certainly has as much power in the field as JDH (which, frankly, is not a high-profile journal). Right now digital humanists seem to be dividing into camps of Beliebers and nonBeliebers, and I’m less inclined to blame any of the people involved — or any structure of ideas, or system of peer review — than I am to suspect that the logic of Twitter itself encourages the formation of “teams.”

I like twitter more than facebook, probably because I choose to interact with people there I share more intellectual interests with. The follower/following relation leads to all sorts of status-seeking and rank-anxiety behavior, however. While it doesn’t surprise me that PR workers and journalists would buy twitter followers (available for about $1 per 1K fake accounts apparently), I’m reasonably sure that I’ve seen academics do so. Choosing not to reciprocate a follow request is reasonable for any number of reasons—attention economy, missing a notification,—but the spurned follower is not privy to any of this decision-making and may very well feel jilted and resentful. Next comes inevitable embarrassment for even noticing such a triviality, and this can cascade into what Wilhelm Fliess referred to as a “shame spiral.”

Not only does the logic of twitter compel its users to pay attention to their number of followers but also to the ratio of their followers to followings. Celebrity accounts–the standard of measure–are generally on the order of 1m to 1, so this is what the demotic tweeter aspires to. Influence algorithms, such as Klout, use this ratio as one way to assess the importance of an account; and I suspect that it also is used by automated discovery services such as (named with impeccable timing) Prismatic. Furthermore, links on twitter seem to have a clickthrough rate of about 1% (so I’ve observed personally), and I suspect this percentage decreases with the more followers an account has. In order for a link to efficiently spread, it has to be retweeted by many people. The more followers an account has, the more likely something posted from it will be retweeted. Underwood’s comment above references “Beliebers,” and anything that Justin Bieber (I’m not entirely sure who that is–perhaps a young soccer player or innovative badmintonist–but he has many followers on twitter) posts, no matter how trivial, will get many retweets.

What is to be done? Community formation on twitter seems like a fascinating area of research. The groupuscle of quasi-surrealists sometimes known as ‘weird twitter’ are apparently already the subject of a dissertation or two, and I could imagine very interesting work being done on the digital humanities community on twitter: network interactions, status effects, and the etiology and epidemiology of controversies…all sorts of wonders. I would be inclined to try some of this myself, but I find the twitter API somewhat cumbersome to use, and the amount of data involved is overwhelming.

The End of Breaking Bad

I wrote a couple of Breaking Bad commentaries last year after the end of the first part of the fifth season. There are now only four episodes left, and I’m not entirely sure if we’ll see anything else about Gustavo Fring’s past. I can see how the Lydia-plot could have a flashback with Fring, but I don’t see how it could get all the way back to Chile. And that’s a shame if true, because I think there’s some really useful political comparisons to be made between Walter White’s and Fring’s respective formative circumstances and economic policies.

Predicting the plot of a show that relies so strongly on flouting the probable is foolish, I suppose, but I would guess that the final four episodes will show Jesse attempting to lure White back into the meth production business. From the flash-forwards, we can see that his identity is known to the community at large, and that he also presumably availed himself of the esoteric vacuum-repairman’s new identity. I would guess that Hank allows Jesse to start cooking, or at least pretend, to cook Walter’s recipe again. Walter’s pride tempts him into an incriminating response, and he narrowly escapes arrest.

The problem with this scenario is that it doesn’t address the Todd & his family and Lydia dynamic. Walter’s already solicited their help in what is strongly hinted as Jesse’s murder, but he doesn’t yet know about the industrial unrest caused by Declan’s sub-par production facilities and staff (Declan reminded me unpleasantly of Don Henley, but I’m aware that may be a wholly personal and private issue). Todd’s white supremacist relatives seem to want to establish a meth empire of their own, and Todd would seem to welcome Jesse, who knows Walter’s process much better than Todd was able to learn, as a capable and diligent partner.

Normally I wouldn’t imagine that Hank would go along with such a plan, especially that Gomez now knows about everything, but it’s clear to me that the early episodes have positioned him as an increasingly desperate figure willing to do anything to get revenge against Walter. So, to summarize, Hank allows Jesse to contact Todd. Jesse learns of the fortuitous slaughter and offers his services. The gang neglects to kill Jesse after seeing his newfound value. Walter dislikes Jesse’s initiative and attempts to intervene. Something happens, and Walter is forced into exile without being arrested or murdered. Perhaps Hank or Marie then talks to the media, and Walter eventually comes back for revenge (most likely against the meth-operation in some form or another).

The only issue that I can’t quite resolve here is Lydia’s motivation. She was willing to murder all of Mike’s staff last season, but only because they were a threat to her. She’s now motivated by greed alone, it would seem, which is perhaps unlikely for someone with her nervous disposition (and obvious accumulated wealth). Perhaps the next episode will explain her circumstances in greater detail. (Maybe Gomez was on Fring’s—and now Lydia’s—payroll the whole time, as I recall many people proposing…)

Citations to Women in Theory

After reading Kieran Healy’s latest post about women and citation patterns in philosophy, I wanted to revisit the co-citation graph I had made of five journals in literary and cultural theory. As I noted, one of these journals is Signs, which is devoted specifically to feminist theory. I didn’t think that its presence would skew the results too much, but I wanted to test it. Here are the top thirty citations in those five journals:

Butler J 1990 117
Jameson F 1981 90
Butler J 1993 72
Lacan J 1977 71
Derrida J 1978 64
Foucault M 1977 61
Chodorow Nancy 1978 60
Gilligan C 1982 60
Fish Stanley 1980 57
Foucault M 1978 56
Spivak G C 1988 54
Bhabha H K 1994 54
Derrida Jacques 1976 53
Benjamin W Illuminations 53
Foucault M 1980 52
Althusser L 1971 51
Said Edward W 1978 51
DE Man P 1979 50
Foucault M 1979 49
Laclau Enesto 1985 48
Hardt M 2000 48
Zizek Slavoj 1989 47
Derrida Jacques 1994 46
Benjamin Walter 1969 45
Lyotard J-f 1984 44
Foucault Michel 1980 44
Anderson B 1983 44
Williams Raymond 1977 42
Frye Northrop 1957 41
Fuss D 1989 40
Irigaray L 1985 40

There are eight women (I’m counting Chantal Mouffe) in the top thirty, and Judith Butler is the most-cited author. To test my intuition that literary theory journals cite female authors more than analytic philosophy, I decided to replace Signs with College Literature. (Here is the co-citation network. Again, these work best with Safari and Chrome.)

Here are the top thirty most cited authors in that corpus:

Jameson F 1981 100
Lacan J 1977 75
Fish Stanley 1980 66
Derrida J 1978 65
Bhabha H K 1994 60
Benjamin W Illuminations 59
Butler J 1990 57
Derrida Jacques 1976 57
Althusser L 1971 56
Bakhtin M M 1981 56
Foucault M 1977 56
DE Man P 1979 52
Lyotard J-f 1984 49
Zizek Slavoj 1989 48
Frye Northrop 1957 48
Derrida Jacques 1994 48
Foucault M 1979 48
Benjamin Walter 1969 48
Hardt M 2000 46
Anderson B 1983 44
Laclau Enesto 1985 43
Marx K Capital 43
Said Edward W 1978 42
Gilroy P 1993 41
Barthes Roland 1977 41
Williams Raymond 1977 40
Freud S Interpretation Dream 40
Jameson Fredric 1991 40
Culler Jonathan 1975 40
Bass Alan 1982 40
Derrida J 1981 39

Butler and Mouffe (whose name doesn’t appear because of the way the citation data is formatted) are the only women in the top thirty (unless I missed something!).

I don’t want to draw any major conclusions from this data, but I’m a bit surprised. Neither of these citation corpora have been cleaned up as much as Healy’s has, for instance, and the choice of journals clearly affects the outcome. The journals I chose were ones that I happened to think might be representative of literary theory and also happened to be in the Web of Science database; many obvious candidates were not.

Citational Network Graph of Literary Theory Journals

I’ve been interested in humanities citation analysis for some time now, though I had been somewhat frustrated in that work by JSTOR pulling its citation data from its DfR portal a year or so ago. It was only a day or two ago with Kieran Healy’s fascinating post on philosophy citation networks that I noticed that the Web of Science database has this information in a relatively accessible format. Healy used Neal Caren’s work on sociology journals as a model. Caren generously supplied his python code in that post, and it’s relatively straightforward to set up and use yourself.*

My first experiments with using Caren’s method were on the Journal of American Folklore, as a meta-analysis of that journal is the subject of an article that John Laudun and I have coming out in a few months, and John has been interested in folklore’s citation patterns for some time now. Here is the network graph** of the co-citations in that journal from 1973-present. (Web of Science’s data generally ends around this time; JSTOR’s did not, though my impression is that the WoS data is a bit cleaner.) Co-citational analysis and the community-detection algorithm produce much better results than my earlier efforts at citiational network analysis. (Healy’s post does a very good job of explaining what co-citation is and why it’s a useful way of constructing the network relations.) I then built two models of PMLA: sparse and larger. Even the sparse graph had only half the the threshold of Caren’s original code, which worked on several journals rather than just one. So I decided that I need more data to get better results.

Several months ago I built a topic model of six journals oriented towards literary theory. Somehow correlating a topic model with the journals’ citation network is something I’ve been interested for some time. The first step here would be actually building the citation network of those journals. Unfortunately boundary 2 and Social Text are not in the Web of Science database. I added the journal of feminist theory Signs, which I had also topic-modeled to compensate, though the results are not going to be directly comparable to the theory-corpus topic model.

This corpus ended up being larger than Healy’s or Caren’s, so I had to adjust the threshold up to 11 to make it manageable. A drawback of D3.js is that it’s very processor-intensive. Here is an image of the network of the five journals:


And here is the draggable network graph. The central nodes identified by the algorithm are Judith Butler’s Gender Trouble (1990) [red], Gayatri Spivak’s Can the Subaltern Speak (1988) & Edward Said’s Orientalism (1978) [light orange], Jacques Derrida’s Writing and Difference (1978) and Positions (1981) [light purple], Michel Foucault’s The Archaeology of Knowledge (1972) & Stanley Fish’s Is There A Text in This Class? [blue], Fredric Jameson’s The Political Unconscious (1981) (plus Althusser’s Lenin and Philosophy [1971]) [salmon pink], Carol Gilligan’s In A Different Voice (1982) & Nancy Chodorow’s The Reproduction of Mothering (1978), [orange], Pierre Bourdieu’s Distinction (1984), Michael Hardt and Antonio Negri’s Empire (2000) & Giorgio Agamben’s State of Exception (2005) [purple], and Jacques Lacan’s Ecrits (1977) [brown]. There is also a green Paul de Man node. Outliers include Hegel, Caruth, Clifford, Cavell, Wordsworth & Coleridge, and an interesting Latour-Bakhtin-Shapin nexus.

I would have liked to have explored this graph in D3 with a lower threshold, but my machine doesn’t have the processing power to handle that many nodes. I have been very happy using gephi in the past, but a java update seemed to make it stop working on my system. More interesting and perhaps unexpected results would appear at lower thresholds, I suspect, but I’m going to have to use another tool to visualize them. The results at this threshold meet my standard of face validity about the history of literary theory since the early 70s, though others might hold different opinions (it’s a contentious subject!).

UPDATE (6/23/13): I made a version of the dynamic graph that allows you to adjust the citation-threshold. There are also versions of a modernist-studies journals citation graph and one for journals in rhetoric and composition. And here is a post explaining the technical details.

*It relies on a couple of modules that are not installed by default on most people’s machines, I believe. First you need to clone Drew Conway’s fork (at the command line, git clone git://, will do it). Then you need to download this implementation of the Louvain network communtity detection algorithm. All of these files need to be in the same directory as Caren’s script. I was unable to install the networx fork on my Mac OS machine with pip, easy_install, or anything else; but the local import worked fine. Once you have set this up, you’ll need to modify by hand the filename in the script with your results. You can also change the threshold by changing the constants in this line: if edge_dict[edge]>3 and cite_dict[edge[0]]>=8 and cite_dict[edge[1]]>=8 :. Web of Science will only allow you to download 500 records at a time; you can either write a script to concatenate your results or do it by hand.

**All of these graphs use the D3.js library, which is very well designed and aesthetically pleasing. It renders very slowly on Firefox, however. Chrome and Safari give a much better viewing experience. (I have no idea about Internet Explorer.)

Interpreting Topics in Law and Economics

Of the many interesting things in Matthew Jockers’s Macroanalysis, I was most intrigued by his discussion of interpreting the topics in topic models. Interpretation is what literary scholars are trained for and tend to excel at, and I’m somewhat skeptical of the notion of an “uninterpretable” topic. I prefer to think of it as a topic that hasn’t yet met its match, hermeneutically speaking. In my experience building topic models of scholarly journals, I have found clear examples of lumping and splitting—terms that are either separated from their natural place or agglomerated into an unhappy mass. The ‘right’ number of topics for a given corpus is generally the one which has the lowest visible proportion of lumped and split topics. But there are other issues in topic-interpretation that can’t easily be resolved this way.

A problem I’ve found in modeling scholarship is how “evidence/argument words” are always highly represented in any given corpus. If you use hyperparameter optimization, which weighs topics according to the relative proportion in the corpus, words like “fact evidence argue make” tend to compose the most representative topics. Options include simply eliminating the topic from the browser, which seems to eliminate a large number of documents that would be otherwise classified, or trying to add all of the evidence words to a stop list. The aggressive pursuit of stop-words degrades the model, though this observation is more of an intuition than anything I can now document.

I thought it might be helpful to others who are interested in working with topic models to create several models of the same corpus and look at the effects created by small changes in the parameters (number of topics, lemmatization of corpus, and stop-words). The journal that I chose to use for this example is the Journal of Law and Economics, for both its ideological interest and methodological consistency. The law-and-economics movement is about as far away from literary studies as it’s possible to be while still engaging in a type of discourse analysis, I think, and I find this contrast both amusing and potentially illuminating. That the field of law-and-economics is perhaps the most well-known (even infamous) example of quantified reasoning used in support of what many view as a distinct political agenda is what led me to choose it to begin to explore the potential critical usefulness of another quantitative method of textual analysis.

I began by downloading all of the research articles published in the journal from JSTOR’s Data for Research. There were 1281 articles. I then converted the word-frequency lists to bags-of-words and created a 70-topic model using MALLET.* The browsable model is here. The first topic is the most general of academic evidence/argument words: “made, make, case, part, view, difficult. . .” I was intrigued by the high-ranking presence of articles by Milton Friedman and R. H. Coase in this topic; it would be suggestive if highly cited or otherwise important articles were most strongly associated with the corpus’s “evidence” terms, but I can’t say that this is anything other than coincidence. The next topic shows the influence of the journal’s title: “law, economics, economic, system, problem, individual.” The duplication of the adjective and noun form of “economics” can be eliminated with stemming or lemmatizing the corpus, though it is not clear if this increases the overall clarity of the model. I noticed that articles “revisiting” topics such as “social cost” and “public goods” are prominent in this topic, which is perhaps explainable by an unusually high proportion of intra-journal citations. (I want to bemoan, for the thousandth time, the loss of JSTOR’s citation data from its API.)

The next two topics are devoted to methodology. Econometric techniques dominate the content of the Journal of Law and Economics, so there’s no surprise that topics featuring those terms would be this widely distributed. Of the next three topics, one seems spuriously related to citations and the other two are also devoted to statistical methodology. It is only the eighth topic that is unambiguously associated with a recognizable subject in the journal: market efficiency. Is this apparent overemphasis on evidence/methodology a problem? And if so, what do you do about it? One approach would be to add many of the evidence-related words to a stop-list. Another would be to label all the topics and let the browser decide which are valuable. Here is a rough attempt at labeling the seventy-topic model.

The number of topics generated is the most obvious and effective parameter to adjust. Though I ended up labeling several of the topics the same way, I’m not sure that I would define those as split topics. The early evidence/methodology related topics do have slightly distinct frames of reference. The topics labeled “Pricing” also refer to different aspects of price theory, which I could have specified. The only obviously lumped-together topic was the final one, with its mixture of sex-worker and file-sharing economics. If there is evidence of both lumping and splitting, then simply adjusting the number of topics is unlikely to solve both problems.

An alternative to aggressive stop-wording is lemmatization. The Natural Language Toolkit has a lemmatizer that calls on the WordNet database. Implementation is simple in python, though slow to execute. A seventy-topic model generated with the lemmatized corpus has continuities with the non-lemmatized model. The browser shows that there are fewer evidence-related topics. Since the default stop-word list does not include the lemmatized forms “ha,” “doe,” “wa,” or “le,” it aggregates those in topics that are more strongly representative than the similar topics in the non-lemmatized model. Comparing the labeled topics with the non-lemmatized model show that there are many direct correspondences. The two insurance-related topics, for instance, have very similar lists of articles. The trend lines do not always match very well, which I believe is caused by the much higher weighting of the first “argument words” topic in the lemmatized corpus (plus also issues about the reliability of graphing these very small changes).

Labeling is inherently subjective, and my adopted labels for the lemmatized corpus were both whimsical in places and also influenced by the first labels that I had chosen. As I mentioned in my comments on Matthew Jockers’s Macroanalysis, computer scientists have developed automatic labeling techniques for topic models. While labor-intensive, doing it by hand forces you to consider each topic’s coherence and reliability in a way that might be easy to miss otherwise. The browser format that shows the articles most closely associated with each topic helps label them as well, I find. It might not be a bad idea for a topic model of journal articles to label each topic based on the title of the article most closely associated with it; this technique would only mislead on deeply divided or clustered topics, or on those which have only one article strongly associated with it (a sign of too many topics in my experience).

(UPDATE: My initial labeling of the tables below was in error because of an indexing error with the topic numbers. The correlations below make much more sense in terms of the topics’ relative weights, and I’m embarrassed that I didn’t notice the problem earlier.)

The topics were not strongly correlated with each other in either direction. In the non-lemmatized model, the only topics with a Pearson correlation above .4 were


The negative correlations below -.4 were


Ted Underwood and Andrew Goldstone’s PMLA topic-modeling post used network graphs to visualize their models and produce identifiable clusters. I suspect this particular model could be graphed in the same way, but the relatively low correlations between topics makes me a little leery of trying it. I generated a few network graphs for John Laudun’s and my folklore project, but we didn’t end up using them for the first article. They weren’t as snazzy as the Underwood and Goldstone graphs, as my gephi patience often runs very thin. (Gephi also has problems with the latest java update, as Ian Milligan pointed out to me on twitter. I intend to update this post before too long with a D3 network graph of the topic correlations.)

[UPDATE: 5/16/13. After some efforts at understanding javascript's object syntax, I've made a clickable network graph of correlations between topics in the lemmatized browser: network graph. The darker the edge, the stronger the correlation.]

The most strongly correlated topics in the lemmatized corpus were


Here is a simple network graph of the positively correlated topics above .2 (thicker lines indicate stronger correlation):


My goal is to integrate a D3.js version of these network graphs into the browsers, so that the nodes link to the topics and that the layout is adjustable. I haven’t yet learned the software well enough to do this however. The simple graph above was made using the R igraph package. [UDPATE: See here for a simple D3.js browser.]

And the negative correlations:


The fact that some topics appear at the top of both the negative and positive correlations in both of the models suggests to me that there is some artifact of the hyperparameter optimization process responsible for this in a way that I don’t quite grasp (though I am aware, sadly enough, that the explanation could be very simple). The .4 threshold I chose is arbitrary, and the correlations follow a consistent and smooth pattern in both models. The related articles section of these browsers is based on Kullback-Leibler divergence, a metric apparently more useful than Manhattan distance. It seems to me that the articles listed under each topic are much more likely to be related to one another than any metric I’ve used to compare the overall weighting of topics.

Another way of assessing the models and label-interpretations is to check where they place highly cited articles. According to google scholar, the most highly cited article** in Journal of Law and Economics is Fama and Jensen’s “Separation of Ownership and Control.” In the non-lemmatized model, it is associated with the AGENTS AND ORGANIZATIONS topic. It appears in the topic I labeled INVESTORS in the lemmatized corpus, but further reflection shows that these terms are closer than I first thought. My intuition, as I have mentioned before in this discussion of Pierre Nora’s “Between Memory and History,” is that highly cited articles are somehow more central to the corpus because they affect the subsequent distribution of terms. The next-most cited article, Oliver Williamson’s “Transaction-cost Economics: The Governance of Contractual Relations” appears, suitably enough, in the topics devoted to contracts in both browsers. And R. H. Coase’s “The Federal Communications Commission” is in the COMMUNICATIONS REGULATION topic in both browsers, a topic whose continuing theoretical interest to the journal was established by Coase’s early article.

As I mentioned in the beginning, I chose the Journal of Law and Economics for this project in interpreting topics in part because of its ideological interest. I have little sympathy for Chicago-style economics and its dire public policy recommendations, but I only expressed that in this project through some sarcastic topic-labeling. Does the classification and sorted browsing enabled by topic modeling affect how a reader perceives antagonistic material? Labeling can be an aggressive activity; would automated labeling of topics alleviate this tendency or reinforce it? I don’t know if this subject has been addressed in informational-retrieval research, but I’d like to find out.

*I am leaving out some steps here. My code that processes the MALLET output into a browser uses scripts in perl and R to link the metadata to the files and create graphs of each topic. Andrew Goldstone’s code performs much the same functions and is much more structurally sound than what I created, which is why I haven’t shared my code. For creating browsers, Allison Chaney’s topic-modeling visualization engine is what I recommend, though I was unsure how to convert MALLET’s output to the lda-c output that it expects (though doing so would doubtlessly be much simpler than writing on your own as I did).

**That is the most highly cited article anywhere that google’s bots have found, not just in the journal itself. I am aware of the assumption inherent in claiming that a highly cited article would necessarily be influential to that particular journal’s development, since disciplinary and discourse boundaries would have to be taken into account. All highly cited articles are cited in multiple disciplines, I believe, and that applies even to a journal carving out new territory in two well-established ones like law and economics.

Recent Developments in Humanities Topic Modeling: Matthew Jockers’s Macroanalysis and the Journal of Digital Humanities

1. Ongoing Concerns
Matthew Jockers’s Macroanalysis: Digital Methods & Literary History arrived in the mail yesterday, and I finished reading just a short while ago. Between it and the recent Journal of Digital Humanities issue on the “Digital Humanities Contribution to Topic Modeling,” I’ve had quite a lot to read and think about. John Laudun and I also finished editing our forthcoming article in The Journal of American Folklore on using topic-models to map disciplinary change. Our article takes a strongly interpretive and qualitative approach, and I want to review what Jockers and some of the contributors to the JDH volume have to say about the interpretation of topic models.

Before I get to that, however, I want to talk about the Representations project’s status, as it was based on viewing the same corpus through a number of different topic-sizes. I had an intuition that documents that were highly cited outside of the journal, such as Pierre Nora’s “Between Memory and History,” might tend to be more reflective of the journal’s overall thematic structure than those less-cited. The fact that citation-count is (to some degree) correlated with publication date complicates this, of course, and I also began to doubt the premise. The opposite, in fact, might be as likely to be true, with articles that have an inverted correlation to the overall thematic structure possibly having more notability than “normal science.” The mathematical naivety of my approach compared to the existing work on topic-modeling and document influence, such as the Gerrish and Blei paper I linked to in the original post, also concerned me.

One important and useful feature missing from the browsers I had built was the display of related documents for each article. After spending one morning reading through early issues of Computers and the Humanities, I built a browser of it and then began working on computing similarity scores for individual articles. I used what seemed to be the simplest and most intuitive measure–the sum of absolute differences of topic assignments (this is known as Manhattan distance). Travis Brown pointed out to me on twitter that Kullback-Leibler divergence would likely give better results.* (Sure enough, in the original LDA paper, KL divergence is recommended.) The Computers and the Humanities browser currently uses the simpler distance measure, and the results are not very good. (This browser also did not filter for research articles only, and I only used the default stop-words list, which means that it is far from as useful as it could be.)

While the KL-divergence is not hard to calculate, I didn’t have time at the beginning of the end of the semester to rewrite the similarity score script to use it.** And since I wanted the next iteration of the browsers to use the presumably more accurate document-similarity scores, I’ve decided to postpone that project for a month or so. Having a javascript interface that allows you to instantly switch views between pre-generated models of varying numbers of topics also seemed like a useful idea; I haven’t seen anyone do that with different numbers of topics in each model yet (please let me know if there are existing examples of something like this).

2. Interpretation

I’m only going to write about a small section of Macroanalysis here. A full review might come in the future. I think that the rhetorical strategies of Jockers’s book (and also of Stephen Ramsay’s Reading Machines, an earlier volume in the Topics in the Digital Humanities series published by the University of Illinois Press) contrast interestingly with other scholarly monographs in literary studies and that this rhetoric is worth examining in the context of the current crisis in the humanities, and the salvific role of computational methods therein. But what I’m going to discuss here is Jockers’s take on labeling and interpreting the topics generated by LDA.

In our interpretation of the folklore-journals corpus John and I did do de-facto labeling or clustering of the topics. We were particularly interested in a cluster of topics related to the performative turn in folklore. Several of these topics did match our expectations in related terms and chronological trends. (Ben Schmidt’s cautions about graphing trends in topics chronologically are persuasive, though I’m more optimistic than he is about the use of dynamic topic modeling for secondary literature.) The documents associated with these apparently performance-related topics accorded with our expectations, and we took this as evidence that the co-occurrence and relative frequency assignments of the algorithm were working as expected. If that were all, then the results would be only another affirmation of the long-attested usefulness of LDA in classification or information-retrieval. And this goes a long way. If it works for things we know, then it works for things we don’t. And there are many texts we don’t know much about.

The real interest with using topic modeling to examine scholarship is when the results contrast with received understanding. When they mostly accord with what someone would expect to find, but there are oddities and discrepancies, we must interpret the results to determine if the fault lies in the algorithm’s classification or in the discipline’s received understanding of its history. By definition, this received understanding is based more on generalization and oral lore rather than analytic scrutiny and revision (which obviously drives much inquiry, but is almost always selective in its target), so there will always be discrepancies. Bibliometric approaches to humanities scholarship lag far behind those of the sciences, as I understand it, and I think they are of intrinsic interest independent of their contribution to disciplinary history.

Jockers describes efforts to label topics algorithmically in Macroanalysis (135, fn1). He mentions that his own work in successively revising the labels of his 19th century novels topic model is being used by David Mimno to train a classifying algorithm. He also cites “Automatic Labeling of Topic Models” and “Best Topic Word Selection for Topic Labelling” by Jey Han Lau and co-authors. Both of these papers explore automatically assigning labels to topics from either the terms themselves or from querying an external source, such as wikipedia, to correlate with the terms. My browsers just use the first four terms of a topic as the label, but I can see how a human-assigned label would make them more consistently understandable. Of course, with many models and large numbers of topics, this process becomes laborious, thus the interest in automatic assignment.

But some topics cannot be interpreted. (These are described as “uninterruptable” topics in Macroanalysis [129] in what I assume is a spell-check mistake.) Ignoring ambiguous topics is “a legitimate use of the data and should not be viewed with suspicion by those who may be wary of the ‘black box’” (130) I agree with Jockers here. In my experience modeling JSTOR data, there are always “evidence/argument” related topics that are highly represented in a hyperparametrized model, and these topics are so general as to be useless for analytic purposes. There are also “OCR error” topics and “bibliography” topics. I wouldn’t describe these latter ones as ambiguous so much as useless, but the point is that you don’t have to account for the entire model to interpret some of the topics. Topics near the bottom of a hyperparametrized model tend not to be widely represented in a corpus and thus are not of very high quality: this “dewey ek chomsky” topic from the browser I created out of five theory-oriented journals is a good example.

I was particularly intrigued by Jockers’s description of combining topic-model and stylometric classifications into a similarity matrix. I would be bewildered and intimidated by the underlying statistical difficulties of combining these two types of classifications, but the results are certainly intriguing. The immortal George Payne Rainsford James and his The False Heir was classified as the closest non-Dickens novel to A Tale of Two Cities, for example (161).

3. The JDH Issue

Scott Weingart and Elijah Meeks, as I noted above, co-edited a recent issue of JDH devoted to topic modeling in the humanities. Many of the articles are versions of widely circulated posts of the last few months, such as the aforementioned Ben Schmidt article and Andrew Goldstone’s and Ted Underwood’s piece on topic-modeling PMLA. (Before I got distracted by topic-browsers, I created some network visualizations of topics similar to those in the Underwood and Goldstone piece. I get frustrated easily with Gephi for some reason, but the network visualization packages in R don’t generally produce graphs as handsome as Gephi’s.) There is a shortened version of David Blei’s “Probabilistic Topic Models” review article, and the slides from David Mimno’s very informative presentation from November’s Topic-Modeling workshop at the University of Maryland. Megan R. Brett does a good job of explaining what’s interesting about the process to a non-specialist audience. I’ve tried this myself two or three times, and it’s much more difficult than I expected it would be. The slightly decontextualized meanings of “topic,” “theme,” “document,” and possibly even “word” that are used to describe the process cause confusion, from what I’ve observed, and it’s also quite difficult to grasp why the “bag of words” approach can produce coherent results if you’re unaccustomed to thinking about the statistical properties of language. Formalist training and methods are hard to reconcile with frequency-based analysis.

Lisa Rhody’s article describes using LDA to model ekphrastic poetry. I was impressed with Rhody’s discussion of interpretation here, as poetry presents a different level of abstraction from secondary texts and even other forms of creative writing. I had noticed in the rhetoric browser I created out of College English, jac, Rhetoric Review, Rhetoric Society Quarterly, and CCC, that the poems often published in College English consistently clustered together (and that topic would have been clustered together had I stop-worded “poems,” which I probably should have done.) Rhody’s article is the longest of the contributions, I believe, and it has a number of observations about the interpretation of topics that I want to think about more carefully.

Finally, the overview of tools available for topic modeling was very helpful. I’ve never used Paper Machines on my zotero collections, but I look forward to trying this out in the near future. A tutorial on using the R lda package might have been a useful addition, though perhaps its target audience would be too small to bother. I think I might be one of the few humanists to experiment with dynamic topic models, which I think is a useful and productive—if daunting—LDA variant. (MALLET has a built-in hierarchical LDA model, but I haven’t yet experimented with it.)

*Here is an informative storified conversation about distance measurements for topic models that Brown showed me.

**Possibly interesting detail: at no point do any of my browser-creation programs use objects or any more complicated data-structure than a hash. If you’re familiar with the types of data manipulation necessary to create one of these, that probably sounds somewhat crazy—hence my reluctance to share the code on github or similar. I know enough to know that it’s not the best way to solve the problem, but it also works, and I don’t feel the need to rewrite it for legibility and some imagined community’s approval. I’m fascinated by the ethos of code-sharing, and I might write something longer about this later.

***I disagree with the University of Illinois Press’s decision to use sigils instead of numbered notes in this book. As a reader, I prefer endnotes, though I know how hard they are to typeset, but Jockers’s book has enough of them that they should be numbered.

Topic Models and Highly Cited Articles: Pierre Nora’s “Between Memory and History” in Representations

I have been interested in bibliometrics for some time now. Humanities citation data has always been harder to come by than that of the sciences, largely because the importance of citation-count as a metric has never much caught on there. Another important reason is a generalized distrust and suspicion of quantification in the humanities. And there are very good reasons to be suspicious of assigning too much significance to citation-counts in any discipline.

I used google scholar to search for most-cited articles in several journals in literary studies and allied fields. (Its default search behavior is to return the most-cited article in its database, which, while having a very broad reach, is far from comprehensive or error-free.) By far the most-cited article I found in any of the journals I looked at was Pierre Nora’s “Between Memory and History: Les Lieux de Mémoire.” A key to success in citation-gathering is multidisciplinary appeal, and Nora’s article has it. It is cited in history, literary studies, anthropology, sociology, and several other fields. (It would be interesting to consider Nora’s argument about the ever-multiplying sites of memory in era of mass quantification, but I’ll have to save that for another time.)

The next question that came to mind would be where Nora’s article would be classified in a topic model of all of the journal’s articles. Representations was first published in 1983. The entire archive in JSTOR contains 1036 documents. For many of my other topic-modeling work with journals, I have only used what JSTOR classifies as research articles. Here, because of the relatively small size of the sample (and also because I wanted to see how the algorithm would classify front matter, back matter, and the other paraphernalia), I used everything. In order to track “Between Memory and History,” I created several different models. It is always a heuristic process to match the number of topics with the size and density of a given corpus. Normally, I would have guessed that somewhere between 30-50 would have been good enough to catch most of the distinct topics while minimizing the lumping together of unrelated ones.

For this project, however, I decided to create six separate models with an incrementally increasing number of topics. The number of topics in each is 10, 30, 60, 90, 120, and 150. I have also created browsers for each model. The index page of each browser shows the first four words of each topic for that model. The topics are sorted in descending order of their proportion in the model. Clicking on one of the topics takes you to a page which shows the full list of terms associated with that topic, the articles most closely associated with that topic (also sorted in descending order—the threshold is .05), and a graph that shows the annual mean of that topic over time. Clicking on any given journal article will take you to a page showing that journal’s bibliographic information, along with a link to JSTOR. The four topics most closely associated with that article are also listed there.

In the ten-topic browser, whose presence here is intended to demonstrate my suspicion that ten topics would not be nearly enough to capture the range of discourse in Representations, Nora’s article is in the ‘French’ topic, a lumped-together race/memory topic, a generalized social/history topic, and the suggestive “time, death, narrative” topic. With a .05 threshold, 32% of the documents in the corpus appear in the ten-topic browser. [UPDATE: 3/16, this figure turned out to be based on a bug in the browser-building program.] None of these classifications are particularly surprising or revealing, given how broad the topics have to be at this level of detail; but one idea that I want to return is the ability of topic-models to identify influential documents in a given corpus. Nora’s article has clearly been very influential, but are there any detectable traces of this influence in a model of the journal in which it appeared?

Sean M. Gerrish and David Blei’s article “Language-based Approach to Measuring Scholarly Impact” uses dynamic topic models to infer which documents are (or will be) most influential in a given collection. What I have done with these Representations models is not dynamic topic modeling but the regular LDA model. I have experimented with dynamic topic models in the past, and I would like to apply the particular techniques described in their article once I can understand them better.

Here is how Nora’s article is classified in each of the topic models (sorted vertically from most to least representative):

10-topics 30-topics 60-topics 90-topics 120-topics 150-topics
{social political work} {history historical cultural} {history historical past} {historical history memory} {memory past history} {memory past collective}
{war american black} {form text relation} {memory jewish holocaust} {form human order} {human form individual} {history historical past}
{time death narrative} {memory jewish jews} {made work ways} {fact make point} {history historical modern} {form relation terms}
{de la le} {time death life} {world human life} {early modern history} {relation difference object} {sense kind fact}
N/A {political social power} {early modern great} {power terms suggests} {de la french} {individual system theory}
N/A {de la le} {make fact question} N/A {fact order present} N/A
N/A N/A {body figure space} N/A {forms figure form} N/A
N/A N/A {makes man relation} N/A N/A N/A
N/A N/A {national history public} N/A N/A N/A

There is a notable consistency between the topics the article is assigned to no matter how many there are to choose from. A logical question to ask is if Nora’s article is assigned to more or less topics than the average article across these six models. The percentage of all articles that are assigned to a topic with a proportional threshold >= .05 ranges from 32% with the ten-topic model to 52% in the 150-topic.

In my next post, I am going to describe the relative frequency of the average article in the different models and try to identify which ones (including Nora’s, if it turns out to be) are disproportionately represented in the topics. I will also begin interpreting these results in light of what I felt was historicism’s relative absence in the theory-journals corpus I created earlier.

[UPDATE: 3/16. I corrected a bug in the browser-building program and generated a new table above with the correct topics linked for Nora's article. The previous table had omitted a few.]

Learning to Code

One of my secret vices is reading polemics about whether or not some group of people, usually humanists or librarians, should learn how to code. What’s meant by “to code” in these discussions varies quite a lot. Sometimes it’s a markup language. More frequently it’s an interpreted language (usually python or ruby). I have yet to come across an argument for why a humanist should learn how to allocate memory and keep track of pointers in C, or master the algorithms and data structures in this typical introductory computer science textbook; but I’m sure they’re out there.

I could easily imagine someone in game studies wanting to learn how to program games in their original environment, such as 6502 assembly, for example. A good materialist impulse, such as learning how to work a printing press or bind a book, should never be discouraged. But what about scholars who have an interest in digital media, electronic editing, or text mining? The skeptical argument here points out that there are existing tools for all of these activities, and the wise and conscientious scholar will seek those out rather than wasting time reinventing an inferior product.

This argument is very persuasive, but it doesn’t survive contact with the realities of today’s text-mining and machine-learning environment. I developed a strong interest in these areas several months ago (and have posted about little else since, sadly enough), even to the point where I went to an NEH seminar on topic modeling hosted by the fine folks at the MITH. One of the informative lectures recommended that anyone serious about pursuing topic modeling projects learn the statistical programming language R and a scripting language such as python. This came as of little surprise to me as being reassured later in the evening by a dinner companion that Southerners were of course discriminated against in academia. I had begun working with topic-modeling in R packages, and a great deal of text-munging was required to assemble the topic output in a legible format. MALLET makes this easier, but there’s no existing GUI solution* for visualizing the topics (or creating browsers of them, which some feel is more useful**).

Whatever flexibility that being able to dispense with existing solutions might offer you is more than counterbalanced by the unforgiving exactitude and provincial scrupulousness of programming languages, which manifestly avoid all but the most literal interpretations and cause limitless suffering for those foolish or masochistic enough to use them. These countless frustrations inevitably lead to undue pride in overcoming them, which lead people (or at least me) to replace a more rational regret over lost time with the temporary confidence of (almost always Pyrrhic) victory.

An optimistic assessment of the future of computation is that interfaces will become sophisticated enough to eliminate the need for almost anyone other than hobbyists to program a computer. Much research in artificial intelligence (and much of the most promising results as I understand them) has been in training computers to program themselves. Functional programming languages, to my untutored eye and heavily imperative mindset, already seem to train their programmers to think in a certain way. The correct syntax is the correct solution, in other words; and how far can it be from that notable efficiency to having the computer synthesize the necessary solutions to any technical difficulty or algorithmic refinement itself? (These last comments are somewhat facetious, though the promise of autoevolution was at the heart of cybernetics and related computational enthusiasms—the recent English translation of Lem’s Summa Technologiae is an interesting source here as is Lem’s “Golem XIV.”)

I can’t help but note that several of the arguments I’ve read that advise people not to learn to code and not to spend time teaching other people how to if you happen to be unlucky enough to be in a position to do so are written by people who make it clear that they themselves know how. (I’m thinking here in particular of Brian Lennon, with whom I’ve had several discussions about these matters on twitter and also David Golumbia.) Though I don’t think this myself, I could see how someone might describe this stance as obscurantist. (It’s probably a matter of ethos and also perhaps a dislike of people who exaggerate their technical accomplishments and abilities in front of audiences who don’t know any better—if you could concede that such things could exist in the DH community.)

*Paper Machines, though I haven’t tried it out, can now import and work with DfR requests. This may include topic modeling functionality as well.

**I have to admit that casual analysis (or, exacting scrutiny) of my server logs reveals that absolutely no one finds these topic browsers worth more than a few seconds’ interest. I haven’t yet figured out if this is because they are objectively uninteresting or if users miss the links because the style sheet. (Or both.)

The Awakening of My Interest in Annular Systems

I’ve been thinking a lot recently about a simple question: can machine learning detect patterns of disciplinary change that are at odds with received understanding? The forms of machine learning that I’ve been using to try to test this—LDA and the dynamic LDA variant—do a very good job of picking up the patterns that you would suspect to find in, say, a large corpus of literary journals. The model I built of several theoretically oriented journals in JSTOR, for example, shows much the same trends that anyone familiar with the broad contours of literary theory would expect to find. The relative absence of historicism as a topic of self-reflective inquiry is also explainable by the journals represented and historicism’s comparatively low incidence of keywords and rote-citations.

I’ve heard from people on twitter that it’s a widely held belief that machine-learning techniques (and, by extension, all quantitative methods) can only tell us what we already know about the texts. I admit some initial skepticism about the prevalence of this claim, but I’ve now seen more evidence of it in the wild, so to speak, and I think I understand where some of this overly categorical skepticism comes from. A test of the validity of topic modeling, for example, would be if it produces a coherent model of a well-known corpus. If it does, then it is likely that it will do the same for an unknown or unread group of texts. The models that I have built of scholarly literature from JSTOR, I can see, are thought by some of the people who’ve seen them to be well-understood corpora. If the models reflect the general topics and trends that people know from their knowledge of the field, then that’s great as far as it goes, but we’ll have to reserve judgment on the great unread.

One issue here is that I don’t think the disciplinary history of any field is well understood. Topic modeling’s disinterested aggregations have the potential to show an unrecognized formation or the persistence of a trend long-thought dormant. Clancy found some clustering of articles in rhetoric journals associated with a topic that she initially would have labeled as “expressivist” from several decades before she would expect. Part of this has to do with the eclectic nature of what’s published in College English, of course, and part has to do with the parallels between creative writing and expressivist pedagogy. But it’s the type of specific connection that someone following established histories is not likely to find.

Ben Schmidt noted that topic modeling was designed and marketed, to some degree, as a replacement for keyword search. Schmidt is more skeptical than I am of the usefulness of this higher-level of abstraction for general scholarly research. I know enough about anthropology to have my eyebrows raised by this Nicholas Wade essay on Napoleon Chagnon, for example, and I still find this browser of American Anthropologist to be a quicker way of finding articles than JSTOR’s interface. I created this browser to compare with the folklore browser* of the corpus that John Laudun and I have been working with. We wanted to see if topic models would reflect our intuition that the cultural/linguistic turn in anthropology and folklore diffused through their respective disciplines’ scholarly journals (the folklore corpus contains the journal most analogous to American Anthropologist, The Journal of American Folklore, but it also has other folklore journals as well) at the expected time (earlier in anthropology than folklore).

A very promising, to my mind, way of correlating topic models of journals is with networks of citations. I’ve done enough network graphs of scholarly citations to know that, unless you heavily prune and categorize the citations, the results are going to be hard to visualize in any meaningful way. (One of the first network graphs I created all of the citations in thirty years of JAF required zooming in to something like 1000x magnification to make out individual nodes. I’m far from an expert at creating efficient network visualizations, needless to say.) JSTOR once provided citation data through its Data for Research interface; it does not any longer as far as I know. This has been somewhat frustrating.

If we had citation data, taking two topics that both seem reflective of a general cultural/linguistic/poststructuralist influence, such as this folklore topic and this anthropological one would allow us to compare the citation networks to see if the concomitant rise in proportion was reflected in references to shared sources (Lévi-Strauss, for example, I know to be one of the most cited authors in the folklore corpus.) I would also like to explore the method described in this paper that uses a related form of posterior inference to discover the most influential documents in a corpus.**

This type of comparative exploration, while presenting an interesting technical challenge to implement (to me, that is, and I fully recognize the incommensurable gulf between using these algorithms and creating and refining them) can’t (yet) be mistaken for discovery. You can’t go from this to an a priori proof of non-discovery, however. Maybe no one is actually arguing this position, and I’m fabricating this straw argument out of supercilious tweets and decontextualized and half-remembered blog posts.

A more serious personal intellectual problem for me is that I find the dispute between Peter Norvig and Noam Chomsky to be either a case of mutual misunderstanding or one where Chomsky has by far the more persuasive case. If I’m being consistent then, I’d have to reject at least some of the methodological premises behind topic-modeling and related techniques. Perhaps “practical value” and “exploration/discovery” can share a peaceful co-existence.

*These browsers work by showing an index page with the first four words of each topic. You can then click on any one of the topics to see the full list of words associated with it, together with a list of articles sorted by how strongly they represent that topic. Clicking then on an individual article takes you to page that shows the other topics most associated with that article, also clickable, and a link to the JSTOR page of the article itself.

**The note about the model taking more than ten hours to run fills me with foreboding, however. My (doubtlessly inefficient) browser-creating scripts can take more than hour to run on a corpus of 10K documents, combined with another hour or more w/ MALLET and R–it really grinds down a person conditioned to expect instant results in today’s attention economy.

Two Topic Browsers

Ben Schmidt, in a detailed and very useful post about some potential problems with using topic models for humanities research, wondered why people didn’t commonly build browsers for their models. For me, the answer was quite simple: I couldn’t figure out how to get the necessary output files from MALLET to use Allison Chaney’s topic modeling visualization engine. I’m sure that the output can be configured to do so, and I’ve built the dynamic-topic-modeling code, which does produce the same type of files as lda-c, but I hadn’t actually used lda-c (except through an R package front-end) for my own models.

It occurred to me that a simple browser wouldn’t be that hard to build myself, so I made one for Clancy’s explorations of the rhetoric/composition journals in JSTOR and another for the theory corpus. (I did use Chaney’s CSS file.) I used my old graphs without the scatterplots layer for the theory-browser, as I didn’t want to take the time to regenerate those yet. And I’m not sure quite what’s going on with unicode/non-ASCII characters; theoretically the code I wrote should convert those properly. [UPDATE: Thanks to a pointer from Andrew Goldstone on twitter, I fixed the encoding issue. binmode, ":utf8" on all filehandles is the answer in perl at least.)

The articles shown for each topic are those that have that topic most strongly associated with them. It’s quite possible that other articles could have higher proportions but have another topic even more strongly associated with it. I should also rewrite the code so that it grabs all articles below a certain threshold of significance.

The Stronghold of Bioinformatics

No one likes gamification or MOOCs, as far as I can tell. What I should say is that anyone trained in the hermeneutics of suspicion might even find it hard to accept their existence. It’s hard to come up with a hypothetical concept that would cry more piteously to the heavens for critique, for example. True to form, until a few weeks ago I had never earned a badge in my life and would have regarded the prospect of doing so with contempt and a touch of pity for whoever was naive enough to suggest it.

Then, there was this Metafilter post. Things I’ve discovered via Metafilter have taken away many months of work-time over the years, so the sensible thing to do would be to quit reading it. But that’s unlikely. In any case, Project Rosalind is a series of programming problems related to bioinformatics. It has the gamified features of “levels,” “badges,” “achievements,” and even, God help me, “xp.” There are a series of problems related to string processing, probability, and other topics. They have a tree-like structure, and you have to solve precursor problems before getting access to the later ones. Solving a problem involves downloading a dataset and submitting a solution within five minutes. After you’ve solved the problem, you can see the code that others have posted to solve the problem.

This feature is particularly interesting to me, as I have never really learned functional programming, so when I see solutions to problems that I have solved in perl in languages such as Haskell, Clojure, or Scala, it’s a bit easier to understand how they were put together. (Rosetta Code is another place to see programming problems solved in multiple languages.) You are allowed unlimited attempts to get the right answer, and you can see forum questions about the problem after two unsuccessful tries. (I have posted a question once—a rather idiotic question in retrospect—and I received a correspondingly withering response, whose impact I mitigated somewhat by imagining it spoken in the Comic Book Guy’s voice.)

I have, at this point, solved twenty-two of the ninety-three problems. The early ones are trivial, but I’m finding the difficulty to be scaling up quite a bit. I’ve used some algorithms I had never worked with before, such as tree-suffix and shortest-superstring. I’ve also used arbitrarily nested loops in perl (with Algorithm::Loops) and contemplated the theoretical limits of what a regular expression can match more than I’ve had to before. It’s also quite interesting to see what the total numbers of problems solved reveal about people’s background knowledge. Two of the problems involving Mendelian inheritance and probability have been solved proportionally many fewer times than (more difficult) string-processing programs. (I don’t mean to be a hypocrite in saying this, as I got tired of the Punnett-squares required in the second one of those and haven’t solved it myself.)

Some of the gamified features of the site I regard as silly (levels, xp, badges, achievements), but I admit that I can’t help but be motivated by the statistical information about how many people have solved which problems. It triggers my instinctual competitiveness, somehow. They even seem to encourage people to post their country of origin to introduce nationalism into the competitive mix here. As a learning tool, I’m not sure how effective it is. It’s quite possible to solve many of the problems while retaining only the barest minimum about the underlying molecular biology, and problems which require a bit more conceptual understanding than that (see the Mendelian inheritance ones above) are comparatively ignored.

The programmatic checking of solutions is also somewhat finicky. An end-of-line character at the end of the file will cause an otherwise correct solution to fail for at least some of the problems, for example. But all in all, I’m very impressed with this site and think it has a lot of potential in teaching people (humanists, for example), how to program. It would be nice to be able to reuse the code with different problem sets, if they ever decide to release the source in the future.


I corrected a few mistakes (I gave myself an extra problem, for instance), and I also wanted to mention an important precursor: Project Euler. This site has mathematics problems, and it also seems a bit more streamlined. I haven’t actually used it yet, though.

Topics in Theory

After experimenting with topic models of Critical Inquiry, I thought it would be interesting to collect several of the theoretical journals that JSTOR has in their collection and run the model on a bigger collection with more topics to see how the algorithm would chart developments in theory.

I downloaded all of the articles (word-frequency data for each article, that is) in New Literary History, Critical Inquiry, boundary 2, Diacritics, Cultural Critique, and Social Text. I then ran a model fitted to one-hundred topics. I had to adjust the stop-word list to account for common words and, unsuccessfully, for words in other languages. What I should have done was use the supplied stop-word lists in those languages as well. At least this way there is a chance that interesting words in those languages will cluster together.

The topics themselves looked good, I thought. One hundred was about the right number, as I didn’t see much evidence of merging or splitting. I should say rather that I saw an acceptable level, or the usual level. This topic, for example, shows what I mean: “aboriginal rap[?] women australian climate weather movement work warming time australia housework change social power oroonoko[?] make wages years.” I also didn’t lemmatize this corpus, although I know how to. Lemmatizing takes a lot of time the way I’m doing it (using the WordNet plugin of the python Natural Language Toolkit). And I frankly haven’t been that impressed with the specificity of the lemmatized models that I have run.

Visualizing changes in topics over time is quite difficult. Each year will have thousands of observations per topic and taking the mean of each topic per year doesn’t always produce very readable results. Benjamin Schmidt suggested trying the geom_smooth function of ggplot2, which I never had much luck with. The main reason, I found, that I couldn’t get it to work very well is that I was trying to create a composite graph of every topic using facet_wrap. Each topic graphed by itself with geom_smooth produced better results.

Here, for example, is the graph for this coherent topic—”gay sexual queer sex lesbian aids sexuality homosexual men homosexuality identity heterosexual male gender desire social lesbians drag butler”:
Graph of Change over Time in "Queer Theory" Topic from Theory Journals

The chronology you see above does approximately track the rise of queer theory, though the smoothing algorithm is full of mystery and error. A scatter-plot of the same graph would be far noisier and also not reveal much in the way of change over time. This topic should also correlate somewhat roughly to postcolonial theory–”indian india hindu colonial postcolonial subaltern british indians nationalist gandhi english bengali religious caste nationalism sanskrit maori bengal west”:
Postcolonial Topics over Time in Theory Journals

I’m suspicious of this linear increase, needless to say. The underlying data is messier. Would Marxist theory show any decline around the predictable historical period? (Terms: “social class theory ideology political production ideological historical marxist marx bourgeois capitalist society capitalism marxism economic labor relations capital”)

Topics in Marxist Theory over Time in Theory Journals

That is roughly what I was expecting. But compare “soviet party revolutionary socialist revolution socialism communist political national left union struggle europe russian fascism war central movement european”:

Communist Theory Topics over Time in Theory Journals

I have hopes for the exploratory potential of topic-modeling disciplinary change this way. Another interesting topic that shows a linear-seeming increase (“muslim islamic islam religious arab muslims secular arabic algerian orientalism rushdie religion iranian iran western turkish ibn secularism algeria”):
Islamic Topics over Time in Theory Journals

To show what the data looks like with different visualizations, I’m going to cycle through several types of graphs of the above topic. The first is a line graph:
Line graph

Next is a scatter-plot:


Now a scatter-plot with the scale_y_log10 function applied:
Point (Log10)

And a yearly mean:
Yearly mean

Finally, a five-year mean:
Five-year mean

All of the graphs reveal a general upward trend, I think, though not as much as the smoothing function does. I would be delighted in hearing any ideas anyone has about better ways to graph these. I’ve not found any improvements in grouping by document rather than year.

There’s more I plan to do with this data set, including coming up with better ways to visualize it (more precision, efficient ways of seeing many at once, etc.) I am including the full list of topics after the fold for reference. Some reveal OCR errors; others are publishing artifacts that my first rounds of stopping didn’t yet remove.

Update (2/14/12): I created a browser of this model that shows the articles most closely associated with each topic.

Continue reading