Org-mode LaTeX Export Issue

I realize this advice may only be relevant to a few souls in this degraded world, but I wanted to document an issue I had when using the org-mode LaTeX exporter with biblatex and James Clawson’s MLA style package.

My template automatically exports babel as one of the LaTeX headers when exporting from org. If the default language is set as “en,” the org exporter will append “,english” as a babel option. This option causes Clawson’s package to place ending punctuation outside quotation marks, among other possible effects.

I didn’t want to change the variable globally, but you can set “#+LANGUAGE :nil” in your header, and “english” will no longer be appended. There will still be a comma after your language setting in the babel options, but, while unsightly, it doesn’t break anything.

I’m sure there’s a more elegant solution to this problem, but I couldn’t find it. I also ran into sectioning problems if I tried to import multiple files into an org master file for eventual LaTeX export. Having the entire book MS in one file solves these problems. Org-mode’s folding features make it more feasible than it might seem to have a 100K MS in a single file.

Another indispensable innovation I’ve discovered for working with reftex and zotero is Robin Wilson’s AutoZotBib, which will automatically synchronize a *.bib file with your zotero library.

National Security

[I originally didn't quite have the kinks worked out of my org-mode HTML export process that produced the document below, but I have updated the post. There is also a pdf of these remarks about Brian Lennon's "The Digital Humanities and National Security."]

“The Real Nature of Control”

The last text I assigned in my recent “Modernism, Fascism, and Sexuality” seminar was Gravity’s Rainbow.1 Among its many oddities is a scene where the spirit of Walther Rathenau is summoned through a medium for the entertainment and mockery of an elite “corporate Nazi crowd”:

These signs are real. They are also the symptoms of a process. The process follows the same form, the same structure. To apprehend it you will follow the signs. All talk of cause and effect is secular history, and secular history is a diversionary tactic. Useful to you, gentlemen, but no longer so to us here. If you want the truth—I know I presume—you must look into the technology of these matters. Even into the hearts of certain molecules—it is they after all which dictate temperatures, pressures, rates of flow, costs, profits, the shapes of towers…2

Rathenau, or as we must rationally conclude, the inventive medium Peter Sachsa, ends with these somewhat ominous questions: “what is the real nature of synthesis?” and “what is the real nature of control?” It was many years ago when I wrote an overview of the critical reception of Gravity’s Rainbow for an undergraduate thesis, and it was lonely work for a student with a VAX account. Even then it was clear that this scene had engaged many critics’ attention for its seeming encapsulation of the novel’s many threads.

Among those threads is cybernetics. The idea of an automated and self-propagating technology of control, independent of its computational substrate, lurks in the margins of Gravity’s Rainbow. It’s also interesting that a statistician becomes a member of the “counterforce” in the text because of his distrust of an orthodox behaviorist.3 I found myself wondering whether Pynchon was aware of the then-recent intellectual history of behaviorism and of how the linguist more responsible than anyone else for its demise was also notable for a distrust of empirical data-gathering and statistics himself.

I find it hard to imagine that Pynchon had not read Chomsky’s “The Responsibility of the Intellectuals” and other writings where he had described the university’s involvement in counter-insurgency and propaganda for the U. S. government. He may well have known that Chomsky’s work in devising a computational and rationalist theory of linguistics had also been funded in part by the military-scientific research industry they both so forcefully criticized.

DH’s Secular History

I was thinking about the real nature of synthesis and control while reading through the recent issue of differences devoted to the digital humanities. Or, rather, to the shadows of the digital humanities.4 Some of the papers were derived from those presented at the “Dark Side of the Digital Humanities” conference held in Milwaukee in 2013. I was able to read David Golumbia’s article in draft5, and I hope to have some more to say about it in another post. Brian Lennon’s article, “The Digital Humanities and National Security,” argues that there is a dialectic in philology between historical humanism and rationalism, which he finds in Raymond Llull’s “combinatorial unilingualism.”6 The rationalist strain of philology inspired cryptological and quantitative approaches to literary study in the late 19th C. Prominent among these disreputables were Baconian cipherists, and Lennon suggests that the conservative temperament of these anti-Stratfordians (along with their rationalist dispositions) affected the development of the intelligence services in the 20th C.

In particular, the utility of literary studies for the national security apparatus grew directly from a mindset favoring “aggregation and documentation” over interpretation and critical debate.7 Lennon quotes from an MLA presidential address by John M. Manley in 1927, and it’s a good find. One of the things that JSTOR and other databases have helped with, often in unacknowledged ways, is eliminating the prejudice against the dusty and unindexed tome. I also like to think that the profession is less Whiggish than it was even when I was an undergraduate, and the database deserves much of the credit here. Lennon next follows Robin Winks’s Cloak and Gown8 and some other sources on academic complicity with the OSS and later CIA through the first post-war decade. Vietnam, as I alluded to earlier through the example of Chomsky, resulted in some pushback against this widespread integration of academic research with the state security apparatus; but Lennon follows Winks in claiming that this resistance had largely subsided by the 1980s.9

In the third section, Lennon turns to the puzzling question of why digital humanists have been reluctant to historicize their research. The question is less puzzling when you consider that almost no digital humanist or humanities computer would recognize or accept Baconian cipherists10 as legitimate ancestors, though I could see Raymond Llull (and Leibniz) being more congenial to those inveterate combinatorists who haunt us even still. Fewer would consider themselves cyberlibertarians, though this identification is much closer in time and conceptual space. Gloating about catching teachers cheating on standardized tests through statistical analysis and concluding from this exercise that more standardized testing is called for is what I see as the guiding model for some digital humanities research.11

“The Shapes of Towers”

There are no proverbs for paranoids sufficient to navigate the surveillance state we now inhabit. A ready example: a sociologist and internet researcher recently documented her efforts at hiding her pregnancy from the internet, which required the use of anonymous browsing technology (Tor) known to be penetrated by intelligence services (if not yet the marketers) and eventually resulted in a report of suspicious economic activity.12 We fill pre-existing internet forms, as the saying goes, and when we fill them we change them and are changed.13 This filling depends on machine learning and textual analysis techniques whose methodological assumptions are often at odds—if not outright inimical—to those of literary analysis or cultural studies.

So, what happens when literary scholars begin to experiment with these technologies?

Can a ballyhooed turn in the humanities, especially in literary studies, that promotes a putatively novel computational textual analytics including textual and other data “visualization” possibly be or remain isolated from the cultural-analytic and specifically textual-analytic activities of the security and military intelligence organizations that are the university’s neighbors—especially when such a turn is represented as a historic opportunity made possible by historic advances in information technology?14

Lennon’s answer is “It seems unlikely.” I have myself dabbled in computational textual analytics, even going so far as to answer a CS graduate student’s question about a variant of the LDA algorithm.15 Though I toil in deserved obscurity, I follow more visible efforts with attention and compassion. I understand the “historic opportunity” rhetoric of mass digitization, and I’m legitimately excited by the potential for disciplinary self-study and generic evolution afforded by the digital archive. Citation analysis is transparently invoked by those who seem to despise humanities scholarship as evidence of its inconsequence, and yet I have spent a great deal time creating co-citation network visualizations.16 Network analysis works wonders for counterinsurgency and population control. Kieran Healy’s humorous post on Paul Revere17 illustrates this well.

I am tempted to answer Lennon’s challenge with an appeal to technological neutrality.18 Topic-modeling is an information-retrieval technology that offers a higher-order of search capability. No one needs this more than the intelligence agencies, just as they would benefit immensely from automated translation. The ominous role of machine translation and the way that it interacts with the discipline of comparative literature in particular is the subject of much of Lennon’s research, and I can see how the argument would carry over. A sustained analysis of a “Digital Humanities Questions and Answers” post about the ethics of accepting military research funding follows in Lennon’s argument. He notes that a forum respondent mentioned a statement by anthropologists concerned over their research being used in counter-insurgency contexts. The relative lack of concern about this issue compared to the anthropologists is taken as further evidence of the complicity of DH as a research practice with the larger national security state. It’s not clear to me that the situations are directly comparable, however, largely because of DH research being mediated through the more technical fields that receive massive amounts of funding for these purposes.

A Nice Future

In his final section, Lennon makes the piquant claim that the “inability to define ‘digital humanities’ means that anyone willing to be sufficiently cheerful in the act can don the digital humanities hat at will.”19 Here is a dig at the legendary “niceness” of the DH community, which sometimes is perceived as passive-aggression to those who do not feel wholly a part of it. At least other facets of the humanities are forthright about their elitism, perhaps someone has thought at one time or another. He counsels DH “enthusiasts” to turn to comparative literature and its debates in recent years as a guide to understanding their present historical dilemma.

Though it is not mentioned in this essay, Lennon has sometimes referred to a Harvard Business Review feature20 that documents decreasing job prospects for information sector workers, defined broadly. I think that Lennon attributes this to progressive (perhaps “recursive”) automation in the industry, which will gradually reduce labor demand in this area. I suspect that outsourcing is more immediately relevant as an explanation for this trend, but the overall point is worth contemplating. It is relatively easy to imagine a turn away from computation as an engine of cultural and economic activity, though the biological turn which is the most easily imaginable substitute would be derived from a heavily computational science itself. What does the post-computational future look like? Does philology (or any other branch of cultural analysis associated with the humanites) exist in it?

These questions may seem a bit ridiculous, but I remember hearing Katherine Hayles predict that the traditional English Department would go the way of Classics. That was fourteen years ago, and I admit that I found it an implausible claim at the time. But perhaps the process was already underway.


  1. Though not traditionally thought of as a modernist novel, I thought that Gravity’s Rainbow anticipates, through its polymorphous, dope-addled antics, much of the early cultural criticism on fascism and sexuality.

  2. Thomas Pynchon, Gravity’s Rainbow (New York: Viking Press, 1973), 167.

  3. That Pointsman is also obviously insane probably figured into his decision-matrix a bit.

  4. http://differences.dukejournals.org/content/25/1.toc

  5. David Golumbia, “Death of a Discipline,” differences 25, no. 1 (2014): 156–76, doi:10.1215/10407391-2420033.

  6. Brian Lennon, “The Digital Humanities and National Security,” differences 25, no. 1 (2014): 134, doi:10.1215/10407391-2420027.

  7. Ibid., 135.

  8. Robin W Winks, Cloak & Gown: Scholars in the Secret War, 1939-1961 (New Haven: Yale University Press, 1996).

  9. Lennon, “The Digital Humanities and National Security,” 138.

  10. I do not think that the Oxfordians relied on ciphers to make their case. A sociologically relevant fact?

  11. The example comes from Steven D. Levitt and Stephen J. DubnerFreakonomics: A Rogue Economist Explores the Hidden Side of Everything (HarperCollins, 2011), of course, though I see its influence as more subtle and aspirational in many digital humanities projects.

  12. Janet Vertesi’s presentation at the “Theorizing the Web” conference is described here: http://mashable.com/2014/04/26/big-data-pregnancy/. About ten years ago at a conference, I met an academic affiliated with the same institution who had her Social Security number prominently displayed on her web-hosted CV.

  13. I refer to a poem by Frank Bidart, “Borges and I,” which is taken somewhat out of context for the epigraph to David Foster Wallace’s The Pale King. Lennon has called Wallace’s work and its reception, I should note in full disclosure, a “very suitable foil.” See http://www.personal.psu.edu/bul5/blog/2013/02/22/n-thegreatunwritten/

  14. Lennon, “The Digital Humanities and National Security,” 142–143.

  15. See http://www.jgoodwin.net/?p=1043.

  16. I wrote about this tension here, http://jgoodwin.net/?p=1329, after reading a claim that “82%” of humanities scholarship goes uncited.

  17. “Using Metadata to find Paul Revere,” http://kieranhealy.org/blog/archives/2013/06/09/using-metadata-to-find-paul-revere/.

  18. Chomsky, in a relevant and interesting context, remarks, “A hammer can be used to smash someone’s skull in, or to build a house. The hammer doesn’t care. Technology is typically neutral; social institutions are not.” See http://zcomm.org/wp-content/uploads/ScienceWars/forumchom.htm. Torture instruments seem like a relevant counterexample here, but I suppose there’s always surgery.

  19. Ibid., 146.

  20. http://hbr.org/2013/11/americas-incredible-shrinking-information-sector

Citation Metrics

Two stories caught my attention yesterday. The first was a review of some recent studies of citation practices by field, broadly considered. The claim that alarmed a number of people on twitter was that “82%” of humanities scholarship was never cited. I pointed out that it was a mistake to assume that “never cited” means “never read.” That someone would even make this inference is quite mysterious to me. Let me explain: this semester, I have been teaching, for the first time, a course on the Victorian novel. I am teaching this class because our department’s primary Victorianist has recently become the director of our graduate program and thus was unable to teach a course in her normal rotation. The texts that I assigned were Villette, Bleak House, Lady Audley’s Secret, Daniel Deronda, Jude the Obscure, and Dracula. (That’s about 3500 pp. of reading, which I’m now thinking might have been a bit much.) Since I have never taught any of these texts before, I have read as much scholarship on them as possible for preparation. I estimate that I’ve read at least twenty articles or book chapters per book. Nothing I have encountered in my seventeen years in the profession has led me to believe that there’s anything unusual about this. Professors routinely consult scholarship in preparation for their teaching, including many sources they will never cite in their own scholarship. There are several reasons for this: 1) most people who teach in humanities departments do not publish very much in absolute terms, so they will not be citation-providers. 2) People who do publish scholarship have, most of the time, to teach a wide variety of things that do not have anything to do with their scholarship, yet they read it to prepare. (I’m aware that there are a small number of professors who have not read anything new in x amount of years, but this is mostly a stereotype rarely met in sublunary lands.) 3) Scholars read many things in their research that informs their understanding of their subject that they do not eventually cite.

I know this last point might be the most questionable. Some journals seem to encourage the footnote that mentions a broad range of background sources, but this practice is far from universal. Citations are often given in a rote or formulaic sense, with a ritualistic nod to some authority that may have little bearing on the subject at hand. There is much more to say about this point, but I now want to consider the question of measuring access. It would be normal to wonder if database-access metrics could provide librarians, scholars, and other interested parties with the necessary information to determine whether a source is being used even if it is not being cited. Anyone who has examined their own server logs knows, however, that determining whether or a machine or a human is at the other end of an HTTP request is more difficult than it seems. Paywalls mitigate some but far from all robotic access. I have no idea whether the access statistics that JSTOR provides, for example, make any attempt to separate human requests from robotic ones. A casual examination of the most-accessed lists of various journals will show significant variance from their most-cited lists, which supports my broader point.

The second outrage came from the University of New Hampshire, where thousands of library books were found in a dumpster. I recall being outraged by the discovery of some Nation magazines from the 1930s in the dumpster at the University of Florida when I was a graduate student, but thousands of books is on a different scale. A librarian is quoted in the article as saying that the books in question had not been checked out for a very long time and thus, the implication goes, were not needed. There are many disturbing aspects to this story, but the one that most immediately came to my mind is the difference between a book being checked out and read in the library. I frequently consult and re-shelve books in libraries. I’m not the only one. When I was an undergraduate, I worked in the UNC-Wilmington library, and I know that statistics were collected on books that had to be re-shelved. Even if UNH did this, which is far from clear, it would not count patron re-shelving. (I could be wrong about this, but my impression is that it’s not common for libraries to collect re-shelving data, and I’m not sure to what use it was put at UNCW.)

I understand that space in a library is finite, though why you would throw books away rather than having a community sale/giveaway is beyond my personal comprehension. I understand that Georgia Tech’s library is moving away from a books-included model, and they had conspicuously few to begin with when I taught there several years ago. In the spirit of self-criticism, I will now ask myself whether my interests in visualizing co-citation graphs and the use of quantitative methods for disciplinary history are at odds with my belief that libraries should have books in them and that scholarship is valuable in ways that citation metrics cannot measure.

They are not.

Metadata and Co-Citation Graphs

I will conclude this missive with an overview of some recent advances I’ve made in improving the d3.js co-citation graphs that I first wrote about here. I have experimented with various measures to increase the utility and visibility of these graphs: creating a threshold slider (based on in-degree of co-citation nodes), adding a chronological dimension, adding expandable and contractible hulls around the communities, analyzing the composition of the communities at different threshold levels, and, most recently, adding animations to show how the network grows on a year-by-year basis. (“Animation” is an aspirational term, as I have not yet been able to adjust the underlying data structures in such a way as to use d3.js’s nifty smooth transitions. Elijah Meeks was kind enough at the recent Texas DH conference to show me this intriguing code that performs many of the stupendous operations I have in mind. Javascript is far from my best symbolic instruction code, however.)

My most recent advance was adding what (generally cryptic and incomplete) metadata that Web of Science offers about citations to the graphs. I have now created three co-citation graphs that will show metadata on mouseover:

If Web of Science supplied a DOI, you should have a direct link to the article. If not, I provided links to search Worldcat and Google Scholar, though I should probably munge the search string to increase the likelihood of success. Though many famous articles and books can be recognized from author and date, not all can. (And of course this varies depending on your knowledge of the field.) I hope this improves the usefulness of these as an exploratory tool. What conclusions about a discipline’s history and formation can be responsibly drawn from this data? I’ve been thinking about this question and the somewhat related question of how citation metrics can be correlated with topic models for several months now. I don’t have any conclusions interesting enough to share at this moment, but I’m optimistic about the heuristic value of these search-and-visualization tools for pedagogy.

Some Notes on the MLA Job Information List

I don’t remember exactly when the MLA digitized all of the issues of the Job Information List, but I was excited about what these documents could tell us about institutional history, the job market, salary trends, and many other things. The PDFs hosted by MLA are image scans, however, which are not immediately searchable as plain text. A variety of OCR solutions are available, but I personally was too lazy to attempt to use any of them.

Jim Ridolfo, not suffering from this despicable sloth, used OCRkit to create searchable versions of the JIL. He then generously made them available. There are several ways that you can work with these documents after you’ve extracted them from the tar archive: you can search them with your machine’s built-in indexer (I’ve only tried this with the Mac OS), convert them with pdftotext or similar to text documents and then use regular command-line utilities (or not convert them and use grep with the appropriate flags to handle binary files—I found this too annoying to deal with personally). Converting each PDF to text with pdftotext requires the use of find, xargs, or a simple shell script, as globbing of the form pdftotext *.pdf will not work.

I first, following Ridolfo’s example, looked for the first example of various things. The first mention of “post-colonial” was in an ad from Williams in 1983, “science fiction” in Clemson’s ad from 1974, for example. I discovered some brazen evidence of the gender dynamics of the profession from the mid-late 60s:

Screen Shot 2014-01-26 at 2.15.38 PM

And another disturbing thing from that ad is the apparent evidence that salaries have not kept pace with inflation. The Bureau of Labor Statistics’ inflation calculator reveals $15.5K to be more than $98K today, for example, which I suspect is not an Asst./Assoc. salary at the same institution today.

[UPDATE: Jack Lynch on facebook (one of the few facebook comments I was able to see, so I apologize if someone else pointed this out as well), noted that the St. Cloud State ad actually mentions this as a range for Full Professor, which is a reasonable figure adjusted for inflation.]

Screen Shot 2014-01-26 at 2.47.48 PM

Using a bit more automation, I used grep, sed, and uniq to create a csv of the frequency of search terms for each year, which I then imported into R and plotted. Here, for example, is a graph of the occurrences of the phrase “literary theory”

[UPDATE: The figures are not normalized for the number of jobs (or words in the job ads), so keep that in mind. 2007 had many more ads than 2008, for example. Again, I saw that Jack Lynch pointed this out on facebook.]

theory_in_mla_jil

Even though this is a rough count because of OCR imperfections, overattention to verbose ads, and only counting the phrase itself, not jobs specifically asking for it alone, I still think this a useful measure of the institutionalization of the concept. I also charted “Shakespeare”

[UPDATE: Here is a graph to compare with the one above on "Shakespeare" that is normalized for the percentage of total words in all of the ads for that year. This is not the ideal way of counting its relative frequency. It's quite possible that the OCR does a better job with the more modern typesetting, and I haven't investigated this thoroughly.]

Screen Shot 2014-01-28 at 6.56.19 PM

“Medieval”

medieval

“Post-Colonial” and “Postcolonial”

postcolonial-jil

“Feminist”

feminist-mla

These graphs are just a simple and preliminary indication of what can be done with this data. With more attention, a queryable database that could create custom graphs of changes in sub-fields over time could be made. Slurping all of the salary ranges out of these ads and charting their growth (or lack thereof) relative to inflation could give us some more insight into the economic realities of the job market.

The Distribution of PhDs in Modernism/modernity

Modernism/modernity is an important and relatively new journal (1994-) that publishes interdisciplinary work in modernist studies. Though I’ve never submitted an article to it (I did publish a book review there), I’ve long heard that it is very difficult to publish in. The last time I checked, the journal did not submit acceptance statistics to the MLA Directory of Periodicals (these statistics make for interesting reading if you’ve never looked at them, by the way).

I thought it would be interesting and sociologically revealing to chart the PhD institutions of the journal’s authors. I decided to check only those who had published research articles—not review-essays, shorter symposium pieces, questionnaires, or book reviews. There were 358 unique authors, along with twenty-two whose PhD institutions I could not track down. (These were mostly UK-based academics. Many UK departments do not list the PhD institution of their academics, while virtually all US ones do.) There were also approximately ten authors who had published more than one article there (six times was the record).

I had hoped that there would be a way to automate this tedious procedure, involving a web crawler and perhaps automatically querying the Proquest dissertation database; but it quickly became evident that any automation I was capable of devising would be error-ridden and require as much time to check as doing it all by hand. Out of resignation, stubbornness, and a deeply misplaced set of priorities, I checked them all by hand and graphed the results:

modernism-modernity

The above image displays all institutions with at least three authors. Those schools with two authors each were: Boston U, Brandeis, Essex, Exeter, London, Manchester, McGill, McMaster, Monash, Oregon, Sheffield, SUNY-Buffalo, Temple, Texas, Tufts, Uppsala, and WUSTL. There were 49 universities with one author each.

From a statistical perspective, I don’t think there’s anything too unusual about this distribution. It is more tightly clustered among elite institutions than I would have guessed beforehand, however. Clancy told me that this project seemed to be a public shaming, which is not my intention at all. I do think that a comparison with another literary journal that publishes in modernist studies would reveal a broader distribution, but I think that this might be as easily explainable by Modernism/modernity‘s interdisciplinary focus as its elitism.

UPDATE (1/3/14):

I had a suggestion to divide the data chronologically. Here is the first half (minimum=2):
Screen Shot 2014-01-03 at 8.14.05 AM

And the second:
Screen Shot 2014-01-03 at 8.05.41 AM

Decluttering Network Graphs

A problem that many of the co-citation graphs I discussed in the last post share is that they are too dense to be easily readable. I created the sliders as a way of alleviating this problem, but some of the data sets are too dense at any citation-threshold. Being able to view only one of the communities at a time seemed like a plausible solution, but I was far from sure how to implement it using d3.js. Solutions that involved pre-processing the data the way that I did for the sliders didn’t seem to be very useful for this problem.

I realized two months ago that I wouldn’t have time to learn d3.js (and javascript in general) well enough to solve this problem this semester, as I’m working very hard on teaching, research, and service. A few moments of lazy, idle scheming today, however, led me to this intriguing gist. The hard work of identifying centroid nodes of different communities and only generating their force-directed graphs when selected has already been done here. I wanted to add text labels to the nodes, to make them more like my other graphs. (The tooltip mouseovers just don’t seem information-rich enough for this purpose, though the graphs are certainly tidier without visible labels.)

As I should have realized, making this minor adjustment was far from easy. I eventually realized that I had to change the selection code from the DOM element “circle.node” to just the general grouping element “g.” With a few other tweaks to the force-layout settings, I tested it out with one citation graph that wasn’t too cluttered (compare here). By far the worst graph I’ve created for illegibility has been the philosophy of science one (see also here for an earlier attempt to make it more legible by adding a chronological slider).

Despite my best efforts, these floating balloons of philosophy of science aren’t a great improvement. Labeling the initial beach balls with the centroid node is probably a good idea, along with just some explanatory text. I do think a similar approach is the way forward with this particular technology for visualizing citation graphs. D3.js is very flexible and powerful, particularly in its ability to create and record illustrative animations. I hope to be able to do some more detailed work with it on citation graphs after the semester ends.

Some Thoughts on Twitter

Ted Underwood made the following comment on Scott Weingart’s post about a recent controversy with the Journal of Digital Humanities:

I can also imagine framing the issue, for instance, as a question about the way power tends to be exercised in a one-to-many social medium. I don’t know many academic fields that rely on Twitter as heavily as DH does. It certainly has as much power in the field as JDH (which, frankly, is not a high-profile journal). Right now digital humanists seem to be dividing into camps of Beliebers and nonBeliebers, and I’m less inclined to blame any of the people involved — or any structure of ideas, or system of peer review — than I am to suspect that the logic of Twitter itself encourages the formation of “teams.”

I like twitter more than facebook, probably because I choose to interact with people there I share more intellectual interests with. The follower/following relation leads to all sorts of status-seeking and rank-anxiety behavior, however. While it doesn’t surprise me that PR workers and journalists would buy twitter followers (available for about $1 per 1K fake accounts apparently), I’m reasonably sure that I’ve seen academics do so. Choosing not to reciprocate a follow request is reasonable for any number of reasons—attention economy, missing a notification,—but the spurned follower is not privy to any of this decision-making and may very well feel jilted and resentful. Next comes inevitable embarrassment for even noticing such a triviality, and this can cascade into what Wilhelm Fliess referred to as a “shame spiral.”

Not only does the logic of twitter compel its users to pay attention to their number of followers but also to the ratio of their followers to followings. Celebrity accounts–the standard of measure–are generally on the order of 1m to 1, so this is what the demotic tweeter aspires to. Influence algorithms, such as Klout, use this ratio as one way to assess the importance of an account; and I suspect that it also is used by automated discovery services such as (named with impeccable timing) Prismatic. Furthermore, links on twitter seem to have a clickthrough rate of about 1% (so I’ve observed personally), and I suspect this percentage decreases with the more followers an account has. In order for a link to efficiently spread, it has to be retweeted by many people. The more followers an account has, the more likely something posted from it will be retweeted. Underwood’s comment above references “Beliebers,” and anything that Justin Bieber (I’m not entirely sure who that is–perhaps a young soccer player or innovative badmintonist–but he has many followers on twitter) posts, no matter how trivial, will get many retweets.

What is to be done? Community formation on twitter seems like a fascinating area of research. The groupuscle of quasi-surrealists sometimes known as ‘weird twitter’ are apparently already the subject of a dissertation or two, and I could imagine very interesting work being done on the digital humanities community on twitter: network interactions, status effects, and the etiology and epidemiology of controversies…all sorts of wonders. I would be inclined to try some of this myself, but I find the twitter API somewhat cumbersome to use, and the amount of data involved is overwhelming.

The End of Breaking Bad

I wrote a couple of Breaking Bad commentaries last year after the end of the first part of the fifth season. There are now only four episodes left, and I’m not entirely sure if we’ll see anything else about Gustavo Fring’s past. I can see how the Lydia-plot could have a flashback with Fring, but I don’t see how it could get all the way back to Chile. And that’s a shame if true, because I think there’s some really useful political comparisons to be made between Walter White’s and Fring’s respective formative circumstances and economic policies.

Predicting the plot of a show that relies so strongly on flouting the probable is foolish, I suppose, but I would guess that the final four episodes will show Jesse attempting to lure White back into the meth production business. From the flash-forwards, we can see that his identity is known to the community at large, and that he also presumably availed himself of the esoteric vacuum-repairman’s new identity. I would guess that Hank allows Jesse to start cooking, or at least pretend, to cook Walter’s recipe again. Walter’s pride tempts him into an incriminating response, and he narrowly escapes arrest.

The problem with this scenario is that it doesn’t address the Todd & his family and Lydia dynamic. Walter’s already solicited their help in what is strongly hinted as Jesse’s murder, but he doesn’t yet know about the industrial unrest caused by Declan’s sub-par production facilities and staff (Declan reminded me unpleasantly of Don Henley, but I’m aware that may be a wholly personal and private issue). Todd’s white supremacist relatives seem to want to establish a meth empire of their own, and Todd would seem to welcome Jesse, who knows Walter’s process much better than Todd was able to learn, as a capable and diligent partner.

Normally I wouldn’t imagine that Hank would go along with such a plan, especially that Gomez now knows about everything, but it’s clear to me that the early episodes have positioned him as an increasingly desperate figure willing to do anything to get revenge against Walter. So, to summarize, Hank allows Jesse to contact Todd. Jesse learns of the fortuitous slaughter and offers his services. The gang neglects to kill Jesse after seeing his newfound value. Walter dislikes Jesse’s initiative and attempts to intervene. Something happens, and Walter is forced into exile without being arrested or murdered. Perhaps Hank or Marie then talks to the media, and Walter eventually comes back for revenge (most likely against the meth-operation in some form or another).

The only issue that I can’t quite resolve here is Lydia’s motivation. She was willing to murder all of Mike’s staff last season, but only because they were a threat to her. She’s now motivated by greed alone, it would seem, which is perhaps unlikely for someone with her nervous disposition (and obvious accumulated wealth). Perhaps the next episode will explain her circumstances in greater detail. (Maybe Gomez was on Fring’s—and now Lydia’s—payroll the whole time, as I recall many people proposing…)

Citations to Women in Theory

After reading Kieran Healy’s latest post about women and citation patterns in philosophy, I wanted to revisit the co-citation graph I had made of five journals in literary and cultural theory. As I noted, one of these journals is Signs, which is devoted specifically to feminist theory. I didn’t think that its presence would skew the results too much, but I wanted to test it. Here are the top thirty citations in those five journals:

Butler J 1990 117
Jameson F 1981 90
Butler J 1993 72
Lacan J 1977 71
Derrida J 1978 64
Foucault M 1977 61
Chodorow Nancy 1978 60
Gilligan C 1982 60
Fish Stanley 1980 57
Foucault M 1978 56
Spivak G C 1988 54
Bhabha H K 1994 54
Derrida Jacques 1976 53
Benjamin W Illuminations 53
Foucault M 1980 52
Althusser L 1971 51
Said Edward W 1978 51
DE Man P 1979 50
Foucault M 1979 49
Laclau Enesto 1985 48
Hardt M 2000 48
Zizek Slavoj 1989 47
Derrida Jacques 1994 46
Benjamin Walter 1969 45
Lyotard J-f 1984 44
Foucault Michel 1980 44
Anderson B 1983 44
Williams Raymond 1977 42
Frye Northrop 1957 41
Fuss D 1989 40
Irigaray L 1985 40

There are eight women (I’m counting Chantal Mouffe) in the top thirty, and Judith Butler is the most-cited author. To test my intuition that literary theory journals cite female authors more than analytic philosophy, I decided to replace Signs with College Literature. (Here is the co-citation network. Again, these work best with Safari and Chrome.)

Here are the top thirty most cited authors in that corpus:

Jameson F 1981 100
Lacan J 1977 75
Fish Stanley 1980 66
Derrida J 1978 65
Bhabha H K 1994 60
Benjamin W Illuminations 59
Butler J 1990 57
Derrida Jacques 1976 57
Althusser L 1971 56
Bakhtin M M 1981 56
Foucault M 1977 56
DE Man P 1979 52
Lyotard J-f 1984 49
Zizek Slavoj 1989 48
Frye Northrop 1957 48
Derrida Jacques 1994 48
Foucault M 1979 48
Benjamin Walter 1969 48
Hardt M 2000 46
Anderson B 1983 44
Laclau Enesto 1985 43
Marx K Capital 43
Said Edward W 1978 42
Gilroy P 1993 41
Barthes Roland 1977 41
Williams Raymond 1977 40
Freud S Interpretation Dream 40
Jameson Fredric 1991 40
Culler Jonathan 1975 40
Bass Alan 1982 40
Derrida J 1981 39

Butler and Mouffe (whose name doesn’t appear because of the way the citation data is formatted) are the only women in the top thirty (unless I missed something!).

I don’t want to draw any major conclusions from this data, but I’m a bit surprised. Neither of these citation corpora have been cleaned up as much as Healy’s has, for instance, and the choice of journals clearly affects the outcome. The journals I chose were ones that I happened to think might be representative of literary theory and also happened to be in the Web of Science database; many obvious candidates were not.

Citational Network Graph of Literary Theory Journals

I’ve been interested in humanities citation analysis for some time now, though I had been somewhat frustrated in that work by JSTOR pulling its citation data from its DfR portal a year or so ago. It was only a day or two ago with Kieran Healy’s fascinating post on philosophy citation networks that I noticed that the Web of Science database has this information in a relatively accessible format. Healy used Neal Caren’s work on sociology journals as a model. Caren generously supplied his python code in that post, and it’s relatively straightforward to set up and use yourself.*

My first experiments with using Caren’s method were on the Journal of American Folklore, as a meta-analysis of that journal is the subject of an article that John Laudun and I have coming out in a few months, and John has been interested in folklore’s citation patterns for some time now. Here is the network graph** of the co-citations in that journal from 1973-present. (Web of Science’s data generally ends around this time; JSTOR’s did not, though my impression is that the WoS data is a bit cleaner.) Co-citational analysis and the community-detection algorithm produce much better results than my earlier efforts at citiational network analysis. (Healy’s post does a very good job of explaining what co-citation is and why it’s a useful way of constructing the network relations.) I then built two models of PMLA: sparse and larger. Even the sparse graph had only half the the threshold of Caren’s original code, which worked on several journals rather than just one. So I decided that I need more data to get better results.

Several months ago I built a topic model of six journals oriented towards literary theory. Somehow correlating a topic model with the journals’ citation network is something I’ve been interested for some time. The first step here would be actually building the citation network of those journals. Unfortunately boundary 2 and Social Text are not in the Web of Science database. I added the journal of feminist theory Signs, which I had also topic-modeled to compensate, though the results are not going to be directly comparable to the theory-corpus topic model.

This corpus ended up being larger than Healy’s or Caren’s, so I had to adjust the threshold up to 11 to make it manageable. A drawback of D3.js is that it’s very processor-intensive. Here is an image of the network of the five journals:

theory-network

And here is the draggable network graph. The central nodes identified by the algorithm are Judith Butler’s Gender Trouble (1990) [red], Gayatri Spivak’s Can the Subaltern Speak (1988) & Edward Said’s Orientalism (1978) [light orange], Jacques Derrida’s Writing and Difference (1978) and Positions (1981) [light purple], Michel Foucault’s The Archaeology of Knowledge (1972) & Stanley Fish’s Is There A Text in This Class? [blue], Fredric Jameson’s The Political Unconscious (1981) (plus Althusser’s Lenin and Philosophy [1971]) [salmon pink], Carol Gilligan’s In A Different Voice (1982) & Nancy Chodorow’s The Reproduction of Mothering (1978), [orange], Pierre Bourdieu’s Distinction (1984), Michael Hardt and Antonio Negri’s Empire (2000) & Giorgio Agamben’s State of Exception (2005) [purple], and Jacques Lacan’s Ecrits (1977) [brown]. There is also a green Paul de Man node. Outliers include Hegel, Caruth, Clifford, Cavell, Wordsworth & Coleridge, and an interesting Latour-Bakhtin-Shapin nexus.

I would have liked to have explored this graph in D3 with a lower threshold, but my machine doesn’t have the processing power to handle that many nodes. I have been very happy using gephi in the past, but a java update seemed to make it stop working on my system. More interesting and perhaps unexpected results would appear at lower thresholds, I suspect, but I’m going to have to use another tool to visualize them. The results at this threshold meet my standard of face validity about the history of literary theory since the early 70s, though others might hold different opinions (it’s a contentious subject!).

UPDATE (6/23/13): I made a version of the dynamic graph that allows you to adjust the citation-threshold. There are also versions of a modernist-studies journals citation graph and one for journals in rhetoric and composition. And here is a post explaining the technical details.

*It relies on a couple of modules that are not installed by default on most people’s machines, I believe. First you need to clone Drew Conway’s fork (at the command line, git clone git://github.com/drewconway/networkx, will do it). Then you need to download this implementation of the Louvain network communtity detection algorithm. All of these files need to be in the same directory as Caren’s script. I was unable to install the networx fork on my Mac OS machine with pip, easy_install, or anything else; but the local import worked fine. Once you have set this up, you’ll need to modify by hand the filename in the script with your results. You can also change the threshold by changing the constants in this line: if edge_dict[edge]>3 and cite_dict[edge[0]]>=8 and cite_dict[edge[1]]>=8 :. Web of Science will only allow you to download 500 records at a time; you can either write a script to concatenate your results or do it by hand.

**All of these graphs use the D3.js library, which is very well designed and aesthetically pleasing. It renders very slowly on Firefox, however. Chrome and Safari give a much better viewing experience. (I have no idea about Internet Explorer.)

Interpreting Topics in Law and Economics

Of the many interesting things in Matthew Jockers’s Macroanalysis, I was most intrigued by his discussion of interpreting the topics in topic models. Interpretation is what literary scholars are trained for and tend to excel at, and I’m somewhat skeptical of the notion of an “uninterpretable” topic. I prefer to think of it as a topic that hasn’t yet met its match, hermeneutically speaking. In my experience building topic models of scholarly journals, I have found clear examples of lumping and splitting—terms that are either separated from their natural place or agglomerated into an unhappy mass. The ‘right’ number of topics for a given corpus is generally the one which has the lowest visible proportion of lumped and split topics. But there are other issues in topic-interpretation that can’t easily be resolved this way.

A problem I’ve found in modeling scholarship is how “evidence/argument words” are always highly represented in any given corpus. If you use hyperparameter optimization, which weighs topics according to the relative proportion in the corpus, words like “fact evidence argue make” tend to compose the most representative topics. Options include simply eliminating the topic from the browser, which seems to eliminate a large number of documents that would be otherwise classified, or trying to add all of the evidence words to a stop list. The aggressive pursuit of stop-words degrades the model, though this observation is more of an intuition than anything I can now document.

I thought it might be helpful to others who are interested in working with topic models to create several models of the same corpus and look at the effects created by small changes in the parameters (number of topics, lemmatization of corpus, and stop-words). The journal that I chose to use for this example is the Journal of Law and Economics, for both its ideological interest and methodological consistency. The law-and-economics movement is about as far away from literary studies as it’s possible to be while still engaging in a type of discourse analysis, I think, and I find this contrast both amusing and potentially illuminating. That the field of law-and-economics is perhaps the most well-known (even infamous) example of quantified reasoning used in support of what many view as a distinct political agenda is what led me to choose it to begin to explore the potential critical usefulness of another quantitative method of textual analysis.

I began by downloading all of the research articles published in the journal from JSTOR’s Data for Research. There were 1281 articles. I then converted the word-frequency lists to bags-of-words and created a 70-topic model using MALLET.* The browsable model is here. The first topic is the most general of academic evidence/argument words: “made, make, case, part, view, difficult. . .” I was intrigued by the high-ranking presence of articles by Milton Friedman and R. H. Coase in this topic; it would be suggestive if highly cited or otherwise important articles were most strongly associated with the corpus’s “evidence” terms, but I can’t say that this is anything other than coincidence. The next topic shows the influence of the journal’s title: “law, economics, economic, system, problem, individual.” The duplication of the adjective and noun form of “economics” can be eliminated with stemming or lemmatizing the corpus, though it is not clear if this increases the overall clarity of the model. I noticed that articles “revisiting” topics such as “social cost” and “public goods” are prominent in this topic, which is perhaps explainable by an unusually high proportion of intra-journal citations. (I want to bemoan, for the thousandth time, the loss of JSTOR’s citation data from its API.)

The next two topics are devoted to methodology. Econometric techniques dominate the content of the Journal of Law and Economics, so there’s no surprise that topics featuring those terms would be this widely distributed. Of the next three topics, one seems spuriously related to citations and the other two are also devoted to statistical methodology. It is only the eighth topic that is unambiguously associated with a recognizable subject in the journal: market efficiency. Is this apparent overemphasis on evidence/methodology a problem? And if so, what do you do about it? One approach would be to add many of the evidence-related words to a stop-list. Another would be to label all the topics and let the browser decide which are valuable. Here is a rough attempt at labeling the seventy-topic model.

The number of topics generated is the most obvious and effective parameter to adjust. Though I ended up labeling several of the topics the same way, I’m not sure that I would define those as split topics. The early evidence/methodology related topics do have slightly distinct frames of reference. The topics labeled “Pricing” also refer to different aspects of price theory, which I could have specified. The only obviously lumped-together topic was the final one, with its mixture of sex-worker and file-sharing economics. If there is evidence of both lumping and splitting, then simply adjusting the number of topics is unlikely to solve both problems.

An alternative to aggressive stop-wording is lemmatization. The Natural Language Toolkit has a lemmatizer that calls on the WordNet database. Implementation is simple in python, though slow to execute. A seventy-topic model generated with the lemmatized corpus has continuities with the non-lemmatized model. The browser shows that there are fewer evidence-related topics. Since the default stop-word list does not include the lemmatized forms “ha,” “doe,” “wa,” or “le,” it aggregates those in topics that are more strongly representative than the similar topics in the non-lemmatized model. Comparing the labeled topics with the non-lemmatized model show that there are many direct correspondences. The two insurance-related topics, for instance, have very similar lists of articles. The trend lines do not always match very well, which I believe is caused by the much higher weighting of the first “argument words” topic in the lemmatized corpus (plus also issues about the reliability of graphing these very small changes).

Labeling is inherently subjective, and my adopted labels for the lemmatized corpus were both whimsical in places and also influenced by the first labels that I had chosen. As I mentioned in my comments on Matthew Jockers’s Macroanalysis, computer scientists have developed automatic labeling techniques for topic models. While labor-intensive, doing it by hand forces you to consider each topic’s coherence and reliability in a way that might be easy to miss otherwise. The browser format that shows the articles most closely associated with each topic helps label them as well, I find. It might not be a bad idea for a topic model of journal articles to label each topic based on the title of the article most closely associated with it; this technique would only mislead on deeply divided or clustered topics, or on those which have only one article strongly associated with it (a sign of too many topics in my experience).

(UPDATE: My initial labeling of the tables below was in error because of an indexing error with the topic numbers. The correlations below make much more sense in terms of the topics’ relative weights, and I’m embarrassed that I didn’t notice the problem earlier.)

The topics were not strongly correlated with each other in either direction. In the non-lemmatized model, the only topics with a Pearson correlation above .4 were

EVIDENCE JOURNAL
ECONOMIC IDEOLOGY EVIDENCE
MODELING METHODOLOGY

The negative correlations below -.4 were

MODELING EVIDENCE
JOURNAL METHODOLOGY
MODELING JOURNAL
EVIDENCE METHODOLOGY

Ted Underwood and Andrew Goldstone’s PMLA topic-modeling post used network graphs to visualize their models and produce identifiable clusters. I suspect this particular model could be graphed in the same way, but the relatively low correlations between topics makes me a little leery of trying it. I generated a few network graphs for John Laudun’s and my folklore project, but we didn’t end up using them for the first article. They weren’t as snazzy as the Underwood and Goldstone graphs, as my gephi patience often runs very thin. (Gephi also has problems with the latest java update, as Ian Milligan pointed out to me on twitter. I intend to update this post before too long with a D3 network graph of the topic correlations.)

[UPDATE: 5/16/13. After some efforts at understanding javascript's object syntax, I've made a clickable network graph of correlations between topics in the lemmatized browser: network graph. The darker the edge, the stronger the correlation.]

The most strongly correlated topics in the lemmatized corpus were

METHODOLOGY MODELING
ARGUMENT WORDS PUBLIC GOODS
ARGUMENT WORDS ECONOMIC IDEOLOGY

Here is a simple network graph of the positively correlated topics above .2 (thicker lines indicate stronger correlation):

lemmatized-correlation

My goal is to integrate a D3.js version of these network graphs into the browsers, so that the nodes link to the topics and that the layout is adjustable. I haven’t yet learned the software well enough to do this however. The simple graph above was made using the R igraph package. [UDPATE: See here for a simple D3.js browser.]

And the negative correlations:

METHODOLOGY ARGUMENT WORDS
ARGUMENT WORDS MODELING
MODELING AMERICA?

The fact that some topics appear at the top of both the negative and positive correlations in both of the models suggests to me that there is some artifact of the hyperparameter optimization process responsible for this in a way that I don’t quite grasp (though I am aware, sadly enough, that the explanation could be very simple). The .4 threshold I chose is arbitrary, and the correlations follow a consistent and smooth pattern in both models. The related articles section of these browsers is based on Kullback-Leibler divergence, a metric apparently more useful than Manhattan distance. It seems to me that the articles listed under each topic are much more likely to be related to one another than any metric I’ve used to compare the overall weighting of topics.

Another way of assessing the models and label-interpretations is to check where they place highly cited articles. According to google scholar, the most highly cited article** in Journal of Law and Economics is Fama and Jensen’s “Separation of Ownership and Control.” In the non-lemmatized model, it is associated with the AGENTS AND ORGANIZATIONS topic. It appears in the topic I labeled INVESTORS in the lemmatized corpus, but further reflection shows that these terms are closer than I first thought. My intuition, as I have mentioned before in this discussion of Pierre Nora’s “Between Memory and History,” is that highly cited articles are somehow more central to the corpus because they affect the subsequent distribution of terms. The next-most cited article, Oliver Williamson’s “Transaction-cost Economics: The Governance of Contractual Relations” appears, suitably enough, in the topics devoted to contracts in both browsers. And R. H. Coase’s “The Federal Communications Commission” is in the COMMUNICATIONS REGULATION topic in both browsers, a topic whose continuing theoretical interest to the journal was established by Coase’s early article.

As I mentioned in the beginning, I chose the Journal of Law and Economics for this project in interpreting topics in part because of its ideological interest. I have little sympathy for Chicago-style economics and its dire public policy recommendations, but I only expressed that in this project through some sarcastic topic-labeling. Does the classification and sorted browsing enabled by topic modeling affect how a reader perceives antagonistic material? Labeling can be an aggressive activity; would automated labeling of topics alleviate this tendency or reinforce it? I don’t know if this subject has been addressed in informational-retrieval research, but I’d like to find out.

*I am leaving out some steps here. My code that processes the MALLET output into a browser uses scripts in perl and R to link the metadata to the files and create graphs of each topic. Andrew Goldstone’s code performs much the same functions and is much more structurally sound than what I created, which is why I haven’t shared my code. For creating browsers, Allison Chaney’s topic-modeling visualization engine is what I recommend, though I was unsure how to convert MALLET’s output to the lda-c output that it expects (though doing so would doubtlessly be much simpler than writing on your own as I did).

**That is the most highly cited article anywhere that google’s bots have found, not just in the journal itself. I am aware of the assumption inherent in claiming that a highly cited article would necessarily be influential to that particular journal’s development, since disciplinary and discourse boundaries would have to be taken into account. All highly cited articles are cited in multiple disciplines, I believe, and that applies even to a journal carving out new territory in two well-established ones like law and economics.

Recent Developments in Humanities Topic Modeling: Matthew Jockers’s Macroanalysis and the Journal of Digital Humanities

1. Ongoing Concerns
Matthew Jockers’s Macroanalysis: Digital Methods & Literary History arrived in the mail yesterday, and I finished reading just a short while ago. Between it and the recent Journal of Digital Humanities issue on the “Digital Humanities Contribution to Topic Modeling,” I’ve had quite a lot to read and think about. John Laudun and I also finished editing our forthcoming article in The Journal of American Folklore on using topic-models to map disciplinary change. Our article takes a strongly interpretive and qualitative approach, and I want to review what Jockers and some of the contributors to the JDH volume have to say about the interpretation of topic models.

Before I get to that, however, I want to talk about the Representations project’s status, as it was based on viewing the same corpus through a number of different topic-sizes. I had an intuition that documents that were highly cited outside of the journal, such as Pierre Nora’s “Between Memory and History,” might tend to be more reflective of the journal’s overall thematic structure than those less-cited. The fact that citation-count is (to some degree) correlated with publication date complicates this, of course, and I also began to doubt the premise. The opposite, in fact, might be as likely to be true, with articles that have an inverted correlation to the overall thematic structure possibly having more notability than “normal science.” The mathematical naivety of my approach compared to the existing work on topic-modeling and document influence, such as the Gerrish and Blei paper I linked to in the original post, also concerned me.

One important and useful feature missing from the browsers I had built was the display of related documents for each article. After spending one morning reading through early issues of Computers and the Humanities, I built a browser of it and then began working on computing similarity scores for individual articles. I used what seemed to be the simplest and most intuitive measure–the sum of absolute differences of topic assignments (this is known as Manhattan distance). Travis Brown pointed out to me on twitter that Kullback-Leibler divergence would likely give better results.* (Sure enough, in the original LDA paper, KL divergence is recommended.) The Computers and the Humanities browser currently uses the simpler distance measure, and the results are not very good. (This browser also did not filter for research articles only, and I only used the default stop-words list, which means that it is far from as useful as it could be.)

While the KL-divergence is not hard to calculate, I didn’t have time at the beginning of the end of the semester to rewrite the similarity score script to use it.** And since I wanted the next iteration of the browsers to use the presumably more accurate document-similarity scores, I’ve decided to postpone that project for a month or so. Having a javascript interface that allows you to instantly switch views between pre-generated models of varying numbers of topics also seemed like a useful idea; I haven’t seen anyone do that with different numbers of topics in each model yet (please let me know if there are existing examples of something like this).

2. Interpretation

I’m only going to write about a small section of Macroanalysis here. A full review might come in the future. I think that the rhetorical strategies of Jockers’s book (and also of Stephen Ramsay’s Reading Machines, an earlier volume in the Topics in the Digital Humanities series published by the University of Illinois Press) contrast interestingly with other scholarly monographs in literary studies and that this rhetoric is worth examining in the context of the current crisis in the humanities, and the salvific role of computational methods therein. But what I’m going to discuss here is Jockers’s take on labeling and interpreting the topics generated by LDA.

In our interpretation of the folklore-journals corpus John and I did do de-facto labeling or clustering of the topics. We were particularly interested in a cluster of topics related to the performative turn in folklore. Several of these topics did match our expectations in related terms and chronological trends. (Ben Schmidt’s cautions about graphing trends in topics chronologically are persuasive, though I’m more optimistic than he is about the use of dynamic topic modeling for secondary literature.) The documents associated with these apparently performance-related topics accorded with our expectations, and we took this as evidence that the co-occurrence and relative frequency assignments of the algorithm were working as expected. If that were all, then the results would be only another affirmation of the long-attested usefulness of LDA in classification or information-retrieval. And this goes a long way. If it works for things we know, then it works for things we don’t. And there are many texts we don’t know much about.

The real interest with using topic modeling to examine scholarship is when the results contrast with received understanding. When they mostly accord with what someone would expect to find, but there are oddities and discrepancies, we must interpret the results to determine if the fault lies in the algorithm’s classification or in the discipline’s received understanding of its history. By definition, this received understanding is based more on generalization and oral lore rather than analytic scrutiny and revision (which obviously drives much inquiry, but is almost always selective in its target), so there will always be discrepancies. Bibliometric approaches to humanities scholarship lag far behind those of the sciences, as I understand it, and I think they are of intrinsic interest independent of their contribution to disciplinary history.

Jockers describes efforts to label topics algorithmically in Macroanalysis (135, fn1). He mentions that his own work in successively revising the labels of his 19th century novels topic model is being used by David Mimno to train a classifying algorithm. He also cites “Automatic Labeling of Topic Models” and “Best Topic Word Selection for Topic Labelling” by Jey Han Lau and co-authors. Both of these papers explore automatically assigning labels to topics from either the terms themselves or from querying an external source, such as wikipedia, to correlate with the terms. My browsers just use the first four terms of a topic as the label, but I can see how a human-assigned label would make them more consistently understandable. Of course, with many models and large numbers of topics, this process becomes laborious, thus the interest in automatic assignment.

But some topics cannot be interpreted. (These are described as “uninterruptable” topics in Macroanalysis [129] in what I assume is a spell-check mistake.) Ignoring ambiguous topics is “a legitimate use of the data and should not be viewed with suspicion by those who may be wary of the ‘black box'” (130) I agree with Jockers here. In my experience modeling JSTOR data, there are always “evidence/argument” related topics that are highly represented in a hyperparametrized model, and these topics are so general as to be useless for analytic purposes. There are also “OCR error” topics and “bibliography” topics. I wouldn’t describe these latter ones as ambiguous so much as useless, but the point is that you don’t have to account for the entire model to interpret some of the topics. Topics near the bottom of a hyperparametrized model tend not to be widely represented in a corpus and thus are not of very high quality: this “dewey ek chomsky” topic from the browser I created out of five theory-oriented journals is a good example.

I was particularly intrigued by Jockers’s description of combining topic-model and stylometric classifications into a similarity matrix. I would be bewildered and intimidated by the underlying statistical difficulties of combining these two types of classifications, but the results are certainly intriguing. The immortal George Payne Rainsford James and his The False Heir was classified as the closest non-Dickens novel to A Tale of Two Cities, for example (161).

3. The JDH Issue

Scott Weingart and Elijah Meeks, as I noted above, co-edited a recent issue of JDH devoted to topic modeling in the humanities. Many of the articles are versions of widely circulated posts of the last few months, such as the aforementioned Ben Schmidt article and Andrew Goldstone’s and Ted Underwood’s piece on topic-modeling PMLA. (Before I got distracted by topic-browsers, I created some network visualizations of topics similar to those in the Underwood and Goldstone piece. I get frustrated easily with Gephi for some reason, but the network visualization packages in R don’t generally produce graphs as handsome as Gephi’s.) There is a shortened version of David Blei’s “Probabilistic Topic Models” review article, and the slides from David Mimno’s very informative presentation from November’s Topic-Modeling workshop at the University of Maryland. Megan R. Brett does a good job of explaining what’s interesting about the process to a non-specialist audience. I’ve tried this myself two or three times, and it’s much more difficult than I expected it would be. The slightly decontextualized meanings of “topic,” “theme,” “document,” and possibly even “word” that are used to describe the process cause confusion, from what I’ve observed, and it’s also quite difficult to grasp why the “bag of words” approach can produce coherent results if you’re unaccustomed to thinking about the statistical properties of language. Formalist training and methods are hard to reconcile with frequency-based analysis.

Lisa Rhody’s article describes using LDA to model ekphrastic poetry. I was impressed with Rhody’s discussion of interpretation here, as poetry presents a different level of abstraction from secondary texts and even other forms of creative writing. I had noticed in the rhetoric browser I created out of College English, jac, Rhetoric Review, Rhetoric Society Quarterly, and CCC, that the poems often published in College English consistently clustered together (and that topic would have been clustered together had I stop-worded “poems,” which I probably should have done.) Rhody’s article is the longest of the contributions, I believe, and it has a number of observations about the interpretation of topics that I want to think about more carefully.

Finally, the overview of tools available for topic modeling was very helpful. I’ve never used Paper Machines on my zotero collections, but I look forward to trying this out in the near future. A tutorial on using the R lda package might have been a useful addition, though perhaps its target audience would be too small to bother. I think I might be one of the few humanists to experiment with dynamic topic models, which I think is a useful and productive—if daunting—LDA variant. (MALLET has a built-in hierarchical LDA model, but I haven’t yet experimented with it.)

*Here is an informative storified conversation about distance measurements for topic models that Brown showed me.

**Possibly interesting detail: at no point do any of my browser-creation programs use objects or any more complicated data-structure than a hash. If you’re familiar with the types of data manipulation necessary to create one of these, that probably sounds somewhat crazy—hence my reluctance to share the code on github or similar. I know enough to know that it’s not the best way to solve the problem, but it also works, and I don’t feel the need to rewrite it for legibility and some imagined community’s approval. I’m fascinated by the ethos of code-sharing, and I might write something longer about this later.

***I disagree with the University of Illinois Press’s decision to use sigils instead of numbered notes in this book. As a reader, I prefer endnotes, though I know how hard they are to typeset, but Jockers’s book has enough of them that they should be numbered.

Topic Models and Highly Cited Articles: Pierre Nora’s “Between Memory and History” in Representations

I have been interested in bibliometrics for some time now. Humanities citation data has always been harder to come by than that of the sciences, largely because the importance of citation-count as a metric has never much caught on there. Another important reason is a generalized distrust and suspicion of quantification in the humanities. And there are very good reasons to be suspicious of assigning too much significance to citation-counts in any discipline.

I used google scholar to search for most-cited articles in several journals in literary studies and allied fields. (Its default search behavior is to return the most-cited article in its database, which, while having a very broad reach, is far from comprehensive or error-free.) By far the most-cited article I found in any of the journals I looked at was Pierre Nora’s “Between Memory and History: Les Lieux de Mémoire.” A key to success in citation-gathering is multidisciplinary appeal, and Nora’s article has it. It is cited in history, literary studies, anthropology, sociology, and several other fields. (It would be interesting to consider Nora’s argument about the ever-multiplying sites of memory in era of mass quantification, but I’ll have to save that for another time.)

The next question that came to mind would be where Nora’s article would be classified in a topic model of all of the journal’s articles. Representations was first published in 1983. The entire archive in JSTOR contains 1036 documents. For many of my other topic-modeling work with journals, I have only used what JSTOR classifies as research articles. Here, because of the relatively small size of the sample (and also because I wanted to see how the algorithm would classify front matter, back matter, and the other paraphernalia), I used everything. In order to track “Between Memory and History,” I created several different models. It is always a heuristic process to match the number of topics with the size and density of a given corpus. Normally, I would have guessed that somewhere between 30-50 would have been good enough to catch most of the distinct topics while minimizing the lumping together of unrelated ones.

For this project, however, I decided to create six separate models with an incrementally increasing number of topics. The number of topics in each is 10, 30, 60, 90, 120, and 150. I have also created browsers for each model. The index page of each browser shows the first four words of each topic for that model. The topics are sorted in descending order of their proportion in the model. Clicking on one of the topics takes you to a page which shows the full list of terms associated with that topic, the articles most closely associated with that topic (also sorted in descending order—the threshold is .05), and a graph that shows the annual mean of that topic over time. Clicking on any given journal article will take you to a page showing that journal’s bibliographic information, along with a link to JSTOR. The four topics most closely associated with that article are also listed there.

In the ten-topic browser, whose presence here is intended to demonstrate my suspicion that ten topics would not be nearly enough to capture the range of discourse in Representations, Nora’s article is in the ‘French’ topic, a lumped-together race/memory topic, a generalized social/history topic, and the suggestive “time, death, narrative” topic. With a .05 threshold, 32% of the documents in the corpus appear in the ten-topic browser. [UPDATE: 3/16, this figure turned out to be based on a bug in the browser-building program.] None of these classifications are particularly surprising or revealing, given how broad the topics have to be at this level of detail; but one idea that I want to return is the ability of topic-models to identify influential documents in a given corpus. Nora’s article has clearly been very influential, but are there any detectable traces of this influence in a model of the journal in which it appeared?

Sean M. Gerrish and David Blei’s article “Language-based Approach to Measuring Scholarly Impact” uses dynamic topic models to infer which documents are (or will be) most influential in a given collection. What I have done with these Representations models is not dynamic topic modeling but the regular LDA model. I have experimented with dynamic topic models in the past, and I would like to apply the particular techniques described in their article once I can understand them better.

Here is how Nora’s article is classified in each of the topic models (sorted vertically from most to least representative):

10-topics 30-topics 60-topics 90-topics 120-topics 150-topics
{social political work} {history historical cultural} {history historical past} {historical history memory} {memory past history} {memory past collective}
{war american black} {form text relation} {memory jewish holocaust} {form human order} {human form individual} {history historical past}
{time death narrative} {memory jewish jews} {made work ways} {fact make point} {history historical modern} {form relation terms}
{de la le} {time death life} {world human life} {early modern history} {relation difference object} {sense kind fact}
N/A {political social power} {early modern great} {power terms suggests} {de la french} {individual system theory}
N/A {de la le} {make fact question} N/A {fact order present} N/A
N/A N/A {body figure space} N/A {forms figure form} N/A
N/A N/A {makes man relation} N/A N/A N/A
N/A N/A {national history public} N/A N/A N/A

There is a notable consistency between the topics the article is assigned to no matter how many there are to choose from. A logical question to ask is if Nora’s article is assigned to more or less topics than the average article across these six models. The percentage of all articles that are assigned to a topic with a proportional threshold >= .05 ranges from 32% with the ten-topic model to 52% in the 150-topic.

In my next post, I am going to describe the relative frequency of the average article in the different models and try to identify which ones (including Nora’s, if it turns out to be) are disproportionately represented in the topics. I will also begin interpreting these results in light of what I felt was historicism’s relative absence in the theory-journals corpus I created earlier.

[UPDATE: 3/16. I corrected a bug in the browser-building program and generated a new table above with the correct topics linked for Nora's article. The previous table had omitted a few.]

Learning to Code

One of my secret vices is reading polemics about whether or not some group of people, usually humanists or librarians, should learn how to code. What’s meant by “to code” in these discussions varies quite a lot. Sometimes it’s a markup language. More frequently it’s an interpreted language (usually python or ruby). I have yet to come across an argument for why a humanist should learn how to allocate memory and keep track of pointers in C, or master the algorithms and data structures in this typical introductory computer science textbook; but I’m sure they’re out there.

I could easily imagine someone in game studies wanting to learn how to program games in their original environment, such as 6502 assembly, for example. A good materialist impulse, such as learning how to work a printing press or bind a book, should never be discouraged. But what about scholars who have an interest in digital media, electronic editing, or text mining? The skeptical argument here points out that there are existing tools for all of these activities, and the wise and conscientious scholar will seek those out rather than wasting time reinventing an inferior product.

This argument is very persuasive, but it doesn’t survive contact with the realities of today’s text-mining and machine-learning environment. I developed a strong interest in these areas several months ago (and have posted about little else since, sadly enough), even to the point where I went to an NEH seminar on topic modeling hosted by the fine folks at the MITH. One of the informative lectures recommended that anyone serious about pursuing topic modeling projects learn the statistical programming language R and a scripting language such as python. This came as of little surprise to me as being reassured later in the evening by a dinner companion that Southerners were of course discriminated against in academia. I had begun working with topic-modeling in R packages, and a great deal of text-munging was required to assemble the topic output in a legible format. MALLET makes this easier, but there’s no existing GUI solution* for visualizing the topics (or creating browsers of them, which some feel is more useful**).

Whatever flexibility that being able to dispense with existing solutions might offer you is more than counterbalanced by the unforgiving exactitude and provincial scrupulousness of programming languages, which manifestly avoid all but the most literal interpretations and cause limitless suffering for those foolish or masochistic enough to use them. These countless frustrations inevitably lead to undue pride in overcoming them, which lead people (or at least me) to replace a more rational regret over lost time with the temporary confidence of (almost always Pyrrhic) victory.

An optimistic assessment of the future of computation is that interfaces will become sophisticated enough to eliminate the need for almost anyone other than hobbyists to program a computer. Much research in artificial intelligence (and much of the most promising results as I understand them) has been in training computers to program themselves. Functional programming languages, to my untutored eye and heavily imperative mindset, already seem to train their programmers to think in a certain way. The correct syntax is the correct solution, in other words; and how far can it be from that notable efficiency to having the computer synthesize the necessary solutions to any technical difficulty or algorithmic refinement itself? (These last comments are somewhat facetious, though the promise of autoevolution was at the heart of cybernetics and related computational enthusiasms—the recent English translation of Lem’s Summa Technologiae is an interesting source here as is Lem’s “Golem XIV.”)

I can’t help but note that several of the arguments I’ve read that advise people not to learn to code and not to spend time teaching other people how to if you happen to be unlucky enough to be in a position to do so are written by people who make it clear that they themselves know how. (I’m thinking here in particular of Brian Lennon, with whom I’ve had several discussions about these matters on twitter and also David Golumbia.) Though I don’t think this myself, I could see how someone might describe this stance as obscurantist. (It’s probably a matter of ethos and also perhaps a dislike of people who exaggerate their technical accomplishments and abilities in front of audiences who don’t know any better—if you could concede that such things could exist in the DH community.)

*Paper Machines, though I haven’t tried it out, can now import and work with DfR requests. This may include topic modeling functionality as well.

**I have to admit that casual analysis (or, exacting scrutiny) of my server logs reveals that absolutely no one finds these topic browsers worth more than a few seconds’ interest. I haven’t yet figured out if this is because they are objectively uninteresting or if users miss the links because the style sheet. (Or both.)

The Awakening of My Interest in Annular Systems

I’ve been thinking a lot recently about a simple question: can machine learning detect patterns of disciplinary change that are at odds with received understanding? The forms of machine learning that I’ve been using to try to test this—LDA and the dynamic LDA variant—do a very good job of picking up the patterns that you would suspect to find in, say, a large corpus of literary journals. The model I built of several theoretically oriented journals in JSTOR, for example, shows much the same trends that anyone familiar with the broad contours of literary theory would expect to find. The relative absence of historicism as a topic of self-reflective inquiry is also explainable by the journals represented and historicism’s comparatively low incidence of keywords and rote-citations.

I’ve heard from people on twitter that it’s a widely held belief that machine-learning techniques (and, by extension, all quantitative methods) can only tell us what we already know about the texts. I admit some initial skepticism about the prevalence of this claim, but I’ve now seen more evidence of it in the wild, so to speak, and I think I understand where some of this overly categorical skepticism comes from. A test of the validity of topic modeling, for example, would be if it produces a coherent model of a well-known corpus. If it does, then it is likely that it will do the same for an unknown or unread group of texts. The models that I have built of scholarly literature from JSTOR, I can see, are thought by some of the people who’ve seen them to be well-understood corpora. If the models reflect the general topics and trends that people know from their knowledge of the field, then that’s great as far as it goes, but we’ll have to reserve judgment on the great unread.

One issue here is that I don’t think the disciplinary history of any field is well understood. Topic modeling’s disinterested aggregations have the potential to show an unrecognized formation or the persistence of a trend long-thought dormant. Clancy found some clustering of articles in rhetoric journals associated with a topic that she initially would have labeled as “expressivist” from several decades before she would expect. Part of this has to do with the eclectic nature of what’s published in College English, of course, and part has to do with the parallels between creative writing and expressivist pedagogy. But it’s the type of specific connection that someone following established histories is not likely to find.

Ben Schmidt noted that topic modeling was designed and marketed, to some degree, as a replacement for keyword search. Schmidt is more skeptical than I am of the usefulness of this higher-level of abstraction for general scholarly research. I know enough about anthropology to have my eyebrows raised by this Nicholas Wade essay on Napoleon Chagnon, for example, and I still find this browser of American Anthropologist to be a quicker way of finding articles than JSTOR’s interface. I created this browser to compare with the folklore browser* of the corpus that John Laudun and I have been working with. We wanted to see if topic models would reflect our intuition that the cultural/linguistic turn in anthropology and folklore diffused through their respective disciplines’ scholarly journals (the folklore corpus contains the journal most analogous to American Anthropologist, The Journal of American Folklore, but it also has other folklore journals as well) at the expected time (earlier in anthropology than folklore).

A very promising, to my mind, way of correlating topic models of journals is with networks of citations. I’ve done enough network graphs of scholarly citations to know that, unless you heavily prune and categorize the citations, the results are going to be hard to visualize in any meaningful way. (One of the first network graphs I created all of the citations in thirty years of JAF required zooming in to something like 1000x magnification to make out individual nodes. I’m far from an expert at creating efficient network visualizations, needless to say.) JSTOR once provided citation data through its Data for Research interface; it does not any longer as far as I know. This has been somewhat frustrating.

If we had citation data, taking two topics that both seem reflective of a general cultural/linguistic/poststructuralist influence, such as this folklore topic and this anthropological one would allow us to compare the citation networks to see if the concomitant rise in proportion was reflected in references to shared sources (Lévi-Strauss, for example, I know to be one of the most cited authors in the folklore corpus.) I would also like to explore the method described in this paper that uses a related form of posterior inference to discover the most influential documents in a corpus.**

This type of comparative exploration, while presenting an interesting technical challenge to implement (to me, that is, and I fully recognize the incommensurable gulf between using these algorithms and creating and refining them) can’t (yet) be mistaken for discovery. You can’t go from this to an a priori proof of non-discovery, however. Maybe no one is actually arguing this position, and I’m fabricating this straw argument out of supercilious tweets and decontextualized and half-remembered blog posts.

A more serious personal intellectual problem for me is that I find the dispute between Peter Norvig and Noam Chomsky to be either a case of mutual misunderstanding or one where Chomsky has by far the more persuasive case. If I’m being consistent then, I’d have to reject at least some of the methodological premises behind topic-modeling and related techniques. Perhaps “practical value” and “exploration/discovery” can share a peaceful co-existence.

*These browsers work by showing an index page with the first four words of each topic. You can then click on any one of the topics to see the full list of words associated with it, together with a list of articles sorted by how strongly they represent that topic. Clicking then on an individual article takes you to page that shows the other topics most associated with that article, also clickable, and a link to the JSTOR page of the article itself.

**The note about the model taking more than ten hours to run fills me with foreboding, however. My (doubtlessly inefficient) browser-creating scripts can take more than hour to run on a corpus of 10K documents, combined with another hour or more w/ MALLET and R–it really grinds down a person conditioned to expect instant results in today’s attention economy.