Interpreting Topics in Law and Economics

Of the many interesting things in Matthew Jockers’s Macroanalysis, I was most intrigued by his discussion of interpreting the topics in topic models. Interpretation is what literary scholars are trained for and tend to excel at, and I’m somewhat skeptical of the notion of an “uninterpretable” topic. I prefer to think of it as a topic that hasn’t yet met its match, hermeneutically speaking. In my experience building topic models of scholarly journals, I have found clear examples of lumping and splitting—terms that are either separated from their natural place or agglomerated into an unhappy mass. The ‘right’ number of topics for a given corpus is generally the one which has the lowest visible proportion of lumped and split topics. But there are other issues in topic-interpretation that can’t easily be resolved this way.

A problem I’ve found in modeling scholarship is how “evidence/argument words” are always highly represented in any given corpus. If you use hyperparameter optimization, which weighs topics according to the relative proportion in the corpus, words like “fact evidence argue make” tend to compose the most representative topics. Options include simply eliminating the topic from the browser, which seems to eliminate a large number of documents that would be otherwise classified, or trying to add all of the evidence words to a stop list. The aggressive pursuit of stop-words degrades the model, though this observation is more of an intuition than anything I can now document.

I thought it might be helpful to others who are interested in working with topic models to create several models of the same corpus and look at the effects created by small changes in the parameters (number of topics, lemmatization of corpus, and stop-words). The journal that I chose to use for this example is the Journal of Law and Economics, for both its ideological interest and methodological consistency. The law-and-economics movement is about as far away from literary studies as it’s possible to be while still engaging in a type of discourse analysis, I think, and I find this contrast both amusing and potentially illuminating. That the field of law-and-economics is perhaps the most well-known (even infamous) example of quantified reasoning used in support of what many view as a distinct political agenda is what led me to choose it to begin to explore the potential critical usefulness of another quantitative method of textual analysis.

I began by downloading all of the research articles published in the journal from JSTOR’s Data for Research. There were 1281 articles. I then converted the word-frequency lists to bags-of-words and created a 70-topic model using MALLET.* The browsable model is here. The first topic is the most general of academic evidence/argument words: “made, make, case, part, view, difficult. . .” I was intrigued by the high-ranking presence of articles by Milton Friedman and R. H. Coase in this topic; it would be suggestive if highly cited or otherwise important articles were most strongly associated with the corpus’s “evidence” terms, but I can’t say that this is anything other than coincidence. The next topic shows the influence of the journal’s title: “law, economics, economic, system, problem, individual.” The duplication of the adjective and noun form of “economics” can be eliminated with stemming or lemmatizing the corpus, though it is not clear if this increases the overall clarity of the model. I noticed that articles “revisiting” topics such as “social cost” and “public goods” are prominent in this topic, which is perhaps explainable by an unusually high proportion of intra-journal citations. (I want to bemoan, for the thousandth time, the loss of JSTOR’s citation data from its API.)

The next two topics are devoted to methodology. Econometric techniques dominate the content of the Journal of Law and Economics, so there’s no surprise that topics featuring those terms would be this widely distributed. Of the next three topics, one seems spuriously related to citations and the other two are also devoted to statistical methodology. It is only the eighth topic that is unambiguously associated with a recognizable subject in the journal: market efficiency. Is this apparent overemphasis on evidence/methodology a problem? And if so, what do you do about it? One approach would be to add many of the evidence-related words to a stop-list. Another would be to label all the topics and let the browser decide which are valuable. Here is a rough attempt at labeling the seventy-topic model.

The number of topics generated is the most obvious and effective parameter to adjust. Though I ended up labeling several of the topics the same way, I’m not sure that I would define those as split topics. The early evidence/methodology related topics do have slightly distinct frames of reference. The topics labeled “Pricing” also refer to different aspects of price theory, which I could have specified. The only obviously lumped-together topic was the final one, with its mixture of sex-worker and file-sharing economics. If there is evidence of both lumping and splitting, then simply adjusting the number of topics is unlikely to solve both problems.

An alternative to aggressive stop-wording is lemmatization. The Natural Language Toolkit has a lemmatizer that calls on the WordNet database. Implementation is simple in python, though slow to execute. A seventy-topic model generated with the lemmatized corpus has continuities with the non-lemmatized model. The browser shows that there are fewer evidence-related topics. Since the default stop-word list does not include the lemmatized forms “ha,” “doe,” “wa,” or “le,” it aggregates those in topics that are more strongly representative than the similar topics in the non-lemmatized model. Comparing the labeled topics with the non-lemmatized model show that there are many direct correspondences. The two insurance-related topics, for instance, have very similar lists of articles. The trend lines do not always match very well, which I believe is caused by the much higher weighting of the first “argument words” topic in the lemmatized corpus (plus also issues about the reliability of graphing these very small changes).

Labeling is inherently subjective, and my adopted labels for the lemmatized corpus were both whimsical in places and also influenced by the first labels that I had chosen. As I mentioned in my comments on Matthew Jockers’s Macroanalysis, computer scientists have developed automatic labeling techniques for topic models. While labor-intensive, doing it by hand forces you to consider each topic’s coherence and reliability in a way that might be easy to miss otherwise. The browser format that shows the articles most closely associated with each topic helps label them as well, I find. It might not be a bad idea for a topic model of journal articles to label each topic based on the title of the article most closely associated with it; this technique would only mislead on deeply divided or clustered topics, or on those which have only one article strongly associated with it (a sign of too many topics in my experience).

(UPDATE: My initial labeling of the tables below was in error because of an indexing error with the topic numbers. The correlations below make much more sense in terms of the topics’ relative weights, and I’m embarrassed that I didn’t notice the problem earlier.)

The topics were not strongly correlated with each other in either direction. In the non-lemmatized model, the only topics with a Pearson correlation above .4 were

EVIDENCE JOURNAL
ECONOMIC IDEOLOGY EVIDENCE
MODELING METHODOLOGY

The negative correlations below -.4 were

MODELING EVIDENCE
JOURNAL METHODOLOGY
MODELING JOURNAL
EVIDENCE METHODOLOGY

Ted Underwood and Andrew Goldstone’s PMLA topic-modeling post used network graphs to visualize their models and produce identifiable clusters. I suspect this particular model could be graphed in the same way, but the relatively low correlations between topics makes me a little leery of trying it. I generated a few network graphs for John Laudun’s and my folklore project, but we didn’t end up using them for the first article. They weren’t as snazzy as the Underwood and Goldstone graphs, as my gephi patience often runs very thin. (Gephi also has problems with the latest java update, as Ian Milligan pointed out to me on twitter. I intend to update this post before too long with a D3 network graph of the topic correlations.)

[UPDATE: 5/16/13. After some efforts at understanding javascript's object syntax, I've made a clickable network graph of correlations between topics in the lemmatized browser: network graph. The darker the edge, the stronger the correlation.]

The most strongly correlated topics in the lemmatized corpus were

METHODOLOGY MODELING
ARGUMENT WORDS PUBLIC GOODS
ARGUMENT WORDS ECONOMIC IDEOLOGY

Here is a simple network graph of the positively correlated topics above .2 (thicker lines indicate stronger correlation):

lemmatized-correlation

My goal is to integrate a D3.js version of these network graphs into the browsers, so that the nodes link to the topics and that the layout is adjustable. I haven’t yet learned the software well enough to do this however. The simple graph above was made using the R igraph package. [UDPATE: See here for a simple D3.js browser.]

And the negative correlations:

METHODOLOGY ARGUMENT WORDS
ARGUMENT WORDS MODELING
MODELING AMERICA?

The fact that some topics appear at the top of both the negative and positive correlations in both of the models suggests to me that there is some artifact of the hyperparameter optimization process responsible for this in a way that I don’t quite grasp (though I am aware, sadly enough, that the explanation could be very simple). The .4 threshold I chose is arbitrary, and the correlations follow a consistent and smooth pattern in both models. The related articles section of these browsers is based on Kullback-Leibler divergence, a metric apparently more useful than Manhattan distance. It seems to me that the articles listed under each topic are much more likely to be related to one another than any metric I’ve used to compare the overall weighting of topics.

Another way of assessing the models and label-interpretations is to check where they place highly cited articles. According to google scholar, the most highly cited article** in Journal of Law and Economics is Fama and Jensen’s “Separation of Ownership and Control.” In the non-lemmatized model, it is associated with the AGENTS AND ORGANIZATIONS topic. It appears in the topic I labeled INVESTORS in the lemmatized corpus, but further reflection shows that these terms are closer than I first thought. My intuition, as I have mentioned before in this discussion of Pierre Nora’s “Between Memory and History,” is that highly cited articles are somehow more central to the corpus because they affect the subsequent distribution of terms. The next-most cited article, Oliver Williamson’s “Transaction-cost Economics: The Governance of Contractual Relations” appears, suitably enough, in the topics devoted to contracts in both browsers. And R. H. Coase’s “The Federal Communications Commission” is in the COMMUNICATIONS REGULATION topic in both browsers, a topic whose continuing theoretical interest to the journal was established by Coase’s early article.

As I mentioned in the beginning, I chose the Journal of Law and Economics for this project in interpreting topics in part because of its ideological interest. I have little sympathy for Chicago-style economics and its dire public policy recommendations, but I only expressed that in this project through some sarcastic topic-labeling. Does the classification and sorted browsing enabled by topic modeling affect how a reader perceives antagonistic material? Labeling can be an aggressive activity; would automated labeling of topics alleviate this tendency or reinforce it? I don’t know if this subject has been addressed in informational-retrieval research, but I’d like to find out.

*I am leaving out some steps here. My code that processes the MALLET output into a browser uses scripts in perl and R to link the metadata to the files and create graphs of each topic. Andrew Goldstone’s code performs much the same functions and is much more structurally sound than what I created, which is why I haven’t shared my code. For creating browsers, Allison Chaney’s topic-modeling visualization engine is what I recommend, though I was unsure how to convert MALLET’s output to the lda-c output that it expects (though doing so would doubtlessly be much simpler than writing on your own as I did).

**That is the most highly cited article anywhere that google’s bots have found, not just in the journal itself. I am aware of the assumption inherent in claiming that a highly cited article would necessarily be influential to that particular journal’s development, since disciplinary and discourse boundaries would have to be taken into account. All highly cited articles are cited in multiple disciplines, I believe, and that applies even to a journal carving out new territory in two well-established ones like law and economics.

Recent Developments in Humanities Topic Modeling: Matthew Jockers’s Macroanalysis and the Journal of Digital Humanities

1. Ongoing Concerns
Matthew Jockers’s Macroanalysis: Digital Methods & Literary History arrived in the mail yesterday, and I finished reading just a short while ago. Between it and the recent Journal of Digital Humanities issue on the “Digital Humanities Contribution to Topic Modeling,” I’ve had quite a lot to read and think about. John Laudun and I also finished editing our forthcoming article in The Journal of American Folklore on using topic-models to map disciplinary change. Our article takes a strongly interpretive and qualitative approach, and I want to review what Jockers and some of the contributors to the JDH volume have to say about the interpretation of topic models.

Before I get to that, however, I want to talk about the Representations project’s status, as it was based on viewing the same corpus through a number of different topic-sizes. I had an intuition that documents that were highly cited outside of the journal, such as Pierre Nora’s “Between Memory and History,” might tend to be more reflective of the journal’s overall thematic structure than those less-cited. The fact that citation-count is (to some degree) correlated with publication date complicates this, of course, and I also began to doubt the premise. The opposite, in fact, might be as likely to be true, with articles that have an inverted correlation to the overall thematic structure possibly having more notability than “normal science.” The mathematical naivety of my approach compared to the existing work on topic-modeling and document influence, such as the Gerrish and Blei paper I linked to in the original post, also concerned me.

One important and useful feature missing from the browsers I had built was the display of related documents for each article. After spending one morning reading through early issues of Computers and the Humanities, I built a browser of it and then began working on computing similarity scores for individual articles. I used what seemed to be the simplest and most intuitive measure–the sum of absolute differences of topic assignments (this is known as Manhattan distance). Travis Brown pointed out to me on twitter that Kullback-Leibler divergence would likely give better results.* (Sure enough, in the original LDA paper, KL divergence is recommended.) The Computers and the Humanities browser currently uses the simpler distance measure, and the results are not very good. (This browser also did not filter for research articles only, and I only used the default stop-words list, which means that it is far from as useful as it could be.)

While the KL-divergence is not hard to calculate, I didn’t have time at the beginning of the end of the semester to rewrite the similarity score script to use it.** And since I wanted the next iteration of the browsers to use the presumably more accurate document-similarity scores, I’ve decided to postpone that project for a month or so. Having a javascript interface that allows you to instantly switch views between pre-generated models of varying numbers of topics also seemed like a useful idea; I haven’t seen anyone do that with different numbers of topics in each model yet (please let me know if there are existing examples of something like this).

2. Interpretation

I’m only going to write about a small section of Macroanalysis here. A full review might come in the future. I think that the rhetorical strategies of Jockers’s book (and also of Stephen Ramsay’s Reading Machines, an earlier volume in the Topics in the Digital Humanities series published by the University of Illinois Press) contrast interestingly with other scholarly monographs in literary studies and that this rhetoric is worth examining in the context of the current crisis in the humanities, and the salvific role of computational methods therein. But what I’m going to discuss here is Jockers’s take on labeling and interpreting the topics generated by LDA.

In our interpretation of the folklore-journals corpus John and I did do de-facto labeling or clustering of the topics. We were particularly interested in a cluster of topics related to the performative turn in folklore. Several of these topics did match our expectations in related terms and chronological trends. (Ben Schmidt’s cautions about graphing trends in topics chronologically are persuasive, though I’m more optimistic than he is about the use of dynamic topic modeling for secondary literature.) The documents associated with these apparently performance-related topics accorded with our expectations, and we took this as evidence that the co-occurrence and relative frequency assignments of the algorithm were working as expected. If that were all, then the results would be only another affirmation of the long-attested usefulness of LDA in classification or information-retrieval. And this goes a long way. If it works for things we know, then it works for things we don’t. And there are many texts we don’t know much about.

The real interest with using topic modeling to examine scholarship is when the results contrast with received understanding. When they mostly accord with what someone would expect to find, but there are oddities and discrepancies, we must interpret the results to determine if the fault lies in the algorithm’s classification or in the discipline’s received understanding of its history. By definition, this received understanding is based more on generalization and oral lore rather than analytic scrutiny and revision (which obviously drives much inquiry, but is almost always selective in its target), so there will always be discrepancies. Bibliometric approaches to humanities scholarship lag far behind those of the sciences, as I understand it, and I think they are of intrinsic interest independent of their contribution to disciplinary history.

Jockers describes efforts to label topics algorithmically in Macroanalysis (135, fn1). He mentions that his own work in successively revising the labels of his 19th century novels topic model is being used by David Mimno to train a classifying algorithm. He also cites “Automatic Labeling of Topic Models” and “Best Topic Word Selection for Topic Labelling” by Jey Han Lau and co-authors. Both of these papers explore automatically assigning labels to topics from either the terms themselves or from querying an external source, such as wikipedia, to correlate with the terms. My browsers just use the first four terms of a topic as the label, but I can see how a human-assigned label would make them more consistently understandable. Of course, with many models and large numbers of topics, this process becomes laborious, thus the interest in automatic assignment.

But some topics cannot be interpreted. (These are described as “uninterruptable” topics in Macroanalysis [129] in what I assume is a spell-check mistake.) Ignoring ambiguous topics is “a legitimate use of the data and should not be viewed with suspicion by those who may be wary of the ‘black box’” (130) I agree with Jockers here. In my experience modeling JSTOR data, there are always “evidence/argument” related topics that are highly represented in a hyperparametrized model, and these topics are so general as to be useless for analytic purposes. There are also “OCR error” topics and “bibliography” topics. I wouldn’t describe these latter ones as ambiguous so much as useless, but the point is that you don’t have to account for the entire model to interpret some of the topics. Topics near the bottom of a hyperparametrized model tend not to be widely represented in a corpus and thus are not of very high quality: this “dewey ek chomsky” topic from the browser I created out of five theory-oriented journals is a good example.

I was particularly intrigued by Jockers’s description of combining topic-model and stylometric classifications into a similarity matrix. I would be bewildered and intimidated by the underlying statistical difficulties of combining these two types of classifications, but the results are certainly intriguing. The immortal George Payne Rainsford James and his The False Heir was classified as the closest non-Dickens novel to A Tale of Two Cities, for example (161).

3. The JDH Issue

Scott Weingart and Elijah Meeks, as I noted above, co-edited a recent issue of JDH devoted to topic modeling in the humanities. Many of the articles are versions of widely circulated posts of the last few months, such as the aforementioned Ben Schmidt article and Andrew Goldstone’s and Ted Underwood’s piece on topic-modeling PMLA. (Before I got distracted by topic-browsers, I created some network visualizations of topics similar to those in the Underwood and Goldstone piece. I get frustrated easily with Gephi for some reason, but the network visualization packages in R don’t generally produce graphs as handsome as Gephi’s.) There is a shortened version of David Blei’s “Probabilistic Topic Models” review article, and the slides from David Mimno’s very informative presentation from November’s Topic-Modeling workshop at the University of Maryland. Megan R. Brett does a good job of explaining what’s interesting about the process to a non-specialist audience. I’ve tried this myself two or three times, and it’s much more difficult than I expected it would be. The slightly decontextualized meanings of “topic,” “theme,” “document,” and possibly even “word” that are used to describe the process cause confusion, from what I’ve observed, and it’s also quite difficult to grasp why the “bag of words” approach can produce coherent results if you’re unaccustomed to thinking about the statistical properties of language. Formalist training and methods are hard to reconcile with frequency-based analysis.

Lisa Rhody’s article describes using LDA to model ekphrastic poetry. I was impressed with Rhody’s discussion of interpretation here, as poetry presents a different level of abstraction from secondary texts and even other forms of creative writing. I had noticed in the rhetoric browser I created out of College English, jac, Rhetoric Review, Rhetoric Society Quarterly, and CCC, that the poems often published in College English consistently clustered together (and that topic would have been clustered together had I stop-worded “poems,” which I probably should have done.) Rhody’s article is the longest of the contributions, I believe, and it has a number of observations about the interpretation of topics that I want to think about more carefully.

Finally, the overview of tools available for topic modeling was very helpful. I’ve never used Paper Machines on my zotero collections, but I look forward to trying this out in the near future. A tutorial on using the R lda package might have been a useful addition, though perhaps its target audience would be too small to bother. I think I might be one of the few humanists to experiment with dynamic topic models, which I think is a useful and productive—if daunting—LDA variant. (MALLET has a built-in hierarchical LDA model, but I haven’t yet experimented with it.)

*Here is an informative storified conversation about distance measurements for topic models that Brown showed me.

**Possibly interesting detail: at no point do any of my browser-creation programs use objects or any more complicated data-structure than a hash. If you’re familiar with the types of data manipulation necessary to create one of these, that probably sounds somewhat crazy—hence my reluctance to share the code on github or similar. I know enough to know that it’s not the best way to solve the problem, but it also works, and I don’t feel the need to rewrite it for legibility and some imagined community’s approval. I’m fascinated by the ethos of code-sharing, and I might write something longer about this later.

***I disagree with the University of Illinois Press’s decision to use sigils instead of numbered notes in this book. As a reader, I prefer endnotes, though I know how hard they are to typeset, but Jockers’s book has enough of them that they should be numbered.

Topic Models and Highly Cited Articles: Pierre Nora’s “Between Memory and History” in Representations

I have been interested in bibliometrics for some time now. Humanities citation data has always been harder to come by than that of the sciences, largely because the importance of citation-count as a metric has never much caught on there. Another important reason is a generalized distrust and suspicion of quantification in the humanities. And there are very good reasons to be suspicious of assigning too much significance to citation-counts in any discipline.

I used google scholar to search for most-cited articles in several journals in literary studies and allied fields. (Its default search behavior is to return the most-cited article in its database, which, while having a very broad reach, is far from comprehensive or error-free.) By far the most-cited article I found in any of the journals I looked at was Pierre Nora’s “Between Memory and History: Les Lieux de Mémoire.” A key to success in citation-gathering is multidisciplinary appeal, and Nora’s article has it. It is cited in history, literary studies, anthropology, sociology, and several other fields. (It would be interesting to consider Nora’s argument about the ever-multiplying sites of memory in era of mass quantification, but I’ll have to save that for another time.)

The next question that came to mind would be where Nora’s article would be classified in a topic model of all of the journal’s articles. Representations was first published in 1983. The entire archive in JSTOR contains 1036 documents. For many of my other topic-modeling work with journals, I have only used what JSTOR classifies as research articles. Here, because of the relatively small size of the sample (and also because I wanted to see how the algorithm would classify front matter, back matter, and the other paraphernalia), I used everything. In order to track “Between Memory and History,” I created several different models. It is always a heuristic process to match the number of topics with the size and density of a given corpus. Normally, I would have guessed that somewhere between 30-50 would have been good enough to catch most of the distinct topics while minimizing the lumping together of unrelated ones.

For this project, however, I decided to create six separate models with an incrementally increasing number of topics. The number of topics in each is 10, 30, 60, 90, 120, and 150. I have also created browsers for each model. The index page of each browser shows the first four words of each topic for that model. The topics are sorted in descending order of their proportion in the model. Clicking on one of the topics takes you to a page which shows the full list of terms associated with that topic, the articles most closely associated with that topic (also sorted in descending order—the threshold is .05), and a graph that shows the annual mean of that topic over time. Clicking on any given journal article will take you to a page showing that journal’s bibliographic information, along with a link to JSTOR. The four topics most closely associated with that article are also listed there.

In the ten-topic browser, whose presence here is intended to demonstrate my suspicion that ten topics would not be nearly enough to capture the range of discourse in Representations, Nora’s article is in the ‘French’ topic, a lumped-together race/memory topic, a generalized social/history topic, and the suggestive “time, death, narrative” topic. With a .05 threshold, 32% of the documents in the corpus appear in the ten-topic browser. [UPDATE: 3/16, this figure turned out to be based on a bug in the browser-building program.] None of these classifications are particularly surprising or revealing, given how broad the topics have to be at this level of detail; but one idea that I want to return is the ability of topic-models to identify influential documents in a given corpus. Nora’s article has clearly been very influential, but are there any detectable traces of this influence in a model of the journal in which it appeared?

Sean M. Gerrish and David Blei’s article “Language-based Approach to Measuring Scholarly Impact” uses dynamic topic models to infer which documents are (or will be) most influential in a given collection. What I have done with these Representations models is not dynamic topic modeling but the regular LDA model. I have experimented with dynamic topic models in the past, and I would like to apply the particular techniques described in their article once I can understand them better.

Here is how Nora’s article is classified in each of the topic models (sorted vertically from most to least representative):

10-topics 30-topics 60-topics 90-topics 120-topics 150-topics
{social political work} {history historical cultural} {history historical past} {historical history memory} {memory past history} {memory past collective}
{war american black} {form text relation} {memory jewish holocaust} {form human order} {human form individual} {history historical past}
{time death narrative} {memory jewish jews} {made work ways} {fact make point} {history historical modern} {form relation terms}
{de la le} {time death life} {world human life} {early modern history} {relation difference object} {sense kind fact}
N/A {political social power} {early modern great} {power terms suggests} {de la french} {individual system theory}
N/A {de la le} {make fact question} N/A {fact order present} N/A
N/A N/A {body figure space} N/A {forms figure form} N/A
N/A N/A {makes man relation} N/A N/A N/A
N/A N/A {national history public} N/A N/A N/A

There is a notable consistency between the topics the article is assigned to no matter how many there are to choose from. A logical question to ask is if Nora’s article is assigned to more or less topics than the average article across these six models. The percentage of all articles that are assigned to a topic with a proportional threshold >= .05 ranges from 32% with the ten-topic model to 52% in the 150-topic.

In my next post, I am going to describe the relative frequency of the average article in the different models and try to identify which ones (including Nora’s, if it turns out to be) are disproportionately represented in the topics. I will also begin interpreting these results in light of what I felt was historicism’s relative absence in the theory-journals corpus I created earlier.

[UPDATE: 3/16. I corrected a bug in the browser-building program and generated a new table above with the correct topics linked for Nora's article. The previous table had omitted a few.]

Learning to Code

One of my secret vices is reading polemics about whether or not some group of people, usually humanists or librarians, should learn how to code. What’s meant by “to code” in these discussions varies quite a lot. Sometimes it’s a markup language. More frequently it’s an interpreted language (usually python or ruby). I have yet to come across an argument for why a humanist should learn how to allocate memory and keep track of pointers in C, or master the algorithms and data structures in this typical introductory computer science textbook; but I’m sure they’re out there.

I could easily imagine someone in game studies wanting to learn how to program games in their original environment, such as 6502 assembly, for example. A good materialist impulse, such as learning how to work a printing press or bind a book, should never be discouraged. But what about scholars who have an interest in digital media, electronic editing, or text mining? The skeptical argument here points out that there are existing tools for all of these activities, and the wise and conscientious scholar will seek those out rather than wasting time reinventing an inferior product.

This argument is very persuasive, but it doesn’t survive contact with the realities of today’s text-mining and machine-learning environment. I developed a strong interest in these areas several months ago (and have posted about little else since, sadly enough), even to the point where I went to an NEH seminar on topic modeling hosted by the fine folks at the MITH. One of the informative lectures recommended that anyone serious about pursuing topic modeling projects learn the statistical programming language R and a scripting language such as python. This came as of little surprise to me as being reassured later in the evening by a dinner companion that Southerners were of course discriminated against in academia. I had begun working with topic-modeling in R packages, and a great deal of text-munging was required to assemble the topic output in a legible format. MALLET makes this easier, but there’s no existing GUI solution* for visualizing the topics (or creating browsers of them, which some feel is more useful**).

Whatever flexibility that being able to dispense with existing solutions might offer you is more than counterbalanced by the unforgiving exactitude and provincial scrupulousness of programming languages, which manifestly avoid all but the most literal interpretations and cause limitless suffering for those foolish or masochistic enough to use them. These countless frustrations inevitably lead to undue pride in overcoming them, which lead people (or at least me) to replace a more rational regret over lost time with the temporary confidence of (almost always Pyrrhic) victory.

An optimistic assessment of the future of computation is that interfaces will become sophisticated enough to eliminate the need for almost anyone other than hobbyists to program a computer. Much research in artificial intelligence (and much of the most promising results as I understand them) has been in training computers to program themselves. Functional programming languages, to my untutored eye and heavily imperative mindset, already seem to train their programmers to think in a certain way. The correct syntax is the correct solution, in other words; and how far can it be from that notable efficiency to having the computer synthesize the necessary solutions to any technical difficulty or algorithmic refinement itself? (These last comments are somewhat facetious, though the promise of autoevolution was at the heart of cybernetics and related computational enthusiasms—the recent English translation of Lem’s Summa Technologiae is an interesting source here as is Lem’s “Golem XIV.”)

I can’t help but note that several of the arguments I’ve read that advise people not to learn to code and not to spend time teaching other people how to if you happen to be unlucky enough to be in a position to do so are written by people who make it clear that they themselves know how. (I’m thinking here in particular of Brian Lennon, with whom I’ve had several discussions about these matters on twitter and also David Golumbia.) Though I don’t think this myself, I could see how someone might describe this stance as obscurantist. (It’s probably a matter of ethos and also perhaps a dislike of people who exaggerate their technical accomplishments and abilities in front of audiences who don’t know any better—if you could concede that such things could exist in the DH community.)

*Paper Machines, though I haven’t tried it out, can now import and work with DfR requests. This may include topic modeling functionality as well.

**I have to admit that casual analysis (or, exacting scrutiny) of my server logs reveals that absolutely no one finds these topic browsers worth more than a few seconds’ interest. I haven’t yet figured out if this is because they are objectively uninteresting or if users miss the links because the style sheet. (Or both.)

The Awakening of My Interest in Annular Systems

I’ve been thinking a lot recently about a simple question: can machine learning detect patterns of disciplinary change that are at odds with received understanding? The forms of machine learning that I’ve been using to try to test this—LDA and the dynamic LDA variant—do a very good job of picking up the patterns that you would suspect to find in, say, a large corpus of literary journals. The model I built of several theoretically oriented journals in JSTOR, for example, shows much the same trends that anyone familiar with the broad contours of literary theory would expect to find. The relative absence of historicism as a topic of self-reflective inquiry is also explainable by the journals represented and historicism’s comparatively low incidence of keywords and rote-citations.

I’ve heard from people on twitter that it’s a widely held belief that machine-learning techniques (and, by extension, all quantitative methods) can only tell us what we already know about the texts. I admit some initial skepticism about the prevalence of this claim, but I’ve now seen more evidence of it in the wild, so to speak, and I think I understand where some of this overly categorical skepticism comes from. A test of the validity of topic modeling, for example, would be if it produces a coherent model of a well-known corpus. If it does, then it is likely that it will do the same for an unknown or unread group of texts. The models that I have built of scholarly literature from JSTOR, I can see, are thought by some of the people who’ve seen them to be well-understood corpora. If the models reflect the general topics and trends that people know from their knowledge of the field, then that’s great as far as it goes, but we’ll have to reserve judgment on the great unread.

One issue here is that I don’t think the disciplinary history of any field is well understood. Topic modeling’s disinterested aggregations have the potential to show an unrecognized formation or the persistence of a trend long-thought dormant. Clancy found some clustering of articles in rhetoric journals associated with a topic that she initially would have labeled as “expressivist” from several decades before she would expect. Part of this has to do with the eclectic nature of what’s published in College English, of course, and part has to do with the parallels between creative writing and expressivist pedagogy. But it’s the type of specific connection that someone following established histories is not likely to find.

Ben Schmidt noted that topic modeling was designed and marketed, to some degree, as a replacement for keyword search. Schmidt is more skeptical than I am of the usefulness of this higher-level of abstraction for general scholarly research. I know enough about anthropology to have my eyebrows raised by this Nicholas Wade essay on Napoleon Chagnon, for example, and I still find this browser of American Anthropologist to be a quicker way of finding articles than JSTOR’s interface. I created this browser to compare with the folklore browser* of the corpus that John Laudun and I have been working with. We wanted to see if topic models would reflect our intuition that the cultural/linguistic turn in anthropology and folklore diffused through their respective disciplines’ scholarly journals (the folklore corpus contains the journal most analogous to American Anthropologist, The Journal of American Folklore, but it also has other folklore journals as well) at the expected time (earlier in anthropology than folklore).

A very promising, to my mind, way of correlating topic models of journals is with networks of citations. I’ve done enough network graphs of scholarly citations to know that, unless you heavily prune and categorize the citations, the results are going to be hard to visualize in any meaningful way. (One of the first network graphs I created all of the citations in thirty years of JAF required zooming in to something like 1000x magnification to make out individual nodes. I’m far from an expert at creating efficient network visualizations, needless to say.) JSTOR once provided citation data through its Data for Research interface; it does not any longer as far as I know. This has been somewhat frustrating.

If we had citation data, taking two topics that both seem reflective of a general cultural/linguistic/poststructuralist influence, such as this folklore topic and this anthropological one would allow us to compare the citation networks to see if the concomitant rise in proportion was reflected in references to shared sources (Lévi-Strauss, for example, I know to be one of the most cited authors in the folklore corpus.) I would also like to explore the method described in this paper that uses a related form of posterior inference to discover the most influential documents in a corpus.**

This type of comparative exploration, while presenting an interesting technical challenge to implement (to me, that is, and I fully recognize the incommensurable gulf between using these algorithms and creating and refining them) can’t (yet) be mistaken for discovery. You can’t go from this to an a priori proof of non-discovery, however. Maybe no one is actually arguing this position, and I’m fabricating this straw argument out of supercilious tweets and decontextualized and half-remembered blog posts.

A more serious personal intellectual problem for me is that I find the dispute between Peter Norvig and Noam Chomsky to be either a case of mutual misunderstanding or one where Chomsky has by far the more persuasive case. If I’m being consistent then, I’d have to reject at least some of the methodological premises behind topic-modeling and related techniques. Perhaps “practical value” and “exploration/discovery” can share a peaceful co-existence.

*These browsers work by showing an index page with the first four words of each topic. You can then click on any one of the topics to see the full list of words associated with it, together with a list of articles sorted by how strongly they represent that topic. Clicking then on an individual article takes you to page that shows the other topics most associated with that article, also clickable, and a link to the JSTOR page of the article itself.

**The note about the model taking more than ten hours to run fills me with foreboding, however. My (doubtlessly inefficient) browser-creating scripts can take more than hour to run on a corpus of 10K documents, combined with another hour or more w/ MALLET and R–it really grinds down a person conditioned to expect instant results in today’s attention economy.

Two Topic Browsers

Ben Schmidt, in a detailed and very useful post about some potential problems with using topic models for humanities research, wondered why people didn’t commonly build browsers for their models. For me, the answer was quite simple: I couldn’t figure out how to get the necessary output files from MALLET to use Allison Chaney’s topic modeling visualization engine. I’m sure that the output can be configured to do so, and I’ve built the dynamic-topic-modeling code, which does produce the same type of files as lda-c, but I hadn’t actually used lda-c (except through an R package front-end) for my own models.

It occurred to me that a simple browser wouldn’t be that hard to build myself, so I made one for Clancy’s explorations of the rhetoric/composition journals in JSTOR and another for the theory corpus. (I did use Chaney’s CSS file.) I used my old graphs without the scatterplots layer for the theory-browser, as I didn’t want to take the time to regenerate those yet. And I’m not sure quite what’s going on with unicode/non-ASCII characters; theoretically the code I wrote should convert those properly. [UPDATE: Thanks to a pointer from Andrew Goldstone on twitter, I fixed the encoding issue. binmode, ":utf8" on all filehandles is the answer in perl at least.)

The articles shown for each topic are those that have that topic most strongly associated with them. It’s quite possible that other articles could have higher proportions but have another topic even more strongly associated with it. I should also rewrite the code so that it grabs all articles below a certain threshold of significance.

The Stronghold of Bioinformatics

No one likes gamification or MOOCs, as far as I can tell. What I should say is that anyone trained in the hermeneutics of suspicion might even find it hard to accept their existence. It’s hard to come up with a hypothetical concept that would cry more piteously to the heavens for critique, for example. True to form, until a few weeks ago I had never earned a badge in my life and would have regarded the prospect of doing so with contempt and a touch of pity for whoever was naive enough to suggest it.

Then, there was this Metafilter post. Things I’ve discovered via Metafilter have taken away many months of work-time over the years, so the sensible thing to do would be to quit reading it. But that’s unlikely. In any case, Project Rosalind is a series of programming problems related to bioinformatics. It has the gamified features of “levels,” “badges,” “achievements,” and even, God help me, “xp.” There are a series of problems related to string processing, probability, and other topics. They have a tree-like structure, and you have to solve precursor problems before getting access to the later ones. Solving a problem involves downloading a dataset and submitting a solution within five minutes. After you’ve solved the problem, you can see the code that others have posted to solve the problem.

This feature is particularly interesting to me, as I have never really learned functional programming, so when I see solutions to problems that I have solved in perl in languages such as Haskell, Clojure, or Scala, it’s a bit easier to understand how they were put together. (Rosetta Code is another place to see programming problems solved in multiple languages.) You are allowed unlimited attempts to get the right answer, and you can see forum questions about the problem after two unsuccessful tries. (I have posted a question once—a rather idiotic question in retrospect—and I received a correspondingly withering response, whose impact I mitigated somewhat by imagining it spoken in the Comic Book Guy’s voice.)

I have, at this point, solved twenty-two of the ninety-three problems. The early ones are trivial, but I’m finding the difficulty to be scaling up quite a bit. I’ve used some algorithms I had never worked with before, such as tree-suffix and shortest-superstring. I’ve also used arbitrarily nested loops in perl (with Algorithm::Loops) and contemplated the theoretical limits of what a regular expression can match more than I’ve had to before. It’s also quite interesting to see what the total numbers of problems solved reveal about people’s background knowledge. Two of the problems involving Mendelian inheritance and probability have been solved proportionally many fewer times than (more difficult) string-processing programs. (I don’t mean to be a hypocrite in saying this, as I got tired of the Punnett-squares required in the second one of those and haven’t solved it myself.)

Some of the gamified features of the site I regard as silly (levels, xp, badges, achievements), but I admit that I can’t help but be motivated by the statistical information about how many people have solved which problems. It triggers my instinctual competitiveness, somehow. They even seem to encourage people to post their country of origin to introduce nationalism into the competitive mix here. As a learning tool, I’m not sure how effective it is. It’s quite possible to solve many of the problems while retaining only the barest minimum about the underlying molecular biology, and problems which require a bit more conceptual understanding than that (see the Mendelian inheritance ones above) are comparatively ignored.

The programmatic checking of solutions is also somewhat finicky. An end-of-line character at the end of the file will cause an otherwise correct solution to fail for at least some of the problems, for example. But all in all, I’m very impressed with this site and think it has a lot of potential in teaching people (humanists, for example), how to program. It would be nice to be able to reuse the code with different problem sets, if they ever decide to release the source in the future.

UPDATE:

I corrected a few mistakes (I gave myself an extra problem, for instance), and I also wanted to mention an important precursor: Project Euler. This site has mathematics problems, and it also seems a bit more streamlined. I haven’t actually used it yet, though.

Topics in Theory

After experimenting with topic models of Critical Inquiry, I thought it would be interesting to collect several of the theoretical journals that JSTOR has in their collection and run the model on a bigger collection with more topics to see how the algorithm would chart developments in theory.

I downloaded all of the articles (word-frequency data for each article, that is) in New Literary History, Critical Inquiry, boundary 2, Diacritics, Cultural Critique, and Social Text. I then ran a model fitted to one-hundred topics. I had to adjust the stop-word list to account for common words and, unsuccessfully, for words in other languages. What I should have done was use the supplied stop-word lists in those languages as well. At least this way there is a chance that interesting words in those languages will cluster together.

The topics themselves looked good, I thought. One hundred was about the right number, as I didn’t see much evidence of merging or splitting. I should say rather that I saw an acceptable level, or the usual level. This topic, for example, shows what I mean: “aboriginal rap[?] women australian climate weather movement work warming time australia housework change social power oroonoko[?] make wages years.” I also didn’t lemmatize this corpus, although I know how to. Lemmatizing takes a lot of time the way I’m doing it (using the WordNet plugin of the python Natural Language Toolkit). And I frankly haven’t been that impressed with the specificity of the lemmatized models that I have run.

Visualizing changes in topics over time is quite difficult. Each year will have thousands of observations per topic and taking the mean of each topic per year doesn’t always produce very readable results. Benjamin Schmidt suggested trying the geom_smooth function of ggplot2, which I never had much luck with. The main reason, I found, that I couldn’t get it to work very well is that I was trying to create a composite graph of every topic using facet_wrap. Each topic graphed by itself with geom_smooth produced better results.

Here, for example, is the graph for this coherent topic—”gay sexual queer sex lesbian aids sexuality homosexual men homosexuality identity heterosexual male gender desire social lesbians drag butler”:
Graph of Change over Time in "Queer Theory" Topic from Theory Journals

The chronology you see above does approximately track the rise of queer theory, though the smoothing algorithm is full of mystery and error. A scatter-plot of the same graph would be far noisier and also not reveal much in the way of change over time. This topic should also correlate somewhat roughly to postcolonial theory–”indian india hindu colonial postcolonial subaltern british indians nationalist gandhi english bengali religious caste nationalism sanskrit maori bengal west”:
Postcolonial Topics over Time in Theory Journals

I’m suspicious of this linear increase, needless to say. The underlying data is messier. Would Marxist theory show any decline around the predictable historical period? (Terms: “social class theory ideology political production ideological historical marxist marx bourgeois capitalist society capitalism marxism economic labor relations capital”)

Topics in Marxist Theory over Time in Theory Journals

That is roughly what I was expecting. But compare “soviet party revolutionary socialist revolution socialism communist political national left union struggle europe russian fascism war central movement european”:

Communist Theory Topics over Time in Theory Journals

I have hopes for the exploratory potential of topic-modeling disciplinary change this way. Another interesting topic that shows a linear-seeming increase (“muslim islamic islam religious arab muslims secular arabic algerian orientalism rushdie religion iranian iran western turkish ibn secularism algeria”):
Islamic Topics over Time in Theory Journals

To show what the data looks like with different visualizations, I’m going to cycle through several types of graphs of the above topic. The first is a line graph:
Line graph

Next is a scatter-plot:

Point-graph

Now a scatter-plot with the scale_y_log10 function applied:
Point (Log10)

And a yearly mean:
Yearly mean

Finally, a five-year mean:
Five-year mean

All of the graphs reveal a general upward trend, I think, though not as much as the smoothing function does. I would be delighted in hearing any ideas anyone has about better ways to graph these. I’ve not found any improvements in grouping by document rather than year.

There’s more I plan to do with this data set, including coming up with better ways to visualize it (more precision, efficient ways of seeing many at once, etc.) I am including the full list of topics after the fold for reference. Some reveal OCR errors; others are publishing artifacts that my first rounds of stopping didn’t yet remove.

Update (2/14/12): I created a browser of this model that shows the articles most closely associated with each topic.

Continue reading

Same Stuff, Different Graph

When I started experimenting with graphing changes in topic-proportions over time, I didn’t pay much attention to the design of the graph. I could see that it was far too busy, but I assumed that this would be relatively easy to adjust using ggplot2‘s many parameters.

It wasn’t. It didn’t take me too long to figure out that I needed to change the data from discrete to continuous in order to see anything like a sparkline, but it was also apparent from the other data sets I was working with that taking the mean at intervals was the only way to make a reasonably clean graph. I ended up using the aggregate function to create the n-year averages, though I read some intriguing descriptions of the power of data.tables in R. (I refuse to ask for help on stackoverflow, even though it would have saved many hours worth of work. Character flaw.)

I now need to learn how to use the reshape package, with its wonderfully named ‘melt’ and ‘cast’ features, to rewrite the code I’m using to change rows to columns. A simple for-loop iteration over a data-frame in R can take hours, I’ve learned; and I expect that this other solution would finish the job in seconds.

Anyway, here’s the revised graph of ELH with annual means of topic-proportions:
Graph of Topics in ELH

The full list of topics can be found in my previous post.

Visualizing Topics in ELH

I was impressed with Ian Milligan’s visualizations of Canadian parliamentary debates, and I wanted to try to visualize some of the topic models I’ve been creating from JSTOR’s Data for Research.

ELH I thought would be an interesting journal to try, as it publishes articles in each issue on quite a range of literary periods, often ranging from medieval to twentieth-century material. I assumed that LDA would be likely to identify each of these periods as a topic. To test this, I downloaded the entire set of articles from JSTOR and created a fifty-topic model. From there, I wanted to chart the proportion of each topic in each document. I was able to import the data in R and use ggplot2 to create the following graph:

ELH-graph

As you can see, many of these topics are identifiable from even two-word samples. Others show a need of lemmatizing (a slow process using the python NLTK, though effective), or of further splitting. Perhaps fifty topics is not quite enough.

The code has to transform row data to column-form in order to be efficiently sorted. It then used the ggplot2 facet_wrap feature to create the graph. I’d be happy to share it, if anyone’s interested, though it uses a for-loop, which I understand to be bad R. You also have to pre-process the JSTOR files to associate dates with the files themselves. I have a perl script for this.

For reference, here is the complete list of topics generated by MALLET:
0 love marriage lady lover woman desire young lovers passion wife sexual beauty friendship husband heart story loves relationship world
1 women female sexual male woman gender men sexuality desire feminine masculine mary sex mother lady patriarchal early domestic feminist
2 place house back scene light great description eyes passage water space sea makes night earth city landscape man day
3 body human bodies medical scientific science physical natural bodily disease nature medicine mental health physiological early yellow james john
4 renaissance english modern book bacon early thomas latin humanist elizabethan classical utopia richard sixteenth cambridge england tudor erasmus knowledge
5 world life experience human sense reality mind personal feeling consciousness imagination real vision man modern individual emotional felt feel
6 medieval english middle arthur piers green poem gawain knight play poet lancelot bat st late courtly sir hym plowman
7 american war whitman poe america political conrad public literature jim world secret adams york marlow walt united leaves german
8 moral man virtue social human character fielding nature good natural society characters sentimental sympathy hero morality tom irony action
9 yeats keats tennyson marvell poem garden art flowers herrick victorian andrew poet nymph stanza beauty swinburne idylls green myth
10 wordsworth coleridge romantic byron blake poem poetry poetic william lyrical poet romanticism prelude lines nature imagination mind book wordsworthian
11 shakespeare play hamlet scene king dramatic tragedy richard othello plays macbeth action audience act shakespearean speech tragic drama measure
12 shelley political burke revolution french mary sublime caleb revolutionary radical rousseau godwin romantic historical wollstonecraft falkland reform prometheus frankenstein
13 chaucer tale troilus medieval tales canterbury prologue wife criseyde man book fortune courtly nat pardoner story knight ye lydgate
14 social literary cultural history historical culture political text form modern discourse literature work forms individual critique critical texts reading
15 johnson pope swift dryden satire addison gulliver satiric augustan wit boswell samuel essay restoration eighteenth lines spectator satirist poem
16 narrative story narrator fiction reader history characters plot tale book romance events novels readers character stories truth fictional text
17 irish scott historical national ireland gothic scottish history english british nation waverley scotland past castle novels ancient family antiquarian
18 church religious catholic protestant religion puritan john england english reformation bishop body roman anglican ecclesiastical christian st real argument
19 language words word speech meaning text reading reader writing rhetorical linguistic style voice verbal rhetoric read speak discourse sense
20 law legal family clarissa father pamela marriage property richardson lovelace child incest daughter letter rape contract lady criminal miss
21 spenser faerie book queene allegory pastoral canto knight allegorical guyon poem colin arthur red britomart venus poet books nature
22 death life time past dead nature man memory present loss child living world natural end soul mother back voice
23 sonnet sonnets line english music lines verse song musical form lyric sound rhyme sequence italian songs stanza lyrics opera
24 woolf public lamb virginia forster society burney social room miss goldsmith evelina elia bloomsbury lily young sheridan peter house
25 english british colonial european england national crusoe imperial cultural island empire indian early foreign spanish east trade west india
26 nature human man mind natural reason world things theory thought truth ideas knowledge philosophy idea philosophical object form imagination
27 jane dickens victorian lucy austen novels david charlotte bleak miss pip sir trollope catherine wuthering emma bronte fanny lady
28 literary english century literature criticism critical history works poetry critics great writers essay art modern work eighteenth age influence
29 make good made man great life men end give put find true time left found long mind thought things
30 black white american slave hawthorne racial slavery race african melville slaves baldwin negro scarlet identity ahab hester southern sentimental
31 social economic class money society labor economy public market trade commercial poor exchange domestic wealth system private property city
32 part point view made general important time fact work kind present earlier sense passage effect form make similar found
33 joyce stephen hardy wilde bloom ulysses james tess molly artist portrait young finnegans wake ford jude chapter dorian father
34 poem poetry poet poems poetic speaker lines poets line verse stanza stevens lyric work reader thy song williams elegy
35 letter book published letters edition writing written years john printed text literary early time william wrote books work author
36 eliot george pater jewish james victorian daniel jews henry gwendolen deronda marius life jew dorothea social adam maggie middlemarch
37 dracula sterne animal tristram beckett stein shandy animals stoker yorick henry horses lucy smart murphy journey dogs mechanical uncle
38 marlowe faustus ovid epic classical virgil tamburlaine dido chapman aeneas ovidian roman hercules gods georgic hero myth aeneid georgics
39 makes power place question order suggests simply means act response relationship fact terms sense role identity claim precisely critics
40 sidney elizabethan sir pastoral lady queen elizabeth essex beowulf stella court philip arcadia ralegh earl countess sonnet poet lord
41 god christian christ spiritual religious divine man grace st holy biblical faith john bible soul sin church word doctrine
42 desire figure body object subject image text violence power representation narrative scene form relation moment pleasure trans fantasy gaze
43 political king james english royal power history john state england henry charles government politics civil court war lord public
44 image world time form vision images order structure pattern symbolic meaning movement symbol figure imagery final process physical metaphor
45 milton paradise adam god satan lost eve samson book fall poem heaven epic son evil hell divine fallen sin
46 art pound aesthetic browning painting work thoreau visual ruskin blake artist cantos ezra plate aesthetics arts museum canto paintings
47 donne hath thy doth thou john doe good elizabethan henry haue owne made sir man renaissance bee thomas world
48 play plays stage jonson drama theater audience theatrical dramatic performance comedy masque sir restoration comic theatre scene actors ben
49 translation french latin il hebrew und se di ne cf ut english dans die version sed renaissance par quod

Experimenting with Dynamic Topic Models

When I first began reading about topic modeling, I very much wanted to experiment with “dynamic” topic modeling, or the tracking of changes in topics over time. David Blei and John Lafferty describe their algorithm in this paper. They also have made a dynamic topic model browser of Science available. I was very impressed with this project and wanted to apply the technique to journals in the humanities using JSTOR’s Data for Research (DfR).

Thankfully, the source code for creating dynamic topic models is also available. (Take note to apply the patch listed under “Issues,” or you’ll get segmentation faults, at least on my computer.) Building this code into an executable program may require several steps, depending on your set-up. If you use Linux, it’s just a matter of installing the Gnu Scientific Libraries. For a Mac, you have to download the XCode command-line tools from Apple, and then install the GSL libraries. For Windows, I’d guess that cygwin would be the best way to go, though I used to compile things with Visual C++ that were written for Unix systems without too much trouble.

The code itself will require a matrix of word frequencies in the ldac format (I use text2ldac to generate these), and a file of time-sequences. The matrix can be a bit tricky to generate from the DfR data. You first have to decide what kind of time-slices you want, and then you have to rename all of the files to reflect their date (the DOIs will not automatically correlate to date of publication). I wrote a perl script to do this. It is sadly far too embarrassing to share with anyone at this point, but I’ll see what I can do if there’s any interest. You then need to count how many documents are in each time-slice. I also used a perl script to do this. Once you have that information, the “-seq.dat” file will have the total number of time-slices on the first line, followed by the number of documents in each time-slice on the following lines.

Now you can execute the dtm code. It has many options, and there’s a shell script in the main directory which outlines several of them for you. I would just copy that code and make the necessary changes. Once it has finished running, it will generate a series of probability matrices for each term in each topic at each time-slice in the “lda-seq” directory of the output folder.

I used R to read these and correlate them with the “.vocab” file, or word-list, that text2ldac generates. You have to create a matrix with the correct number of time-slices, import the word list, and then sort by each column to get the topics for each time. I wrote an R function to do this as well, but it’s, if anything, more embarrassing than the perl.

My colleagues John Laudun and Clai Rice have been working on citation networks in The Journal of American Folklore. We are wondering if correlating regular LDA topic models or dynamic ones with changes in citation frequencies can reveal anything about disciplinary changes. To test the dynamic topic model, I ran it on the research articles in this journal from 1900-2010. I generated 50 topics, which I won’t reproduce all of here. This topic is of local interest, however:

[1900-1909] “french” “dream” “german” “dreams” “yuma” “mohave” “ethnic” “local” “louisiana” “american”
[1910-1919] “french” “dream” “german” “dreams” “yuma” “mohave” “ethnic” “local” “louisiana” “american”
[1920-1929] “french” “dream” “german” “dreams” “yuma” “mohave” “ethnic” “local” “louisiana” “american”
[1930-1939] “french” “german” “dream” “dreams” “ethnic” “yuma” “mohave” “local” “louisiana” “american”
[1940-1949] “french” “german” “dream” “dreams” “ethnic” “local” “louisiana” “mohave” “yuma” “identity”
[1950-1959] “french” “german” “dream” “ethnic” “dreams” “local” “identity” “louisiana” “american” “group”
[1960-1969] “german” “french” “ethnic” “dream” “local” “identity” “dreams” “group” “american” “louisiana”
[1970-1979] “german” “ethnic” “french” “identity” “local” “dream” “mardi” “group” “gras” “american”
[1980-1989] “german” “ethnic” “french” “identity” “mardi” “local” “gras” “social” “american” “group”
[1990-1999] “mardi” “gras” “ethnic” “identity” “french” “german” “local” “american” “people” “louisiana”
[2000-2010] “mardi” “gras” “ethnic” “identity” “french” “american” “canadian” “people” “local” “cultural”

I am not ready to make any interpretive claims yet. Another thing I’ve been working is using python’s Natural Language Toolkit’s WordNet lemmatizer to lemmatize the corpus before topic-modeling it. It works, but it is also very slow.

Creating Topic Models with JSTOR’s Data for Research (DfR)

Here are some instructions for creating the same types of topic models of JSTOR’s journals that I did with Critical Inquiry and Signs.

These instructions are designed for someone using a Mac or Linux platform. (The differences below between using Linux and a Mac should be apparent to anyone who uses Linux, so I’m not going to indicate them here; it’s mainly where files are stored.) All of this should work on Windows, but you’ll need to install Cygwin or use alternate shell commands. MALLET has slightly different installation instructions for the Windows platform as well, I believe.

  1. Download and install MALLET.
    • Download the file (I assume it will be in /Users/yourusername/Downloads)
    • open Terminal/shell (this is in the Applications/Utilities folder on a Mac)
    • cd to the directory where you downloaded it (something like:
      cd /Users/yourusername/Downloads
      if you use Downloads as default directory.)
    • Now enter these commands:
      tar -xzvf mallet-2.0.7.tar.gz
      cd mallet-2.0.7
  2. 2. Download and unzip DfR data.
    • Create a DfR account. Log in.
    • Find the journal you want. Make sure the total number of issues is less than 1000 or be content with a random sample. (You can request a higher limit with an explanation of why you need it, or you can download multiple files.)
    • Go to the “Submit New Request” tab.
    • Select Citations Only AND Word Counts.
    • Select CSV for Output Format.
    • Give a job title.
    • Click “Submit Job”
    • Wait for notification email.
    • When you get it, go to “My Requests” page,
    • Use your browser’s “Save As” feature to download the “Full Dataset” file to the MALLET directory (I’m assuming it’s /Users/yourusername/Downloads/mallet-2.0.7).
    • Go to terminal. Type
      $unzip 2012..[bunch of numbers]..zip
  3. 3. Pre-processing the JSTOR data.
    • Download Andrew Goldstone’s count2txt. Save it in the same directory you’ve been using.
    • Type
      perl -v
    • Unless the output of that command says “perl 5, version 14,” open count2txt in a text editor.
    • Find the line of the code that reads use v5.14;.
    • Delete the line, or add a #before it. (You could also change “14″ to the version of perl that you use.)
    • Enter
      perl count2txt --multifile wordcounts/*.CSV
    • Create a txt-files only directory for MALLET to work on:
      mkdir text
      cp wordcounts/*.txt text
  4. 4. Run Mallet
    • Enter (note that this—and all other commands here—should all be on the same line):
      bin/mallet import-dir --input text --output topic-input.mallet --keep-sequence --remove-stopwords
    • Now run the topic-modeler:
      bin/mallet train-topics --input topic-input.mallet --num-topics 10 --output-topic-keys jstor.model.txt
    • Now look at file jstor.model.txt for your results.

    You probably will want to add more topics than 10, but this shouldn’t take too long for a first experiment. MALLET also has a lot of parameters you can experiment with. You’ll probably want to add your own stop words. In the same directory, you can create a list with your own stop words in a text editor and save it as “stop.txt”. Then, try this command to re-create the MALLET input file:
    bin/mallet import-dir --input text --output topic-input.mallet --keep-sequence --remove-stopwords --extra-stopwords stop.txt

Two Critical Inquiry Topic Models

Here are two topic models of Critical Inquiry, generated with the same algorithm but different implementations (MALLET and R topicmodels package, slightly different stopword lists, the latter also was generated with a minimum word frequency of seven):

0 black music musical white african jazz sound performance american racial negro song cultural sounds race rap cage singer composer
1 meaning theory interpretation question philosophy language point claim philosophical sense truth fact argument knowledge intention metaphor text account speech
2 american duke james john trans culture life william things modern cambridge michael david robert soviet shame henry york objects
3 trans time question subject derrida language place order object relation word thing reading moment longer things work thought writing
4 history historical narrative discourse account contemporary terms status context social ways relation discussion essay sense form representation specific position
5 god christian history religious greek ancient modern tradition divine century body early philosophy latin nature religion medieval church soul
6 film cinema films camera screen images frame image movie theater shot early visual cinematic narrative kiss hollywood scene documentary
7 science scientific human knowledge media theory sciences natural studies life social history technology communication machine humanities disciplines system psychology
8 body time game process space affect play form motion hand ways level attention bodies turn making figure physical parts
9 political social politics cultural power culture theory society ideology critique intellectual ideological state economic class liberal struggle revolution marx
10 law legal public case justice trial political war court state violence rights states moral crime speech abuse slave united
11 art painting work visual image picture artist images fig paintings works artists artistic photography aesthetic museum photograph objects object
12 form work time terms art nature individual structure order reality analysis general style works experience concept process elements theory
13 literary literature criticism text reading book work critics texts writing reader fiction language english author readers works read critic
14 life human moral good man sense great experience work fact kind find personal idea mind character people view social
15 time years people day great long house young called read book wrote man word times year men left english
16 story love death man dead face life eyes point scene narrative moment long real james stories heart narrator characters
17 italian di del fig della il fascist spanish inca ii italy autumn giovanni st saint che text building verdi
18 german history benjamin trans historical freud von germany art modern das memory panofsky essay berlin early hegel walter war
19 public war time national city american education work social economic space people urban culture corporate building united market business
20 poetry poem poet language poems poetic poets english lines literary lyric word romantic text verse poetics prose pound milton
21 women sexual female woman male feminist desire sex men sexuality mother gender freud identity body psychoanalytic child psychoanalysis feminine
22 jewish jews israel israeli palestinian state arab jew religious land people political religion identity palestinians muslim islamic rabbi al
23 french en france title sur qui dans paris une paul text ne foucault letter est man jean derrida au
24 cultural european culture colonial western american national chinese african indian native english identity white british racial race south africa

And

[0] “city” “public” “space” “museum” “memorial” “people” “place” “building” “time” “national” “american” “work” “war” “aboriginal” “landscape” “architecture” “land” “fig” “vietnam”

[1] “political” “social” “power” “theory” “politics” “historical” “economic” “class” “history” “work” “form” “foucault” “point” “society” “ideological” “critique” “notion” “production” “order”

[2] “music” “musical” “style” “art” “work” “time” “form” “history” “works” “analysis” “theory” “spatial” “historical” “panofsky” “structure” “individual” “temporal” “concept” “specific”

[3] “black” “white” “american” “cultural” “culture” “african” “racial” “western” “european” “chinese” “indian” “identity” “race” “duke” “social” “colonial” “political” “national” “history”

[4] “women” “sexual” “female” “woman” “feminist” “male” “desire” “freud” “men” “love” “sex” “mother” “psychoanalysis” “sexuality” “man” “gender” “body” “father” “power”

[5] “life” “human” “things” “natural” “nature” “history” “social” “man” “sense” “people” “objects” “time” “thing” “kind” “society” “culture” “cultural” “modern” “object”

[6] “science” “knowledge” “scientific” “studies” “cultural” “social” “history” “academic” “sciences” “education” “american” “humanities” “culture” “work” “study” “disciplines” “students” “discipline” “political”

[7] “slave” “american” “slavery” “york” “time” “law” “shaw” “color” “legal” “story” “captain” “isabel” “political” “slaves” “case” “point” “man” “kafka” “margaret”

[8] “novel” “book” “fiction” “text” “reader” “reading” “work” “james” “read” “characters” “character” “writing” “author” “time” “case” “readers” “moral” “novels” “life”

[9] “war” “german” “memory” “history” “trial” “man” “time” “historical” “political” “life” “holocaust” “trauma” “years” “nazi” “story” “work” “people” “benjamin” “french”

[10] “painting” “picture” “image” “art” “point” “work” “figure” “fig” “mirror” “face” “portrait” “visual” “left” “pictures” “figures” “paintings” “hand” “man” “painter”

[11] “narrative” “story” “events” “historical” “time” “form” “order” “history” “sense” “text” “man” “structure” “discourse” “point” “autumn” “relation” “well” “myth” “human”

[12] “art” “work” “aesthetic” “painting” “works” “modern” “history” “object” “artistic” “artist” “form” “objects” “photography” “modernist” “artists” “abstract” “kind” “historical” “modernism”

[13] “jewish” “god” “religious” “christian” “israel” “palestinian” “israeli” “political” “jews” “history” “state” “religion” “arab” “people” “authority” “power” “palestinians” “benjamin” “jew”

[14] “law” “public” “political” “legal” “speech” “state” “social” “moral” “justice” “rights” “liberal” “child” “court” “case” “violence” “power” “society” “private” “government”

[15] “literary” “literature” “criticism” “work” “history” “writing” “critics” “language” “theory” “texts” “american” “works” “culture” “cultural” “essay” “social” “historical” “sense” “critic”

[16] “poetry” “poem” “poetic” “poet” “poems” “language” “poets” “english” “form” “sense” “lyric” “word” “milton” “lines” “voice” “well” “speaker” “poetics” “kind”

[17] “film” “image” “images” “cinema” “films” “camera” “time” “photograph” “visual” “photographs” “space” “screen” “frame” “photography” “early” “body” “shot” “kiss” “object”

[18] “media” “time” “human” “communication” “game” “machine” “system” “benjamin” “work” “technology” “theory” “modern” “form” “trans” “social” “war” “italian” “medium” “early”

[19] “philosophy” “derrida” “question” “philosophical” “man” “thought” “word” “language” “time” “truth” “sense” “heidegger” “order” “trans” “knowledge” “things” “idea” “work” “place”

[20] “pound” “action” “tragedy” “kind” “kane” “pluralism” “burke” “comic” “thought” “terms” “american” “booth” “nature” “play” “evidence” “drama” “form” “work” “story”

[21] “history” “french” “american” “time” “historical” “modern” “century” “latin” “soviet” “en” “published” “france” “qui” “work” “years” “medieval” “culture” “united” “early”

[22] “jazz” “joyce” “irish” “theater” “play” “ulysses” “modern” “blackface” “suicide” “singer” “bloom” “book” “english” “riddle” “stage” “great” “louis” “jewish” “body”

[23] “meaning” “language” “interpretation” “text” “theory” “metaphor” “reading” “point” “speech” “fact” “intention” “sense” “interpretive” “question” “context” “linguistic” “word” “argument” “literary”

[24] “body” “death” “person” “experience” “human” “life” “psychology” “dead” “shame” “psychological” “mind” “work” “case” “theory” “affect” “bodies” “time” “mental” “torture”

Critical Inquiry is a famous journal, and, while I’m familiar with its contents, I can’t quite decide how well these topics represent it. I’d be curious to hear any thoughts you might have. My ultimate goal is to work on correlating topic modeling with citational analysis (though not with this particular journal of course, because of “The Footnote, in Theory” by Anne H. Stevens and Jay Williams).

Topic Modeling Signs

Natalia Cecire tweeted during the topic-modeling workshop that she was momentarily excited by thinking that a presentation on the journal Science was on Signs: Journal of Women in Culture and Society. As it turns out, I have been experimenting with creating topic models from JSTOR’s Data for Research, and I decided to see what the Signs corpus would come up with.

I downloaded word-frequency data for all the issues of the journal. I then used a script to convert the CSV files into *.txt files with the word frequencies duplicated (basically the same approach described by Andrew Goldstone here UPDATE: Andrew wrote some more extensive code for this task.) I then used text2ldac, a Python script, to convert the text files into a sparse-matrix readable by the LDA algorithms.

To generate the topic models, I am currently using the R package topicmodels. (There are additional details about how to import the ldac files created by text2ldac into R, but I’ll skip those for the time being.) I ran a Gibbs-sampling 25 topic model on the Signs data, with a minimum frequency of words set at seven (text2ldac handles this as an option). I also used a stop-list of common English words, plus the name of the journal and a few French articles and prepositions.

This is what resulted:

[1] “art” “pocahontas” “english” “woman” “women” “female” “sappho”

[2] “role” “human” “social” “models” “behavior” “women” “primate”

[3] “women” “feminist” “gender” “social” “men” “university” “work”

[4] “women” “marriage” “men” “hispanic” “honor” “village” “university

[5] “women” “state” “rights” “law” “social” “public” “political”

[6] “women” “rape” “percent” “white” “bermuda” “sex” “men”

[7] “nuclear” “language” “feminist” “defense” “war” “weapons” “text”

[8] “feminist” “social” “language” “women” “relations” “gender” “theory”

[9] “women” “labor” “work” “family” “children” “economic” “men”

[10] “women” “psychology” “feminist” “female” “mother” “woolf” “male”

[11] “women” “jane” “michel” “life” “woman” “early” “louise”

[12] “moral” “women” “muslim” “care” “theory” “social” “morality”

[13] “women” “science” “studies” “education” “social” “students” “feminist”

[14] “women” “library” “men” “beauvoir” “language” “ddc” “classification”

[15] “medical” “martineau” “women” “harriet” “alice” “cases” “mother”

[16] “women” “club” “literacy” “printed” “press” “university” “print”

[17] “women” “body” “religious” “female” “st” “dance” “christian”

[18] “freud” “charcot” “field” “lucy” “women” “bronte” “social”

[19] “feminist” “incest” “women” “distance” “theory” “aerial” “sexual”

[20] “women” “role” “female” “woman” “control” “german” “social”

[21] “women” “feminist” “mizrahi” “political” “movement” “israeli” “social”

[22] “black” “women” “white” “race” “community” “work” “men”

[23] “sexual” “lesbian” “sexuality” “women” “sex” “love” “female”

[24] “lesbian” “work” “lesbians” “nature” “gender” “identity” “women”

[25] “women” “chinese” “suffrage” “white” “african” “china” “national”

There’s a lot of experimenting involved with determining a suitable number of topics. Some of these look that might be separated from others, and others look like they belong in different topics. I also see some words to add to the stop-list (“ddc” and “st,” for example). (I couldn’t decide about “women/woman.” Too common, or would it distort the results to leave out?) And stemming would also help.

UPDATE:

I ran the model without “women” or “woman,” since those words were possibly over-represented in the terms above. Here is the result:

[1] “moral” “role” “care” “social” “art” “theory” “models”

[2] “black” “white” “race” “children” “group” “community” “racial”

[3] “men” “african” “kenya” “marriage” “groups” “university” “marry”

[4] “studies” “university” “american” “education” “history” “female” “students”

[5] “science” “social” “human” “sex” “differences” “field” “behavior”

[6] “work” “labor” “family” “social” “men” “children” “economic”

[7] “work” “gender” “martineau” “men” “hispanic” “feminization” “cultural”

[8] “rape” “percent” “bermuda” “white” “sex” “men” “raped”

[9] “religious” “male” “life” “female” “medieval” “early” “university”

[10] “law” “muslim” “legal” “court” “rights” “state” “islamic”

[11] “feminist” “nuclear” “language” “distance” “defense” “theory” “aerial”

[12] “freud” “pocahontas” “english” “charcot” “masque” “smith” “virginia”

[13] “michel” “death” “life” “female” “love” “louise” “political”

[14] “film” “body” “irigaray” “work” “university” “female” “desire”

[15] “chinese” “men” “indian” “china” “public” “library” “asian”

[16] “sexual” “sex” “gender” “men” “sexuality” “female” “social”

[17] “political” “feminist” “gender” “state” “movement” “rights” “social”

[18] “female” “love” “lesbian” “literary” “university” “life” “story”

[19] “nature” “control” “carson” “sea” “human” “natural” “life”

[20] “jane” “bronte” “lucy” “novel” “rochester” “eyre” “marriage”

[21] “club” “literacy” “printed” “press” “university” “print” “white”

[22] “incest” “feminist” “mother” “sexual” “psychology” “mothers” “men”

[23] “feminist” “social” “gender” “theory” “feminism” “political” “relations”

[24] “suffrage” “white” “national” “german” “data” “movement” “nigerian”

[25] “mizrahi” “lesbian” “feminist” “lesbians” “israeli” “identity” “israel”

Still a few oddities, but it looks reasonable.

Errol Morris’s A Wilderness of Error: The Trials of Jeffrey MacDonald

I am from North Carolina. I’m quite familiar with the eastern part of the state, having lived there off and on for almost a quarter-century. Nothing surprised me more in this unusual book than learning there was apparently a thriving “hippie” scene in Fayetteville in 1970. It seems unimaginable from what I experienced, but the returning military from SE Asia, heroin, etc. dynamic was quite different from anything I remember. Anyway, while I was familiar with the broad outlines of the Jeffrey MacDonald case, I have never read any of the books about it (or seen the mini-series or any of the other documentaries). It’s an intrinsically fascinating story, and Errol Morris is in many ways an ideal author to explore them. Morris has been a philosophy of science student and private investigator in addition to the documentary filmmaker responsible for freeing an innocent man from a Texas prison, among other provocations. In particular, Morris is fascinated with epistemology and what he describes as “postmodernist” attacks upon it. He has written amusingly about his encounters with Thomas Kuhn in this regard, and his interest in this case was certainly furthered by Janet Malcolm’s The Journalist and the Murderer, a book about the Joe McGinniss/Jeffrey MacDonald relationship that Morris thinks argues that the truth of the case is either essentially or practically unknowable. Morris rejects such an attitude with the entirety of his being, it seems. I find his objections either to be overstated or grounded in philosophical presuppositions that I don’t share, but it’s a witty and bracing attitude all the same.

Curiously enough, Morris’s remarks about Malcolm’s seeming skepticism about truth in the book reminded me of how Hal Incandenza attributes a similar skepticism about truth to his mother Avril in David Foster Wallace’s Infinite Jest. I made this connection because of the various revelations about how often Wallace distorted facts in his creative non-fiction and reportage. I have never thought much about this, considering how obvious these distortions were, but it does raise serious ethical questions in many kinds of reporting. And the truth of the MacDonald case is very hard to discern. Very early on a February morning in 1970, MacDonald’s wife and two children were murdered. He was seriously wounded but not killed. The crime scene was not kept well preserved. Army investigators developed a theory that MacDonald staged the scene and blamed intruders. The intruders MacDonald described seemed to imitate the Manson murders, which raised suspicions in the minds of investigators. A copy of Esquire devoted to the Manson murders and other instances of countercultural violence was in MacDonald’s living room. (Amazingly enough, the entire issue of this magazine was read in open court at the trial that convicted MacDonald in 1979.)

An MP saw a woman standing on the side of the road on the way to MacDonald’s house. His description matched MacDonald’s description of this woman, and investigators quickly determined that a drug-informant named Helena Stoeckley fit the description. Stoeckley would implicate herself in vague and somewhat contradictory ways in the murders many times before she died in 1983. A former FBI agent hired as a private investigator by MacDonald named Ted Gunderson, who was in the process of becoming quite the conspiracy theorist, strongly played up Stoeckley’s possible interest in occultism, and she claimed in a late interview to have been part of a cult who murdered the MacDonald family ritualistically. (Another, more prosaic and probable explanation she gave is that she knew MacDonald from working as a nurse’s assistant, and that she and her friends went to his house to intimidate him into giving them drugs. MacDonald never mentioned this to investigators, however.)

Morris convinces me that MacDonald’s trial was hopelessly corrupted. He dwells less on MacDonald’s strange behavior, which included an appearance on the Dick Cavett Show so self-involved that it alone apparently convinced his wife’s step-father that he had in fact committed the murders. He also told the same man that he had tracked down and killed one of the culprits—a lie so brazen as to be nearly unimaginable. He started a new life in California much more quickly than his in-laws and prevailing standards of decency might judge appropriate. He showed disastrous judgment in his choice of defense attorneys, but it’s hard to say if this was sufficiently foreseeable.

The genre of true-crime is morally reprehensible. It’s easy enough to deduce this from general principles, but I have also read a book in this genre that involved the family of someone I knew, and it was full of outright lies and meretricious conjectures. That’s what sells. Morris makes a very convincing case that McGinniss realized that his access to MacDonald had not given him a marketable story, and so he came up with the diet-pill psychosis hypothesis to lend closure. Morris comes close in his discussion of this point to the realization that stories themselves are always false and partial, though this is very the claim he repudiates so strongly in Malcolm’s account. And I would agree with him that the economics of publishing and personal ambitions of an author should not be confused with grand claims about the inherently illusory nature of narrative.

Another interesting facet of the book is how it reveals Morris’s primarily visual imagination. While over five-hundred pages in length, the book seems much shorter, primarily because it is divided into over sixty short chapters. Each of the chapters is about the length of one of Morris’s blog posts (excuse me, “short essays”) for the NYT. The book also has some sharply designed diagrams. At one point, Morris reveals that he wanted to make a documentary about the case; but he couldn’t get funding because everyone he talked to about it had already made up their minds about MacDonald being guilty. Even allowing for the density of visual information, it’s hard for me to imagine how all of this could have gotten into a two-hour documentary, but who knows. I would have watched it.

Homeland

I suffer from a very rare form of prosopagnosia in which every time I try to picture Mandy Patinkin, I imagine Steven Seagal instead. This has made watching Homeland a somewhat disconcerting experience. Or at least thinking about it, I should say. Though Claire Danes and Patinkin are well cast, I don’t think Damian Lewis or Morenna Baccarin are altogether plausible in their parts, though Lewis is obviously a fine actor. (At various points, the script attempts to comment on Baccarin’s appearance—the daughter can’t believe these are her parents, etc.)

Emily Nussbaum at the New Yorker wrote that she wished that the economics of television permitted the more aesthetically satisfying conclusion of one-season only series, with the bomb actually going off in the last episode. I agree, though I would have preferred that it went off accidentally, after Brody had decided not to trigger it.

Nussbaum, in the article I linked to above, also mentions that Homeland is an “antidote” for the same production team’s 24. I think “apology” might be a more appropriate term, though I’m not sure that it either is very accurate. The second season has begun with Israel bombing Iranian nuclear installations, and, while things are unsettled as a consequence, they seem much less so than I expect they would in reality. But what Nussbaum was mainly getting at there, I think, is the idea of the series’s apparent moral equivalence. The Vice-President who ordered the drone strike that killed Abu Nazir’s child (and dozens more) is at least as morally compromised according to the show’s moral perspective as anyone else.

24 itself was not immune to using far-right authority figures as enemies, and I always thought that this was an attempt to inoculate itself against its constant presentation of torture-apologetics. I’m not sure what’s equivalent in Homeland. It would also be interesting to speculate about what type of cultural authority is meant to be represented in Carrie Mathison’s love of jazz. An atmospheric decision only? I also love that Patinkin’s a blackmailer and also the show’s central moral authority.