<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Jonathan Goodwin</title>
	<atom:link href="http://www.jgoodwin.net/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://www.jgoodwin.net</link>
	<description></description>
	<lastBuildDate>Thu, 16 May 2013 20:40:43 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Interpreting Topics in Law and Economics</title>
		<link>http://www.jgoodwin.net/?p=1203</link>
		<comments>http://www.jgoodwin.net/?p=1203#comments</comments>
		<pubDate>Thu, 09 May 2013 05:40:55 +0000</pubDate>
		<dc:creator>Jonathan</dc:creator>
				<category><![CDATA[Aleatory Research]]></category>

		<guid isPermaLink="false">http://www.jgoodwin.net/?p=1203</guid>
		<description><![CDATA[Of the many interesting things in Matthew Jockers&#8217;s Macroanalysis, I was most intrigued by his discussion of interpreting the topics in topic models. Interpretation is what literary scholars are trained for and tend to excel at, and I&#8217;m somewhat skeptical &#8230; <a href="http://www.jgoodwin.net/?p=1203">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Of the many interesting things in Matthew Jockers&#8217;s <i>Macroanalysis</i>, I was most intrigued by his discussion of interpreting the topics in topic models. Interpretation is what literary scholars are trained for and tend to excel at, and I&#8217;m somewhat skeptical of the notion of an &#8220;uninterpretable&#8221; topic. I prefer to think of it as a topic that hasn&#8217;t yet met its match, hermeneutically speaking. In my experience building topic models of scholarly journals, I have found clear examples of lumping and splitting&#8212;terms that are either separated from their natural place or agglomerated into an unhappy mass. The &#8216;right&#8217; number of topics for a given corpus is generally the one which has the lowest visible proportion of lumped and split topics. But there are other issues in topic-interpretation that can&#8217;t easily be resolved this way.</p>
<p>A problem I&#8217;ve found in modeling scholarship is how &#8220;evidence/argument words&#8221; are always highly represented in any given corpus. If you use hyperparameter optimization, which weighs topics according to the relative proportion in the corpus, words like &#8220;fact evidence argue make&#8221; tend to compose the most representative topics. Options include simply eliminating the topic from the browser, which seems to eliminate a large number of documents that would be otherwise classified, or trying to add all of the evidence words to a stop list. The aggressive pursuit of stop-words degrades the model, though this observation is more of an intuition than anything I can now document.</p>
<p>I thought it might be helpful to others who are interested in working with topic models to create several models of the same corpus and look at the effects created by small changes in the parameters (number of topics, lemmatization of corpus, and stop-words). The journal that I chose to use for this example is the <i>Journal of Law and Economics</i>, for both its ideological interest and methodological consistency. The law-and-economics movement is about as far away from literary studies as it&#8217;s possible to be while still engaging in a type of discourse analysis, I think, and I find this contrast both amusing and potentially illuminating. That the field of law-and-economics is perhaps the most well-known (even infamous) example of quantified reasoning used in support of what many view as a distinct political agenda is what led me to choose it to begin to explore the potential critical usefulness of another quantitative method of textual analysis.</p>
<p>I began by downloading all of the research articles published in the journal from JSTOR&#8217;s <a href="http://dfr.jstor.org/">Data for Research</a>. There were 1281 articles. I then converted the word-frequency lists to bags-of-words and created a 70-topic model using MALLET.* The browsable model is <a href="http://jgoodwin.net/law-economics/70/">here</a>. The <a href="http://jgoodwin.net/law-economics/70/topic28.html">first topic</a> is the most general of academic evidence/argument words: &#8220;made, make, case, part, view, difficult. . .&#8221; I was intrigued by the high-ranking presence of articles by Milton Friedman and R. H. Coase in this topic; it would be suggestive if highly cited or otherwise important articles were most strongly associated with the corpus&#8217;s &#8220;evidence&#8221; terms, but I can&#8217;t say that this is anything other than coincidence. The next <a href="http://jgoodwin.net/law-economics/70/topic34.html">topic</a> shows the influence of the journal&#8217;s title: &#8220;law, economics, economic, system, problem, individual.&#8221; The duplication of the adjective and noun form of &#8220;economics&#8221; can be eliminated with stemming or lemmatizing the corpus, though it is not clear if this increases the overall clarity of the model. I noticed that articles &#8220;revisiting&#8221; topics such as &#8220;social cost&#8221; and &#8220;public goods&#8221; are prominent in this topic, which is perhaps explainable by an unusually high proportion of intra-journal citations. (I want to bemoan, for the thousandth time, the loss of JSTOR&#8217;s citation data from its API.)</p>
<p>The <a href="http://jgoodwin.net/law-economics/70/topic41.html">next</a> <a href="http://jgoodwin.net/law-economics/70/topic4.html">two</a> topics are devoted to methodology. Econometric techniques dominate the content of the <i>Journal of Law and Economics</i>, so there&#8217;s no surprise that topics featuring those terms would be this widely distributed. Of the next three topics, one seems spuriously related to <a href="http://jgoodwin.net/law-economics/70/topic12.html">citations</a> and the <a href="http://jgoodwin.net/law-economics/70/topic42.html">other</a> <a href="http://jgoodwin.net/law-economics/70/topic27.html">two</a> are also devoted to statistical methodology. It is only the eighth topic that is unambiguously associated with a recognizable subject in the journal: <a href="http://jgoodwin.net/law-economics/70/topic61.html">market efficiency</a>. Is this apparent overemphasis on evidence/methodology a problem? And if so, what do you do about it? One approach would be to add many of the evidence-related words to a stop-list. Another would be to label all the topics and let the browser decide which are valuable. Here is a rough attempt at labeling the seventy-topic <a href="http://jgoodwin.net/law-economics/70/index-labeled.html">model</a>.</p>
<p>The number of topics generated is the most obvious and effective parameter to adjust. Though I ended up labeling several of the topics the same way, I&#8217;m not sure that I would define those as split topics. The early evidence/methodology related topics do have slightly distinct frames of reference. The topics labeled &#8220;Pricing&#8221; also refer to different aspects of price theory, which I could have specified. The only obviously lumped-together topic was the final one, with its mixture of sex-worker  and file-sharing economics. If there is evidence of both lumping and splitting, then simply adjusting the number of topics is unlikely to solve both problems.</p>
<p>An alternative to aggressive stop-wording is lemmatization. The Natural Language Toolkit has a lemmatizer that calls on the WordNet database. Implementation is simple in python, though slow to execute. A seventy-topic model generated with the lemmatized corpus has continuities with the non-lemmatized model. The <a href="http://jgoodwin.net/law-economics/70-lemma/">browser</a> shows that there are fewer evidence-related topics. Since the default stop-word list does not include the lemmatized forms &#8220;ha,&#8221; &#8220;doe,&#8221; &#8220;wa,&#8221; or &#8220;le,&#8221; it aggregates those in topics that are more strongly representative than the similar topics in the non-lemmatized model. Comparing the <a href="http://jgoodwin.net/law-economics/70-lemma/index-labeled.html">labeled topics</a> with the <a href="http://jgoodwin.net/law-economics/70/index-labeled.html">non-lemmatized model</a> show that there are many direct correspondences. The <a href="http://jgoodwin.net/law-economics/70/topic45.html">two</a> <a href="http://jgoodwin.net/law-economics/70-lemma/topic49.html">insurance-related topics</a>, for instance, have very similar lists of articles. The trend lines do not always match very well, which I believe is caused by the much higher weighting of the first &#8220;argument words&#8221; topic in the lemmatized corpus (plus also issues about the reliability of graphing these very small changes).</p>
<p>Labeling is inherently subjective, and my adopted labels for the lemmatized corpus were both whimsical in places and also influenced by the first labels that I had chosen. As I mentioned in my <a href="http://jgoodwin.net/?p=1184">comments on Matthew Jockers&#8217;s <i>Macroanalysis</i></a>, computer scientists have developed automatic labeling techniques for topic models. While labor-intensive, doing it by hand forces you to consider each topic&#8217;s coherence and reliability in a way that might be easy to miss otherwise. The browser format that shows the articles most closely associated with each topic helps label them as well, I find. It might not be a bad idea for a topic model of journal articles to label each topic based on the title of the article most closely associated with it; this technique would only mislead on deeply divided or clustered topics, or on those which have only one article strongly associated with it (a sign of too many topics in my experience).</p>
<p>(UPDATE: My initial labeling of the tables below was in error because of an indexing error with the topic numbers. The correlations below make much more sense in terms of the topics&#8217; relative weights, and I&#8217;m embarrassed that I didn&#8217;t notice the problem earlier.)</p>
<p>The topics were not strongly correlated with each other in either direction. In the non-lemmatized model, the only topics with a Pearson correlation above .4 were</p>
<table>
<tr>
<td> <a href="http://jgoodwin.net/law-economics/70/topic28.html">EVIDENCE</a></td>
<td><a href="http://jgoodwin.net/law-economics/70/topic34.html">JOURNAL</a></td>
</tr>
<tr>
<td> <a href="http://jgoodwin.net/law-economics/70/topic20.html">ECONOMIC IDEOLOGY</a></td>
<td> <a href="http://jgoodwin.net/law-economics/70/topic28.html">EVIDENCE</a></td>
</tr>
<tr>
<td><a href="http://jgoodwin.net/law-economics/70/topic42.html">MODELING</a></td>
<td><a href="http://jgoodwin.net/law-economics/70/topic4.html">METHODOLOGY</a></td>
</tr>
</table>
<p>The negative correlations below -.4 were </p>
<table>
<tr>
<td><a href="http://jgoodwin.net/law-economics/70/topic42.html">MODELING</a></td>
<td><a href="http://jgoodwin.net/law-economics/70/topic28.html">EVIDENCE</a></td>
</tr>
<tr>
<td> <a href="http://jgoodwin.net/law-economics/70/topic34.html">JOURNAL</a></td>
<td><a href="http://jgoodwin.net/law-economics/70/topic4.html">METHODOLOGY</a></td>
</tr>
<tr>
<td> <a href="http://jgoodwin.net/law-economics/70/topic42.html">MODELING</a></td>
<td><a href="http://jgoodwin.net/law-economics/70/topic34.html">JOURNAL</a></td>
</tr>
<tr>
<td><a href="http://jgoodwin.net/law-economics/70/topic28.html">EVIDENCE</a></td>
<td><a href="http://jgoodwin.net/law-economics/70/topic4.html">METHODOLOGY</a></td>
</tr>
</table>
<p>Ted Underwood and Andrew Goldstone&#8217;s <a href="http://tedunderwood.com/2012/12/14/what-can-topic-models-of-pmla-teach-us-about-the-history-of-literary-scholarship/">PMLA topic-modeling post</a> used network graphs to visualize their models and produce identifiable clusters. I suspect this particular model could be graphed in the same way, but the relatively low correlations between topics makes me a little leery of trying it. I generated a few network graphs for John Laudun&#8217;s and my folklore project, but we didn&#8217;t end up using them for the first article. They weren&#8217;t as snazzy as the Underwood and Goldstone graphs, as my gephi patience often runs very thin. (Gephi also has problems with the latest java update, as Ian Milligan pointed out to me on twitter. I intend to update this post before too long with a D3 network graph of the topic correlations.)</p>
<p>[UPDATE: 5/16/13. After some efforts at understanding javascript's object syntax, I've made a clickable network graph of correlations between topics in the lemmatized browser: <a href="http://www.jgoodwin.net/law-economics/70-lemma/new-test.html">network graph</a>. The darker the edge, the stronger the correlation.]</p>
<p>The most strongly correlated topics in the lemmatized corpus were</p>
<table>
<tr>
<td><a href="http://jgoodwin.net/law-economics/70-lemma/topic26.html">METHODOLOGY</a></td>
<td> <a href="http://jgoodwin.net/law-economics/70-lemma/topic43.html">MODELING</a></td>
</tr>
<td> <a href="http://jgoodwin.net/law-economics/70-lemma/topic51.html">ARGUMENT WORDS</a></td>
<td><a href="http://jgoodwin.net/law-economics/70-lemma/topic5.html">PUBLIC GOODS</a></td>
<tr>
<td><a href="http://jgoodwin.net/law-economics/70-lemma/topic51.html">ARGUMENT WORDS</a></td>
<td><a href="http://jgoodwin.net/law-economics/70-lemma/topic65.html">ECONOMIC IDEOLOGY</a></td>
</tr>
</table>
<p>Here is a simple network graph of the positively correlated topics above .2 (thicker lines indicate stronger correlation):</p>
<p><a href="http://www.jgoodwin.net/wordpress/wp-content/uploads/2013/05/lemmatized-correlation.png"><img src="http://www.jgoodwin.net/wordpress/wp-content/uploads/2013/05/lemmatized-correlation.png" alt="lemmatized-correlation" width="480" height="480" class="aligncenter size-full wp-image-1218" /></a></p>
<p>My goal is to integrate a D3.js version of these network graphs into the browsers, so that the nodes link to the topics and that the layout is adjustable. I haven&#8217;t yet learned the software well enough to do this however. The simple graph above was made using the R igraph package. [UDPATE: See <a href="http://www.jgoodwin.net/law-economics/70-lemma/new-test.html">here</a> for a simple D3.js browser.]</p>
<p>And the negative correlations:</p>
<table>
<tr>
<td> <a href="http://jgoodwin.net/law-economics/70-lemma/topic26.html">METHODOLOGY</a></td>
<td> <a href="http://jgoodwin.net/law-economics/70-lemma/topic51.html">ARGUMENT WORDS</a></td>
</tr>
<tr>
<td> <a href="http://jgoodwin.net/law-economics/70-lemma/topic51.html">ARGUMENT WORDS</a></td>
<td> <a href="http://jgoodwin.net/law-economics/70-lemma/topic43.html">MODELING</a></td>
</tr>
<tr>
<td><a href="http://jgoodwin.net/law-economics/70-lemma/topic43.html">MODELING</a></td>
<td> <a href="http://jgoodwin.net/law-economics/70-lemma/topic40.html">AMERICA?</a></td>
</tr>
</table>
<p>The fact that some topics appear at the top of both the negative and positive correlations in both of the models suggests to me that there is some artifact of the hyperparameter optimization process responsible for this in a way that I don&#8217;t quite grasp (though I am aware, sadly enough, that the explanation could be very simple). The .4 threshold I chose is arbitrary, and the correlations follow a consistent and smooth pattern in both models. The related articles section of these browsers is based on Kullback-Leibler divergence, a metric apparently more useful than Manhattan distance. It seems to me that the articles listed under each topic are much more likely to be related to one another than any metric I&#8217;ve used to compare the overall weighting of topics.</p>
<p>Another way of assessing the models and label-interpretations is to check where they place highly cited articles. According to google scholar, the most highly cited article** in <i>Journal of Law and Economics</i> is Fama and Jensen&#8217;s &#8220;Separation of Ownership and Control.&#8221; In the non-lemmatized model, it is associated with the <a href="http://jgoodwin.net/law-economics/70/725104.html">AGENTS AND ORGANIZATIONS</a> topic. It appears in the topic I labeled <a href="http://jgoodwin.net/law-economics/70-lemma/725104.html">INVESTORS</a> in the lemmatized corpus, but further reflection shows that these terms are closer than I first thought. My intuition, as I have mentioned before in this discussion of <a href="http://jgoodwin.net/?p=1142">Pierre Nora&#8217;s &#8220;Between Memory and History,&#8221;</a> is that highly cited articles are somehow more central to the corpus because they affect the subsequent distribution of terms. The next-most cited article, Oliver Williamson&#8217;s &#8220;Transaction-cost Economics: The Governance of Contractual Relations&#8221; appears, suitably enough, in the topics devoted to contracts in both browsers. And R. H. Coase&#8217;s &#8220;The Federal Communications Commission&#8221; is in the COMMUNICATIONS REGULATION topic in both browsers, a topic whose continuing theoretical interest to the journal was established by Coase&#8217;s early article.  </p>
<p>As I mentioned in the beginning, I chose the <i>Journal of Law and Economics</i> for this project in interpreting topics in part because of its ideological interest. I have little sympathy for Chicago-style economics and its dire public policy recommendations, but I only expressed that in this project through some sarcastic topic-labeling. Does the classification and sorted browsing enabled by topic modeling affect how a reader perceives antagonistic material? Labeling can be an aggressive activity; would automated labeling of topics alleviate this tendency or reinforce it? I don&#8217;t know if this subject has been addressed in informational-retrieval research, but I&#8217;d like to find out.</p>
<p>*I am leaving out some steps here. My code that processes the MALLET output into a browser uses scripts in perl and R to link the metadata to the files and create graphs of each topic. Andrew Goldstone&#8217;s <a href="https://github.com/agoldst/dfr-analysis">code</a> performs much the same functions and is much more structurally sound than what I created, which is why I haven&#8217;t shared my code. For creating browsers, Allison Chaney&#8217;s <a href="http://code.google.com/p/tmve/">topic-modeling visualization engine</a> is what I recommend, though I was unsure how to convert MALLET&#8217;s output to the lda-c output that it expects (though doing so would doubtlessly be much simpler than writing on your own as I did).</p>
<p>**That is the most highly cited article anywhere that google&#8217;s bots have found, not just in the journal itself. I am aware of the assumption inherent in claiming that a highly cited article would necessarily be influential to that particular journal&#8217;s development, since disciplinary and discourse boundaries would have to be taken into account. All highly cited articles are cited in multiple disciplines, I believe, and that applies even to a journal carving out new territory in two well-established ones like law and economics.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jgoodwin.net/?feed=rss2&#038;p=1203</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Recent Developments in Humanities Topic Modeling: Matthew Jockers&#8217;s Macroanalysis and the Journal of Digital Humanities</title>
		<link>http://www.jgoodwin.net/?p=1184</link>
		<comments>http://www.jgoodwin.net/?p=1184#comments</comments>
		<pubDate>Sat, 13 Apr 2013 21:33:05 +0000</pubDate>
		<dc:creator>Jonathan</dc:creator>
				<category><![CDATA[Aleatory Research]]></category>

		<guid isPermaLink="false">http://www.jgoodwin.net/?p=1184</guid>
		<description><![CDATA[1. Ongoing Concerns Matthew Jockers&#8217;s Macroanalysis: Digital Methods &#038; Literary History arrived in the mail yesterday, and I finished reading just a short while ago. Between it and the recent Journal of Digital Humanities issue on the &#8220;Digital Humanities Contribution &#8230; <a href="http://www.jgoodwin.net/?p=1184">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p><strong>1. Ongoing Concerns</strong><br />
Matthew Jockers&#8217;s <i>Macroanalysis: Digital Methods &#038; Literary History</i> arrived in the mail yesterday, and I finished reading just a short while ago. Between it and the recent <a href="http://journalofdigitalhumanities.org/2-1/dh-contribution-to-topic-modeling/"><i>Journal of Digital Humanities</i> issue</a> on the &#8220;Digital Humanities Contribution to Topic Modeling,&#8221; I&#8217;ve had quite a lot to read and think about. <a href="http://johnlaudun.org">John Laudun</a> and I also finished editing our forthcoming article in <i>The Journal of American Folklore</i> on using topic-models to map disciplinary change. Our article takes a strongly <i>interpretive</i> and qualitative approach, and I want to review what Jockers and some of the contributors to the JDH volume have to say about the interpretation of topic models.</p>
<p>Before I get to that, however, I want to talk about the <a href="http://www.jgoodwin.net/?p=1142"><i>Representations</i></a> project&#8217;s status, as it was based on viewing the same corpus through a number of different topic-sizes. I had an intuition that documents that were highly cited outside of the journal, such as Pierre Nora&#8217;s &#8220;Between Memory and History,&#8221; might tend to be more reflective of the journal&#8217;s overall thematic structure than those less-cited. The fact that citation-count is (to some degree) correlated with publication date complicates this, of course, and I also began to doubt the premise. The opposite, in fact, might be as likely to be true, with articles that have an inverted correlation to the overall thematic structure possibly having more notability than &#8220;normal science.&#8221; The mathematical naivety of my approach compared to the existing work on topic-modeling and document influence, such as the <a href="http://www.cs.princeton.edu/~blei/papers/GerrishBlei2010.pdf">Gerrish and Blei</a> paper I linked to in the original post, also concerned me. </p>
<p>One important and useful feature missing from the browsers I had built was the display of related documents for each article. After spending one morning reading through early issues of <i>Computers and the Humanities</i>, I built a browser of it and then began working on computing similarity scores for individual articles. I used what seemed to be the simplest and most intuitive measure&#8211;the sum of absolute differences of topic assignments (this is known as Manhattan distance). Travis Brown pointed out to me on twitter that <a href="http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Kullback-Leibler divergence</a> would likely give better results.* (Sure enough, in the original <a href="http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf">LDA paper</a>, KL divergence is recommended.) The <i>Computers and the Humanities</i> <a href="http://www.jgoodwin.net/ch-browser">browser</a> currently uses the simpler distance measure, and the results are not very good. (This browser also did not filter for research articles only, and I only used the default stop-words list, which means that it is far from as useful as it could be.)</p>
<p>While the KL-divergence is not hard to calculate, I didn&#8217;t have time at the beginning of the end of the semester to rewrite the similarity score script to use it.** And since I wanted the next iteration of the <a href="http://www.jgoodwin.net/representations/">browsers</a> to use the presumably more accurate document-similarity scores, I&#8217;ve decided to postpone that project for a month or so. Having a javascript interface that allows you to instantly switch views between pre-generated models of varying numbers of topics also seemed like a useful idea; I haven&#8217;t seen anyone do that with different numbers of topics in each model yet (please let me know if there are existing examples of something like this).</p>
<p><strong>2. Interpretation</strong></p>
<p>I&#8217;m only going to write about a small section of <i>Macroanalysis</i> here. A full review might come in the future. I think that the rhetorical strategies of Jockers&#8217;s book (and also of Stephen Ramsay&#8217;s <i>Reading Machines</i>, an earlier volume in the <i>Topics in the Digital Humanities</i> series published by the University of Illinois Press) contrast interestingly with other scholarly monographs in literary studies and that this rhetoric is worth examining in the context of the current crisis in the humanities, and the salvific role of computational methods therein. But what I&#8217;m going to discuss here is Jockers&#8217;s take on labeling and interpreting the topics generated by LDA.</p>
<p>In our interpretation of the folklore-journals <a href="http://www.jgoodwin.net/folklore-browser">corpus</a> John and I did do de-facto labeling or clustering of the topics. We were particularly interested in a cluster of topics related to the performative turn in folklore. Several of these topics did match our expectations in related terms and chronological trends. (Ben Schmidt&#8217;s <a href="http://journalofdigitalhumanities.org/2-1/words-alone-by-benjamin-m-schmidt/">cautions</a> about graphing trends in topics chronologically are persuasive, though I&#8217;m more optimistic than he is about the use of dynamic topic modeling for secondary literature.) The documents associated with these apparently performance-related topics accorded with our expectations, and we took this as evidence that the co-occurrence and relative frequency assignments of the algorithm were working as expected. If that were all, then the results would be only another affirmation of the long-attested usefulness of LDA in classification or information-retrieval. And this goes a long way. If it works for things we know, then it works for things we don&#8217;t. And there are many texts we don&#8217;t know much about.</p>
<p>The real interest with using topic modeling to examine scholarship is when the results contrast with received understanding. When they mostly accord with what someone would expect to find, but there are oddities and discrepancies, we must interpret the results to determine if the fault lies in the algorithm&#8217;s classification or in the discipline&#8217;s received understanding of its history. By definition, this received understanding is based more on generalization and oral lore rather than analytic scrutiny and revision (which obviously drives much inquiry, but is almost always selective in its target), so there will always be discrepancies. Bibliometric approaches to humanities scholarship lag far behind those of the sciences, as I understand it, and I think they are of intrinsic interest independent of their contribution to disciplinary history.</p>
<p>Jockers describes efforts to label topics algorithmically in <i>Macroanalysis</i> (135, fn1). He mentions that his own work in successively revising the labels of his 19th century novels topic model is being used by David Mimno to train a classifying algorithm. He also cites <a href="http://www.aclweb.org/anthology-new/P/P11/P11-1154.pdf">&#8220;Automatic Labeling of Topic Models&#8221;</a> and <a href="http://www.aclweb.org/anthology-new/C/C10/C10-2069.pdf">&#8220;Best Topic Word Selection for Topic Labelling&#8221;</a> by Jey Han Lau and co-authors. Both of these papers explore automatically assigning labels to topics from either the terms themselves or from querying an external source, such as wikipedia, to correlate with the terms. My browsers just use the first four terms of a topic as the label, but I can see how a human-assigned label would make them more consistently understandable. Of course, with many models and large numbers of topics, this process becomes laborious, thus the interest in automatic assignment.</p>
<p>But some topics cannot be interpreted. (These are described as &#8220;uninterruptable&#8221; topics in <i>Macroanalysis</i> [129] in what I assume is a spell-check mistake.) Ignoring ambiguous topics is &#8220;a legitimate use of the data and should not be viewed with suspicion by those who may be wary of the &#8216;black box&#8217;&#8221; (130) I agree with Jockers here. In my experience modeling JSTOR data, there are always &#8220;evidence/argument&#8221; related topics that are highly represented in a hyperparametrized model, and these topics are so general as to be useless for analytic purposes. There are also &#8220;OCR error&#8221; topics and &#8220;bibliography&#8221; topics. I wouldn&#8217;t describe these latter ones as ambiguous so much as useless, but the point is that you don&#8217;t have to account for the entire model to interpret some of the topics. Topics near the bottom of a hyperparametrized model tend not to be widely represented in a corpus and thus are not of very high quality: this <a href="http://www.jgoodwin.net/theory-browser/topic35.html">&#8220;dewey ek chomsky&#8221;</a> topic from the <a href="http://www.jgoodwin.net/theory-browser/">browser</a> I created out of five theory-oriented journals is a good example.</p>
<p>I was particularly intrigued by Jockers&#8217;s description of combining topic-model and stylometric classifications into a similarity matrix. I would be bewildered and intimidated by the underlying statistical difficulties of combining these two types of classifications, but the results are certainly intriguing. The immortal George Payne Rainsford James and his <i>The False Heir</i> was classified as the closest non-Dickens novel to <i>A Tale of Two Cities</i>, for example (161).</p>
<p><strong>3. The JDH Issue</strong></p>
<p>Scott Weingart and Elijah Meeks, as I noted above, co-edited a recent issue of JDH devoted to topic modeling in the humanities. Many of the articles are versions of widely circulated posts of the last few months, such as the aforementioned Ben Schmidt article and Andrew Goldstone&#8217;s and Ted Underwood&#8217;s piece on topic-modeling <i>PMLA</i>. (Before I got distracted by topic-browsers, I created some network visualizations of topics similar to those in the Underwood and Goldstone piece. I get frustrated easily with Gephi for some reason, but the network visualization packages in R don&#8217;t generally produce graphs as handsome as Gephi&#8217;s.) There is a shortened version of David Blei&#8217;s &#8220;Probabilistic Topic Models&#8221; review article, and the slides from David Mimno&#8217;s very informative presentation from November&#8217;s Topic-Modeling workshop at the University of Maryland. Megan R. Brett does a <a href="http://journalofdigitalhumanities.org/2-1/topic-modeling-a-basic-introduction-by-megan-r-brett/">good job</a> of explaining what&#8217;s interesting about the process to a non-specialist audience. I&#8217;ve tried this myself two or three times, and it&#8217;s much more difficult than I expected it would be. The slightly decontextualized meanings of &#8220;topic,&#8221; &#8220;theme,&#8221; &#8220;document,&#8221; and possibly even &#8220;word&#8221; that are used to describe the process cause confusion, from what I&#8217;ve observed, and it&#8217;s also quite difficult to grasp why the &#8220;bag of words&#8221; approach can produce coherent results if you&#8217;re unaccustomed to thinking about the statistical properties of language. Formalist training and methods are hard to reconcile with frequency-based analysis.</p>
<p>Lisa Rhody&#8217;s <a href="http://journalofdigitalhumanities.org/2-1/topic-modeling-and-figurative-language-by-lisa-m-rhody/">article</a> describes using LDA to model ekphrastic poetry. I was impressed with Rhody&#8217;s discussion of interpretation here, as poetry presents a different level of abstraction from secondary texts and even other forms of creative writing. I had noticed in the <a href="http://www.jgoodwin.net/rhet-browser">rhetoric browser</a> I created out of <i>College English</i>, <i>jac</i>, <i>Rhetoric Review</i>, <i>Rhetoric Society Quarterly</i>, and <i>CCC</i>, that the poems often published in <i>College English</i> consistently <a href="http://www.jgoodwin.net/rhet-browser/topic4.html">clustered</a> together (and that topic would have been clustered together had I stop-worded &#8220;poems,&#8221; which I probably should have done.) Rhody&#8217;s article is the longest of the contributions, I believe, and it has a number of observations about the interpretation of topics that I want to think about more carefully.</p>
<p>Finally, the overview of tools available for topic modeling was very helpful. I&#8217;ve never used Paper Machines on my zotero collections, but I look forward to trying this out in the near future. A tutorial on using the R lda package might have been a useful addition, though perhaps its target audience would be too small to bother. I think I might be one of the few humanists to experiment with <a href="http://www.jgoodwin.net/?p=1043">dynamic topic models</a>, which I think is a useful and productive&#8212;if daunting&#8212;LDA variant. (MALLET has a built-in hierarchical LDA model, but I haven&#8217;t yet experimented with it.)</p>
<p>*Here is an informative <a href="http://storify.com/travisbrown/distance-measures-for-topic-modeling">storified conversation</a> about distance measurements for topic models that Brown showed me.</p>
<p>**Possibly interesting detail: at no point do any of my browser-creation programs use objects or any more complicated data-structure than a hash. If you&#8217;re familiar with the types of data manipulation necessary to create one of these, that probably sounds somewhat crazy&#8212;hence my reluctance to share the code on github or similar. I know enough to know that it&#8217;s not the best way to solve the problem, but it also <i>works</i>, and I don&#8217;t feel the need to rewrite it for legibility and some imagined community&#8217;s approval. I&#8217;m fascinated by the ethos of code-sharing, and I might write something longer about this later.</p>
<p>***I disagree with the University of Illinois Press&#8217;s decision to use sigils instead of numbered notes in this book. As a reader, I prefer endnotes, though I know how hard they are to typeset, but Jockers&#8217;s book has enough of them that they should be numbered.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jgoodwin.net/?feed=rss2&#038;p=1184</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Topic Models and Highly Cited Articles: Pierre Nora&#8217;s &#8220;Between Memory and History&#8221; in Representations</title>
		<link>http://www.jgoodwin.net/?p=1142</link>
		<comments>http://www.jgoodwin.net/?p=1142#comments</comments>
		<pubDate>Thu, 14 Mar 2013 20:19:32 +0000</pubDate>
		<dc:creator>Jonathan</dc:creator>
				<category><![CDATA[Aleatory Research]]></category>

		<guid isPermaLink="false">http://www.jgoodwin.net/?p=1142</guid>
		<description><![CDATA[I have been interested in bibliometrics for some time now. Humanities citation data has always been harder to come by than that of the sciences, largely because the importance of citation-count as a metric has never much caught on there. &#8230; <a href="http://www.jgoodwin.net/?p=1142">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>I have been interested in bibliometrics for some time now. Humanities citation data has always been harder to come by than that of the sciences, largely because the importance of citation-count as a metric has never much caught on there. Another important reason is a generalized distrust and suspicion of quantification in the humanities. And there are very good reasons to be suspicious of assigning too much significance to citation-counts in any discipline.</p>
<p>I used google scholar to search for most-cited articles in several journals in literary studies and allied fields. (Its default search behavior is to return the most-cited article in its database, which, while having a very broad reach, is far from comprehensive or error-free.) By far the most-cited article I found in any of the journals I looked at was Pierre Nora&#8217;s <a href="http://jstor.org/stable/2928520">&#8220;Between Memory and History: <i>Les Lieux de M&eacute;moire</i>.&#8221;</a> A key to success in citation-gathering is multidisciplinary appeal, and Nora&#8217;s article has it. It is cited in history, literary studies, anthropology, sociology, and several other fields. (It would be interesting to consider Nora&#8217;s argument about the ever-multiplying sites of memory in era of mass quantification, but I&#8217;ll have to save that for another time.)</p>
<p>The next question that came to mind would be where Nora&#8217;s article would be classified in a topic model of all of the journal&#8217;s articles. <i>Representations</i> was first published in 1983. The entire archive in JSTOR contains 1036 documents. For many of my other topic-modeling work with journals, I have only used what JSTOR classifies as research articles. Here, because of the relatively small size of the sample (and also because I wanted to see how the algorithm would classify front matter, back matter, and the other paraphernalia), I used everything. In order to track &#8220;Between Memory and History,&#8221; I created several different models. It is always a heuristic process to match the number of topics with the size and density of a given corpus. Normally, I would have guessed that somewhere between 30-50 would have been good enough to catch most of the distinct topics while minimizing the lumping together of unrelated ones.</p>
<p>For this project, however, I decided to create six separate models with an incrementally increasing number of topics. The number of topics in each is 10, 30, 60, 90, 120, and 150. I have also created <a href="http://www.jgoodwin.net/representations/">browsers</a> for each model. The index page of each browser shows the first four words of each topic for that model. The topics are sorted in descending order of their proportion in the model. Clicking on one of the topics takes you to a page which shows the full list of terms associated with that topic, the articles most closely associated with that topic (also sorted in descending order&#8212;the threshold is .05), and a graph that shows the annual mean of that topic over time. Clicking on any given journal article will take you to a page showing that journal&#8217;s bibliographic information, along with a link to JSTOR. The four topics most closely associated with that article are also listed there.</p>
<p>In the ten-topic browser, whose presence here is intended to demonstrate my suspicion that ten topics would not be nearly enough to capture the range of discourse in <i>Representations</i>, Nora&#8217;s article is in <a href="http://www.jgoodwin.net/representations/10/topic0.html">the &#8216;French&#8217; topic</a>, a lumped-together race/memory <a href="http://www.jgoodwin.net/representations/10/topic4.html">topic</a>, a generalized <a href="http://www.jgoodwin.net/representations/10/topic7.html">social/history</a> topic, and the suggestive <a href="http://www.jgoodwin.net/representations/10/topic9.html">&#8220;time, death, narrative&#8221; topic</a>. <del datetime="2013-03-16T18:50:48+00:00">With a .05 threshold, 32% of the documents in the corpus appear in the ten-topic browser.</del> [UPDATE: 3/16, this figure turned out to be based on a bug in the browser-building program.] None of these classifications are particularly surprising or revealing, given how broad the topics have to be at this level of detail; but one idea that I want to return is the ability of topic-models to identify influential documents in a given corpus. Nora&#8217;s article has clearly been very influential, but are there any detectable traces of this influence in a model of the journal in which it appeared?</p>
<p>Sean M. Gerrish and David Blei&#8217;s article <a href="http://www.cs.princeton.edu/~blei/papers/GerrishBlei2010.pdf">&#8220;Language-based Approach to Measuring Scholarly Impact&#8221;</a> uses dynamic topic models to infer which documents are (or will be) most influential in a given collection. What I have done with these <i>Representations</i> models is not dynamic topic modeling but the regular LDA model. I have experimented with <a href="http://www.jgoodwin.net/?p=1043">dynamic topic models</a> in the past, and I would like to apply the particular techniques described in their article once I can understand them better.</p>
<p>Here is how Nora&#8217;s article is classified in each of the topic models (sorted vertically from most to least representative):</p>
<table>
<tr>
<th>10-topics</th>
<th>30-topics</th>
<th>60-topics</th>
<th>90-topics</th>
<th>120-topics</th>
<th>150-topics</th>
</tr>
<tr>
<td><a href="http://www.jgoodwin.net/representations/10/topic7.html">{social political work}</a></td>
<td><a href="http://www.jgoodwin.net/representations/30/topic0.html">{history historical cultural}</a></td>
<td><a href="http://www.jgoodwin.net/representations/60/topic14.html">{history historical past}</a></td>
<td><a href="http://www.jgoodwin.net/representations/90/topic59.html">{historical history memory}</a></td>
<td><a href="http://www.jgoodwin.net/representations/120/topic12.html">{memory past history}</a></td>
<td><a href="http://www.jgoodwin.net/representations/150/topic90.html">{memory past collective}</a></td>
</tr>
<tr>
<td><a href="http://www.jgoodwin.net/representations/10/topic4.html">{war american black}</a></td>
<td><a href="http://www.jgoodwin.net/representations/30/topic24.html">{form text relation}</a></td>
<td><a href="http://www.jgoodwin.net/representations/60/topic48.html">{memory jewish holocaust}</a></td>
<td><a href="http://www.jgoodwin.net/representations/90/topic82.html">{form human order}</a></td>
<td><a href="http://www.jgoodwin.net/representations/120/topic48.html">{human form individual}</a></td>
<td><a href="http://www.jgoodwin.net/representations/150/topic85.html">{history historical past}</a></td>
</tr>
<tr>
<td><a href="http://www.jgoodwin.net/representations/10/topic9.html">{time death narrative}</a></td>
<td><a href="http://www.jgoodwin.net/representations/30/topic13.html">{memory jewish jews}</a></td>
<td><a href="http://www.jgoodwin.net/representations/60/topic18.html">{made work ways}</a></td>
<td><a href="http://www.jgoodwin.net/representations/90/topic67.html">{fact make point}</a></td>
<td><a href="http://www.jgoodwin.net/representations/120/topic55.html">{history historical modern}</a></td>
<td><a href="http://www.jgoodwin.net/representations/150/topic19.html">{form relation terms}</a></td>
</tr>
<tr>
<td><a href="http://www.jgoodwin.net/representations/10/topic0.html">{de la le}</a></td>
<td><a href="http://www.jgoodwin.net/representations/30/topic9.html">{time death life}</a></td>
<td><a href="http://www.jgoodwin.net/representations/60/topic45.html">{world human life}</a></td>
<td><a href="http://www.jgoodwin.net/representations/90/topic42.html">{early modern history}</a></td>
<td><a href="http://www.jgoodwin.net/representations/120/topic105.html">{relation difference object}</a></td>
<td><a href="http://www.jgoodwin.net/representations/150/topic36.html">{sense kind fact}</a></td>
</tr>
<tr>
<td>N/A</td>
<td><a href="http://www.jgoodwin.net/representations/30/topic6.html">{political social power}</a></td>
<td><a href="http://www.jgoodwin.net/representations/60/topic35.html">{early modern great}</a></td>
<td><a href="http://www.jgoodwin.net/representations/90/topic74.html">{power terms suggests}</a></td>
<td><a href="http://www.jgoodwin.net/representations/120/topic72.html">{de la french}</a></td>
<td><a href="http://www.jgoodwin.net/representations/150/topic122.html">{individual system theory}</a></td>
</tr>
<tr>
<td>N/A</td>
<td><a href="http://www.jgoodwin.net/representations/30/topic1.html">{de la le}</a></td>
<td><a href="http://www.jgoodwin.net/representations/60/topic39.html">{make fact question}</a></td>
<td>N/A</td>
<td><a href="http://www.jgoodwin.net/representations/120/topic96.html">{fact order present}</a></td>
<td>N/A</td>
</tr>
<tr>
<td>N/A</td>
<td>N/A</td>
<td><a href="http://www.jgoodwin.net/representations/60/topic19.html">{body figure space}</a></td>
<td>N/A</td>
<td><a href="http://www.jgoodwin.net/representations/120/topic68.html">{forms figure form}</a></td>
<td>N/A</td>
</tr>
<tr>
<td>N/A</td>
<td>N/A</td>
<td><a href="http://www.jgoodwin.net/representations/60/topic16.html">{makes man relation}</a></td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
</tr>
<tr>
<td>N/A</td>
<td>N/A</td>
<td><a href="http://www.jgoodwin.net/representations/60/topic21.html">{national history public}</a></td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
</tr>
</table>
<p>There is a notable consistency between the topics the article is assigned to no matter how many there are to choose from. A logical question to ask is if Nora&#8217;s article is assigned to more or less topics than the average article across these six models. The percentage of all articles that are assigned to a topic with a proportional threshold >= .05 ranges from 32% with the ten-topic model to 52% in the 150-topic.</p>
<p>In my next post, I am going to describe the relative frequency of the average article in the different models and try to identify which ones (including Nora&#8217;s, if it turns out to be) are disproportionately represented in the topics. I will also begin interpreting these results in light of what I felt was historicism&#8217;s relative absence in the <a href="http://www.jgoodwin.net/theory-browser">theory-journals corpus</a> I created earlier.</p>
<p>[UPDATE: 3/16. I corrected a bug in the browser-building program and generated a new table above with the correct topics linked for Nora's article. The previous table had omitted a few.]</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jgoodwin.net/?feed=rss2&#038;p=1142</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Learning to Code</title>
		<link>http://www.jgoodwin.net/?p=1131</link>
		<comments>http://www.jgoodwin.net/?p=1131#comments</comments>
		<pubDate>Sun, 10 Mar 2013 02:00:39 +0000</pubDate>
		<dc:creator>Jonathan</dc:creator>
				<category><![CDATA[Aleatory Research]]></category>

		<guid isPermaLink="false">http://www.jgoodwin.net/?p=1131</guid>
		<description><![CDATA[One of my secret vices is reading polemics about whether or not some group of people, usually humanists or librarians, should learn how to code. What&#8217;s meant by &#8220;to code&#8221; in these discussions varies quite a lot. Sometimes it&#8217;s a &#8230; <a href="http://www.jgoodwin.net/?p=1131">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>One of my secret vices is reading polemics about whether or not some group of people, usually <a href="http://stephenramsay.us/text/2011/01/11/on-building/">humanists</a> or <a href="http://blogs.princeton.edu/librarian/2013/03/why-i-ignore-gurus-sherpas-ninjas-mavens-and-other-sages/">librarians</a>, should learn how to code. What&#8217;s meant by &#8220;to code&#8221; in these discussions varies quite a lot. Sometimes it&#8217;s a markup language. More frequently it&#8217;s an interpreted language (usually python or ruby). I have yet to come across an argument for why a humanist should learn how to allocate memory and keep track of pointers in C, or master the algorithms and data structures in this typical <a href="http://mitpress.mit.edu/sicp/full-text/book/book-Z-H-4.html#%_toc_start">introductory computer science textbook</a>; but I&#8217;m sure they&#8217;re out there.</p>
<p>I could easily imagine someone in game studies wanting to learn how to program games in their original environment, such as 6502 assembly, for example. A good materialist impulse, such as learning how to work a printing press or bind a book, should never be discouraged. But what about scholars who have an interest in digital media, electronic editing, or text mining? The skeptical argument here points out that there are existing tools for all of these activities, and the wise and conscientious scholar will seek those out rather than wasting time reinventing an inferior product.</p>
<p>This argument is very persuasive, but it doesn&#8217;t survive contact with the realities of today&#8217;s text-mining and machine-learning environment. I developed a strong interest in these areas several months ago (and have posted about little else since, sadly enough), even to the point where I went to an NEH seminar on topic modeling hosted by the fine folks at the <a href="http://mith.umd.edu/">MITH</a>. One of the informative lectures recommended that anyone serious about pursuing topic modeling projects learn the statistical programming language R and a scripting language such as python. This came as of little surprise to me as being reassured later in the evening by a dinner companion that Southerners were of course discriminated against in academia. I had begun working with topic-modeling in R packages, and a great deal of text-munging was required to assemble the topic output in a legible format. MALLET makes this easier, but there&#8217;s no existing GUI solution* for visualizing the topics (or creating <a href="http://jgoodwin.net/anthro-browser">browsers</a> of them, which some feel is more useful**).</p>
<p>Whatever flexibility that being able to dispense with existing solutions might offer you is more than counterbalanced by the unforgiving exactitude and provincial scrupulousness of programming languages, which manifestly avoid all but the most literal interpretations and cause limitless suffering for those foolish or masochistic enough to use them. These countless frustrations inevitably lead to undue pride in overcoming them, which lead people (or at least me) to replace a more rational regret over lost time with the temporary confidence of (almost always Pyrrhic) victory. </p>
<p>An optimistic assessment of the future of computation is that interfaces will become sophisticated enough to eliminate the need for almost anyone other than hobbyists to program a computer. Much research in artificial intelligence (and much of the most promising results as I understand them) has been in training computers to program themselves. Functional programming languages, to my untutored eye and heavily imperative mindset, already seem to train their programmers to think in a certain way. The correct syntax is the correct solution, in other words; and how far can it be from that notable efficiency to having the computer synthesize the necessary solutions to any technical difficulty or algorithmic refinement itself? (These last comments are somewhat facetious, though the promise of autoevolution was at the heart of cybernetics and related computational enthusiasms&#8212;the <a href="http://www.upress.umn.edu/book-division/books/summa-technologiae">recent English translation</a> of Lem&#8217;s <i>Summa Technologiae</i> is an interesting source here as is Lem&#8217;s &#8220;Golem XIV.&#8221;)</p>
<p>I can&#8217;t help but note that several of the arguments I&#8217;ve read that advise people not to learn to code and not to spend time teaching other people how to if you happen to be unlucky enough to be in a position to do so are written by people who make it clear that they themselves know how. (I&#8217;m thinking here in particular of <a href="http://www.personal.psu.edu/bul5/">Brian Lennon</a>, with whom I&#8217;ve had several discussions about these matters on twitter and also <a href="http://uncomputing.org">David Golumbia</a>.) Though I don&#8217;t think this myself, I could see how someone might describe this stance as obscurantist. (It&#8217;s probably a matter of ethos and also perhaps a dislike of people who exaggerate their technical accomplishments and abilities in front of audiences who don&#8217;t know any better&#8212;if you could concede that such things could exist in the DH community.)</p>
<p>*<a href="https://github.com/chrisjr/papermachines">Paper Machines</a>, though I haven&#8217;t tried it out, can now import and work with DfR requests. This may include topic modeling functionality as well.</p>
<p>**I have to admit that casual analysis (or, exacting scrutiny) of my server logs reveals that absolutely no one finds these topic browsers worth more than a few seconds&#8217; interest. I haven&#8217;t yet figured out if this is because they are objectively uninteresting or if users miss the links because the style sheet. (Or both.)</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jgoodwin.net/?feed=rss2&#038;p=1131</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>The Awakening of My Interest in Annular Systems</title>
		<link>http://www.jgoodwin.net/?p=1121</link>
		<comments>http://www.jgoodwin.net/?p=1121#comments</comments>
		<pubDate>Fri, 22 Feb 2013 00:43:57 +0000</pubDate>
		<dc:creator>Jonathan</dc:creator>
				<category><![CDATA[Aleatory Research]]></category>

		<guid isPermaLink="false">http://www.jgoodwin.net/?p=1121</guid>
		<description><![CDATA[I&#8217;ve been thinking a lot recently about a simple question: can machine learning detect patterns of disciplinary change that are at odds with received understanding? The forms of machine learning that I&#8217;ve been using to try to test this&#8212;LDA and &#8230; <a href="http://www.jgoodwin.net/?p=1121">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>I&#8217;ve been thinking a lot recently about a simple question: can machine learning detect patterns of disciplinary change that are at odds with received understanding? The forms of machine learning that I&#8217;ve been using to try to test this&#8212;LDA and the dynamic LDA variant&#8212;do a very good job of picking up the patterns that you would suspect to find in, say, a large corpus of literary journals. The model I built of several theoretically oriented journals in JSTOR, for example, shows much the same trends that anyone familiar with the broad contours of literary theory would expect to find. The relative absence of historicism as a topic of self-reflective inquiry is also explainable by the journals represented and historicism&#8217;s comparatively low incidence of keywords and rote-citations.</p>
<p>I&#8217;ve heard from people on twitter that it&#8217;s a widely held belief that machine-learning techniques (and, by extension, all quantitative methods) can only tell us what we already know about the texts. I admit some initial skepticism about the prevalence of this claim, but I&#8217;ve now seen more evidence of it in the wild, so to speak, and I think I understand where some of this overly categorical skepticism comes from. A test of the validity of topic modeling, for example, would be if it produces a coherent model of a well-known corpus. If it does, then it is likely that it will do the same for an unknown or unread group of texts. The models that I have built of scholarly literature from JSTOR, I can see, are thought by some of the people who&#8217;ve seen them to be well-understood corpora. If the models reflect the general topics and trends that people know from their knowledge of the field, then that&#8217;s great as far as it goes, but we&#8217;ll have to reserve judgment on the great unread.</p>
<p>One issue here is that I don&#8217;t think the disciplinary history of any field is well understood. Topic modeling&#8217;s disinterested aggregations have the potential to show an unrecognized formation or the persistence of a trend long-thought dormant. <a href="http://culturecat.net/node/1564">Clancy</a> found some clustering of articles in rhetoric journals associated with a topic that she initially would have labeled as &#8220;expressivist&#8221; from several decades before she would expect. Part of this has to do with the eclectic nature of what&#8217;s published in <i>College English</i>, of course, and part has to do with the parallels between creative writing and expressivist pedagogy. But it&#8217;s the type of specific connection that someone following established histories is not likely to find.</p>
<p>Ben Schmidt noted that <a href="http://sappingattention.blogspot.com/2013/01/keeping-words-in-topic-models.html">topic modeling</a> was designed and marketed, to some degree, as a replacement for keyword search. Schmidt is more skeptical than I am of the usefulness of this higher-level of abstraction for general scholarly research. I know enough about anthropology to have my eyebrows raised by this <a href="http://www.nytimes.com/2013/02/19/science/napoleon-chagnons-war-stories-in-the-amazon-and-at-home.html">Nicholas Wade essay on Napoleon Chagnon</a>, for example, and I still find this <a href="http://jgoodwin.net/anthro-browser">browser</a> of <i>American Anthropologist</i> to be a quicker way of finding articles than JSTOR&#8217;s interface. I created this browser to compare with the <a href="http://jgoodwin.net/folklore-browser">folklore browser*</a> of the corpus that <a href="http://johnlaudun.org">John Laudun</a> and I have been working with. We wanted to see if topic models would reflect our intuition that the cultural/linguistic turn in anthropology and folklore diffused through their respective disciplines&#8217; scholarly journals (the folklore corpus contains the journal most analogous to <i>American Anthropologist</i>, <i>The Journal of American Folklore</i>, but it also has other folklore journals as well) at the expected time (earlier in anthropology than folklore).</p>
<p>A very promising, to my mind, way of correlating topic models of journals is with networks of citations. I&#8217;ve done enough network graphs of scholarly citations to know that, unless you heavily prune and categorize the citations, the results are going to be hard to visualize in any meaningful way. (One of the first network graphs I created all of the citations in thirty years of <i>JAF</i> required zooming in to something like 1000x magnification to make out individual nodes. I&#8217;m far from an expert at creating efficient network visualizations, needless to say.) JSTOR once provided citation data through its Data for Research interface; it does not any longer as far as I know. This has been somewhat frustrating.</p>
<p>If we had citation data, taking two topics that both seem reflective of a general cultural/linguistic/poststructuralist influence, such as this <a href="http://www.jgoodwin.net/folklore-browser/topic26.html">folklore topic</a> and this <a href="http://www.jgoodwin.net/anthro-browser/topic3.html">anthropological one</a> would allow us to compare the citation networks to see if the concomitant rise in proportion was reflected in references to shared sources (L&eacute;vi-Strauss, for example, I know to be one of the most cited authors in the folklore corpus.) I would also like to explore the method described in this <a href="http://www.cs.princeton.edu/~blei/papers/GerrishBlei2010.pdf">paper</a> that uses a related form of posterior inference to discover the most influential documents in a corpus.**</p>
<p>This type of comparative exploration, while presenting an interesting technical challenge to implement (to me, that is, and I fully recognize the incommensurable gulf between using these algorithms and <i>creating</i> and refining them) can&#8217;t (yet) be mistaken for discovery. You can&#8217;t go from this to an a priori proof of non-discovery, however. Maybe no one is actually arguing this position, and I&#8217;m fabricating this straw argument out of supercilious tweets and decontextualized and half-remembered blog posts.</p>
<p>A more serious personal intellectual problem for me is that I find the dispute between <a href="http://norvig.com/chomsky.html">Peter Norvig</a> and <a href="http://www.theatlantic.com/technology/archive/12/11/noam-chomsky-on-where-artificial-intelligence-went-wrong/261637/?single_page=true">Noam Chomsky</a> to be either a case of mutual misunderstanding or one where Chomsky has by far the more persuasive case. If I&#8217;m being consistent then, I&#8217;d have to reject at least some of the methodological premises behind topic-modeling and related techniques. Perhaps &#8220;practical value&#8221; and &#8220;exploration/discovery&#8221; can share a peaceful co-existence.</p>
<p>*These browsers work by showing an index page with the first four words of each topic. You can then click on any one of the topics to see the full list of words associated with it, together with a list of articles sorted by how strongly they represent that topic. Clicking then on an individual article takes you to page that shows the other topics most associated with that article, also clickable, and a link to the JSTOR page of the article itself.</p>
<p>**The note about the model taking more than ten hours to run fills me with foreboding, however. My (doubtlessly inefficient) browser-creating scripts can take more than hour to run on a corpus of 10K documents, combined with another hour or more w/ MALLET and R&#8211;it really grinds down a person conditioned to expect instant results in today&#8217;s attention economy.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jgoodwin.net/?feed=rss2&#038;p=1121</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Two Topic Browsers</title>
		<link>http://www.jgoodwin.net/?p=1110</link>
		<comments>http://www.jgoodwin.net/?p=1110#comments</comments>
		<pubDate>Wed, 13 Feb 2013 15:44:05 +0000</pubDate>
		<dc:creator>Jonathan</dc:creator>
				<category><![CDATA[Aleatory Research]]></category>

		<guid isPermaLink="false">http://www.jgoodwin.net/?p=1110</guid>
		<description><![CDATA[Ben Schmidt, in a detailed and very useful post about some potential problems with using topic models for humanities research, wondered why people didn&#8217;t commonly build browsers for their models. For me, the answer was quite simple: I couldn&#8217;t figure &#8230; <a href="http://www.jgoodwin.net/?p=1110">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Ben Schmidt, <a href="http://sappingattention.blogspot.com/2013/01/keeping-words-in-topic-models.html">in a detailed and very useful post</a> about some potential problems with using topic models for humanities research, wondered why people didn&#8217;t commonly build browsers for their models. For me, the answer was quite simple: I couldn&#8217;t figure out how to get the necessary output files from MALLET to use Allison Chaney&#8217;s <a href="http://code.google.com/p/tmve/">topic modeling visualization engine</a>. I&#8217;m sure that the output can be configured to do so, and I&#8217;ve built the dynamic-topic-modeling code, which does produce the same type of files as lda-c, but I hadn&#8217;t actually used lda-c (except through an R package front-end) for my own models.</p>
<p>It occurred to me that a simple browser wouldn&#8217;t be that hard to build myself, so I made <a href="http://jgoodwin.net/rhet-browser/">one</a> for Clancy&#8217;s <a href="http://culturecat.net/node/1564">explorations</a> of the rhetoric/composition journals in JSTOR and <a href="http://jgoodwin.net/theory-browser/">another</a> for the <a href="http://jgoodwin.net/?p=1068">theory corpus</a>. (I did use Chaney&#8217;s CSS file.) I used my old graphs without the scatterplots layer for the theory-browser, as I didn&#8217;t want to take the time to regenerate those yet. And I&#8217;m not sure quite what&#8217;s going on with unicode/non-ASCII characters; theoretically the code I wrote should convert those properly. [UPDATE: Thanks to a pointer from <a href="http://andrewgoldstone.com/">Andrew Goldstone</a> on twitter, I fixed the encoding issue. <code>binmode, ":utf8"</code> on all filehandles is the answer in perl at least.)</p>
<p>The articles shown for each topic are those that have that topic most strongly associated with them. It&#8217;s quite possible that other articles could have higher proportions but have another topic even more strongly associated with it. I should also rewrite the code so that it grabs all articles below a certain threshold of significance. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.jgoodwin.net/?feed=rss2&#038;p=1110</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Stronghold of Bioinformatics</title>
		<link>http://www.jgoodwin.net/?p=1098</link>
		<comments>http://www.jgoodwin.net/?p=1098#comments</comments>
		<pubDate>Sat, 05 Jan 2013 21:13:38 +0000</pubDate>
		<dc:creator>Jonathan</dc:creator>
				<category><![CDATA[Aleatory Research]]></category>

		<guid isPermaLink="false">http://www.jgoodwin.net/?p=1098</guid>
		<description><![CDATA[No one likes gamification or MOOCs, as far as I can tell. What I should say is that anyone trained in the hermeneutics of suspicion might even find it hard to accept their existence. It&#8217;s hard to come up with &#8230; <a href="http://www.jgoodwin.net/?p=1098">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>No one likes gamification or MOOCs, as far as I can tell. What I should say is that anyone trained in the hermeneutics of suspicion might even find it hard to accept their existence. It&#8217;s hard to come up with a hypothetical concept that would cry more piteously to the heavens for critique, for example. True to form, until a few weeks ago I had never earned a badge in my life and would have regarded the prospect of doing so with contempt and a touch of pity for whoever was naive enough to suggest it.</p>
<p>Then, there was this <a href="http://www.metafilter.com/122684/The-data-that-we-actually-used">Metafilter post</a>. Things I&#8217;ve discovered via Metafilter have taken away many months of work-time over the years, so the sensible thing to do would be to quit reading it. But that&#8217;s unlikely. In any case, <a href="http://rosalind.info">Project Rosalind</a> is a series of programming problems related to bioinformatics. It has the gamified features of &#8220;levels,&#8221; &#8220;badges,&#8221; &#8220;achievements,&#8221; and even, God help me, &#8220;xp.&#8221; There are a series of problems related to string processing, probability, and other topics. They have a <a href="http://rosalind.info/problems/tree-view/">tree-like</a> structure, and you have to solve precursor problems before getting access to the later ones. Solving a problem involves downloading a dataset and submitting a solution within five minutes. After you&#8217;ve solved the problem, you can see the code that others have posted to solve the problem.</p>
<p>This feature is particularly interesting to me, as I have never really learned functional programming, so when I see solutions to problems that I have solved in perl in languages such as Haskell, Clojure, or Scala, it&#8217;s a bit easier to understand how they were put together. (<a href="http://rosettacode.org/wiki/Rosetta_Code">Rosetta Code</a> is another place to see programming problems solved in multiple languages.) You are allowed unlimited attempts to get the right answer, and you can see forum questions about the problem after two unsuccessful tries. (I have posted a question once&#8212;a rather idiotic question in retrospect&#8212;and I received a correspondingly withering response, whose impact I mitigated somewhat by imagining it spoken in the Comic Book Guy&#8217;s voice.) </p>
<p>I have, at this point, <a href="http://rosalind.info/users/joncgoodwin/">solved</a> twenty-two of the ninety-three problems. The early ones are trivial, but I&#8217;m finding the difficulty to be scaling up quite a bit. I&#8217;ve used some algorithms I had never worked with before, such as tree-suffix and shortest-superstring. I&#8217;ve also used arbitrarily nested loops in perl (with Algorithm::Loops) and contemplated the theoretical limits of what a regular expression can match more than I&#8217;ve had to before. It&#8217;s also quite interesting to see what the total numbers of problems <a href="http://rosalind.info/problems/list-view/">solved</a> reveal about people&#8217;s background knowledge. <a href="http://rosalind.info/problems/iprb/">Two</a> of the <a href="http://rosalind.info/problems/lia/">problems</a> involving Mendelian inheritance and probability have been solved proportionally many fewer times than (more difficult) string-processing programs. (I don&#8217;t mean to be a hypocrite in saying this, as I got tired of the Punnett-squares required in the second one of those and haven&#8217;t solved it myself.)</p>
<p>Some of the gamified features of the site I regard as silly (levels, xp, badges, achievements), but I admit that I can&#8217;t help but be motivated by the statistical information about how many people have solved which problems. It triggers my instinctual competitiveness, somehow. They even seem to encourage people to post their country of origin to introduce nationalism into the competitive mix here. As a learning tool, I&#8217;m not sure how effective it is. It&#8217;s quite possible to solve many of the problems while retaining only the barest minimum about the underlying molecular biology, and problems which require a bit more conceptual understanding than that (see the Mendelian inheritance ones above) are comparatively ignored.</p>
<p>The programmatic checking of solutions is also somewhat finicky. An end-of-line character at the end of the file will cause an otherwise correct solution to fail for at least some of the problems, for example. But all in all, I&#8217;m very impressed with this site and think it has a lot of potential in teaching people (humanists, for example), how to program. It would be nice to be able to reuse the code with different problem sets, if they ever decide to release the source in the future.</p>
<p>UPDATE:</p>
<p>I corrected a few mistakes (I gave myself an extra problem, for instance), and I also wanted to mention an important precursor: <a href="http://projecteuler.net/">Project Euler</a>. This site has mathematics problems, and it also seems a bit more streamlined. I haven&#8217;t actually used it yet, though.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jgoodwin.net/?feed=rss2&#038;p=1098</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Topics in Theory</title>
		<link>http://www.jgoodwin.net/?p=1068</link>
		<comments>http://www.jgoodwin.net/?p=1068#comments</comments>
		<pubDate>Sat, 01 Dec 2012 23:48:41 +0000</pubDate>
		<dc:creator>Jonathan</dc:creator>
				<category><![CDATA[Aleatory Research]]></category>

		<guid isPermaLink="false">http://www.jgoodwin.net/?p=1068</guid>
		<description><![CDATA[After experimenting with topic models of Critical Inquiry, I thought it would be interesting to collect several of the theoretical journals that JSTOR has in their collection and run the model on a bigger collection with more topics to see &#8230; <a href="http://www.jgoodwin.net/?p=1068">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>After <a href="http://www.jgoodwin.net/?p=1027">experimenting with topic models</a> of <i>Critical Inquiry</i>, I thought it would be interesting to collect several of the theoretical journals that JSTOR has in their collection and run the model on a bigger collection with more topics to see how the algorithm would chart developments in theory.</p>
<p>I downloaded all of the articles (word-frequency data for each article, that is) in <i>New Literary History</i>, <i>Critical Inquiry</i>, <i>boundary 2</i>, <i>Diacritics</i>, <i>Cultural Critique</i>, and <i>Social Text</i>. I then ran a model fitted to one-hundred topics. I had to adjust the stop-word list to account for common words and, unsuccessfully, for words in other languages. What I should have done was use the supplied stop-word lists in those languages as well. At least this way there is a chance that interesting words in those languages will cluster together.</p>
<p>The topics themselves looked good, I thought. One hundred was about the right number, as I didn&#8217;t see much evidence of merging or splitting. I should say rather that I saw an acceptable level, or the usual level. This topic, for example, shows what I mean: &#8220;aboriginal <b>rap</b>[?] women australian climate weather movement work warming time australia housework change social power <b>oroonoko</b>[?] make wages years.&#8221; I also didn&#8217;t lemmatize this corpus, although I know how to. Lemmatizing takes a lot of time the way I&#8217;m doing it (using the WordNet plugin of the python Natural Language Toolkit). And I frankly haven&#8217;t been that impressed with the specificity of the lemmatized models that I have run.</p>
<p>Visualizing changes in topics over time is quite difficult. Each year will have thousands of observations per topic and taking the mean of each topic per year doesn&#8217;t always produce very readable results. <a href="http://sappingattention.blogspot.com">Benjamin Schmidt</a> <a href="http://www.jgoodwin.net/?p=1060&#038;cpage=1#comment-43602">suggested</a> trying the <code>geom_smooth</code> function of <code>ggplot2</code>, which I never had much luck with. The main reason, I found, that I couldn&#8217;t get it to work very well is that I was trying to create a composite graph of every topic using <code>facet_wrap</code>. Each topic graphed by itself with <code>geom_smooth</code> produced better results.</p>
<p>Here, for example, is the graph for this coherent topic&#8212;&#8221;gay sexual queer sex lesbian aids sexuality homosexual men homosexuality identity heterosexual male gender desire social lesbians drag butler&#8221;:<br />
<a href="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/12/gay_sexual_queer_.png"><img src="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/12/gay_sexual_queer_-1024x1024.png" alt="Graph of Change over Time in &quot;Queer Theory&quot; Topic from Theory Journals" title="Graph of Change over Time in &quot;Queer Theory&quot; Topic from Theory Journals" width="584" height="584" class="aligncenter size-large wp-image-1069" /></a></p>
<p>The chronology you see above does approximately track the rise of queer theory, though the smoothing algorithm is full of mystery and error. A scatter-plot of the same graph would be far noisier and also not reveal much in the way of change over time. This topic should also correlate somewhat roughly to postcolonial theory&#8211;&#8221;indian india hindu colonial postcolonial subaltern british indians nationalist gandhi english bengali religious caste nationalism sanskrit maori bengal west&#8221;:<br />
<a href="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/12/indian_india_hindu_.png"><img src="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/12/indian_india_hindu_-1024x1024.png" alt="Postcolonial Topics over Time in Theory Journals" title="Postcolonial Topics over Time in Theory Journals" width="584" height="584" class="aligncenter size-large wp-image-1071" /></a></p>
<p>I&#8217;m suspicious of this linear increase, needless to say. The underlying data is messier. Would Marxist theory show any decline around the predictable historical period? (Terms: &#8220;social class theory ideology political production ideological historical marxist marx bourgeois capitalist society capitalism marxism economic labor relations capital&#8221;)</p>
<p><a href="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/12/social_class_theory_.png"><img src="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/12/social_class_theory_-1024x1024.png" alt="Topics in Marxist Theory over Time in Theory Journals" title="Topics in Marxist Theory over Time in Theory Journals" width="584" height="584" class="aligncenter size-large wp-image-1073" /></a></p>
<p>That is roughly what I was expecting. But compare &#8220;soviet party revolutionary socialist revolution socialism communist political national left union struggle europe russian fascism war central movement european&#8221;:</p>
<p><a href="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/12/soviet_party_revolutionary_.png"><img src="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/12/soviet_party_revolutionary_-1024x1024.png" alt="Communist Theory Topics over Time in Theory Journals" title="Communist Theory Topics over Time in Theory Journals" width="584" height="584" class="aligncenter size-large wp-image-1075" /></a></p>
<p>I have hopes for the exploratory potential of topic-modeling disciplinary change this way. Another interesting topic that shows a linear-seeming increase (&#8220;muslim islamic islam religious arab muslims secular arabic algerian orientalism rushdie religion iranian iran western turkish ibn secularism algeria&#8221;):<br />
<a href="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/12/muslim_islamic_islam_.png"><img src="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/12/muslim_islamic_islam_-1024x1024.png" alt="Islamic Topics over Time in Theory Journals" title="Islamic Topics over Time in Theory Journals" width="584" height="584" class="aligncenter size-large wp-image-1077" /></a></p>
<p>To show what the data looks like with different visualizations, I&#8217;m going to cycle through several types of graphs of the above topic. The first is a line graph:<br />
<a href="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/12/line.png"><img src="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/12/line-1024x1024.png" alt="Line graph" title="Line graph" width="584" height="584" class="aligncenter size-large wp-image-1082" /></a></p>
<p>Next is a scatter-plot:</p>
<p><a href="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/12/point.png"><img src="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/12/point-1024x1024.png" alt="Point-graph" title="Point-graph" width="584" height="584" class="aligncenter size-large wp-image-1083" /></a></p>
<p>Now a scatter-plot with the <code>scale_y_log10</code> function applied:<br />
<a href="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/12/point_log10.png"><img src="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/12/point_log10-1024x1024.png" alt="Point (Log10)" title="point_log10" width="584" height="584" class="aligncenter size-large wp-image-1084" /></a></p>
<p>And a yearly mean:<br />
<a href="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/12/year-mean.png"><img src="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/12/year-mean-1024x1024.png" alt="Yearly mean" title="year-mean" width="584" height="584" class="aligncenter size-large wp-image-1085" /></a></p>
<p>Finally, a five-year mean:<br />
<a href="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/12/five-year.png"><img src="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/12/five-year-1024x1024.png" alt="Five-year mean" title="five-year" width="584" height="584" class="aligncenter size-large wp-image-1087" /></a></p>
<p>All of the graphs reveal a general upward trend, I think, though not as much as the smoothing function does. I would be delighted in hearing any ideas anyone has about better ways to graph these. I&#8217;ve not found any improvements in grouping by document rather than year.</p>
<p>There&#8217;s more I plan to do with this data set, including coming up with better ways to visualize it (more precision, efficient ways of seeing many at once, etc.) I am including the full list of topics after the fold for reference. Some reveal OCR errors; others are publishing artifacts that my first rounds of stopping didn&#8217;t yet remove.</p>
<p>Update (2/14/12): I created a <a href="http://jgoodwin.net/theory-browser">browser</a> of this model that shows the articles most closely associated with each topic.</p>
<p><span id="more-1068"></span><br />
0       0.0321  american left radical political movement social economics black time years war power orwell decade began america back students books<br />
1       0.02376 chinese china western hong cultural kong wang boundary modern zhang west mao lu intellectual shanghai intellectuals japanese east liu<br />
2       0.09169 french france paris pierre jean barthes bataille flaubert work proust sartre jacques louis revolution text marcel madame histoire georges<br />
3       0.03662 movement movements left political radical american revolution cultural world aronowitz issue civil change issues society sexual freedom social history<br />
4       0.18127 language speech words word linguistic translation english voice meaning discourse speaking speaker act speak sentence utterance languages spoken verbal<br />
5       0.02654 woolf virginia jane beckett lentricchia gilbert austen lawrence moore richards eve forster adam room samson edna shaw bloomsbury stevens<br />
6       0.03491 asian american united ethnic pacific states immigrant immigrants immigration racial transnational border korean diaspora hawaiian mexican chicano identity diasporic<br />
7       0.68931 power suggests figure authority text version rhetoric appears makes irony offers terms force calls rhetorical cited earlier act ironic<br />
8       0.03294 latin spanish cuban don cuba puerto juan borges spain america mexico mexican brazilian rican american jose brazil maria garcia<br />
9       0.0539  medieval middle oral latin literary ages ancient tradition written auerbach century classical texts renaissance modern rhetoric augustine early philology<br />
10      0.12145 south national state political government people rights nation community local international official africa human african policy population land resistance<br />
11      0.04518 medical health body medicine disease aids drug illness mental patients treatment patient clinical healing hysteria madness addiction bodies coffee<br />
12      0.05613 german das den germany als berlin ist kafka benjamin ein karl eine trans ich mit dem ernst friedrich sich<br />
13      0.15228 social class theory ideology political production ideological historical marxist marx bourgeois capitalist society capitalism marxism economic labor relations capital<br />
14      0.18724 philosophy theory philosophical knowledge truth thought science scientific world wittgenstein epistemological human idea philosophers view language theoretical reason empirical<br />
15      0.21737 university trans john david duke cambridge boundary chicago michael robert harvard oxford richard paul modern james minnesota princeton peter<br />
16      0.0917  aesthetic art benjamin adorno sublime aesthetics work kant experience critique object beautiful form modern beauty concept judgment modernity trans<br />
17      0.12624 narrative story narrator narratives events stories narration time plot tale event voice literary telling structure discourse action account history<br />
18      0.20746 history historical past time present historians historian future modern events period histories century human historiography temporal study change historicism<br />
19      0.0124  aboriginal rap women australian climate weather movement work warming time australia housework change social power oroonoko make wages years<br />
20      0.02402 jewish palestinian arab israel israeli jews palestinians palestine zionist jew arabs state zionism middle west land political east hebrew<br />
21      0.56336 form work general structure individual forms elements analysis principle formal specific works terms single set style function features type<br />
22      0.16719 public york media television news national american times show president united march recent audience private april states people campaign<br />
23      0.11547 identity postmodern cultural politics postmodernism difference discourse social culture power dominant practices identities world resistance discourses history struggle language<br />
24      0.41333 critical studies work cultural theory critique political contemporary essay recent intellectual theoretical ways historical questions analysis question practice discussion<br />
25      0.06531 japanese money economic market economy exchange japan corporate commodity capital financial business capitalism consumption consumer commodities production economics wealth<br />
26      0.09897 sexual women male female woman men body gender sex feminine sexuality desire masculine power man difference masculinity pleasure bodies<br />
27      0.06501 women feminist feminism feminists gender female woman male men sexual patriarchal sex work politics political radical mary history movement<br />
28      0.66109 argument make position evidence response good view critics find simply claim general values arguments claims problem easily issue difficult<br />
29      0.07487 poem romantic poetry poet wordsworth poetic milton poems coleridge william mind nature yeats paradise poets blake thy shelley bloom<br />
30      0.01992 indian india hindu colonial postcolonial subaltern british indians nationalist gandhi english bengali religious caste nationalism sanskrit maori bengal west<br />
31      0.08238 fig photography photograph photographs figure photographic museum portrait picture objects visual pictures image images camera object medium display portraits<br />
32      0.1694  moral human ethical freedom life ethics individual action good morality values person social nature actions reason responsibility man judgment<br />
33      0.20092 literary literature criticism critical history critics theory critic works study language english art tradition texts modern aesthetic writers essays<br />
34      0.14431 writing book writer life writers write read reading written books work autobiography literary autobiographical literature personal wrote style reader<br />
35      0.01414 chomsky dewey ek war read goodman american politics marcuse movement work political state left social radical approach public life<br />
36      0.05149 literary history cited univ text cal notes ity dis ence human ness pro sion tional term form inter comparison<br />
37      0.06194 music musical sound jazz song dance performance sounds voice musicians songs listening play hear art playing singing radio recording<br />
38      0.1099  american america united states national americans war john world william james henry culture north cultural cold history canadian melville<br />
39      0.03316 gay sexual queer sex lesbian aids sexuality homosexual men homosexuality identity heterosexual male gender desire social lesbians drag butler<br />
40      0.17065 death body violence human dead life animal living bodies man fear pain kill blood horror murder animals scene violent<br />
41      0.09262 city space urban spatial building place spaces cities architecture site home center house landscape architectural public places built map<br />
42      0.02614 williams james brown tom fuller eliot maggie american john bishop read margaret people smithson act charlotte book bowl robert<br />
43      0.05497 foucault life power deleuze everyday michel modern sovereignty state political agamben trans sovereign body guattari human discipline politics disciplinary<br />
44      0.18385 love father family marriage woman life man young mother wife story home house husband daughter desire lady women scene<br />
45      0.69503 time order question place point longer end moment truth means present fact back precisely word speak give remains beginning<br />
46      0.30084 water earth land sea sun place green landscape tree river sky trees snow space light stone red white high<br />
47      0.10127 law legal rights state court justice property laws case act authority decision system rule states sovereign contract rules sovereignty<br />
48      0.49656 desire moment return loss form death lost presence absence condition remains sense figure passage identity past end crisis sign<br />
49      0.16624 system time systems theory information communication cognitive affect body processes affective human process temporal level space perception brain distinction<br />
50      0.42525 world experience life human reality consciousness nature process mind imagination vision sense language individual personal meaning man form act<br />
51      0.40925 book published years work text letter written early author title letters books publication read readers english number found wrote<br />
52      0.06248 technology information media computer technological digital technologies electronic machine communication human control world technical data virtual machines web internet<br />
53      0.06998 police crime trial violence criminal prison case murder legal evidence crimes political violent victim victims serial secret justice eichmann<br />
54      0.3546  subject discourse order relation space form discursive object difference practice subjectivity logic place symbolic subjects production position mode boundary<br />
55      0.02432 emerson ellison burke twain hawthorne trilling read invisible ralph huckleberry writers work finn jim black social literature women john<br />
56      0.03159 renaissance pastoral king court courtly queen literary english greenblatt prince sidney elizabethan good marie royal sir henry knight text<br />
57      0.09066 freud psychoanalysis psychoanalytic desire unconscious lacan theory object subject freudian ego psychic symbolic sexual pleasure psychological dream psychology fantasy<br />
58      0.57855 great modern century man life made time age men history years early thought intellectual long world found nineteenth period<br />
59      0.14516 labor economic workers work class social working economy state welfare system industrial market percent capital control poor government union<br />
60      0.1693  text reading literary interpretation meaning texts reader textual work author interpretive readers theory history intention read understanding act interpretations<br />
61      0.02364 muslim islamic islam religious arab muslims secular arabic algerian orientalism rushdie religion iranian iran western turkish ibn secularism algeria<br />
62      0.07184 play theater drama audience dramatic stage performance shakespeare plays theatrical action history hamlet tragedy characters comedy comic character actor<br />
63      0.04363 memory trauma past holocaust memories traumatic event jews nazi history experience truth jewish testimony auschwitz german victims witness war<br />
64      0.18222 image representation images visual space body gaze object vision presence mirror representations represented eye visible point perception represent picture<br />
65      0.06115 film films cinema camera cinematic hollywood movie screen shot documentary frame scene movies early spectator visual images time sequence<br />
66      0.13551 inquiry critical winter autumn abbreviated professor response spring account summer trans claim chicago fact theory point made convention essay<br />
67      0.07161 black white african racial race blacks slave negro racism slavery bois color racist whiteness whites people social class blackness<br />
68      0.11013 metaphor language meaning sign linguistic semiotic semantic signs system theory metaphors word metaphorical structure discourse level words literal semiotics<br />
69      0.04175 ou comme nous mais tout sans akhmatova avec cette sont baudelaire ii fait ses ces aux lydia meme leur<br />
70      0.04025 natural science mathematical scientific mathematics machine physics quantum hobbes human descartes nature mechanical machines body universe set history einstein<br />
71      0.10857 god religious christian religion divine church christianity secular faith theological christ theology jesus sacred spiritual holy biblical tradition jewish<br />
72      0.09419 war military vietnam world united states american nuclear enemy terror cold power peace soldiers torture wars army violence bush<br />
73      0.09257 art painting work artist artistic artists works arts paintings visual aesthetic modernist modernism painter modern museum abstract fried style<br />
74      0.06986 science human scientific nature natural scientists genetic species biological environmental research evolutionary sciences knowledge evolution biology humans life environment<br />
75      0.0255  gramsci italian italy che prison antonio fascist ii notebooks croce giovanni vita canto carlo dante verdi michelangelo rome trans<br />
76      0.21001 people time years good talk lot back things talking women wanted feel work make told interview thing men put<br />
77      0.19357 political politics state public social liberal power democratic society democracy civil freedom sphere people discourse rights private radical intellectuals<br />
78      0.10585 derrida text man reading deconstruction writing language texts jacques deconstructive miller play figure read difference question rhetoric paul essay<br />
79      0.05332 greek tragedy classical tragic ancient oedipus aristotle plato epic socrates homer roman greeks riddle greece city antigone gods athens<br />
80      0.07717 english british irish england london john eighteenth ireland britain victorian william century early thomas sir george late history charles<br />
81      0.09257 colonial european western native west culture world indian cultural africa african peoples imperial discourse europe indigenous anthropology people colonialism<br />
82      0.0366  genre bakhtin genres russian generic dialogue carnival dialogic literary mikhail literature rabelais poetics dostoevsky text speech work theory pushkin<br />
83      0.11295 myth ritual sacred myths king mythic symbolic story magic man traditional hero tale great stories power tales gods ancient<br />
84      0.01573 shame movement black social trip individual larkin youth political white culture lsd term anger person usage heavy cultural man<br />
85      0.10016 poetry poem poems poet poetic poets pound language poetics verse lyric line lines olson words word prose robert form<br />
86      0.72608 kind sense things make work part thing point find fact made world making makes call ways great called sort<br />
87      0.76598 point terms fact question notion problem sense concept part discussion view relation relationship important nature context makes idea simply<br />
88      0.08277 soviet party revolutionary socialist revolution socialism communist political national left union struggle europe russian fascism war central movement european<br />
89      0.10262 global world national cultural postcolonial nation international modernity globalization nationalism states united local transnational economic culture political western capital<br />
90      0.11165 fiction novels literary characters fictional character story reader author joyce narrator readers literature romance james genre narrative works fictions<br />
91      0.03936 game play games players sports playing player baseball sport chess world team rules ball leisure cricket played life living<br />
92      0.05541 diacritics trans derrida ofthe time community relation levinas ethical ethics event gift possibility jacques responsibility work logic blanchot future<br />
93      0.0573  children child family mother parents abuse birth mothers adult adoption childhood baby families maternal father kinship reproductive motherhood home<br />
94      0.12026 university students education academic faculty student research teaching graduate humanities higher knowledge professional school universities studies educational college english<br />
95      0.3797  social society cultural forms individual group ways relations role practices groups community people individuals power public important process life<br />
96      0.24791 young man room street small years day side big home food boy front days car back girl began middle<br />
97      0.09584 heidegger nietzsche thought hegel philosophy time world thinking truth trans spirit philosophical metaphysical essence sense phenomenology existence metaphysics man<br />
98      0.47688 back eyes time face night left hand head day man hands dark light dream words house inside world black<br />
99      0.16072 culture cultural popular mass rock high class contemporary youth production everyday industry dominant cultures american style traditional media consumption </p>
]]></content:encoded>
			<wfw:commentRss>http://www.jgoodwin.net/?feed=rss2&#038;p=1068</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Same Stuff, Different Graph</title>
		<link>http://www.jgoodwin.net/?p=1060</link>
		<comments>http://www.jgoodwin.net/?p=1060#comments</comments>
		<pubDate>Fri, 30 Nov 2012 01:46:40 +0000</pubDate>
		<dc:creator>Jonathan</dc:creator>
				<category><![CDATA[Aleatory Research]]></category>

		<guid isPermaLink="false">http://www.jgoodwin.net/?p=1060</guid>
		<description><![CDATA[When I started experimenting with graphing changes in topic-proportions over time, I didn&#8217;t pay much attention to the design of the graph. I could see that it was far too busy, but I assumed that this would be relatively easy &#8230; <a href="http://www.jgoodwin.net/?p=1060">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>When I started <a href="http://jgoodwin.net/?p=1049">experimenting</a> with graphing changes in topic-proportions over time, I didn&#8217;t pay much attention to the design of the graph. I could see that it was far too busy, but I assumed that this would be relatively easy to adjust using <code>ggplot2</code>&#8216;s many parameters.</p>
<p>It wasn&#8217;t. It didn&#8217;t take me too long to figure out that I needed to change the data from discrete to continuous in order to see anything like a sparkline, but it was also apparent from the other data sets I was working with that taking the mean at intervals was the only way to make a reasonably clean graph. I ended up using the <code>aggregate</code> function to create the n-year averages, though I read some intriguing descriptions of the power of data.tables in R. (I refuse to ask for help on stackoverflow, even though it would have saved many hours worth of work. Character flaw.) </p>
<p>I now need to learn how to use the <a href="http://cran.r-project.org/web/packages/reshape/index.html">reshape</a> package, with its wonderfully named &#8216;melt&#8217; and &#8216;cast&#8217; features, to rewrite the code I&#8217;m using to change rows to columns. A simple for-loop iteration over a data-frame in R can take hours, I&#8217;ve learned; and I expect that this other solution would finish the job in seconds.</p>
<p>Anyway, here&#8217;s the revised graph of <i>ELH</i> with annual means of topic-proportions:<br />
<a href="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/11/elh_line.png"><img src="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/11/elh_line-1024x1024.png" alt="Graph of Topics in ELH" title="Graph of Topics in ELH" width="584" height="584" class="aligncenter size-large wp-image-1061" /></a></p>
<p>The full list of topics can be found in my previous <a href="http://jgoodwin.net/?p=1049">post</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jgoodwin.net/?feed=rss2&#038;p=1060</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Visualizing Topics in ELH</title>
		<link>http://www.jgoodwin.net/?p=1049</link>
		<comments>http://www.jgoodwin.net/?p=1049#comments</comments>
		<pubDate>Sun, 25 Nov 2012 17:54:25 +0000</pubDate>
		<dc:creator>Jonathan</dc:creator>
				<category><![CDATA[Aleatory Research]]></category>

		<guid isPermaLink="false">http://www.jgoodwin.net/?p=1049</guid>
		<description><![CDATA[I was impressed with Ian Milligan&#8217;s visualizations of Canadian parliamentary debates, and I wanted to try to visualize some of the topic models I&#8217;ve been creating from JSTOR&#8217;s Data for Research. ELH I thought would be an interesting journal to &#8230; <a href="http://www.jgoodwin.net/?p=1049">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>I was impressed with Ian Milligan&#8217;s <a href="http://ianmilligan.ca/2012/11/09/cultural-trends-in-hansard/">visualizations</a> of <a href="http://ianmilli.files.wordpress.com/2012/11/hansard-topic-models.jpg">Canadian parliamentary debates</a>, and I wanted to try to visualize some of the topic models I&#8217;ve been creating from JSTOR&#8217;s Data for Research.</p>
<p><i>ELH</i> I thought would be an interesting journal to try, as it publishes articles in each issue on quite a range of literary periods, often ranging from medieval to twentieth-century material. I assumed that LDA would be likely to identify each of these periods as a topic. To test this, I downloaded the entire set of articles from JSTOR and created a fifty-topic model. From there, I wanted to chart the proportion of each topic in each document. I was able to import the data in R and use ggplot2 to create the following graph:</p>
<p><a href="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/11/elh.png"><img src="http://www.jgoodwin.net/wordpress/wp-content/uploads/2012/11/elh-1024x1024.png" alt="ELH-graph" title="elh" width="584" height="584" class="aligncenter size-large wp-image-1050" /></a></p>
<p>As you can see, many of these topics are identifiable from even two-word samples. Others show a need of lemmatizing (a slow process using the python NLTK, though effective), or of further splitting. Perhaps fifty topics is not quite enough.</p>
<p>The code has to transform row data to column-form in order to be efficiently sorted. It then used the ggplot2 facet_wrap feature to create the graph. I&#8217;d be happy to share it, if anyone&#8217;s interested, though it uses a for-loop, which I understand to be bad R. You also have to pre-process the JSTOR files to associate dates with the files themselves. I have a perl script for this.</p>
<p>For reference, here is the complete list of topics generated by MALLET:<br />
0  love marriage lady lover woman desire young lovers passion wife sexual beauty friendship husband heart story loves relationship world<br />
1  women female sexual male woman gender men sexuality desire feminine masculine mary sex mother lady patriarchal early domestic feminist<br />
2  place house back scene light great description eyes passage water space sea makes night earth city landscape man day<br />
3  body human bodies medical scientific science physical natural bodily disease nature medicine mental health physiological early yellow james john<br />
4  renaissance english modern book bacon early thomas latin humanist elizabethan classical utopia richard sixteenth cambridge england tudor erasmus knowledge<br />
5  world life experience human sense reality mind personal feeling consciousness imagination real vision man modern individual emotional felt feel<br />
6  medieval english middle arthur piers green poem gawain knight play poet lancelot bat st late courtly sir hym plowman<br />
7  american war whitman poe america political conrad public literature jim world secret adams york marlow walt united leaves german<br />
8  moral man virtue social human character fielding nature good natural society characters sentimental sympathy hero morality tom irony action<br />
9  yeats keats tennyson marvell poem garden art flowers herrick victorian andrew poet nymph stanza beauty swinburne idylls green myth<br />
10 wordsworth coleridge romantic byron blake poem poetry poetic william lyrical poet romanticism prelude lines nature imagination mind book wordsworthian<br />
11 shakespeare play hamlet scene king dramatic tragedy richard othello plays macbeth action audience act shakespearean speech tragic drama measure<br />
12 shelley political burke revolution french mary sublime caleb revolutionary radical rousseau godwin romantic historical wollstonecraft falkland reform prometheus frankenstein<br />
13 chaucer tale troilus medieval tales canterbury prologue wife criseyde man book fortune courtly nat pardoner story knight ye lydgate<br />
14 social literary cultural history historical culture political text form modern discourse literature work forms individual critique critical texts reading<br />
15 johnson pope swift dryden satire addison gulliver satiric augustan wit boswell samuel essay restoration eighteenth lines spectator satirist poem<br />
16 narrative story narrator fiction reader history characters plot tale book romance events novels readers character stories truth fictional text<br />
17 irish scott historical national ireland gothic scottish history english british nation waverley scotland past castle novels ancient family antiquarian<br />
18 church religious catholic protestant religion puritan john england english reformation bishop body roman anglican ecclesiastical christian st real argument<br />
19 language words word speech meaning text reading reader writing rhetorical linguistic style voice verbal rhetoric read speak discourse sense<br />
20 law legal family clarissa father pamela marriage property richardson lovelace child incest daughter letter rape contract lady criminal miss<br />
21 spenser faerie book queene allegory pastoral canto knight allegorical guyon poem colin arthur red britomart venus poet books nature<br />
22 death life time past dead nature man memory present loss child living world natural end soul mother back voice<br />
23 sonnet sonnets line english music lines verse song musical form lyric sound rhyme sequence italian songs stanza lyrics opera<br />
24 woolf public lamb virginia forster society burney social room miss goldsmith evelina elia bloomsbury lily young sheridan peter house<br />
25 english british colonial european england national crusoe imperial cultural island empire indian early foreign spanish east trade west india<br />
26 nature human man mind natural reason world things theory thought truth ideas knowledge philosophy idea philosophical object form imagination<br />
27 jane dickens victorian lucy austen novels david charlotte bleak miss pip sir trollope catherine wuthering emma bronte fanny lady<br />
28 literary english century literature criticism critical history works poetry critics great writers essay art modern work eighteenth age influence<br />
29 make good made man great life men end give put find true time left found long mind thought things<br />
30 black white american slave hawthorne racial slavery race african melville slaves baldwin negro scarlet identity ahab hester southern sentimental<br />
31 social economic class money society labor economy public market trade commercial poor exchange domestic wealth system private property city<br />
32 part point view made general important time fact work kind present earlier sense passage effect form make similar found<br />
33 joyce stephen hardy wilde bloom ulysses james tess molly artist portrait young finnegans wake ford jude chapter dorian father<br />
34 poem poetry poet poems poetic speaker lines poets line verse stanza stevens lyric work reader thy song williams elegy<br />
35 letter book published letters edition writing written years john printed text literary early time william wrote books work author<br />
36 eliot george pater jewish james victorian daniel jews henry gwendolen deronda marius life jew dorothea social adam maggie middlemarch<br />
37 dracula sterne animal tristram beckett stein shandy animals stoker yorick henry horses lucy smart murphy journey dogs mechanical uncle<br />
38 marlowe faustus ovid epic classical virgil tamburlaine dido chapman aeneas ovidian roman hercules gods georgic hero myth aeneid georgics<br />
39 makes power place question order suggests simply means act response relationship fact terms sense role identity claim precisely critics<br />
40 sidney elizabethan sir pastoral lady queen elizabeth essex beowulf stella court philip arcadia ralegh earl countess sonnet poet lord<br />
41 god christian christ spiritual religious divine man grace st holy biblical faith john bible soul sin church word doctrine<br />
42 desire figure body object subject image text violence power representation narrative scene form relation moment pleasure trans fantasy gaze<br />
43 political king james english royal power history john state england henry charles government politics civil court war lord public<br />
44 image world time form vision images order structure pattern symbolic meaning movement symbol figure imagery final process physical metaphor<br />
45 milton paradise adam god satan lost eve samson book fall poem heaven epic son evil hell divine fallen sin<br />
46 art pound aesthetic browning painting work thoreau visual ruskin blake artist cantos ezra plate aesthetics arts museum canto paintings<br />
47 donne hath thy doth thou john doe good elizabethan henry haue owne made sir man renaissance bee thomas world<br />
48 play plays stage jonson drama theater audience theatrical dramatic performance comedy masque sir restoration comic theatre scene actors ben<br />
49 translation french latin il hebrew und se di ne cf ut english dans die version sed renaissance par quod </p>
]]></content:encoded>
			<wfw:commentRss>http://www.jgoodwin.net/?feed=rss2&#038;p=1049</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
