Topics in Theory

After experimenting with topic models of Critical Inquiry, I thought it would be interesting to collect several of the theoretical journals that JSTOR has in their collection and run the model on a bigger collection with more topics to see how the algorithm would chart developments in theory.

I downloaded all of the articles (word-frequency data for each article, that is) in New Literary History, Critical Inquiry, boundary 2, Diacritics, Cultural Critique, and Social Text. I then ran a model fitted to one-hundred topics. I had to adjust the stop-word list to account for common words and, unsuccessfully, for words in other languages. What I should have done was use the supplied stop-word lists in those languages as well. At least this way there is a chance that interesting words in those languages will cluster together.

The topics themselves looked good, I thought. One hundred was about the right number, as I didn’t see much evidence of merging or splitting. I should say rather that I saw an acceptable level, or the usual level. This topic, for example, shows what I mean: “aboriginal rap[?] women australian climate weather movement work warming time australia housework change social power oroonoko[?] make wages years.” I also didn’t lemmatize this corpus, although I know how to. Lemmatizing takes a lot of time the way I’m doing it (using the WordNet plugin of the python Natural Language Toolkit). And I frankly haven’t been that impressed with the specificity of the lemmatized models that I have run.

Visualizing changes in topics over time is quite difficult. Each year will have thousands of observations per topic and taking the mean of each topic per year doesn’t always produce very readable results. Benjamin Schmidt suggested trying the geom_smooth function of ggplot2, which I never had much luck with. The main reason, I found, that I couldn’t get it to work very well is that I was trying to create a composite graph of every topic using facet_wrap. Each topic graphed by itself with geom_smooth produced better results.

Here, for example, is the graph for this coherent topic—“gay sexual queer sex lesbian aids sexuality homosexual men homosexuality identity heterosexual male gender desire social lesbians drag butler”:
Graph of Change over Time in "Queer Theory" Topic from Theory Journals

The chronology you see above does approximately track the rise of queer theory, though the smoothing algorithm is full of mystery and error. A scatter-plot of the same graph would be far noisier and also not reveal much in the way of change over time. This topic should also correlate somewhat roughly to postcolonial theory–“indian india hindu colonial postcolonial subaltern british indians nationalist gandhi english bengali religious caste nationalism sanskrit maori bengal west”:
Postcolonial Topics over Time in Theory Journals

I’m suspicious of this linear increase, needless to say. The underlying data is messier. Would Marxist theory show any decline around the predictable historical period? (Terms: “social class theory ideology political production ideological historical marxist marx bourgeois capitalist society capitalism marxism economic labor relations capital”)

Topics in Marxist Theory over Time in Theory Journals

That is roughly what I was expecting. But compare “soviet party revolutionary socialist revolution socialism communist political national left union struggle europe russian fascism war central movement european”:

Communist Theory Topics over Time in Theory Journals

I have hopes for the exploratory potential of topic-modeling disciplinary change this way. Another interesting topic that shows a linear-seeming increase (“muslim islamic islam religious arab muslims secular arabic algerian orientalism rushdie religion iranian iran western turkish ibn secularism algeria”):
Islamic Topics over Time in Theory Journals

To show what the data looks like with different visualizations, I’m going to cycle through several types of graphs of the above topic. The first is a line graph:
Line graph

Next is a scatter-plot:


Now a scatter-plot with the scale_y_log10 function applied:
Point (Log10)

And a yearly mean:
Yearly mean

Finally, a five-year mean:
Five-year mean

All of the graphs reveal a general upward trend, I think, though not as much as the smoothing function does. I would be delighted in hearing any ideas anyone has about better ways to graph these. I’ve not found any improvements in grouping by document rather than year.

There’s more I plan to do with this data set, including coming up with better ways to visualize it (more precision, efficient ways of seeing many at once, etc.) I am including the full list of topics after the fold for reference. Some reveal OCR errors; others are publishing artifacts that my first rounds of stopping didn’t yet remove.

Update (2/14/12): I created a browser of this model that shows the articles most closely associated with each topic.

0 0.0321 american left radical political movement social economics black time years war power orwell decade began america back students books
1 0.02376 chinese china western hong cultural kong wang boundary modern zhang west mao lu intellectual shanghai intellectuals japanese east liu
2 0.09169 french france paris pierre jean barthes bataille flaubert work proust sartre jacques louis revolution text marcel madame histoire georges
3 0.03662 movement movements left political radical american revolution cultural world aronowitz issue civil change issues society sexual freedom social history
4 0.18127 language speech words word linguistic translation english voice meaning discourse speaking speaker act speak sentence utterance languages spoken verbal
5 0.02654 woolf virginia jane beckett lentricchia gilbert austen lawrence moore richards eve forster adam room samson edna shaw bloomsbury stevens
6 0.03491 asian american united ethnic pacific states immigrant immigrants immigration racial transnational border korean diaspora hawaiian mexican chicano identity diasporic
7 0.68931 power suggests figure authority text version rhetoric appears makes irony offers terms force calls rhetorical cited earlier act ironic
8 0.03294 latin spanish cuban don cuba puerto juan borges spain america mexico mexican brazilian rican american jose brazil maria garcia
9 0.0539 medieval middle oral latin literary ages ancient tradition written auerbach century classical texts renaissance modern rhetoric augustine early philology
10 0.12145 south national state political government people rights nation community local international official africa human african policy population land resistance
11 0.04518 medical health body medicine disease aids drug illness mental patients treatment patient clinical healing hysteria madness addiction bodies coffee
12 0.05613 german das den germany als berlin ist kafka benjamin ein karl eine trans ich mit dem ernst friedrich sich
13 0.15228 social class theory ideology political production ideological historical marxist marx bourgeois capitalist society capitalism marxism economic labor relations capital
14 0.18724 philosophy theory philosophical knowledge truth thought science scientific world wittgenstein epistemological human idea philosophers view language theoretical reason empirical
15 0.21737 university trans john david duke cambridge boundary chicago michael robert harvard oxford richard paul modern james minnesota princeton peter
16 0.0917 aesthetic art benjamin adorno sublime aesthetics work kant experience critique object beautiful form modern beauty concept judgment modernity trans
17 0.12624 narrative story narrator narratives events stories narration time plot tale event voice literary telling structure discourse action account history
18 0.20746 history historical past time present historians historian future modern events period histories century human historiography temporal study change historicism
19 0.0124 aboriginal rap women australian climate weather movement work warming time australia housework change social power oroonoko make wages years
20 0.02402 jewish palestinian arab israel israeli jews palestinians palestine zionist jew arabs state zionism middle west land political east hebrew
21 0.56336 form work general structure individual forms elements analysis principle formal specific works terms single set style function features type
22 0.16719 public york media television news national american times show president united march recent audience private april states people campaign
23 0.11547 identity postmodern cultural politics postmodernism difference discourse social culture power dominant practices identities world resistance discourses history struggle language
24 0.41333 critical studies work cultural theory critique political contemporary essay recent intellectual theoretical ways historical questions analysis question practice discussion
25 0.06531 japanese money economic market economy exchange japan corporate commodity capital financial business capitalism consumption consumer commodities production economics wealth
26 0.09897 sexual women male female woman men body gender sex feminine sexuality desire masculine power man difference masculinity pleasure bodies
27 0.06501 women feminist feminism feminists gender female woman male men sexual patriarchal sex work politics political radical mary history movement
28 0.66109 argument make position evidence response good view critics find simply claim general values arguments claims problem easily issue difficult
29 0.07487 poem romantic poetry poet wordsworth poetic milton poems coleridge william mind nature yeats paradise poets blake thy shelley bloom
30 0.01992 indian india hindu colonial postcolonial subaltern british indians nationalist gandhi english bengali religious caste nationalism sanskrit maori bengal west
31 0.08238 fig photography photograph photographs figure photographic museum portrait picture objects visual pictures image images camera object medium display portraits
32 0.1694 moral human ethical freedom life ethics individual action good morality values person social nature actions reason responsibility man judgment
33 0.20092 literary literature criticism critical history critics theory critic works study language english art tradition texts modern aesthetic writers essays
34 0.14431 writing book writer life writers write read reading written books work autobiography literary autobiographical literature personal wrote style reader
35 0.01414 chomsky dewey ek war read goodman american politics marcuse movement work political state left social radical approach public life
36 0.05149 literary history cited univ text cal notes ity dis ence human ness pro sion tional term form inter comparison
37 0.06194 music musical sound jazz song dance performance sounds voice musicians songs listening play hear art playing singing radio recording
38 0.1099 american america united states national americans war john world william james henry culture north cultural cold history canadian melville
39 0.03316 gay sexual queer sex lesbian aids sexuality homosexual men homosexuality identity heterosexual male gender desire social lesbians drag butler
40 0.17065 death body violence human dead life animal living bodies man fear pain kill blood horror murder animals scene violent
41 0.09262 city space urban spatial building place spaces cities architecture site home center house landscape architectural public places built map
42 0.02614 williams james brown tom fuller eliot maggie american john bishop read margaret people smithson act charlotte book bowl robert
43 0.05497 foucault life power deleuze everyday michel modern sovereignty state political agamben trans sovereign body guattari human discipline politics disciplinary
44 0.18385 love father family marriage woman life man young mother wife story home house husband daughter desire lady women scene
45 0.69503 time order question place point longer end moment truth means present fact back precisely word speak give remains beginning
46 0.30084 water earth land sea sun place green landscape tree river sky trees snow space light stone red white high
47 0.10127 law legal rights state court justice property laws case act authority decision system rule states sovereign contract rules sovereignty
48 0.49656 desire moment return loss form death lost presence absence condition remains sense figure passage identity past end crisis sign
49 0.16624 system time systems theory information communication cognitive affect body processes affective human process temporal level space perception brain distinction
50 0.42525 world experience life human reality consciousness nature process mind imagination vision sense language individual personal meaning man form act
51 0.40925 book published years work text letter written early author title letters books publication read readers english number found wrote
52 0.06248 technology information media computer technological digital technologies electronic machine communication human control world technical data virtual machines web internet
53 0.06998 police crime trial violence criminal prison case murder legal evidence crimes political violent victim victims serial secret justice eichmann
54 0.3546 subject discourse order relation space form discursive object difference practice subjectivity logic place symbolic subjects production position mode boundary
55 0.02432 emerson ellison burke twain hawthorne trilling read invisible ralph huckleberry writers work finn jim black social literature women john
56 0.03159 renaissance pastoral king court courtly queen literary english greenblatt prince sidney elizabethan good marie royal sir henry knight text
57 0.09066 freud psychoanalysis psychoanalytic desire unconscious lacan theory object subject freudian ego psychic symbolic sexual pleasure psychological dream psychology fantasy
58 0.57855 great modern century man life made time age men history years early thought intellectual long world found nineteenth period
59 0.14516 labor economic workers work class social working economy state welfare system industrial market percent capital control poor government union
60 0.1693 text reading literary interpretation meaning texts reader textual work author interpretive readers theory history intention read understanding act interpretations
61 0.02364 muslim islamic islam religious arab muslims secular arabic algerian orientalism rushdie religion iranian iran western turkish ibn secularism algeria
62 0.07184 play theater drama audience dramatic stage performance shakespeare plays theatrical action history hamlet tragedy characters comedy comic character actor
63 0.04363 memory trauma past holocaust memories traumatic event jews nazi history experience truth jewish testimony auschwitz german victims witness war
64 0.18222 image representation images visual space body gaze object vision presence mirror representations represented eye visible point perception represent picture
65 0.06115 film films cinema camera cinematic hollywood movie screen shot documentary frame scene movies early spectator visual images time sequence
66 0.13551 inquiry critical winter autumn abbreviated professor response spring account summer trans claim chicago fact theory point made convention essay
67 0.07161 black white african racial race blacks slave negro racism slavery bois color racist whiteness whites people social class blackness
68 0.11013 metaphor language meaning sign linguistic semiotic semantic signs system theory metaphors word metaphorical structure discourse level words literal semiotics
69 0.04175 ou comme nous mais tout sans akhmatova avec cette sont baudelaire ii fait ses ces aux lydia meme leur
70 0.04025 natural science mathematical scientific mathematics machine physics quantum hobbes human descartes nature mechanical machines body universe set history einstein
71 0.10857 god religious christian religion divine church christianity secular faith theological christ theology jesus sacred spiritual holy biblical tradition jewish
72 0.09419 war military vietnam world united states american nuclear enemy terror cold power peace soldiers torture wars army violence bush
73 0.09257 art painting work artist artistic artists works arts paintings visual aesthetic modernist modernism painter modern museum abstract fried style
74 0.06986 science human scientific nature natural scientists genetic species biological environmental research evolutionary sciences knowledge evolution biology humans life environment
75 0.0255 gramsci italian italy che prison antonio fascist ii notebooks croce giovanni vita canto carlo dante verdi michelangelo rome trans
76 0.21001 people time years good talk lot back things talking women wanted feel work make told interview thing men put
77 0.19357 political politics state public social liberal power democratic society democracy civil freedom sphere people discourse rights private radical intellectuals
78 0.10585 derrida text man reading deconstruction writing language texts jacques deconstructive miller play figure read difference question rhetoric paul essay
79 0.05332 greek tragedy classical tragic ancient oedipus aristotle plato epic socrates homer roman greeks riddle greece city antigone gods athens
80 0.07717 english british irish england london john eighteenth ireland britain victorian william century early thomas sir george late history charles
81 0.09257 colonial european western native west culture world indian cultural africa african peoples imperial discourse europe indigenous anthropology people colonialism
82 0.0366 genre bakhtin genres russian generic dialogue carnival dialogic literary mikhail literature rabelais poetics dostoevsky text speech work theory pushkin
83 0.11295 myth ritual sacred myths king mythic symbolic story magic man traditional hero tale great stories power tales gods ancient
84 0.01573 shame movement black social trip individual larkin youth political white culture lsd term anger person usage heavy cultural man
85 0.10016 poetry poem poems poet poetic poets pound language poetics verse lyric line lines olson words word prose robert form
86 0.72608 kind sense things make work part thing point find fact made world making makes call ways great called sort
87 0.76598 point terms fact question notion problem sense concept part discussion view relation relationship important nature context makes idea simply
88 0.08277 soviet party revolutionary socialist revolution socialism communist political national left union struggle europe russian fascism war central movement european
89 0.10262 global world national cultural postcolonial nation international modernity globalization nationalism states united local transnational economic culture political western capital
90 0.11165 fiction novels literary characters fictional character story reader author joyce narrator readers literature romance james genre narrative works fictions
91 0.03936 game play games players sports playing player baseball sport chess world team rules ball leisure cricket played life living
92 0.05541 diacritics trans derrida ofthe time community relation levinas ethical ethics event gift possibility jacques responsibility work logic blanchot future
93 0.0573 children child family mother parents abuse birth mothers adult adoption childhood baby families maternal father kinship reproductive motherhood home
94 0.12026 university students education academic faculty student research teaching graduate humanities higher knowledge professional school universities studies educational college english
95 0.3797 social society cultural forms individual group ways relations role practices groups community people individuals power public important process life
96 0.24791 young man room street small years day side big home food boy front days car back girl began middle
97 0.09584 heidegger nietzsche thought hegel philosophy time world thinking truth trans spirit philosophical metaphysical essence sense phenomenology existence metaphysics man
98 0.47688 back eyes time face night left hand head day man hands dark light dream words house inside world black
99 0.16072 culture cultural popular mass rock high class contemporary youth production everyday industry dominant cultures american style traditional media consumption

7 thoughts on “Topics in Theory

  1. Great topics, as usual — and the idea of combining multiple journals is definitely the direction we should go with this. At least that’s what I think — but people who know me know I’m obsessed with getting a larger data set, so …

    Re: the visualization problem, I’ve been combining a scatterplot of yearly probabilities and a smoothing line.

    E.g., here

    I’m just using R scatter.smooth for this rather than ggplot, so it’s not pretty and there are no error bars. But I’m quite pleased with the scatterplot-of-yearly-probabilities approach. Doing a scatterplot of individual articles introduces too much irrelevance, because there are always going to be articles (in every year) that just aren’t on topic X.

  2. With the smoothing function, I didn’t see any significant difference between x-axis-as-documents and x-axis-as-years, but the former is definitely too busy to be readable otherwise.

    Are you adjusting your MALLET docs output from the row form (which I’m sure you can do, though I’ve masochistically refused to look into it): (0 10.302387/jstor.txt 13 .23 24 .04 . . .), or are you handling that in R? I’ve come up with a solution that uses iteration over the data frame to go wide-to-long, but it’s very slow. I can’t come up with a melt/cast solution or a base reshape one.

  3. I’m very, very masochistic, so I wrote my own topic-modeling code in Java, and am still using it instead of MALLET. I wanted to make sure I understood the process, and could modify it as needed, and also (frankly) wanted to be able to make snappy retorts whenever people made the facile observation that “we’re treating LDA too much as a black box.”

    But anyway, aside from that retort-y motivation, I don’t necessarily recommend rolling your own. MALLET is faster, and the ad-hoc character of my code means that I can’t easily “plug into” other people’s workflows. E.g., in this case, my code directly outputs a document-topic matrix rather than MALLET’s token-by-token representation, so I haven’t had to solve that problem. Andrew Goldstone probably has; he may be using PERL, though.

  4. I remember reading that you wrote your own LDA code. Did you follow the lda-c source code, or implement it from the algorithm? The most masochistic and nerdy thing I’ve yet done for this project is compile the dynamic topic-modeling code and then spend several hours in gdb trying to figure out why it was seg-faulting (fruitlessly, as it turned out), which brought me back to the days of using my Aztec C compiler on the Amiga 500.

    For me, and I know you’ve written about this too, the ‘black-box’ effect is more noticeable with the underlying statistical methods than the code that implements them. I particularly want to learn more about and experiment with context-preserving (word order, etc.) approaches.

  5. I implemented it from the algorithm. But you’re right, writing the code doesn’t totally negate the black-box effect, because the statistical mechanics remain a bit opaque. Right now I’m fiddling with hyperparameter settings because I noticed my topics weren’t quite as crisp as the ones Andrew was getting from MALLET on the same dataset. It turns out those pesky hyperparameters make a difference!

    Variant topic-modeling algorithms are also very seductive. I’m messing around with “Topics over Time,” but haven’t gone much deeper down the rabbit hole than that. It’s a deep rabbit hole, and I think David Mimno may be right when he observes that generic LDA is actually very hard to improve. But what fun …

  6. It’s great to see what’s possible with a coherent set of journals–this is just fascinating, and those topics are full of suggestion. Like Ted I always want more data! I’d love to know what happens if you throw a non-theory journal into this mix.

    I think both yearly averages and scatterplots of documents are helpful, since the nice smooth upward trend line in this post turns out to combine two different kinds of phenomena: a few articles dominated by the target topic and, later, a bunch of articles with a substantial but not dominating proportion of the topic. One possibility is that this is actually showing the dynamic of trend-setting article followed by burgeoning subfield of discourse. Ted and I are exploring this more in modeling PMLA. More on that soon, hope you’ll comment when I blog (and when Ted blogs).

    On the more prosaic matter of operating mallet and munging its output, I feel your pain. If it’s any consolation I’ve put my for-loop-ridden R code on github. It runs slowly. I have a faster perl script which for some reason I’d forgotten about, which I’ll push there too.

  7. Thanks for sharing your code, Andrew. You’ve taken a more systematic approach to handling the various tasks necessary to produce these graphs than I have, and I look forward to trying these out.

    I also agree with you about the scatter-plot-plus-line approach. I was so concerned with the difficulties of making one type of graph that overlapping another illustrative layer didn’t occur to me. (Particularly since the faceted graph couldn’t show much detail in any case.)

    Your nested for loops might very well be faster than the single for loop with ever-increasing vector approach that I’m using. A melt/dcast solution freezes my computer before I can even see if it would work, and I got too frustrated to try it with a small example dataset beforehand.

    The trend-setting interpretation seems to me to be a useful analytic approach, and I’m thinking of ways to correlate it with citation networks.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>