Creating Topic Models with JSTOR's Data for Research (DfR)
Sat Nov 17, 2012
These instructions are designed for someone using a Mac or Linux platform. (The differences below between using Linux and a Mac should be apparent to anyone who uses Linux, so I’m not going to indicate them here; it’s mainly where files are stored.) All of this should work on Windows, but you’ll need to install Cygwin or use alternate shell commands. MALLET has slightly different installation instructions for the Windows platform as well, I believe.
Download and install MALLET.
Download the file (I assume it will be in /Users/yourusername/Downloads)
open Terminal/shell (this is in the Applications/Utilities folder on a Mac)
cd to the directory where you downloaded it (something like:
cd /Users/yourusername/Downloadsif you use Downloads as default directory.)
Now enter these commands:
tar -xzvf mallet-2.0.7.tar.gz
- Download and unzip DfR data.
Create a DfR account. Log in.
Find the journal you want. Make sure the total number of issues is less than 1000 or be content with a random sample. (You can request a higher limit with an explanation of why you need it, or you can download multiple files.)
Go to the “Submit New Request” tab.
Select Citations Only AND Word Counts.
Select CSV for Output Format.
Give a job title.
Click “Submit Job”
Wait for notification email.
When you get it, go to “My Requests” page,
Use your browser’s “Save As” feature to download the “Full Dataset” file to the MALLET directory (I’m assuming it’s /Users/yourusername/Downloads/mallet-2.0.7).
Go to terminal. Type
$unzip 2012..[bunch of numbers]..zip
- Pre-processing the JSTOR data.
Download Andrew Goldstone’s count2txt. Save it in the same directory you’ve been using.
Unless the output of that command says “perl 5, version 14,” open count2txt in a text editor.
Find the line of the code that reads
Delete the line, or add a
#before it. (You could also change “14” to the version of perl that you use.)
perl count2txt --multifile wordcounts/*.CSV
Create a txt-files only directory for MALLET to work on:
cp wordcounts/*.txt text
- Run Mallet
Enter (note that this—and all other commands here—should all be on the same line):
bin/mallet import-dir --input text --output topic-input.mallet --keep-sequence --remove-stopwords
Now run the topic-modeler:
bin/mallet train-topics --input topic-input.mallet --num-topics 10 --output-topic-keys jstor.model.txt
Now look at file jstor.model.txt for your results.
You probably will want to add more topics than 10, but this shouldn’t
take too long for a first experiment. MALLET also has a lot of
parameters you can experiment with. You’ll probably want to add your own
stop words. In the same directory, you can create a list with your own
stop words in a text editor and save it as “stop.txt”. Then, try this
command to re-create the MALLET input file:
bin/mallet import-dir --input text --output topic-input.mallet --keep-sequence --remove-stopwords --extra-stopwords stop.txt