Same Stuff, Different Graph

When I started experimenting with graphing changes in topic-proportions over time, I didn’t pay much attention to the design of the graph. I could see that it was far too busy, but I assumed that this would be relatively easy to adjust using ggplot2‘s many parameters.

It wasn’t. It didn’t take me too long to figure out that I needed to change the data from discrete to continuous in order to see anything like a sparkline, but it was also apparent from the other data sets I was working with that taking the mean at intervals was the only way to make a reasonably clean graph. I ended up using the aggregate function to create the n-year averages, though I read some intriguing descriptions of the power of data.tables in R. (I refuse to ask for help on stackoverflow, even though it would have saved many hours worth of work. Character flaw.)

I now need to learn how to use the reshape package, with its wonderfully named ‘melt’ and ‘cast’ features, to rewrite the code I’m using to change rows to columns. A simple for-loop iteration over a data-frame in R can take hours, I’ve learned; and I expect that this other solution would finish the job in seconds.

Anyway, here’s the revised graph of ELH with annual means of topic-proportions:
Graph of Topics in ELH

The full list of topics can be found in my previous post.

2 thoughts on “Same Stuff, Different Graph

  1. reshape2 is great, and plyr is also super-useful. But you can also use geom_smooth in ggplot2 with a low ‘span’ (.15 or so) to get something pretty similar, even from thousands of document-level points.

    How come some of the curves look different on the topics? For example, ‘letter-book’ seems to have changed from U shaped to ∩-shaped form the last post to this one, and image-world changed from ∩-shaped to U-shaped?

  2. Geom_smooth never quite worked for me, though I didn’t put as much time as I should have in adjusting its various parameters.

    I read many stackoverflow responses that used plyr, but I also had trouble figuring out how to apply it to my particular problem. I’m going to keep investigating because the time problem is getting serious with my row-transformation.

    I hope the discrepancies are a result of a few outlier documents being averaged down. The first plot has every single document plotted, whereas this one is the mean of each year. I’ll recheck the data to see if I averaged it incorrectly, which, sadly enough, is entirely possible.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>