plot textual differences in Shiny



Wordclouds such as Wordle are pretty rubbish, so I thought I'd try to make a better one, one that actually produces (statistically) meaningful results. I was so happy with the outcome I decided to make it interactive, so go on, have a play!

Compare any two files texts (turns out file uploading in Shiny is pretty experimental/dysfunctional) , and graphically map differences between them. The application will stem the file, remove stop words, and calculate statistical significance, all in a few clicks. Using the controls below you can also change the text size, plot title, the positioning of the terms (to avoid overlap), add transparency, and change the number of words plotted.

The sample image included to the left shows differences between my undergraduate thesis about Richard Pipes as a figure or ridicule in Rusian media (on the left) and my mphil theses about Katyn in Polish and Russian media (on the right). I think the plot makes the differences in emphasis pretty obvious. The words in light blue in the middle are terms featuring strongly in both texts and which are not significantly more present in one or the other.

I've presented the code and the logic behind the application elsewhere, so here I include only basic instructions: select two files to compare. Comparisons work best for medium sized files - too small and there will be no differences, too large and processing time will become a bottleneck. If trying to do anything big I strongly recommend compiling the R script locally.

Any language should work, but you may need to find your own stoplist (and stem it!) to get meaningful results. My Russian stop list may be downloaded from here. UPDATE: the Russian stoplist has been hardcoded into the app. Native support for English and I think German also exists, but for other languages you will need to recompile the programme with a custom made stoplist.

I've embedded the app below, but a more userfriendly version can be acccessed here

UPDATE: file upload is not working at the moment, so text needs to be pasted in. This will only work for small to medium size documents.

2 comments:

  1. This looks wonderful! I'm interested to see your R script, but it looks like the URL isn't quite right. Can you fix it? thanks.

    ReplyDelete
    Replies
    1. I don't think I ever uploaded it. Oops. Should be done now. I've identified a small bug in the way the z scores are calculated, so don't treat this as gospel! I'll try to update sometime this week. Best, R

      Delete