Scaling up text processing and Shutting up R: Topic modelling and MALLET

In this post I show how a combination of MALLET, Python, and data.table means we can analyse quite Big data in R, even though R itself buckles when confronted by textual data. 

Topic modelling is great fun. Using topic modelling I have been able to separate articles about the 'Kremlin' as a) a building, b) an international actor c) the adversary of the Russian opposition, and d) as a political and ideological symbol.  But for topic modelling to be really useful it needs to see a lot of text. I like to do my analysis in R, but R tends to disagree with large quantities of text. Reading digital humanities literature lately, I suspect I am not alone in confronting this four-step process, of which this post is a result:

1) We want more data to get better results (and really just because it's available)
2) More data makes our software crash. Sometimes spectacularly, and particularly if its R.
3) Ergo we need new or more efficient methods.
4) We write a blog post about our achievements in reaching step 3


Recently I have found that I can use MALLET for topic modelling of my Russian media dataset. Topic modelling has become popular of late, and there are numerous excellent descriptions of what it is and how it works - see for instance Ted Underwood's piece here. Personally I've not made extensive use of topic models, usually fitted using LDA, because I thought the Russian case structure would be a barrier to getting useful results, and because most tools struggled to handle the quantity of data I wanted to analyse. Consequently, the R topicmodels package was perfectly OK, but was limited by being reliant upon the tm package and R, both of which struggle when faced with thousands let alone millions of texts.

The obvious solution to these problems is using MALLET. Mallet runs Java and is consequently very much more efficient than R. David Mimno has recently released a very nifty R wrapper (aptly named 'mallet'). The package is handy for getting used to working with MALLET, but as texts to be analysed need to be loaded into R, and consequently into memory, this wasn't really an option either. Ben Marwick has released an excellent script that allows you to run the command line implementation of MALLET from R, as well as import the results afterwards, and this is probably the approach closest to what we need. 

All these implementations suffer when you try to scale up the analysis - mainly due to R, rather than MALLET, though MALLET also is happier if it can keep all its input in memory.* I fed my data to the program through the command line, which worked fine after increasing the heap space. In the end MALLET spat out a 7GB file, a matrix just shy of 1 million rows by 1000 columns or so, which was much too large to read in to R. Or rather, I could read in the file, but not do much with it afterwards.

I was able to read the file in using read.table, but my laptop died when I tried to convert the data.frame to a data.table. In the end I used fread  from from the data.table package to get the file straight into the data.table format. 

The MALLET output is in the format [ID] [filename], followed by topic-proportion codes ranked according to the quality of the match:


And this repeated for 500 topics. In short, it is ideal identifying the top subject matter for individual articles - no reshaping at all is needed - but not so great for finding the distribution of topics across articles. To reorder the data by topic (as opposed to topic rank) would require the mother of all reshape operations. The obvious way to do this is using reshape and dcast, following (or: copy-pasting) Ben Marwick:

outputtopickeysresult <- font="" header="F," outputtopickeys="" read.table="" sep="\t">
outputdoctopicsresult <-read .table="" font="" header="F," outputdoctopics="" sep="\t">
# manipulate outputdoctopicsresult to be more useful 
dat <- font="" outputdoctopicsresult="">
l_dat <- 2="" dat="" font="" idvar="1:2," nbsp="" ncol="" reshape="" varying="list(topics=colnames(dat[,seq(3,">
                                             props=colnames(dat[,seq(4, ncol(dat), 2)])),   direction="long")
library(reshape2)
w_dat <- dcast="" font="" l_dat="" v2="" v3="">
rm(l_dat) # because this is very big but not longer needed

Ehrm, yeah. R didn't like that very much. It wanted to allocate  a twelve digit memory vector (maybe I exaggerate), which was just not an option, definitely not on my laptop.

The best approach I found in a single piece of code was using the splitstackshape package:


merged.stack(dat, id.vars=c("Id","text"), var.stubs=c("topic", "proportion"), sep="topic|proportion")


This worked great for up to about 100 000 rows, but it still makes a copy of the data.table and struggled to deal with the whole dataset. I tried some other data.table options, but they all involved melting the data into a very long table, then casting it back into a short or wide form, and at no point was I able to process more than about 200 000 rows. I am absolutely certain it is *possible* to do this in data.table, but that I'm just that bit too dense to find the solution. After crashing my system one time to many, I decided the crux was as follows: if I am going to have a copy of the data in memory, I won't have enough spare memory to do anything with it. 

My workaround? Not to keep anything in memory and to use a bit of Python. Nice and slow: read one line, write one line, placing the proportions in order of topics rather than rank. None of this load-everything-into-memory-and-cross-my-fingers nonsense. Using a dictionary of dictionaries is a much faster approach, but here as an exercise I tried to keep as little data as possible in memory. This little script requires virtually no memory, and by halving the number of columns (and rounding to five decimal points) the output file was a third of the size of the input file - about 2GB. 

A benchmark: using fread() R loaded the processed data in 3 and a half minutes:

  user  system elapsed 
 201.72    3.09  242.02 

The imported file occupied 1.6gb of memory, and was much more manageable:

> gc()
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    534899   28.6     899071   48.1    818163   43.7
Vcells 219987339 1678.4  233753696 1783.4 220164956 1679.8

And much easier to work with too:

etc.

Using data.table's ability to conduct join operations, data in this format allows me to analyse how a particular topic varied over time, was more or less present in one or other newspaper, was associated with a particular genre, feed it to a machine learning test, or whatever really. 

* Has anyone figured out a good way of working around this?

4 comments:

  1. Thanks for the hat-tip, I'm glad my gist was of some use. I've also been frustrated at memory limits with that reshape operation. So far my work-arounds have been a bit of Perl (thanks to Andrew Goldstone) and using a big cluster. I'll give you two methods a shot also. By the way, I loved your post on the Russian and English language Wikipedia pages, fascinating stuff!

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Very nice. I have encountered very similar R + text data woes and gone with similar solutions when handling MALLET results on large corpora. Let me recommend the bigmemory and bigtable packages, which get around R's desire to hold everything in memory. Also sparseMatrices were a huge win for my text data.

    My version of outboard processing of the MALLET sampling state in python is here: https://github.com/agoldst/tmhls/blob/master/python/simplify_state.py and here: https://github.com/agoldst/dfr-analysis/blob/master/topics_rmallet.R (see read_simplified_state() at line 821). I wrote that so I could plot the way a words were being assigned to different topics over time (using the date metadata for my documents). Keep meaning to write that up in a blog post.

    --Andrew Goldstone

    ReplyDelete
    Replies
    1. Thanks for the tips - ff and sparseMatrices have been on my to-learn list for some time, maybe this will inspire me to give it another go. Your topics_rmallet.R file is quite remarkable, thanks for the link. I sometimes feel we're all reinventing the wheel with this stuff. Do you find these packages makes analysing text in R good, or just more bearable?

      I got a bit tired of workarounds in R, and decided to shove my text into large sql databases. I found R quite solutions to text processing very slow so I access my texts through Python and only pipe summaries or meta-data to R, which tends to work quite well.

      Delete