In this post I show how a combination of MALLET, Python, and data.table means we can analyse quite Big data in R, even though R itself buckles when confronted by textual data.
Topic modelling is great fun. Using topic modelling I have been able to separate articles about the 'Kremlin' as a) a building, b) an international actor c) the adversary of the Russian opposition, and d) as a political and ideological symbol. But for topic modelling to be really useful it needs to see a lot of text. I like to do my analysis in R, but R tends to disagree with large quantities of text. Reading digital humanities literature lately, I suspect I am not alone in confronting this four-step process, of which this post is a result:
1) We want more data to get better results (and really just because it's available)
2) More data makes our software crash. Sometimes spectacularly, and particularly if its R.
3) Ergo we need new or more efficient methods.
4) We write a blog post about our achievements in reaching step 3
The obvious solution to these problems is using MALLET. Mallet runs Java and is consequently very much more efficient than R. David Mimno has recently released a very nifty R wrapper (aptly named 'mallet'). The package is handy for getting used to working with MALLET, but as texts to be analysed need to be loaded into R, and consequently into memory, this wasn't really an option either. Ben Marwick has released an excellent script that allows you to run the command line implementation of MALLET from R, as well as import the results afterwards, and this is probably the approach closest to what we need.
All these implementations suffer when you try to scale up the analysis - mainly due to R, rather than MALLET, though MALLET also is happier if it can keep all its input in memory.* I fed my data to the program through the command line, which worked fine after increasing the heap space. In the end MALLET spat out a 7GB file, a matrix just shy of 1 million rows by 1000 columns or so, which was much too large to read in to R. Or rather, I could read in the file, but not do much with it afterwards.
I was able to read the file in using read.table, but my laptop died when I tried to convert the data.frame to a data.table. In the end I used fread from from the data.table package to get the file straight into the data.table format.
The MALLET output is in the format [ID] [filename], followed by topic-proportion codes ranked according to the quality of the match:
And this repeated for 500 topics. In short, it is ideal identifying the top subject matter for individual articles - no reshaping at all is needed - but not so great for finding the distribution of topics across articles. To reorder the data by topic (as opposed to topic rank) would require the mother of all reshape operations. The obvious way to do this is using reshape and dcast, following (or: copy-pasting) Ben Marwick:
outputtopickeysresult <- font="" header="F," outputtopickeys="" read.table="" sep="\t">->
outputdoctopicsresult <-read .table="" font="" header="F," outputdoctopics="" sep="\t">-read>
# manipulate outputdoctopicsresult to be more useful
dat <- font="" outputdoctopicsresult="">->
l_dat <- 2="" dat="" font="" idvar="1:2," nbsp="" ncol="" reshape="" varying="list(topics=colnames(dat[,seq(3,">->
props=colnames(dat[,seq(4, ncol(dat), 2)])), direction="long")
w_dat <- dcast="" font="" l_dat="" v2="" v3="">->
rm(l_dat) # because this is very big but not longer needed
Ehrm, yeah. R didn't like that very much. It wanted to allocate a twelve digit memory vector (maybe I exaggerate), which was just not an option, definitely not on my laptop.
The best approach I found in a single piece of code was using the splitstackshape package:
merged.stack(dat, id.vars=c("Id","text"), var.stubs=c("topic", "proportion"), sep="topic|proportion")
This worked great for up to about 100 000 rows, but it still makes a copy of the data.table and struggled to deal with the whole dataset. I tried some other data.table options, but they all involved melting the data into a very long table, then casting it back into a short or wide form, and at no point was I able to process more than about 200 000 rows. I am absolutely certain it is *possible* to do this in data.table, but that I'm just that bit too dense to find the solution. After crashing my system one time to many, I decided the crux was as follows: if I am going to have a copy of the data in memory, I won't have enough spare memory to do anything with it.
My workaround? Not to keep anything in memory and to use a bit of Python. Nice and slow: read one line, write one line, placing the proportions in order of topics rather than rank. None of this load-everything-into-memory-and-cross-my-fingers nonsense. Using a dictionary of dictionaries is a much faster approach, but here as an exercise I tried to keep as little data as possible in memory. This little script requires virtually no memory, and by halving the number of columns (and rounding to five decimal points) the output file was a third of the size of the input file - about 2GB.
A benchmark: using fread() R loaded the processed data in 3 and a half minutes:
user system elapsed
201.72 3.09 242.02
The imported file occupied 1.6gb of memory, and was much more manageable:
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 534899 28.6 899071 48.1 818163 43.7
Vcells 219987339 1678.4 233753696 1783.4 220164956 1679.8
And much easier to work with too:
Using data.table's ability to conduct join operations, data in this format allows me to analyse how a particular topic varied over time, was more or less present in one or other newspaper, was associated with a particular genre, feed it to a machine learning test, or whatever really.
* Has anyone figured out a good way of working around this?