Visualising Structure in Topic Models

How exactly should we visualise topic models to get an overview of how topics relate to each other? This post is a brief lit review of that debate - I realise the subject matter is sooo last year. I also present my chosen solution to the dilemma: I use dendrograms to position topic, and add a network visualisation using an arcplot to expose linkages between subjects that frequently co-occur, without being correlated.

For details of what topic models are, read Ted Underwood blog posts here , and Matthew Jockers' macroanalysis. I wrote a little bit about it elsewhere, so I will get straight to the jugular:



Topic models are a gross simplification of often extremely diverse texts. We make further sweeping simplifications by visualising these topics.

1) Problems of selection: we select how many topics we want to analyse, which introduces a type of selection bias. Choosing a large number of topics means we zoom in on fine details; selecting a small number means we get the broadest possible overview. And selecting a particular number of topics over another means some topics will be grouped which might otherwise appear separate.

2) Consequently we need to know how topics relate to each other. It is not worth over-analysing a topic that mysteriously disappears, if it has merely been replaced by another very closely related topic. Or rather: in this case we should be analysing subtle changes in discourse rather than a binary presence v absence.

3) Scott Weingart has been on a crusade against the willy-nilly use of network visualisations. Scott's objections are sound: topics aren't networks, and by forcing them into that particular format we chop away large parts of our data, generally edges. Often the resultant representations are misleading or plain wrong.


How can we visualise topic models?
Ted Underwood wrote a blog post about this a year ago to the day. The comments section to that text raises some interesting ideas, including arguments for and against PCA, network visualisations, and hierarchical clustering. Personally I have found PCA to be useful for exploration, but it's a bit hard to determine what each axis represents. So movingly swiftly on...

Ben Schmidt argued persuasively that hierarchical clustering maintains the full complexity of the data. I think hierarchical clustering, visualised using dendrograms, allows us to solve the problem of granularity and overlapping clusters. Andrew Goldstone and Ted Underwood described the problem as follows:
if you change the number of topics, you can get results that look substantially different. On the other hand, to say that two models “look substantially different” isn’t to say that they’re incompatible. A jigsaw puzzle cut into 100 pieces looks different from one with 150 pieces. If you examine them piece by piece, no two pieces are the same—but once you put them together you’re looking at the same picture. [source]
Dendrograms provide the solution here, by showing a tree of clusters - thus if we select 100, 200, or a 1000 clusters, the dendrogram will show how they can be grouped and subdivided.  Consider a small part of my topic model, that relating to film. Cutting the plot and using ggdendrogram allows us to zoom in on how topics cluster:

The red box on the plot represents a grouping about film generally. If we are analysing texts about cinema festivals, we should at the very least test to see if our findings are confounded by any of the other topics within this cluster. The blue box shows a more fine-grained division - here are topics about the reception of films, while the lower three topics are about the actors and production teams involved in making films. There is every chance we would want not to split films into seven categories, but consider these two groupings.

More generally, we see that cinema features in close proximity to a set of topics about culture and lifestyle, which makes intuitive sense.

Now, of course, the majority of the tree-diagram is missing, which is fine, because we are only interested in a small part of it, but it would be good to contextualise this data somewhat. In my previous post I wrote about using topic modelling to analyse Katyn. Katyn was predominantly debated using a topic apparently about Stalinist repressions, but other memory and history topics appeared too. How do these topics relate to each other?


Let's put the dubious aesthetics of the illustration aside for a minute. The graph illustrates firstly a methodological point: visualising dendrograms is going to be a trade-off between overview and detail, and specifically, it will involve some arbitrary cuts. Using d3 might be an option for interactive graphs - I've uploaded some examples and a howto here - , but the question of how to present this type of information in print remains.

Though a crude approximation, the illustration does serve to show how language about cinema is strongly clustered, while language about memory and history is spread across the conceptual space. Consider the green block: this contains memory subjects within the context of culture: sites of memory are closely related to construction projects, while museums, archaeology, tsarist history and the orthodox Church are closely related too. Also here we find state symbols, the question of war veterans, and sites of memory. The topic of war veterans is used to discuss annual commemorations of the Victory in 1945, while the topic 'sites of memory' was used to express outrage at the removal of the Bronze Soldier from the center of Tallinn.

Unlike the green cluster, the purple and blue clusters are clearly political or politicised - the memory of the war and Katyn feature strongly within the context of Russians abroad and language learning. This might appear strange, but it relates to the larger question of discrimination against Russians living outside Russia in the SNG. Without having looked very closely, I would guess that language learning relates most frequently to the rights of Russian-speaking Ukrainians to live in a Russian language environment. It is telling that the cluster closest to this is about religion, homosexuality, and extremism, while the larger context is one of international disputes.

I have highlighted a pale blue box containing what appears to be oppositional rhetoric: these are questions about the state, democracy, elite interests, and Russia's role with the West.Notice in this box two topics of real interest: 'inheritance of power, history of authority', and 'the narod and power'. The former group is labelled for the function it fulfills more than its content, which is made up of former Soviet elites, terminology such as TsK (central committee), perestroika, vlast', kremlin, etc. This is a historical subject, which is mobilised for expressly political purposes, and the illustration gives a hint at what these might be, namely discussions framed by oppositional rhetoric about the nature of political power in Russia.

The second group, 'narod and power' fulfils much the same function, but without the explicitly historical vocabulary. My lightning analysis of Katyn showed that the biggest difference between discussions about Katyn in state-owned and independent media was that the independent media framed Katyn intellectually and ideologically through the topic 'narod and power', while state-owned media wrote much more about Katyn in culture and current affairs.

The crucial point here is that neither of those linkages are hinted at, let alone made explicit through the dendrogram visualisation. The reasons for this is simple: the dendrogram presents a relatively linear hierarchy which obscures linkages between semantically distant topics. Clearly the category 'oscars and awards' has more in common with 'actors' than with Katyn, and consequently it features in the film category, which in itself is located within the cultural section, while Katyn on average is most closely connected to narratives about WW2. The fact is, though, that a large proportion of Russian articles about Katyn are about Wajda's film under the same name. How can we include this complexity?

The obvious solution is a network, because it allows multi-dimensional interactions to be mapped. But, and it's a big caveat, using a network in practice means cutting many edges. Existing network visualisation methods, to my knowledge, don't map negative edges. Now, in fairness, I can't think of any way of visualising really complex textual data that doesn't involve sacrificing complexity. In my case I first selected a date-range, then individual publications. When making the topics I selected only nouns, and also removed some stop-words (typically media meta-data). As scholars conducting text-analysis we routinely use stemmers to chop words. In this context, is losing some weak relationships really a problem? We should know that texts aren't networks, and that any attempt to model them as such is a simplification and a generalisation, and should only be taken as one of many ways to find structure. So, while I disagree with Scott Weingart's blanket ban on cutting edges in order to create better visualisations, I do agree with him that the process of mapping data creates illusory structures, ones that portray relationships as much more fixed than they are in reality, and, that moreover, do this seductively using cool arcs and shading.

DendroArcs
(that sounded good in my head; not so sure anymore)
Gaston Sanchez has written some mindblowingly good Rcode for making arcplots. These plots are essentially networks, but the nodes are fixed in space. This has great possibilities for our application, because we use the network logic to visualise relationships in the data, without having our visualisation defined by a small set of the relationships. Hence DendroArcs: we use a dendrogram (hclust() function in R) to calculate a hierarchy of clusters, and use a network to map the strongest links across the clusters. This could even be used to map negative correlations, though so far that's resulted in pretty messy maps. Eventually I might write up some code for these, but for now they are very much in beta: it's a hack of Gaston's code, plus some photoshopping to align the dendrogram. When I get around to it I'll do this using viewports instead. Anyway, here's the structure of the 15[1] topics most frequently used to speak about Katyn, first in independent, and secondly in state-owned media:



The reader taking the time to look further than the eye-candy will notice some quite striking differences, both in structure of topics, and links between them. For instance, in the state-media the second-largest topic is about film-making, which appears not to be strongly linked with any other topic. By contrast, legal language is very strongly linked to the subject of Stalinist repressions. This goes someway to show that pro-Kremlin texts about Wajda's film, state visits, and to a lesser degree the plane crash make virtually no mention of Katyn as a site of Stalinist crimes.

The independent media gives a good example of how the hierarchical clustering and network mapping lines up - virtually all the strong links are within clusters. While the structures are not identical for the two sets of data, they align fairly neatly. But unlike the state-owned publications, the texts in independent media see strong linkages between topics - see for instance the strong links between texts about Wajda's film, and those about ideology and civilisation.  The same goes for stories about the plane crash, which are closely associated with narratives about international relations. This points to a real difference between texts in the two datasets: state-owned media, when speaking about Katyn in relation to 'sensitive subject matter', are much less likely to make comparisons and references to other subjects.

One aspect of this that strikes me as unsatisfactory is comparing the two plots: because each set has its own clustering, the topics are not in the same order, which makes comparison hard. I also wonder if somehow making this circular (d3-style, keeping the links on the inside) might be more efficient use of space and allow it to scale up a bit better. If there's any interest in this I might even write up that code.

Edit 20/11/2013: I have implemented some d3 visualisations, and think a circular layout works quite well. I have found the number of toipcs modelled in this way can be increased past 200.

Edit 27/12/2013: Code and sample data here: https://github.com/fredheir/rdendroarcs


[1] Problem number 1 of this method: it is still linear. This makes scaling beyond about 30 topics difficult. Hence the subset of data



3 comments:

  1. Trees are all right, but here's another idea : calculating the distance matrix and then mapping selected nearest neighbours on a graph of the topic tags... I tried this (with an LSI instead of LDA) on work occupations in french (for instance http://www.researchgate.net/publication/256444652_Navigation_Metiers_FAP_v1.2/file/504635229e96f68263.png?ev=pub_ext_doc_dl&origin=publication_detail&inViewer=true)

    ReplyDelete
  2. Here's another topic model visualization idea, called "eye diagram". It has been used with LDA, but actually with biological data instead of text documents:
    http://bioinformatics.oxfordjournals.org/content/25/12/i145/F1.expansion.html

    The Processing code and an R interface can be found in github: https://github.com/ouzor/eyediagram

    ReplyDelete
    Replies
    1. That looks great, thank's for the pointer!R

      Delete