Topic models are discussed really well elsewhere, and rather superficially by me here. In my topic model for the Russian media over the period of 2003-2013 I found seven or eight topics about history and memory. One of them was clearly about Katyn and about Stalinist repression.
The terms dominating this topic were:
трагедия жертва память поляк семья гибель родственник репрессия расстрел расследование катастрофа власть ответственность реабилитация брат член пострадавший соболезнование годовщина (tragedy, victim, memory, Pole, family, death, relative, repression, execution [rasstrel], investigation, catastrophy, power [vlast'], responsibility, rehabilitation, brother, member, victim [postradavshi], commiseration, anniversary)
It was interesting to note how the language of Katyn was so present in the larger topic about Stalinist repressions (e.g. Pole, brother [the Kaczynski's were identical twins]). But in a way, this is perhaps not surprising. Stalinst crimes against Russians are not frequently written about by the media, while Katyn, due to international pressure from Poland, has for a number of reasons been news-worthy.
Very briefly on topic models and methods
This is what a topic looks like. To achieve this, topic modelling takes a collection of documents and attempts to split every word in every text into distinct topics. Each document is modelled as being made up of different proportions of topics, and each topic consists of a collection of words, which each have a probability of featuring in passages about the topic. Yes - it is a bit complicated. To do mine I followed Matthew Jockers 'secret recipe' for topic modelling: I took the root form of words, attempted to keep only nouns, and excluded nouns tagged as being people or place names, and calculated 500 topics. The motivation for this was twofold: firstly using someone else's schema saves me time, and secondly it means I can't cherry-pick results.
Katyn is to my mind the most explosive Eastern European memory conflict in recent years. Katyn - the symbolic site of Polish WW2 suffering at the hands of the Soviets - was the subject of an epic film by Andrzej Wajda, and long a thorn in Polish-Russian relations. If the release of Wajda's film caused an upset, that was nothing compared to President Kaczynski's plane crashing on the way to participate in the annual commemorations at Katyn in 2010. It was a tragedy of grand scale, and the irony that Kaczynski, the champion of the Polish campaign for greater international recognition of Katyn, should be killed in this way was lost on no one. It was almost inevitable that the incident would give rise to conspiracy theories of Russian involvement - a narrative frequently referred to as Katyn-2.
This is ground well-covered by now - [Shameless plug: consider for instance the Memory at War Project's book about Katyn, or hold on for my article on how Katyn was mobilised during Polish elections - coming to a journal near you in 2014]. The issue here is that this major international incident forced the Russian media to write about Katyn. Previously most texts printed in state-owned media had been reasonably evasive about what exactly happened at Katyn. This did not change - press agency reports featured the line 'President Kaczynski was in Smolensk to participate in a commemorative event (traurnie meropriiatie). These passive constructions are still favoured: consider this text from Komsomol'skaia Pravda earlier this year:
'On Wednesday 10 April 2013 Poland and Russia remember the members of the Polish delegation who died in the planecrash three years ago. In April 2010 the Polish plane TU-154 crashed while attempting to land at the airfield 'Smolensk-Severnii'. All the passengers and members of the crew - 96 persons, including the president of the republic, Lech Kaczynskii, as well as polish politicians, religious and social figures (deiateli), died.'
Curiously there is no mention of why the Polish delegation was flying to Smolensk. But, that's just the intro, right? Further on we are sure to read about the Polish officers, executed by the NKVD?
Not so much. The closest we get is the following:
'The Polish minister of Culture, Bogdan Zdroevskii, upon arriving in Smolensk to participate in commemorative events in 2012, for the first time spoke about the project to establish a memorial to the victims of the catastrophy near Smolensk.'
I could go on, but for now I encourage the reader to explore this independently, or take it on faith: one of the biggest differences between state-media and the Russian independent media was the willingness to print who had done what to whom at Katyn. For now, let's explore what topic modelling can tell us about Russian media coverage of Katyn.
Topic distribution in texts about Katyn
I collected all the texts about Katyn in my database of Russian media sources. There were 140 texts printed in the independent media, compared to 330 in the state-owned sources. Considering that the state-owned publications are much larger, at a ratio of roughly 4:1, Katyn is relatively more frequently written about in independent media, but the differences is not dramatic.
After calculating 500 topic models using MALLET I manually labelled the topics. Because no text is made up of a single topic, we can identify the most topics most commonly used to discuss Katyn. These are:
These topics are calculated based on all the texts. The main topic appears to be about Katyn and Stalinist repressions, so it is no surprise it should feature strongly. The other topics, though, are generalizable to many other subjects, but we can understand why they have been identified in texts about Katyn: president Kaczynski’s died in a plane crash on the way to commemorations at Katyn; Andrzej Wajda’s film about Katyn was nominated for an Oscar, while more generally Katyn features in debates about the Second World War. Indeed, some Polish politicians have demanded Katyn be legally recognised as genocide. The final topic, labelled ‘narod and power’ reflects intellectual debates questioning the ideological motivation of political elites.
The reader should bear in mind here, that these labels were calculated based on the entire data-base of text, and not selected with the Katyn example in mind.
Politically motivated subject selection
Let’s zoom in a bit further by adding a few more topics and a division by political orientation:
As it turns out, the more liberal and more pro-Kremlin newspapers feature uneven topic selection: the state-owned newspapers’ coverage about Katyn is overwhelmingly about the plane crash, and about Wajda’s film – as can be seen in the high proportion of texts featuring the language of film making, film festivals, and prize awards. Conversely, the liberal sources write more about Katyn in the context of Stalinist repressions. The starkest difference, though, is in the category ‘the narod and power’ – pointing to the role of Katyn in debates about 'the nation', ideology and politics. Topic selection, then, points to a divide between intellectual, ideological, and to a lesser degree historical subjects in independent media, to cultural and current affairs subjects in state-owned media.
In this way, combining keywords and topic models allows us to identify the type of discourses mobilised in conjunction with a particular topic. This example scratches the surface. We could have looked at example texts from the different categories. The main point here is to show how we can identify topic selection for a given subject, and contrast the proportions of each topic based on different criteria, as well as to hint at how topic modelling can identify memory discourses.
In the next post I wade into a debate about the best way to visualise the topic model as a whole.
 Ranked as most frequent compared to proportion in entire dataset
 State-owned: Izvestiia, Rossiiskaia Gazeta. Independent: Gazeta.ru, Novaia Gazeta.
 I know these binary categories are not perfect. But imperfect comparisons trump no comparison.