big geo-data visualisations



Spotting international conflict is very easy with the GDELT data set, combined with ggplot and R. The simple gif above shows snapshots of Russian/Soviet activity from January 1980 and January 2000. I think it also illustrates how Russia nowadays looks more to the east and the South than during the Cold War. The trend, though not very strong above, gets even clearer by the end of the 2000s.


This blog post is really two in one: first I argue the GDELT data is great for the grand overviews it allows, and second I present a few tentative thoughts about the coding scheme that allows us to slice up the data. Specifically, I think a structure has been chosen that maximises the number of entries, and as a result, also the number of false positives. I would be delighted to be proven wrong, though - I think this is a fantastic tool!


I wanted to go one step further than the gif above, so I made an animation of all the events in the GDELT dataset featuring Russia. That's 3.3 million entries, each mapped 12 times (for blur).

 

In the end YouTube's upload process and compression rather destroyed the effect. The video should have looked like this:



The video has 1748 separate frames, all created in ggplot and R. Creating the graphics in R took upwards of 12 hours. For the edges I used minimal alpha values to avoid the image going white. The process used was virtually identical to the one I described in detail here. This produced images like this:

Afterwards, to achieve the effect of the first image, I added some layers using a macro in photoshop, which took 4 hours to run. The lines bounce around because I added a bit of noise in the bezier calculation (to prevent all the lines being exactly on top of each other).

If I was to do this again I would aggregate the edges rather than draw them individually (it got very slow once I hit 1996, and the number of events in the data set skyrocketed). I would also choose a standard video resolution to reduce compression noise from YouTube's upload process. I would also think of a better way to do the subtitles, because writing them out by hand is not the best approach!


My thoughts about the GDELT coding structure: 
I'm beginning to have a few doubts about the coding system used. Automation works well when splitting into binary categories, but rather less well when into multiple groups. From what I understand of the coding system here, there are dozens of types of events. According to the project's authors:
4. The Tabari system is applied to each article in full-story mode to extract all events contained anywhere in the article and the Tabari geocoding post-processing system is enabled to georeference each event back to the specific city or geographic landmark it is associated with. 
5. The final list of events for each newswire is internally deduplicated. Multiple references to the same event across one or more articles from the same newswire are collapsed into a single event record. To allow the study of each newswire individually, events are not deduplicated across newswires (externally deduplicated).
As I undertsand the system above,  errors are multiplied as the number of articles and sources increases; only identical results are deduplicated, so many errors get filed as separate entries:
Each record is then converted into a unique identifier key that concatenates the actor, action, and location fields. This is then checked against the list of all existing events in the database: if the event already exists, the previous event record has its NumMentions and NumSources fields updated accordingly, otherwise a new record is inserted into the database
The main problem as I see it is the multiplication of errors: if coding accuracy is 95% (which I would consider extremely impressive, given the number of agent and event types), and there are 100 different sources, we would be virtually certain to find the event incorrectly coded at least once.

Another gripe I have with the GDELT data is it is anglo-centric. Yes, national news agencies (who publish in English) are included, but adding in sources such as Google news vastly skew the included events towards the Anglo-American perspective. Granted, possible solutions, such as translating queries or using machine translation would be profoundly problematic (varying accuracy between languages, to name but one).

The real problem, as I see it, may be summed up as follows:
GDELT is massive. Plotting all of it is virtually impossible. However, slicing it up relies on using an often imprecise classification scheme, meaning that the false positives, no doubt relatively rare, are still very significant in a dataset of millions of entries. Take the plot below where all data featuring 'RUSGOV' as initiator of an action graded as -10 on the Goldstein scale, i.e 'Military attack; clash; assault' in 2011:



Note the surprising spread of 'acts of war' attributed to the Russian government, in only one year! For sure, the Russian government did all kinds of despicable things during 2011, not least cracking down on crushing the domestic opposition, but I cannot imagine it conducted military attacks in Australia, Mexico, the Gulf, etc, etc.

In fact, these are the figures as tabulated by event code:


CAMEO code 183   186   190    193    194   195  202  203  874 1831 1832
N                  4       8     247   31      25      24    6     3      39     10       1

186: Assassinate
190: Use conventional military force, not speciļ¬ed below
193: Fight with small arms and light weapons
874: Retreat or surrender militarily

194: Fight with artillery and tanks
195: Employ aerial weapons

Now as I said above I'm willing to believe a lot about the Russian government. But surely they did not use conventional military force 247 times during 2011, a year during which, to my knowledge, Russia was not at war? or 39 retreats?  I don’t mean to say they are all wrong – Russia frequently intervenes internationally, and has an ongoing Chechen problem domestically as well. But my suspicion is that due to inevitable errors in coding vast numbers of newspaper articles a number of other military retreats (e.g. Syria) have inadvertently been attributed to Russia. Maybe Putin or Lavrov particularly forcefully condemned Western intervention (for instance like this). In any case, due to the coding structure where each unique combination of actor1-actor2-event type is retained, such errors are inevitably going to be frequent. And this is the real crux of my argument: big data analyses work if our selection is large, because then errors will be small, and virtually of no importance (this is the point argued here, but much less well when we investigate the smaller categories. If we take a fringe actor, for instance any Central Asian state, and look at a rare category, e.g. political assinations, what proportion of those in the data set will be false positives? Probably most entries, given there are millions of instances in which this false positive might be created, and very few that might result in true positives. A real danger, then, is that the move to big data forces us to look at certain issues as many smaller questions are dimmed by all the noise.

I briefly inspected some of the events listed. Here are the ones for military retreat (code 874)
Consider the retreat from Afghanistan: these likely refer to the historical parallel of Soviet withdrawal, rather than a present day evacuation. Withdrawals from France, London, Bulgaria, and Kamchatka are harder to explain, but also are likely to be errors of some sort. Surprisingly none of the Russian withdrawals are from the Caucasus. 

If I was to suggest a solution I would advocate stricter filtering, and also reduce the diversity of English language news sources. Surely there must be a point where adding more possibly duplicated information in to the database will yield more false positives than expose false negatives? The fact that there appear to be so many false positives makes it hard to draw very clear conclusions from the data when sliced up. So maybe a bit less is a lot more. That's my two cent, anyway.

I'd like to point out, I hope I am wrong in this assessment, and even if I am not I am confident the coding systems may be continually improved upon to reduce the errors further. So to end I'll give the word to, Kalev Leetaru 'If you're looking at ten tweets and you're getting a few wrong, you've got problems. If you're looking at ten billion tweets, basically it washes out as noise. The real patterns are the ones that survive the noise.' [Link] This is true, and you can see that in the YouTube video above. The difficulty is when you try to look closer - this is what we're not very good at yet. 

No comments:

Post a Comment