Plugging hierarchical data from R into d3

Here I show how to convert tabulated data into a json format that can be used in d3 graphics. The motivation for this was an attempt at getting an overview of topic models (link). Illustrations like the one to the right are very attractive; my motivation to learn how to make them was that the radial layout sometimes saves a lot of space - in my case when visualising tree diagrams. But, this type of layout is hard to do in R.  d3 can be used with data in both csv and json format, and has a method 'nest' to convert tabular data into a hierarchical structure. When I started out with d3, though, this was all over my head, and this post shows how to make the conversion from tabular data to json in R.

This post has three parts:
1) I map topics about Stalin to illustrate how this approach can be used to visualise topic models
2) I go through a function to shape data for use in d3 illustrations
3) I end with variations on how to show complexity in topic models



d3 visualisations can be very effective: they are interactive, often colourful, and very flexible. d3, though, has a steep learning-curve, one which in my case is not yet leveling out, so my approach here was to find a template I liked and see how I could plug in my data.These will still need some customisation to present topic models exactly the way I want.

Links between the most common topics in texts about Stalin, adapted from Mike Bostock:



The logic behind the illustration is as follows: the topics are ordered using a hierarchical clustering function, meaning highly correlated topics are grouped together. Then the strongest correlations between the clusters are drawn in the center. In this way we also see associations between thematically distant topics. In the case of the illustration above, there are apparently two main types of text about Stalin: firstly, Stalin features in debates about ideology, and politics, and secondly within texts about culture and lifestyle. Curiously there are few strong links across this divide, the clearest example of a bridge being the 'culture and civilisation' topic, which correlates both with debates about the Russian opposition, and with topics labelled as 'poetry' and 'books and reading'.

Moving data between R and d3
My first steps in using d3 involved plugging my own data into an example; the one we use below comes from here. If you explore the sourcecode of that page, you will notice four things are happening:
1) Html holds all the bits together: <html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>

2) Formatting of the content is defined in css. Thus to change font-size of the labels, or the color of the arcs, you would edit the style sheet. For instance, nodes are displayed using fontsize 10:
.node{font-size:10px;}

3) The d3 javascript libraries are loaded

4) The d3 code shapes the visualisation.

The key part here is how the data is loaded: a separate file, called "flare.json" holds the data.

To me JSON looks a lot like a combination of python lists and dictionaries. Take a look at the first few lines here

{
 "name": "flare",
 "children": [
  {
   "name": "analytics",
   "children": [
    {
     "name": "cluster",
     "children": [
      {"name": "AgglomerativeCluster", "size": 3938},
      {"name": "CommunityStructure", "size": 3812},
      {"name": "HierarchicalCluster", "size": 6714},
      {"name": "MergeEdge", "size": 743}
     ]
    },
    {
     "name": "graph",
     "children": [
      {"name": "BetweennessCentrality", "size": 3534},
...

It's not as complicated as it looks: Starting at the top of the tree, 'flare' is the central node. The first node it connects to is 'analytics', which connects to 'cluster', and finally 'agglmerativeCluster'. This branch of the tree is mapped at 12 o'clock in the visualisation.

It may not be immediately obvious, but this is a structure we can recreate. My code and data can be found here1) use hclust() to create a hierarchy
2) create a table of the hierarchy
3) convert this to JSON
In the end we will create a function which takes four inputs:
dt (a data.table)
groupVars (a list of variables containing meta information)
dataVars (the variables containing data)
outfile (the destination file)

First up we calculate a correlation matrix. Next we create a hierarchy using hclust()


now we split the data based on membership structure. We will take four levels. this means we will calculate which group each variable belongs in for different levels of the tree structure:


We combine these variables together with the variable labels and order from the hierarchical clustering. Here we can add in any other information, such as size, or colour. Then we sort the data by the order variable, just so everything is positioned in the right place (not necessary for this example, but if you have a more complex visualisation you will be glad you did this) :

This gives us a nice tabulated output:




Here we can see that V80 is in group 1 at the first split, group2 at the second, group 8 at the third, and 14th at the fourth.

Here is the tricky bit: we use a recursive function (this means the function calls itself as long as a condition is true) to create a list structure. Modifying this function to suit your needs will be the trickiest bit here. The first condition specifies that if there are more than two columns in the data frame, then we split the dataframe into a series of smaller frames using the values in the first column. The first lapply function creates nested lists, and the second adds the label and size information.

We then wrap the toJSON function (from RJSONIO) around all of this, and write it to a file:


The output looks as follows:
{
  "name": "Centre",
  "children": [
  {
        "name": "1",
        "imports": [
        {
            "name": "1",
            "imports": [
            {
              "name": "1",
              "imports": [
              {
                "name": "1",
                "imports": [
              {
                "name": "V1",
                "size": 0.8938
              },
              {
                "name": "V92",
                "size": 1.5306
              }


Now all we've got to do is go back into the html file, and replace 'flare.json' with the path of our data output.

To visualise this, clone the d3 github and follow the steps at https://github.com/mbostock/d3/wiki to setup a local server.

Putting all that together will give the illustration below:


Topic models in d3Using the method above, it should be fairly straight forward to recreate the illustrations below.
Two possible ways of visualising topic models are using tree visualisations. The circular version of these is similar to the bundle graph above, except that the tree calculated from the correlated values is in the center, not the actual links:



Another version of this is the collapsible tree. With better labeling this could be a fruitful way forward


A third possible representation of this data is through a treemap. The advantage of the treemap is that it visualises not just relative psoition, but also relative size. It is a bitter harder to see the links between clusters, though:



Finally, circle packing allows us to get a birds-eye view of the data. Personally I think it gives a clearer view of topic clusters than do tree diagrams:

9 comments:

  1. I'm just about to plunge into d3, which looks amazing, so this is very timely. Thanks for posting it!

    ReplyDelete
  2. Nice Post,

    How did you deploy your plot to the web? I have my d3js graph running in a local server, now I want to put it to my blog. How did you do it in blogger?

    ReplyDelete
    Replies
    1. I hosted mine on github.io and used iframes to get them into the body of the post

      Delete
    2. This comment has been removed by the author.

      Delete
  3. Very helpful. I like the idea of taking a template and stuffing in our data and getting some experience this way. I'm also going to be taking the plunge into d3 here real soon

    ReplyDelete
  4. I am having trouble using your "sampleData" file to recreate the graphs, would you mind uploading it again? Thanks

    ReplyDelete
    Replies
    1. The real trouble is finding "dt" ... called the first time at "dataVars <- colnames(dt)[!colnames(dt) %in% groupVars]"

      Delete
    2. Yes, I can't remember whether I made this explicit or not, but the code assumes the input data is a data.table named dt, and that you have a vector of names, called groupVars, which should not be used in the calculations. The line you flagged up selects all columnnames not listed in groupVars. Hope this helps, R

      Delete