gabe gaster

data projects to learn on

the grammar of graphics and another kind of heat map

Today I went to the Chicago Visualization Meetup on ggplot2. Two great things happened. 1. I finally have an intuitive sense of what the grammar of graphics actually is — which means I can now think about plotting with ggplot the right way. 2. I made a different kind of heat map with jitter.

At the meetup, I read the introduction to Hadley Wickham’s book, which is freely available on amazon — I highly recommend reading pages 3 and 4.

What the Grammar of Graphics Is

From this I understood that there are three pieces to the grammar of graphics — there are data, aesthetics, and geometries. The idea of the grammar of graphics is to rethink what a graphic is. For Leland Wilkinson, it’s the output of a function which maps data to some visual space — a geometry. Mathematicians have a certain concept in mind with the term geometry — though for most people, a sense of spatial relation is a good place to start. The geometry itself is almost always either discrete or Euclidean — though it is occasionally polar. A discrete example is a bar chart — where e.g. the earnings of two rival companies are being compared, although the distance along the independent axis doesn’t necessarily denote spatial distance at all.1 A Euclidean example is the most natural, maybe — where distance matters — e.g. plotting against time. The polar example is the pie chart — though there are others, too.2

The mapping itself — the function which takes data to the geometry — is an aesthetic. It can have various parameters (color, size, type, transparency, shape, etc.) which affect how the data actually appears in the geometry.

This helps make sense of the actual syntax of ggplot2 — in which every plot has two functions, aesthetics (aes) and geometries (geoms).

The other thing I thought through differently today was the use of ggplot’s geom method jitter. Previously, I’d only used jitter in plots where one variable was discrete and the other continuous — so that the jittering happens along the dimension in which space has little meaning (because it’s discrete). Today, we used jitter in cases with two discrete variables — in which case, when the points are small enough, the jittering takes on the affect of pointalism — or a heat map.

Many people who know more about R than I do (I rely on them all the time as a resource and so should you) have talked about heat maps in R and ggplot2 — though these methods fill entire blocks of tile one color. With a pointalist heat map — using jitter — one can see more subtle patterns in the data, if there’s enough of it.

Today we used Crime data from the City of Chicago Data Portal. The inspiration for all of this was from the class — and one of the scripts below is a modification of Tom Schenk's, who led the session.

The tight tiles and the fact that the density of the points is what communicates the data make this a heat map. But this variation also allows you to see the interactions of multiple color variables. The trick is to set the opacity and size of the points small enough so that you can see the color interaction.

Once again — the aesthetics are critical to the graphic.

# Gabe Gaster
# Data from the Chicago Data Portal
crime = read.csv("Crimes_-_2011.csv")

# Clean the data -- take out typos
levels(crime$Primary.Type)[11] <- "INTERFERENCE WITH PUBLIC OFFICER"
levels(crime$Primary.Type)[22] <- "OTHER OFFENSE"

# Make some graphs
ggplot(crime,aes(x=Primary.Type, y=factor(Arrest))) +
  geom_jitter(aes(color=Domestic), size=I(.3)) +
  coord_flip() + ylab("Arrest Made") + xlab("Type of Offense") +
  opts(title="Chicago Crime, 2011") + 
  guides(colour=guide_legend(override.aes = list(size = 4)))

ggplot(crime, aes(x = Ward, y = Primary.Type)) +
  geom_point(aes(color = Arrest),position = "jitter",size=I(.3)) +
  guides(colour=guide_legend(override.aes = list(size = 4))) +
  opts(title="Chicago Crime, 2011 by Ward",axis_ticks=theme_blank())

See what I mean?

Full resolution images of the graphs are here and here.

  1. It denotes the discrete distance. 

  2. This got me thinking — what other geometries are used? The first most natural one to ask is: what kind of data visualizations could you make with a hyperbolic geometry? I’m not entirely sure this question is well posed, because mathematicians use the term geometry differently than data scientists do. In this post, I’ve tried to use the term in such a way that it makes sense in both contexts.