When I was just starting out with R, I played with the preloaded data sets — data on cars (just use ‘mtcars’). This allows you to play around the basic commands, summarizing data sets with e.g. summary and plot.
Once you know a bit about this, you quickly notice that you want ways to cut the data, to massage it (reshape it) to get it to look just like you want, and that often you want to cut the data into pieces, and apply a function to each piece.
Enter packages plyr and reshape, by Hadley Wickham, who is way awesome and has written at least 17 R packages. One of the great things about plyr and reshape is that there is a great set of talks that Hadley gave in 2009, with published notes that you can work through on your own.
One of the workshops analyzes baby names with the census set of the 1000 most popular baby names for boys and girls, since 1880. Hadley’s tutorial will take you through a bunch of initial explorations you can do with the data, and even challenge you to do some explorations of your own. e.g. plot the changing frequency of names that start with the letter j.
One question you might ask (Hadley did) is about the popularity of biblical names.
Here’s one of my explorations. Goal: understand biblical baby names over time.
The first step is to pose and hone the problem in terms of something calculable: Plot the proportion of baby names that are biblical, since 1880.
To do this, I needed a list of biblical names.
First, I googled around for such a list. Wikipedia’s list was the first hit. It’s great to find out that my surprisingly complicated problem has been crowd-sourced: the wikipedia community has been generous enough to maintain this list (including citations). The problem is, the list is not in a form that’s easy for me use…
I kept digging to try to find a pre-compiled list, but alas — it seemed that no one had published a list. On wikipedia, the list was sufficiently long (a separate list of each letter) that it would have been a total pain to copy and paste by hand. This is the ‘dumb way’.
The “dumb way” is not only dumb because it would waste my time. It is dumb because it precludes one of the main objective of any scientist — that their work be replicable. If I wanted to rerun my experiment, as impractical as it is to copy the list once by hand, it’s near impossible to copy the list twice by hand. Also, importantly, copying by hand would have probably introduced errors into the list. More importantly, taking the time up front (in this case, 5 hours) to teach myself the smart way of doing something is almost always the right approach. Once I’ve invested the time, the method is mine. In fact, one of the main reasons I pursued the project was to learn about this method which I think is essential: ‘scraping’.
Scraping is a data science term which refers to sifting through large amounts of somewhat unstructured data and picking out the parts of it that matter to your problem. In this case, I wanted to load each wikipedia page, isolate the biblical names, and turn them into a list.
To do this, I put down R and picked up Python.
I had to learn about calls to ‘httplib2’, remind myself how regular expressions work, and then look at the source of the wikipedia list. The idea is straightforward: break it into chunks. This is the computational analog of an adaptation of George Polya’s heuristic, which I learned from Japheth Wood:
If there’s a hard problem you can’t solve, there’s an easier problem you can’t solve.
import httplib, re def findName(text): a = re.match(r"(<.+>)?([A-Z][-'a-z]+)(</a>)?,", text) if a: return a.groups()[-2] def main(): h = httplib2.Http('.cache') url = 'http://en.wikipedia.org/wiki/' + 'List_of_biblical_names_starting_with_' letters = ("A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O""P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z") f = open('BiblicalNames.txt','w') f.write('Biblical Names\n') for letter in letters: print 'Querying letter '+letter, url_iter = url + letter page = h.request(url_iter, "GET") print 'Analyzing letter '+letter ## from <ul> to </ul>, grab the second one text = re.split(r"</?ul>", page) ## split it up into lines. lines = re.split(r"<li>", text) for line in lines: name = findName(line) if name: f.write(findName(line)+'\n') f.close() if __name__ == "__main__": main()
Finally, in R:
library(plyr) library(reshape2) library(ggplot2) bibnames <- read.csv('BiblicalNames.txt')$BiblicalNames bnames$bibl <- is.element(bnames$name, bibnames) bibpop <- ddply(bnames, c('year', 'sex', 'bibl'), summarise, tot=sum(percent)) bibpopT <- subset(bibpop, bibl==TRUE) qplot(year, tot, data=bpopT, geom='line', colour=sex, ymin=0, ymax=1)
Which lets us see easily see it. As you can see—male biblical baby names are currently at a 70 year low.
Download the code yourself from my github.