gabe gaster

Month

April 2012

3 posts

Biblical Baby Names

When I was just starting out with R, I played with the preloaded data sets — data on cars (just use ‘mtcars’).  This allows you to play around the basic commands, summarizing data sets with e.g. summary and plot.

Once you know a bit about this, you quickly notice that you want ways to cut the data, to massage it (reshape it) to get it to look just like you want, and that often you want to cut the data into pieces, and apply a function to each piece.

Enter packages plyr and reshape, by Hadley Wickham, who is way awesome and has written at least 17 R packages. One of the great things about plyr and reshape is that there is a great set of talks that Hadley gave in 2009, with published notes that you can work through on your own.

One of the workshops analyzes baby names with the census set of the 1000 most popular baby names for boys and girls, since 1880. Hadley’s tutorial will take you through a bunch of initial explorations you can do with the data, and even challenge you to do some explorations of your own. e.g. plot the changing frequency of names that start with the letter j.

One question you might ask (Hadley did) is about the popularity of biblical names.

Here’s one of my explorations.  Goal: understand biblical baby names over time. 

The first step is to pose and hone the problem in terms of something calculable: Plot the proportion of baby names that are biblical, since 1880. 

To do this, I needed a list of biblical names.

First, I googled around for such a list. Wikipedia’s list was the first hit.  It’s great to find out that my surprisingly complicated problem has been crowd-sourced: the wikipedia community has been generous enough to maintain this list (including citations). The problem is, the list is not in a form that’s easy for me use…

I kept digging to try to find a pre-compiled list, but alas — it seemed that no one had published a list. On wikipedia, the list was sufficiently long (a separate list of each letter) that it would have been a total pain to copy and paste by hand. This is the ‘dumb way’.

Aside

The “dumb way” is not only dumb because it would waste my time. It is dumb because it precludes one of the main objective of any scientist — that their work be replicable. If I wanted to rerun my experiment, as impractical as it is to copy the list once by hand, it’s near impossible to copy the list twice by hand.  Also, importantly, copying by hand would have probably introduced errors into the list. More importantly, taking the time up front (in this case, 5 hours) to teach myself the smart way of doing something is almost always the right approach. Once I’ve invested the time, the method is mine.  In fact, one of the main reasons I pursued the project was to learn about this method which I think is essential: ‘scraping’.

Scraping

Scraping is a data science term which refers to sifting through large amounts of somewhat unstructured data and picking out the parts of it that matter to your problem. In this case, I wanted to load each wikipedia page, isolate the biblical names, and turn them into a list.

To do this, I put down R and picked up Python.

I had to learn about calls to ‘httplib2’, remind myself how regular expressions work, and then look at the source of the wikipedia list. The idea is straightforward: break it into chunks. This is the computational analog of an adaptation of George Polya’s heuristic, which I learned from Japheth Wood:

If there’s a hard problem you can’t solve, there’s an easier problem you can’t solve.

 import httplib, re

 def findName(text):
      a = re.match(r"(<.+>)?([A-Z][-'a-z]+)(</a>)?,", text)
      if a: return a.groups()[-2]

 def main():
      h = httplib2.Http('.cache')
      url = 'http://en.wikipedia.org/wiki/' +
      'List_of_biblical_names_starting_with_'
      letters = ("A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M"
                 "N" "O""P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z")

      f = open('BiblicalNames.txt','w')
      f.write('Biblical Names\n')

      for letter in letters:
           print 'Querying letter '+letter,
           url_iter = url + letter
           page = h.request(url_iter, "GET")[1]
           print 'Analyzing letter '+letter

           ## from <ul> to </ul>, grab the second one
           text = re.split(r"</?ul>", page)[1]

           ## split it up into lines.
           lines = re.split(r"<li>", text)
           for line in lines:
                name = findName(line)
                if name:
                     f.write(findName(line)+'\n')            
      f.close()

 if __name__ == "__main__":
      main()

Finally, in R:

  library(plyr)
  library(reshape2)
  library(ggplot2)

  bibnames <- read.csv('BiblicalNames.txt')$BiblicalNames
  bnames$bibl <- is.element(bnames$name, bibnames)
  bibpop <- ddply(bnames, c('year', 'sex', 'bibl'), summarise, 
      tot=sum(percent))
  bibpopT <- subset(bibpop, bibl==TRUE)
  qplot(year, tot, data=bpopT, geom='line', colour=sex, 
      ymin=0, ymax=1)   

Which lets us see easily see it. As you can see—male biblical baby names are currently at a 70 year low.

Download the code yourself from my github.

Apr 23, 20121 note
#plyr #reshape #scraping #R #python

hello world.

Apr 22, 2012
How to restore Bookmarks for Chrome on OS X 10.8

Note: This works for OS X 10.8.2 and Chrome 22.0.1229.94

Open up the Terminal (see this tutorial, for example) and then type:

cd ~/Library/Application\ Support/Google/Chrome/Default

and hit enter.

This takes you to the right folder — which is otherwise defaults to hidden when using the typical Mac Finder. Then if you want to restore the Bookmarks file (which is called Bookmarks), all you have to do is restore from the backup, which is called Bookmarks.bak.

Here is one way to do that.

First, make a backup of your current Bookmarks file, in case anything goes wrong. Do that by typing:

cp Bookmarks ~/Desktop/.

and hit enter.

This copies the file called Bookmark — which is the backup for Bookmark into a folder that you regularly have access to — such as the desktop.

Then, to get Chrome to use the backup Bookmarks file which it will load the next time Chrome starts up, type:

mv Bookmarks.bak Bookmarks

Which renames the Bookmarks.bak file “Bookmarks” and, in so doing, deletes the old Bookmarks file (which is why we made a copy earlier, just in case).

Now restart chrome—if everything looks good, great! Otherwise you can always go back to the way it was before, from the same terminal window that you have open by typing:

mv Bookmarks Bookmarks.bak
cp ~/Desktop/Bookmarks .

Then, if you reopen Chrome, bookmarks will be as they were before.

Apr 1, 2012
#Chrome #OS X
Next page →
2012
  • January
  • February
  • March
  • April 3
  • May
  • June
  • July 1
  • August
  • September
  • October 1
  • November
  • December