R vs Python – Round 1

Document title: R vs Python – Round 1

Date: January 5, 2014

Text by: Simon Garnier (www.theswarmlab.com / @sjmgarnier)

R code by: Simon Garnier (www.theswarmlab.com / @sjmgarnier)

Python code by: Randy Olson (www.randalolson.com / @randal_olson)

Document generated with RStudio (www.rstudio.com), knitr (www.yihui.name/knitr/) and pandoc (www.johnmacfarlane.net/pandoc/). Python figures generated with iPython Notebook (www.ipython.org/notebook.html).


My friend Randy Olson and I got into the habit to argue about the relative qualities of our favorite languages for data analysis and visualization. I am an enthusiastic R user (www.r-project.org) while Randy is a fan of Python (www.python.org). One thing we agree on however is that our discussions are meaningless unless we actually put R and Python to a series of tests to showcase their relative strengths and weaknesses. Essentially we will set a common goal (e.g., perform a particular type of data analysis or draw a particular type of graph) and create the R and Python codes to achieve this goal. And since Randy and I are all about sharing, open source and open access, we decided to make public the results of our friendly challenges so that you can help us decide between R and Python and, hopefully, also learn something along the way.

Today’s challenge: where we learn that Hollywood’s cemetery is full

1 – Introduction

For this first challenge, we will use data collected by Randy for his recent post on the “Top 25 most violence packed films” in the history of the movie industry. For his post, Randy generated a simple horizontal barchart showing the top 25 more violent films ordered by number of on screen deaths per minute. In the rest of this document, we will show you how to reproduce this graph using Python and how to achieve a similar result with R. We will detail the different steps of the process and provide for each step the corresponding code (red boxes for R, green boxes for Python). You will also find the entire codes at the end of this document.

If you think there’s a better way to code this in either language, leave a pull request on our GitHub repository or leave a note with suggestions in the comments below.

And now without further ado, let’s get started!

2 – Step by step process

First thing first, let’s set up our working environment by loading some necessary libraries.

# Load libraries
library(lattice)        # Very versatile graphics package
library(latticeExtra)   # Addition to "lattice" that makes layering graphs a 
                        # breeze, and I'm a lazy person, so why not
# This starts the IPython Notebook pylab module, useful for plotting and
# interactive scientific computing
%pylab inline
from pandas import read_csv

Now let’s load the data for today’s job. The raw data were scraped by Randy (using Python) from www.MovieBodyCounts.com and he generously provided the result of his hard work on FigShare at this address: http://dx.doi.org/10.6084/m9.figshare.889719.

# Load data into a data frame
body.count.data <- read.csv("http://files.figshare.com/1332945/film_death_counts.csv")
# Read the data into a pandas DataFrame
body_count_data = read_csv("http://files.figshare.com/1332945/film_death_counts.csv")

For each movie, the data frame contains a column for the total number of on screen deaths (“Body_Count”) and a column for the duration (“Length_Minutes”). We will now create an extra column for the number of on screen deaths per minute of each movie (“Deaths_Per_Minute”)

# Compute on screen deaths per minute for each movie. 
body.count.data <- within(body.count.data, { 
  Deaths_Per_Minute <- Body_Count / Length_Minutes
  ord <- order(Deaths_Per_Minute, decreasing = TRUE)  # useful later
# Divide the body counts by the length of the film
body_count_data["Deaths_Per_Minute"] = (body_count_data["Body_Count"].apply(float).values /

Now we will reorder the data frame by (descending) number of on screen deaths per minute, and select the top 25 most violent movies according to this criterion.

# Reorder "body.count.data" by (descending) number of on screen deaths per minute
body.count.data <- body.count.data[body.count.data$ord,]

# Select top 25 most violent movies by number of on screen deaths per minute
body.count.data <- body.count.data[1:25,]
# Only keep the top 25 highest kills per minute films
body_count_data = body_count_data.sort("Deaths_Per_Minute", ascending=False)[:25]

# Change the order of the data so highest kills per minute films are on top in the plot
body_count_data = body_count_data.sort("Deaths_Per_Minute", ascending=True)

In Randy’s graph, the “y” axis shows the film title with the release date. We will now generate the full title for each movie following a “Movie name (year)” format, and append it to the data frame.

# Combine film title and release date into a new factor column with levels
# ordered by ascending violence
body.count.data <- within(body.count.data, {
  Full_Title <- paste0(Film, " (", Year, ")")
  ord <- order(Deaths_Per_Minute, decreasing = TRUE)
  Full_Title <- ordered(Full_Title, levels = rev(unique(Full_Title[ord])))
# Generate the full titles for the movies: movie name (year)
full_title = []

for film, year in zip(body_count_data["Film"].values, body_count_data["Year"].values):
  full_title.append(film + " (" + str(year) + ")")

body_count_ y-axis ticks on the left and x-axis ticks on the bottom
ax.xaxis.tick_bottom()data["Full_Title"] = array(full_title)

Now we are ready to generate the barchart. We’re going to start with the default options and then we will make this thing look pretty.

# Generate base graph
graph <- barchart(Full_Title ~Deaths_Per_Minute, data = body.count.data)

plot of chunk baseGraphR

# plot the bars
fig = plt.figure(figsize=(8,12))

# Plot the red horizontal bars
rects = plt.barh(range(len(body_count_data["Deaths_Per_Minute"])),

# Add the film labels to left of the bars (y-axis)
yticks(range(len(body_count_data["Full_Title"])), body_count_data["Full_Title"].values, fontsize=14)xticks(arange(0, 5, 1), [""])

Ok, now let’s make this pretty.

# Create theme
my.bloody.theme <- within(trellis.par.get(), {    # Initialize theme with default value
  axis.line$col <- NA                             # Remove axes 
  plot.polygon <- within(plot.polygon, {
    col <- "#8A0606"                              # Set bar colors to a nice bloody red
    border <- NA                                  # Remove bars' outline
  axis.text$cex <- 1                              # Default axis text size is a bit small. Make it bigger
  layout.heights <- within(layout.heights, {
    bottom.padding <- 0                           # Remove bottom padding
    axis.bottom <- 0                              # Remove axis padding at the bottom of the graph
    axis.top <- 0                                 # Remove axis padding at the top of the graph

# Update figure with new theme + other improvements (like a title for instance)
graph <- update(
  main = '25 most violence packed films by deaths per minute',  # Title of the barchart
  par.settings = my.bloody.theme,                               # Use custom theme
  xlab = NULL,                                                  # Remove label of x axis
  scales = list(x = list(at = NULL)),                                 # Remove rest of x axis
  xlim = c(0, 6.7),                                             # Set graph limits along x axis to accomodate the additional text (requires some trial and error)
  box.width = 0.75)                                             # Default bar width is a bit small. Make it bigger)


plot of chunk prettyR

# Don't have any x tick labels
xticks(arange(0, 5, 1), [""])

# Plot styling

# Remove the plot frame lines
ax = axes()

# Color the y-axis ticks the same dark red color, and the x-axis ticks white
ax.tick_params(axis="y", color="#8A0707")
ax.tick_params(axis="x", color="white")

ax.xaxis.grid(color="white", linestyle="-")

Finally, the last thing we want to add to our graph is the number of deaths per minute and the duration of each movie on the right of the graph.

# Combine number of on screen death per minute and duration of the movies into a new character string column
body.count.data <- within(body.count.data, {
  Deaths_Per_Minute_With_Length <- paste0(round(body.count.data$Deaths_Per_Minute, digits = 2), " (", body.count.data$Length_Minutes, " mins)")

# Add number of on screen deaths per minute and duration of movies at the end of each bar 
graph <- graph + layer(with(body.count.data, 
    Deaths_Per_Minute,                  # x position of the text
    25:1,                               # y position of the text
    pos = 4,                            # Position of the text relative to the x and y position (4 = to the right)
    Deaths_Per_Minute_With_Length)))    # Text to display                                     

# Print graph

plot of chunk rightLabelsR

# This function adds the deaths per minute label to the right of the bars
def autolabel(rects):
  for i, rect in enumerate(rects):
  width = rect.get_width()
label_text = (str(round(float(width), 2)) +
                " (" + str(body_count_data["Length_Minutes"].values[i]) +
                " mins)")

plt.text(width + 0.25,
         rect.get_y() + rect.get_height() / 2.,


3 – R bonus

Just for fun, I decided to add to the R graph a little accessory in relation with the general theme of this data set.

# Load additional libraries
library(jpeg)  # To read JPG images
library(grid)  # Graphics library with better image plotting capabilities

# Download a pretty background image; mode is set to "wb" because it seems that
# Windows needs it. I don't use Windows, I can't confirm
download.file(url = "http://www.theswarmlab.com/wp-content/uploads/2014/01/bloody_gun.jpg", 
              destfile = "bloody_gun.jpg", quiet = TRUE, mode = "wb")

# Load gun image using "readJPEG" from the "jpeg" package
img <- readJPEG("bloody_gun.jpg")

# Add image to graph using "grid.raster" from the "grid" package
graph <- graph +layer_(
    as.raster(img),                 # Image as a raster
    x = 1,                          # x location of image "Normalised Parent Coordinates"
    y = 0,                          # y location of image "Normalised Parent Coordinates"
    height = 0.7,                   # Height of the image. 1 indicates that the image height is equal to the graph height
    just = c("right", "bottom")))   # Justification of the image relative to its x and y locations

# Print graph

plot of chunk gunR

4 – Source code

R and Python source codes are available here.

For F# fan, Terje Tyldum has written his version of the code in F# here.

Randy and I also recommend that you check out this post by Ramiro Gómez (@yaph) where he does a more in-depth analysis of the data set we used for today’s challenge.

For R users who are lattice-phobic, a Reddit user (needleSharing) kindly provided a solution using only base graphics.

Ruby users can check Juanjo Bazán’s code here.

83 thoughts on “R vs Python – Round 1

  1. Great stuff guys! An R user here. Just wondering if you’d come across the ggplot2 package – I find it’s quite a bit more intuitive than lattice etc… once you get past the initial learning curve!

    After you’ve done all the steps above, just do:


    g = rasterGrob(img,interpolate=T)

    ggplot(body.count.data) + annotation_custom(g,xmax=17,ymax=5.5) + geom_bar(aes(x=Full_Title,y=Deaths_Per_Minute),stat="identity",fill='#8A0606') + coord_flip(ylim=c(0,7)) + labs(y='Deaths per Minute',x=NULL) + theme(axis.ticks = element_blank(),axis.line = element_blank(),axis.text.x = element_blank()) + geom_text(aes(x=Full_Title,y=Deaths_Per_Minute+0.1,label=Deaths_Per_Minute_With_Length),hjust=0,size=4)

    The gun looks more/less good depending on the shape of the plot...

    • Thanks very much Oscar, for your kind words and for providing this alternative graphing code.

      I do know ggplot2, it’s a great graphics package. I prefer lattice because, personally, I find its formula interface more intuitive. I use linear and nonlinear modeling quite frequently and lattice handles data the same way, so it always made more sense for me to use lattice. Besides the latticeExtra package makes layering very easy, which used to be the main advantage of using ggplot2. This being said, you can achieve nearly similar graphs (identical in many cases) using one or the other package, and I think that a little “competition” is great to push the development of both packages further (though it seems that ggplot2 hasn’t been updated since last March).

      Regarding the gun, the problem comes from trying to use a fixed ratio image in a dynamic ratio graphics environment. It does look good only for some combinations of image-environment ratios. In this case, I used a 10 x 8 ratio (W x H) to achieve this result. You can easily specify the dimensions of the graphics environment when you call a new graphics device (X11, quartz, pdf, png, etc…).


    • Thanks Eric! You should check http://rosettacode.org/. It’s a participatory website that collects and organizes “solutions to the same task in as many different languages as possible, to demonstrate how languages are similar and different, and to aid a person with a grounding in one approach to a problem in learning another”. Not many statistical tasks there, but it’s a really fascinating resource to compare different languages.

  2. This is great – I’ve been wanting to learn more python and – apart from being fun – the side-by-side with R is really helpful.
    I think you have a typo in your R code – the paste command is missing a closing quote for the separator, should be
    Full_Title = paste(body.count.data$Film, ‘ (‘, body.count.data$Year, ‘)’, sep=””);

  3. Very nice. I’m well versed in R and Python and I very much prefer R when it comes to statistical computing and ploting for the reasons that become apparent in your post. The Python code which normally is simple and elegant, here tends to look pretty awful. Python is also poorly designed for interactive use compared to R, lacking lazy evaluation of function arguments that R takes full advantage of. Python also doesn’t preserve the order of keyword arguments, wich also has nasty consequences.

    • Hi Ernest,

      To clarify: The purpose of these posts is not to prove that one language is better than the other. Each language has its purpose, and ultimately which syntax you think is better is personal preference. That said, I do not think you can make the statements you’ve made like they’re fact — especially in the light of this relatively tiny comparison. I will admit however that I am a hacky programmer and do not write the “cleanest” Python code, and I’ll try harder to program more cleanly in future posts.

      Regarding Python for interactive use: Please take a look at IPython (www.ipython.org) and especially IPython Notebook (http://ipython.org/notebook.html). The Python community has made huge improvements in interactive programming in the past decade. Python also has an R-like DataFrame (http://pandas.pydata.org/), a vast statistics library (http://scipy.org/), and even an implementation of ggplot2 (http://blog.yhathq.com/posts/ggplot-for-python.html).

      I think if you looked more into Python, you wouldn’t find it as lacking as you think it is.


      • I have to admit that iPython Notebook is pretty amazing. RStudio provides a notebook system based on Markdown, but it’s not reactive like iPython Notebook is. Hopefully a iPython-like notebook will soon come to R (see http://www.youtube.com/embed/3niqZhc_Nbo for an encouraging step in that direction).

      • Hi Randy. I agree, in fact I’m a fan of Python. I just think Python is not very well suited for statistical programming. I’m not saying this just based on your examples, but on my personal experience with both languages. It’s true that there is an element of subjectivity to it, however consider something like creating a named list. In R, is as simple as it gets list(a=1, b=2), in Python it’s cumbersome: you have to pass 2 lists as arguments, one with values and another one with names. This is a consequence of how the language is designed and no library can change that. I have used Scipy in the past but always end up with the feeling that the resulting code is very un-Pythonic and ugly. It’s not your programming style what is at fault. That said, I think that the comparison is quite interesting and I’m looking forward to the next episode :)

      • If I understand your example right, a named list can be handled by a dict. In Python, you can declare dicts with: dict(“a”:1, “b”:2) etc. Not much different than the named list in Python. If you can provide some example scipy code that you struggled with, I’d be happy to take a look and make suggestions. Cheers.

      • Yes, Python dicts are the closest equivalent to R lists, but the problem is Python dicts are not ordered. Sometimes it doesn’t matter, sometimes it does. And dicts are used internally by Python to handle keyword arguments, so when you call a function and the order of the arguments matters, you can’t use keyword arguments. I don’t have a concrete example of SciPy code to show you right now. The things I dislike of SciPy is that basically you have to use SciPy data types instead of pure Python data types and this adds a layer of complexity. Also SciPy types are very low-level, a little out of place in a high-level language. In R the programmer doesn’t have to worry about floats, ints, and arrays, everything is an array. Another thing that bothered me is that when programming with SciPy you always tend to try to avoid loops, for performance reasons, which is one of Pyhon’s strenghts compared to R. So it’s a little bit the worst of both worlds. Just my opinion, hehe.

      • When I said “you try to avoid loops” I didn’t mean you personally, Randy, I meant in general.

      • Actually Python provides ordered dictionaries since 2.7

        >>> from collections import OrderedDict
        >>> OrderedDict(a=1, b=2)
        OrderedDict([('a', 1), ('b', 2)])

  4. Nice example, guys. I like the idea a lot. But… why is the R code so complicated-looking? Why not start by stripping off all the unnecessary semicolons at the end of each line (none are necessary). Then with a few uses of tools like with() and within() you can just about get rid of all the ugliness and leave the logic of the code pretty apparent. For this purpose code has to be readable as well as work. (I am not competent to comment on the Python code in this way, but I think I get the hang of how it goes.)

    Again, nice idea. When do you plan Round 2? I’m keen to see it…

    • Thanks for your message Bill!

      Semicolons are just a bad habit (some would argue that it’s a good one actually) that I have from programming in C and Java, both of which I learned before R. If I find 5 minutes, I’ll try to strip them from the code online. Also I’m not used to use with and within, I’ll look into this too.

      Next round should be online this weekend if Randy and I manage to find the time (grant application deadlines for me, paper submission deadline for Randy). It will be about scraping data from websites.

      • Here is an example of what I mean

        my.bloody.theme2 <- within(trellis.par.get(), {
        axis.line$col <- NA # Remove axes
        plot.polygon$col <- '#8A0606' # Set bar colors to a nice bloody red
        plot.polygon$border <- NA # Remove bars' outline
        axis.text$cex <- 1 # Default axis text size is a bit small. Make it bigger
        layout.heights$bottom.padding <- 0 # Remove bottom padding
        layout.heights$axis.bottom <- 0 # Remove axis padding at the bottom of the graph
        layout.heights$axis.top <- 0 # Remove axis padding at the top of the graph

        That looks much less off-putting to me…

      • Thanks for the example. This is indeed more elegant. I’m still a young padawan and I have a lot to learn before becoming an R Jedi :-)

      • PS you might want to look at paste0(…), a tiny convenience function equivalent to paste(…, sep = “”), but every little bit helps!

      • Awesome! Thanks! This is what happens when one doesn’t read past the first line of the documentation :-)

      • Amazing! Thank you so much! I was about to start cleaning up the code and re-uploading everything. This will be certainly very helpful.

      • I want to thank you again for your code, I’ve learned some very useful things thanks to it. The “within” function indeed makes the code more readable. Unfortunately it doesn’t play well with “knitr::spin”, especially when I have to include Python blocks or Roxygen style comments within a “within” call in order to keep the comparison between the two languages easy to follow.

      • Domage! You win some …

        I guess you have to decide between readable code and ‘literate programming’. Now that’s an unusual choice to have to make.

        Maybe this is one to raise with Xie Yihui

      • I’ll update the code in the post tomorrow with a somewhat cleaner version considering the incompatibilities with “knitr::spin”. And I’ll also post on GitHub the Venables Approved code :-)

      • PS (final, I promise!) On Windows 8.x (not that I would ever admit to using such a demented OS) you may need to specify mode = “wb” on the download.file() call to retrieve the image. Otherwise the download can be munged.

        You will also notice that I am not a fan of using “=” for all assignments. There are several kinds of assignment in R and it is useful to keep them separate. Also, there are occasions using within() where you cannot get away with using “=” for assignment. It is the classic ‘gotcha’.

      • No worries. I have a lot of habits (good and bad) from using other languages regularly (mainly MATLAB and Java, but also C when I’m really bored). It’s always good to be reminded that there are standards and that there are very good reasons for using them :-)

      • The code has been updated on the post (R and Python). Not perfect, but it was the best I could do to keep the R/Python comparison easy to follow.

  5. It is very nice comparison. I am looking forward for other round. I am a R programmer and don’t know about Python. It looks Python need more codes to create a simple figure.

    • Thanks!

      I think it’s about the same. It mostly depends on how much customization you add to your plot and how you organize it.

    • It appears to be the same for me. You must also consider that Python is not a domain specific language, which is a feat in itself to be compared to a domain specific language. You can also make Python’s MATPLOTLIB plotting package look like ggplot or change defaults by editing its matplotlibrc file. I have already come to the conclusion that making the “effort” in creating charts or doing data analysis is similar between Python and R. The major differentiator that I see is the availability of specialized statistical packages. This is where R wins. Where Python wins is that it is a general purpose language, there are just so many use cases for Python.

  6. Great challenge. Found this blogpost through a tweet by Miguel de Icaza where he asked for a F# solution to this problem and so I figured I should give it a try.

    Here is a small blogpost with my version

    F# has a couple of features that fit this kind of problem nicely but impressed by both R and Python

    Looking forward to the next challenge.

  7. This is a great idea. Coincidently I have been exploring Python plus pandas + other modules as a possible replacement for R. Cannot see it happening yet though… R has too many advantages. The R Code in Round 1 runs without a hitch, but I believe the Python code is not correct and/or incomplete. I produce an error with:
    body_count_ y-axis ticks on the left and x-axis ticks on the bottom

    The above seems to be combined comment/code??
    What am I doing wrong?

  8. Considering lines of code is similar bodes well for Python since it is not a domain specific language. But Python will still lag behind in terms of availability of statistical packages. Nice write-up, can’t wait for more rounds!

  9. This one snippet fails to use one of Python’s best features, list comprehensions.

    full_title = [] # whenever I see a line like this, red flags start going off

    for film, year in zip(body_count_data[“Film”].values, body_count_data[“Year”].values):
    full_title.append(film + ” (” + str(year) + “)”)
    should instead be this:
    full_title = [film + ” (” + str(year) + “)”
    for (film, year) in zip(body_count_data[“Film”].values,
    I’m going to assume my spacing will be ignored when I hit submit, which will make this a lot harder to read, so maybe paste this into a text editor and align everything yourself.

    • While I agree that list comprehensions are awesome, I personally think they’re terrible for teaching. Any newbie programmer can follow along with a simple for loop, but that list comprehension requires a bit more cognitive load that may just be too much for them. Hence the preference to go with the simpler code.

  10. I am having to do some statistics, and being a Python programmer, I have stuck with Python as it seems to have what extras I need.

    Many others might find that a language that covers more domains allows them to get things done without having to learn too many new languages from scratch. If the task is to read and condition the data from a serially attached peripheralm do the stats, then control a robot, then you are more likely to be able to do it all in Python rather than R.

    • I agree with you that a versatile language is always good to have. However I disagree with you when you say R is less versatile than Python. R is also an all purpose language, but because it was born in a statistical context, people believe that it is not good for anything else. It’s a cultural misunderstanding rather than a truth rooted in facts.

      For instance, on Reddit someone claimed that it would much harder to scrape the data that we used in this post using R (I think the word “nightmare” was used to describe the process). I think we proved him/her wrong in the second installment of our challenge series (see here: http://www.theswarmlab.com/r-vs-python-round-2/). And in the particular case that you describe (reading data from a peripheral and sending order back to a robot through the same or another peripheral), I can assure you it’s possible in R without too much effort. I know it because I’ve done it several years ago (reading, processing and sending back data from and to a robotic platform via a serial port). And at the time, I was just starting to learn R.

    • Thanks! I’m on my phone now, I’ll post a link to your code in the main text as soon as possible.

      BTW Round 2 is already online :-)

  11. Stata version:

    insheet using http://files.figshare.com/1332945/film_death_counts.csv, clear
    generate title = film + ” (” + string(year) + “)”
    generate dpm = body_count / length
    gsort – dpm
    graph hbar (mean) dpm if _n <= 25, over(title, sort(dpm) descending ///
    label(labsize(vsmall))) blabel(bar, format(%3.2f)) ///
    ytitle(Deaths per minute) ylabel(0(1)6) ///
    title(25 most violence packed films by deaths per minute)

  12. This is great! I’m in the awkward position of having learnt R very well without ever having used Python. Now, I feel like it would benefit me to have the experience in Python, but whenever I try to practice it I give up because I can already do it in R.

    Is there any point to investing in Python if you’re already good with R, beyond collaboration with other Python programmers?

    • I believe that knowing how to code in different languages is always a beneficial thing, like being able to speak multiple languages. It always comes in handy when you start interacting with the code of other programmers. Also some languages are better suited than others for certain tasks. Knowing at least the basic of the most popular languages (Python, R, Java, C++, etc.) will give you more flexibility in your workflow. However for your everyday work, stick with what you know best, it’s probably more efficient.

    • Yes, I think there is. When you learn a new language, even one quite similar to one you already know, you get a new perspective in some sense, and you get better at programming and you get better at learning new languages – something I have had to do since the 60s (yes, Fortran 66 was my first language) and that you folks will have to do with increasing flexibility into the future.

      Python has a lot of strengths that R lacks, and conversely. The two overlap a lot, but they also complement. It’s a good step.

      In a sense it’s like asking “how much time should I spend optimising my R code?” The conventional answer is “only as long as you need to make it work acceptably fast”. In general this is a sound principle, but it overlooks the fact that even silly optimisations often lead to a clearer understanding of what is really going on under the bonnet, and in this game any learning experience can be valuable.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>