Data Manipulation in R with dplyr – Part 1

dplyr is one of the packages in R that makes R so loved by data scientists. It has three main goals:

  1. Identify the most important data manipulation tools needed for data analysis and make them easy to use in R.
  2. Provide blazing fast performance for in-memory data by writing key pieces of code in C++.
  3. Use the same code interface to work with data no matter where it’s stored, whether in a data frame, a data table or database.

Introduction to the dplyr package and the tbl class
This post is mostly about code. If you’re interested in learning dplyr I recommend you type in the commands line by line on the R console to see first hand what’s happening.

# INTRODUCTION TO dplyr AND tbls
# Load the dplyr package
library(dplyr)
# Load the hflights package
library(hflights)
# Call both head() and summary() on hflights
head(hflights)
summary(hflights)
# Convert the hflights data.frame into a hflights tbl
hflights <- tbl_df(hflights)
# Display the hflights tbl
hflights
# Create the object carriers, containing only the UniqueCarrier variable of hflights
carriers <- hflights$UniqueCarrier
# Use lut to translate the UniqueCarrier column of hflights and before doing so
# glimpse hflights to see the UniqueCarrier variablle
glimpse(hflights)
lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental",
"DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways",
"WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier",
"FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")
hflights$UniqueCarrier <- lut[hflights$UniqueCarrier]
# Now glimpse hflights to see the change in the UniqueCarrier variable
glimpse(hflights)
# Fill up empty entries of CancellationCode with 'E'
# To do so, first index the empty entries in CancellationCode
cancellationEmpty <- hflights$CancellationCode == ""
# Assign 'E' to the empty entries
hflights$CancellationCode[cancellationEmpty] <- 'E'
# Use a new lookup table to create a vector of code labels. Assign the vector to the CancellationCode column of hflights
lut = c('A' = 'carrier', 'B' = 'weather', 'C' = 'FFA', 'D' = 'security', 'E' = 'not cancelled')
hflights$CancellationCode <- lut[hflights$CancellationCode]
# Inspect the resulting raw values of your variables
glimpse(hflights)
view raw introduction.R hosted with ❤ by GitHub

Select and mutate
dplyr provides grammar for data manipulation apart from providing data structure. The grammar is built around 5 functions (also referred to as verbs) that do the basic tasks of data manipulation.

The 5 verbs of dplyr
select – removes columns from a dataset
filter – removes rows from a dataset
arrange – reorders rows in a dataset
mutate – uses the data to build new columns and values
summarize – calculates summary statistics

dplyr functions do not change the dataset. They return a new copy of the dataset to use.

To answer the simple question whether flight delays tend to shrink or grow during a flight, we can safely discard a lot of the variables of each flight. To select only the ones that matter, we can use select()

hflights[c('ActualElapsedTime','ArrDelay','DepDelay')]
# Equivalently, using dplyr:
select(hflights, ActualElapsedTime, ArrDelay, DepDelay)
# Print out a tbl with the four columns of hflights related to delay
select(hflights, ActualElapsedTime, AirTime, ArrDelay, DepDelay)
# Print out hflights, nothing has changed!
hflights
# Print out the columns Origin up to Cancelled of hflights
select(hflights, Origin:Cancelled)
# Find the most concise way to select: columns Year up to and
# including DayOfWeek, columns ArrDelay up to and including Diverted
# Answer to last question: be concise!
# You may want to examine the order of hflight's column names before you
# begin with names()
names(hflights)
select(hflights, -(DepTime:AirTime))
view raw verbs01.R hosted with ❤ by GitHub

dplyr comes with a set of helper functions that can help you select variables. These functions find groups of variables to select, based on their names. Each of these works only when used inside of select()

  • starts_with(“X”): every name that starts with “X”
  • ends_with(“X”): every name that ends with “X”
  • contains(“X”): every name that contains “X”
  • matches(“X”): every name that matches “X”, where “X” can be a regular expression
  • num_range(“x”, 1:5): the variables named x01, x02, x03, x04 and x05
  • one_of(x): every name that appears in x, which should be a character vector
# Helper functions used with dplyr
# Print out a tbl containing just ArrDelay and DepDelay
select(hflights, ArrDelay, DepDelay)
# Use a combination of helper functions and variable names to print out
# only the UniqueCarrier, FlightNum, TailNum, Cancelled, and CancellationCode
# columns of hflights
select(hflights, UniqueCarrier, FlightNum, contains("Tail"), contains("Cancel"))
# Find the most concise way to return the following columns with select and its
# helper functions: DepTime, ArrTime, ActualElapsedTime, AirTime, ArrDelay,
# DepDelay. Use only helper functions
select(hflights, ends_with("Time"), ends_with("Delay"))
view raw verbs02.R hosted with ❤ by GitHub

In order to appreciate the usefulness of dplyr, here are some comparisons between base R and dplyr

# Some comparisons to basic R
# both hflights and dplyr are available
ex1r <- hflights[c("TaxiIn","TaxiOut","Distance")]
ex1d <- select(hflights, TaxiIn, TaxiOut, Distance)
ex2r <- hflights[c("Year","Month","DayOfWeek","DepTime","ArrTime")]
ex2d <- select(hflights, Year:ArrTime, -DayofMonth)
ex3r <- hflights[c("TailNum","TaxiIn","TaxiOut")]
ex3d <- select(hflights, TailNum, contains("Taxi"))
view raw comparisons01.R hosted with ❤ by GitHub

mutate() is the second of the five data manipulation functions. mutate() creates new columns which are added to a copy of the dataset.

# Add the new variable ActualGroundTime to a copy of hflights and save the result as g1.
g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime)
# Add the new variable GroundTime to a g1. Save the result as g2.
g2 <- mutate(g1, GroundTime = TaxiIn + TaxiOut)
# Add the new variable AverageSpeed to g2. Save the result as g3.
g3 <- mutate(g2, AverageSpeed = Distance / AirTime * 60)
# Print out g3
g3
view raw verbs03.r hosted with ❤ by GitHub

So far we have added variables to hflights one at a time, but we can also use mutate() to add multiple variables at once.

# Add a second variable loss_percent to the dataset: m1
m1 <- mutate(hflights, loss = ArrDelay - DepDelay, loss_percent = ((ArrDelay - DepDelay)/DepDelay)*100)
# mutate() allows you to use a new variable while creating a next variable in the same call
# Copy and adapt the previous command to reduce redendancy: m2
m2 <- mutate(hflights, loss = ArrDelay - DepDelay, loss_percent = (loss/DepDelay) * 100 )
# Add the three variables as described in the third instruction: m3
m3 <- mutate(hflights, TotalTaxi = TaxiIn + TaxiOut, ActualGroundTime = ActualElapsedTime - AirTime, Diff = TotalTaxi - ActualGroundTime)
view raw verbs04.r hosted with ❤ by GitHub
Advertisement

Statistical Learning – 2016

On January 12, 2016, Stanford University professors Trevor Hastie and Rob Tibshirani will offer the 3rd iteration of Statistical Learning, a MOOC which first began in January 2014, and has become quite a popular course among data scientists. It is a great place to learn statistical learning (machine learning) methods using the R programming language. For a quick course on R, check this out – Introduction to R Programming

Slides and videos for Statistical Learning MOOC by Hastie and Tibshirani available separately here. Slides and video tutorials related to this book by Abass Al Sharif can be downloaded here.

The course covers the following book which is available for free as a PDF copy.

Logistics and Effort:

statLearnEffort

Rough Outline of Schedule (based on last year’s course offering):

Week 1: Introduction and Overview of Statistical Learning (Chapters 1-2)
Week 2: Linear Regression (Chapter 3)
Week 3: Classification (Chapter 4)
Week 4: Resampling Methods (Chapter 5)
Week 5: Linear Model Selection and Regularization (Chapter 6)
Week 6: Moving Beyond Linearity (Chapter 7)
Week 7: Tree-based Methods (Chapter 8)
Week 8: Support Vector Machines (Chapter 9)
Week 9: Unsupervised Learning (Chapter 10)

Prerequisites: First courses in statistics, linear algebra, and computing.

 

Troubleshooting ‘Rattle’ (R library) Installation on Ubuntu

This post pertains to Ubuntu / Debian users only.

rattle is a free graphical interface for data mining with R. I wanted to visualize decision trees and had to install this library.
> install.packages('rattle')
got me the following error message:

configure: error: GTK version 2.8.0 required
ERROR: configuration failed for package ‘RGtk2’

rattle_installationNonZeroExit

This error occurs when attempting to install the RGtk2 package. The install is looking for the header files for GTK. Possibly they are not yet. Luckily the problem can be solved quite easily. Open Terminal (Ctrl + Alt + T) and type in the following commands:


sudo apt-get update
wajig install libgtk2.0-dev

Go back and try installing rattle now with the same command as earlier. It should work. It did for me! As you can see below, decision trees are visualized lot better with rattle than if you used just rpart.

rattle

Getting Started with R on MIT’s 14.74x (Foundations of Development Policy)

I noticed that a major grievance of many students enrolled in MIT‘s latest edX course on development policy (Foundations of Development Policy: Advanced Development Economics) was that there wasn’t enough done to get them going with the R assignments. I have posted the R code for the homework (past the deadline, of course) of the first 2 weeks, so that others get a hang of the level of R that might be needed to solve these assignments in the following weeks. I’m willing to help out those needing help getting up to speed with R required for this course. For specific queries, leave your message in the comments section.

A great place to get spend time learning R before taking Foundations of Development Policy (14.74x) would be another edX course that’s been getting great reviews recently: Introduction to R Programming

R Code for Home Work (Week 1)

# set working directory to local directory where the data is kept
setwd("~/IGIDR/Development Economics - MIT/Homework Assignment 01")
# read the data
wb_dev_ind = read.csv("wb_dev_ind.csv")
# summarize data
summary(wb_dev_ind)
# Question 1
# What is the Mean of GDP per capita? What is the standard deviation of GDP per capita?
meanGDPperCapita = mean(wb_dev_ind$gdp_per_capita, na.rm = TRUE)
print(round(meanGDPperCapita))
sdGDPperCapita = sd(wb_dev_ind$gdp_per_capita, na.rm = TRUE)
print(round(sdGDPperCapita))
# Question 2
# What is the mean illiteracy rate across all countries? What is the standard deviation?
illiteracy_all = numeric(nrow(wb_dev_ind))
wb_dev_ind$illiteracy_all = illiteracy_all
wb_dev_ind$illiteracy_all = 100 - wb_dev_ind$literacy_all
meanIlliteracy = mean(wb_dev_ind$illiteracy_all, na.rm = TRUE)
print(round(meanIlliteracy))
sdIlliteracy = sd(wb_dev_ind$illiteracy_all, na.rm = TRUE)
print(round(sdIlliteracy))
# Question 3
# What is the mean infant mortality rate across all countries? What is the standard deviation?
meanInfantMortality = mean(wb_dev_ind$infant_mortality, na.rm = TRUE)
print(round(meanInfantMortality))
sdInfantMortality = sd(wb_dev_ind$infant_mortality, na.rm = TRUE)
print(round(sdInfantMortality))
# Question 4
# What is the mean male illiteracy rate? What is the mean female illiteracy rate?
illiteracy_male = numeric(nrow(wb_dev_ind))
wb_dev_ind$illiteracy_male = illiteracy_male
wb_dev_ind$illiteracy_male = 100 - wb_dev_ind$literacy_male
meanIlliteracyMale = mean(wb_dev_ind$illiteracy_male, na.rm = TRUE)
print(round(meanIlliteracyMale))
sdIlliteracyMale = sd(wb_dev_ind$illiteracy_male, na.rm = TRUE)
print(round(sdIlliteracyMale))
illiteracy_female = numeric(nrow(wb_dev_ind))
wb_dev_ind$illiteracy_female = illiteracy_female
wb_dev_ind$illiteracy_female = 100 - wb_dev_ind$literacy_female
meanIlliteracyFemale = mean(wb_dev_ind$illiteracy_female, na.rm = TRUE)
print(round(meanIlliteracyFemale))
sdIlliteracyFemale = sd(wb_dev_ind$illiteracy_female, na.rm = TRUE)
print(round(sdIlliteracyFemale))
# Question 5
# What are the mean, minimum, and maximum illiteracy rate among the 50 richest countries
richest50 = wb_dev_ind[order(wb_dev_ind$gdp_per_capita, decreasing = TRUE),][1:50,]
summary(richest50)
# Question 6
# What are the mean, minimum, and maximum illiteracy rate among the 50 poorest countries?
poorest50 = wb_dev_ind[order(wb_dev_ind$gdp_per_capita),][1:50,]
summary(poorest50)
# Question 7
# What are the mean, minimum, and maximum infant mortality rate among the 50 richest countries?
summary(richest50)
# Question 8
# What are the mean, minimum, and maximum infant mortality rate among the 50 poorest countries?
summary(poorest50)
# Question 9
# What is the median GDP per capita?
summary(wb_dev_ind)
# Question 10-12
# Regress the infant mortality rate on per capita GDP, and then answer questions 10-12
model1 = lm(infant_mortality ~ gdp_per_capita, data = wb_dev_ind)
summary(model1)
# Question 13
# Regress the illiteracy rate on GDP per capita. Is the coefficient on per capita GDP significantly different from zero at the 5% level?
model2 = lm(illiteracy_all ~ gdp_per_capita, data = wb_dev_ind)
summary(model2)
# Question 14
# Regress the infant mortality rate on the illiteracy rate. Graph a scatter plot of the data as well as the regression line.
model3 = lm(infant_mortality ~ illiteracy_all, data = wb_dev_ind)
summary(model3)
plot(wb_dev_ind$illiteracy_all, wb_dev_ind$infant_mortality)
abline(model3)
view raw HW01.R hosted with ❤ by GitHub

R Code for Home Work (Week 2)

# Set working directory to local directory where the data is kept
setwd("~/IGIDR/Development Economics - MIT/Homework Assignment 02")
# read data
migueldata = read.csv("ted_miguel_worms.csv", header = TRUE)
attach(migueldata)
# Question 6
# How many observations are there per pupil? (Enter a whole number of 0 or higher)?
length(migueldata$pupid)
length(unique(migueldata$pupid))
# Question 7
# What percentage of the pupils are boys? (Answers within 0.50 percentage points of the correct answer will be accepted. For instance, 67 would be accepted if the correct answer is 67.45%)
mean(sex, na.rm = TRUE)
# Question 8
# What percentage of pupils took the deworming pill in 1998? (Answers within 0.50 percentage points of the correct answer will be accepted. For instance, 67 would be accepted if the correct answer is 67.45%)
mean(pill98, na.rm = TRUE)
# Question 9
# Was the percentage of schools assigned to treatment in 1998 greater than or less than the percentage of pupils that actually took the deworming pill in 1998?
mean(treat_sch98, na.rm = TRUE)
mean(treat_sch98, na.rm = TRUE) > mean(pill98, na.rm = TRUE) # Ans = Greater Than
# Question 10
# Which of the following variables from the dataset are dummy variables? (Check all that apply.)
summary(migueldata)
# Question 11
# Using the data, find and enter the difference in outcomes (Y: school participation) between students who took the pill and students who did not in 1998. (Enter your answer as a difference in proportions. For instance, if the proportion in one group is 0.61 and the proportion in the other group is 0.54, enter 0.07. Answers within 0.05 of the correct answer will be accepted. For instance, 0.28 would be accepted if the correct answer is 0.33.)
took_pill_98 = mean(migueldata[migueldata$pill98 == 1,]$totpar98, na.rm = TRUE)
no_pill_98 = mean(migueldata[migueldata$pill98 == 0,]$totpar98, na.rm = TRUE)
diff = took_pill_98 - no_pill_98
diff
# Question 12
# Since schools were randomly assigned to the deworming treatment group, the estimate calculated in the previous answer is an unbiased estimate of taking the pill on school attendance.
# False
# Explanation
# The estimated impact of 13 percentage points calculated in the previous answer might not be a good estimate of the effect of taking the pill. Many students in the randomly assigned treatment schools did not actually take the pills, so those who took the pills would not have been randomly selected at all. For instance, kids who attend school more anyway might have been more likely to be there when the pills were handed out, meaning that omitted variables would be correlated with taking the pill and future school attendance. This would bias the estimate upward i.e. the 13 percentage point difference might overstate the impact of deworming on attendance.
# Question 13
# Using the data, find and enter the difference in outcomes (Y: school participation) between students in treatment schools and students not in treatment schools in 1998, regardless of whether or not they actually took the pill. (Enter your answer as a difference in proportions. For instance, if the proportion in one group is 0.61 and the proportion in the other group is 0.54, enter 0.07. Answers within 0.05 of the correct answer will be accepted. For instance, 0.28 would be accepted if the correct answer is 0.33.)
in_treatment_sch = mean(migueldata[migueldata$treat_sch98 == 1,]$totpar98, na.rm = TRUE)
non_treatment_sch = mean(migueldata[migueldata$treat_sch98 == 0,]$totpar98, na.rm = TRUE)
diff_treatment_sch = in_treatment_sch - non_treatment_sch
diff_treatment_sch
# Question 14
# Using the data, calculate the difference in the probability of taking the pill given that a student was in a treatment school and the probability of taking it if a student was not in a treatment school. (Enter your answer as a difference in proportions. For instance, if the proportion in one group is 0.61 and the proportion in the other group is 0.54, enter 0.07. Answers within 0.05 of the correct answer will be accepted. For instance, 0.28 would be accepted if the correct answer is 0.33.)
pr_pill_treatment_sch = mean(migueldata[migueldata$treat_sch98 == 1,]$pill98, na.rm = TRUE)
pr_pill_no_treatment_sch = mean(migueldata[migueldata$treat_sch98 == 0,]$pill98, na.rm = TRUE)
diff_pr_pill_treatment_sch = pr_pill_treatment_sch - pr_pill_no_treatment_sch
# Question 15
# Using the data, derive the Wald Estimator of taking the pill on school attendance. (Enter your answer as a difference in proportions. For instance, if the proportion in one group is 0.61 and the proportion in the other group is 0.54, enter 0.07. Answers within 0.05 of the correct answer will be accepted. For instance, 0.28 would be accepted if the correct answer is 0.33.)
waldRatio = diff_treatment_sch/diff_pr_pill_treatment_sch
waldRatio
view raw HW02.R hosted with ❤ by GitHub

I hope this helps!

R — The Big Mover in IEEE Spectrum’s 2015 Rankings for Top 10 Programming Languages

The column on the left is the 2015 ranking; the column on the right is the 2014 ranking for comparison:

top-tech-rankings

source: The 2015 Top Ten Programming Languages

The thing to note is that the top 5 languages haven’t budged from their positions. R has pushed past PHP, JavaScirpt and Ruby, which have maintained their relative positions.  So this year’s rankings have been about R moving forward.

Introducing cricketr! : An R package to analyze performances of cricketers

Wicked! Or must I say ‘howzzat!?’

Giga thoughts ...

Yet all experience is an arch wherethro’
Gleams that untravell’d world whose margin fades
For ever and forever when I move.
How dull it is to pause, to make an end,
To rust unburnish’d, not to shine in use!

Ulysses by Alfred Tennyson

Introduction

This is an initial post in which I introduce a cricketing package ‘cricketr’ which I have created. This package was a natural culmination to my earlier posts on cricket and my completing 9 modules of Data Science Specialization, from John Hopkins University at Coursera. The thought of creating this package struck me some time back, and I have finally been able to bring this to fruition.

So here it is. My R package ‘cricketr!!!’

This package uses the statistics info available in ESPN Cricinfo Statsguru. The current version of this package only uses data from test cricket. I plan to develop functionality for One-day and…

View original post 4,951 more words

Solution to [Viral] Math Puzzle for Vietnamese Eight-Year-Olds (Using R)

There was this math problem that went viral recently –  it was a Singapore Math Olympiad problem meant for 14-year olds which surprisingly many adults couldn’t solve. A friend sent me a link to that problem and I solved it with pen and paper in about 10 minutes, leaving me feeling hardly challenged. I guess really-tough questions aren’t the ones that actually go viral. If anything, a math problem going viral means it caters to an average IQ, something like a 100 – which then left me wondering whatever my IQ was! I had taken many online IQ tests when I was pursuing my engineering degree and my scores ranged between 135 and 145. But they weren’t tests one could really feel great about, mainly because everyone I knew who took them, scored 110+. I never took a Mensa test in the past, so I’m not sure whether I’d have made the cut. Anyway, I decided to check Quora to see if the nerds had found anything remotely mimicking a Mensa type IQ test. So I checked out this one and got a 130. It seems to be the least I’ve ever got on an IQ test, so without giving you any rationale to why I feel this might be closer to accurately measuring your IQ, I say it’s probably worth a shot.

iq_test_score

 

A week back, there was yet another math puzzle that had gone viral, meant for Vietnamese eight-year-olds, a problem that had stumped parents and teachers. You need to fill in the empty boxes below with the digits from 1 to 9 so that the equation makes sense, following the order of operations – multiply first, then division, addition and subtraction last. Apparently, this problem was for third graders in the town of Bao Loc in the Vietnamese Highlands.

I didn’t solve this one with pen and paper, and instead wrote an R program. It’s clearly a problem with a fixed number of variables (nine) associated with standard math operators, meaning there would be 9! (362880) permutations – which makes you think there’s got to be more than one solution. It became obvious I had to write some code.

I wrote a function appropriately named baoloc() to list out the solutions to this problem. The code is self explanatory. I got 128 unique solutions, which means if the students had to solve this problem with pencil on paper, they could get one of 128 possible answers, which makes the job of the examiner arduous and boring!

## The problem could be written as u + 13v/w + x + 12y - z - 11 + pq/r -10 = 66
## which reduces to u + 13v/w + x + 12y - z + pq/r = 87
## This problem was solved on [1] "Wed May 27 17:46:52 2015"
baoloc <- function()
{
packages <- rownames(installed.packages())
if("combinat" %in% packages) library("combinat") else install.packages("combinat")
numbers <- 1:9
permutations <- permn(numbers) ## list of all permutations of vector input
solutions <- numeric(9)
for(i in 1:length(permutations))
{
solution <- permutations[[i]]
if(solution[1] + 13*(solution[2]/solution[3]) + solution[4] + 12*solution[5] - solution[6] + (solution[7]*solution[8] / solution[9]) == 87)
{
solutions <- rbind(solutions, solution)
}
}
print("The number of solutions are:")
print(nrow(solutions)-1)
return(solutions[2:nrow(solutions),])
}
view raw baoloc.R hosted with ❤ by GitHub

The above code produces the following solutions:
[1] "The number of solutions are:"
[1] 128
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
solution 9 1 2 5 6 7 3 4 8
solution 7 9 6 1 5 2 3 4 8
solution 1 9 6 7 5 2 3 4 8
solution 1 5 2 3 4 8 7 9 6
solution 1 5 2 3 4 8 9 7 6
solution 5 1 2 9 6 7 3 4 8
solution 5 1 2 9 6 7 4 3 8
solution 1 5 2 8 4 7 3 9 6
solution 1 5 2 8 4 7 9 3 6
solution 1 9 6 7 5 2 4 3 8
solution 7 9 6 1 5 2 4 3 8
solution 9 1 2 5 6 7 4 3 8
solution 1 2 6 4 7 8 5 3 9
solution 1 2 6 4 7 8 3 5 9
solution 1 4 8 2 7 9 3 5 6
solution 1 4 8 2 7 9 5 3 6
solution 7 1 4 9 6 5 2 3 8
solution 9 1 4 7 6 5 2 3 8
solution 5 4 1 9 2 7 8 3 6
solution 5 4 1 9 2 7 3 8 6
solution 9 4 1 5 2 7 8 3 6
solution 9 4 1 5 2 7 3 8 6
solution 4 9 6 1 5 8 3 7 2
solution 4 9 6 1 5 8 7 3 2
solution 8 6 4 7 5 9 1 3 2
solution 7 6 4 8 5 9 1 3 2
solution 9 4 8 5 6 7 1 3 2
solution 5 4 8 9 6 7 1 3 2
solution 1 9 6 4 5 8 7 3 2
solution 1 9 6 4 5 8 3 7 2
solution 9 1 4 7 6 5 3 2 8
solution 7 1 4 9 6 5 3 2 8
solution 1 3 9 4 7 8 2 5 6
solution 1 3 4 7 6 5 2 9 8
solution 1 3 4 7 6 5 9 2 8
solution 1 3 9 4 7 8 5 2 6
solution 1 5 3 9 4 2 7 8 6
solution 1 5 3 9 4 2 8 7 6
solution 1 3 2 9 5 6 7 4 8
solution 1 3 2 9 5 6 4 7 8
solution 1 3 6 2 7 9 5 4 8
solution 1 3 6 2 7 9 4 5 8
solution 1 3 2 4 5 8 7 9 6
solution 1 3 2 4 5 8 9 7 6
solution 6 3 1 9 2 5 8 7 4
solution 6 3 1 9 2 5 7 8 4
solution 9 3 1 6 2 5 8 7 4
solution 9 3 1 6 2 5 7 8 4
solution 7 3 1 5 2 6 8 9 4
solution 7 3 1 5 2 6 9 8 4
solution 5 3 1 7 2 6 9 8 4
solution 5 3 1 7 2 6 8 9 4
solution 9 5 3 1 4 2 7 8 6
solution 9 5 3 1 4 2 8 7 6
solution 3 1 4 2 7 9 6 5 8
solution 3 1 4 2 7 9 5 6 8
solution 7 3 4 1 6 5 9 2 8
solution 7 3 4 1 6 5 2 9 8
solution 3 6 4 9 5 8 1 7 2
solution 3 6 4 9 5 8 7 1 2
solution 5 4 8 9 6 7 3 1 2
solution 9 4 8 5 6 7 3 1 2
solution 7 6 4 8 5 9 3 1 2
solution 8 6 4 7 5 9 3 1 2
solution 9 6 4 3 5 8 1 7 2
solution 9 6 4 3 5 8 7 1 2
solution 4 3 9 1 7 8 5 2 6
solution 4 3 9 1 7 8 2 5 6
solution 4 3 2 1 5 8 9 7 6
solution 4 3 2 1 5 8 7 9 6
solution 3 2 4 8 5 1 7 9 6
solution 3 2 4 8 5 1 9 7 6
solution 5 9 3 6 2 1 8 7 4
solution 5 9 3 6 2 1 7 8 4
solution 3 5 2 1 4 8 9 7 6
solution 3 5 2 1 4 8 7 9 6
solution 6 9 3 5 2 1 7 8 4
solution 6 9 3 5 2 1 8 7 4
solution 3 9 6 2 5 1 7 4 8
solution 3 9 6 2 5 1 4 7 8
solution 3 2 8 6 5 1 7 9 4
solution 3 2 8 6 5 1 9 7 4
solution 7 3 2 8 5 9 6 1 4
solution 8 3 2 7 5 9 6 1 4
solution 8 3 2 7 5 9 1 6 4
solution 7 3 2 8 5 9 1 6 4
solution 3 2 1 5 4 7 8 9 6
solution 3 2 1 5 4 7 9 8 6
solution 3 9 2 8 1 5 7 6 4
solution 9 3 2 1 5 6 7 4 8
solution 3 9 2 8 1 5 6 7 4
solution 9 3 2 1 5 6 4 7 8
solution 2 3 6 1 7 9 4 5 8
solution 2 3 6 1 7 9 5 4 8
solution 8 9 2 3 1 5 6 7 4
solution 8 9 2 3 1 5 7 6 4
solution 2 9 6 3 5 1 4 7 8
solution 2 9 6 3 5 1 7 4 8
solution 6 2 8 3 5 1 9 7 4
solution 6 2 8 3 5 1 7 9 4
solution 7 2 8 9 6 5 3 1 4
solution 9 2 8 7 6 5 3 1 4
solution 8 7 2 5 3 9 6 1 4
solution 8 7 2 5 3 9 1 6 4
solution 5 7 2 8 3 9 1 6 4
solution 5 7 2 8 3 9 6 1 4
solution 8 2 4 3 5 1 9 7 6
solution 8 2 4 3 5 1 7 9 6
solution 8 5 2 7 4 9 3 1 6
solution 7 5 2 8 4 9 3 1 6
solution 4 2 6 1 7 8 3 5 9
solution 4 2 6 1 7 8 5 3 9
solution 7 5 2 8 4 9 1 3 6
solution 8 5 2 7 4 9 1 3 6
solution 2 4 8 1 7 9 5 3 6
solution 2 8 6 9 4 1 5 7 3
solution 2 8 6 9 4 1 7 5 3
solution 9 8 6 2 4 1 7 5 3
solution 9 8 6 2 4 1 5 7 3
solution 2 4 8 1 7 9 3 5 6
solution 2 1 4 3 7 9 5 6 8
solution 2 1 4 3 7 9 6 5 8
solution 8 5 2 1 4 7 9 3 6
solution 8 5 2 1 4 7 3 9 6
solution 5 2 1 3 4 7 9 8 6
solution 5 2 1 3 4 7 8 9 6
solution 9 2 8 7 6 5 1 3 4
solution 7 2 8 9 6 5 1 3 4

Excel using R

Very often we find ourselves caught up with really mundane tasks involving voluminous excel data — in business and yes, even in research! The story keeps repeating everywhere. I’ll talk about how to automate some mechanical and brain-dead spreadsheet operations using R. This post will cover the basics of reading and writing operations on Excel data. There are 2 packages worth installing, that I’ve taken a liking to for this purpose. xlsx and XLConnect.

install.packages("xlsx","XLConnect")

Let’s say you have huge volumes of spreadsheet data, of a survey conducted in a city several times during the year (like every month or so). What if you were interested in only some of the details from that survey, and you wanted to extract those details for each time the survey was conducted that year, and summarize the whole thing in yet another spreadsheet. You don’t want to be repeatedly copying and pasting and wasting several man-hours doing so, especially if you had hundreds of surveys to scavenge from. Instead, a simple R script could complete the same task in minutes.

I’ll talk about an example scenario and walk you through the R code. I had data coming from a survey conducted every month of the year for the district of Agra, India — home to the Taj Mahal. Since they were the same survey conducted every month, the excel files were identical in format. Each file contained different data, but of same parameters being tested each month – so data in specific cells, say, E11:V11 were instances of the same metric common to each file. I wanted to stack data from these particular cells from each file onto a fresh excel workbook. I could have copy-pasted the whole thing and done the job manually, but I wasn’t just looking at data from one district, viz., Agra, but hundreds of other districts. This was obviously a task that badly needed automating.

So, for Agra, I had 12 spreadsheets corresponding to surveys for each month of the year. The R file (Agra) that you can see along with the excel files, is the script for creating the final spreadsheet that accomplished the earlier-mentioned objective.

Agra Directory

I wanted data taken from the cells E11:V11 from each of the above spreadsheets. For example, here’s how it looks for the April and May spreadsheets (highlighted portion)

A-Agra_April

B-Agra_May

I had to write R code to get the data from each of the spreadsheets to be neatly stacked in a fresh workbook named Outliers-Agra, and give appropriate row names, column names, and worksheet name, as in the screenshot below.

Agra-Outliers.xls

How was this done using R?

First, the necessary libraries had to be loaded into R, the ones talked about at the start of this post.

library("xlsx") 
library("XLConnect")  

Next, I had to change the working directory to where the survey data was kept, in this case, a folder aptly named Agra. The files in the directory were stored in a vector named files. Since the main R-file where I was writing the script Agra.R, was part of that directory, it also was part of the vector files. The vector files was created to later read the spreadsheets iteratively using a for-loop, so Agra.R had to be deleted from it. This was done using the which() function to locate the position of the R-file, craftily using subsetting rules to remove it from the vector.

setwd("C:/Anirudh/Coding/R/Agra") ## CHANGE WORKING DIRECTORY as required
files <- list.files() ## vector containing the file names in that directory
files <- files[-which(files == "Agra.R")] ## Removes R-file (in this case, Agra.R) from the vector of files

An empty matrix, temp, was created to be exactly the size of the data to be read from all the excel files in the directory. The data from each file was then iteratively stored to fill temp. The read.xlsx() function is a function from the xlsx library. The arguments are mostly self-explanatory. files[i] subsets the ith instance in files. header is set FALSE to indicate that the top row of the excel file did not consist of variables.

temp <- matrix(nrow = length(files), ncol = 18) ## matrix, 18 columns as in 5:22, rows = no.of xls files
for(i in seq_along(files[1:length(files)]))
{
  dat <- read.xlsx(files[i], sheetIndex = 1, rowIndex = 11, colIndex = 5:22, header = FALSE)
  mat <- as.matrix(dat) ## reading data from 11th row of xlsx file into a matrix of row size 1
  temp[i,] <- mat ## storing data from above matrix into matrix temp
}

The contents of temp (a matrix object) were then stored in a data frame object before being written to a fresh excel file (this is necessary). For writing operations in excel, I used functions from the XLConnect library. The workbook had to be loaded using loadWorkbook() with the create argument set to TRUE to indicate that a fresh excel file was being created. If I had to write to an already existing file, the argument would have been set FALSE . I created a worksheet and named it ANC3 (that had got something to do with the data). startRow and startCol are set to the location where we want the data to begin from.

df <- as.data.frame(temp) ## storing contents of temp matrix as data frame
wb <- loadWorkbook("C:/Anirudh/Coding/R/Agra/Outliers-Agra.xlsx", create = TRUE) # new workbook
createSheet(wb, name = "ANC3") ## create worksheet / tab with name "ANC3"
writeWorksheet(wb, df, sheet = "ANC3", startRow = 2, startCol = 2, header = FALSE)

Next, the data in the new excel file, Outliers-Agra.xlsx was given appropriate column names (colnames) and row names (months).

colnames <- read.xlsx(files[i],sheetIndex = 1, rowIndex = 6, colIndex = 5:22, header = FALSE)
colnames <- as.data.frame(as.matrix(colnames))
months <- as.data.frame(matrix(c("April", "May", "June", "July", "August", "September","October", "November","December", "January", "February", "March")))
writeWorksheet(wb, colnames, sheet = "ANC3", startRow = 1, startCol = 2, header = FALSE)
writeWorksheet(wb, months, sheet = "ANC3", startRow = 2, startCol = 1, header = FALSE)

Lastly, the workbook needed to be saved.

saveWorkbook(wb)

After executing the R script, the new workbook Outliers-Agra is added to the parent directory.

Agra Directory with Agra.R and Outliers-Agra

So this is how some basic Excel work can be managed using R. Since these were only 12 files, writing all this code to perform simple copy and paste functions might seem like an awful lot to go through. But the data in this example was for one Indian district  (Agra) — and if you had to do the same thing thousands of times covering hundreds of districts (directories), you’d thank this gem of free software known as R, for what it can accomplish.

Featured Image: https://xkcd.com/1445/

One-Month-Old Blog

UPDATE: While I’m already half way through the much recommended book by Zed A. Shaw – Learn Python The Hard Way, I’m still doing my research on other great resources to help me get started with Python. This page listing 10 Python blogs worth following, in particular emphasizes Mouse vs python to be the most useful. Starting the 10th of June, I’ll be engaged on a 9-week-long MOOC on Computer Science using Python, offered by MIT.

It’s been 2 months since I got started with R, and although my progress seems fast to me, it appears so mainly because R comes with insanely helpful packages that reduce large chunks of code into simple functions. Not only that, data visualization and graphics generated in R are beautiful and elegant. For example, the following code generates a Mandelbrot set created through the first 50 iterations of equation z = z2 + c plotted for different complex constants c

library(caTools)         # external package providing write.gif function
jet.colors <- colorRampPalette(c("#00007F", "blue", "#007FFF", "cyan", "#7FFF7F",
                                 "yellow", "#FF7F00", "red", "#7F0000"))
m <- 1000                # define size
C <- complex( real=rep(seq(-1.8,0.6, length.out=m), each=m ),
              imag=rep(seq(-1.2,1.2, length.out=m), m ) )
C <- matrix(C,m,m)       # reshape as square matrix of complex numbers
Z <- 0                   # initialize Z to zero
X <- array(0, c(m,m,50)) # initialize output 3D array
for (k in 1:50) {        # loop with 50 iterations
  Z <- Z^2+C             # the central difference equation
  X[,,k] <- exp(-abs(Z)) # capture results
}
write.gif(X, "Mandelbrot.gif", col=jet.colors, delay=800)

Mandelbrot

This is just an illustration of the power of a dozen or so lines of R code. Just as there are a ridiculous many packages in R, there are countless modules packed into many thousands of packages in Python to make life simpler, so I wasn’t surprised to find a module called antigravity, that can be imported in Python like this:

 import antigravity  

and voila, you are redirected to this telling XKCD tale of Cueball performing gravity-defying stunts with Python.

source: http://www.xkcd.com/353/