Highly Divisible Triangular Number — Project Euler (Problem 12)

All n numbers are Triangle Numbers. They’re called so, because they can be represented in the form of a triangular grid of points where the first row contains a single element and each subsequent row contains one more element than the previous one.

TriangularNumber_900

Problem 12 of Project Euler asks for the first triangle number with more than 500 divisors.

These are the factors of the first seven triangle numbers:

1 = 1: 1
2 = 3: 1,3
3 = 6: 1,2,3,6
4 = 10: 1,2,5,10
∑5 = 15: 1,3,5,15
∑6 = 21: 1,3,7,21
∑7 = 28: 1,2,4,7,14,28

Here’s how I proceeded:

First Step: Find the smallest number with 500 divisors. Seems like a good starting point to begin our search.
Second Step: Starting at the number found in the previous step, search for the next triangle number. Check to see whether this number has 500+ divisors. If yes, this is the number we were looking for, else…
Third Step: Check n for which ∑n = triangle number found in the previous step
Fourth Step: Add (n+1) to the last triangle number found, to find the next triangle number. Check whether this number has 500+ divisors. If yes, this number is the answer. If not, repeat Fourth Step till the process terminates.

Now for the details:

The First Step isn’t exactly a piece of cake, but necessary to reduce computation time. I solved this with a bit of mental math. The main tool for the feat is the prime number decomposition theorem:

Every integer N is the product of powers of prime numbers

N = pαqβ· … · rγ
Where p, q, …, r are prime, while α, β, …, γ are positive integers. Such representation is unique up to the order of the prime factors.
If N is a power of a prime, N = pα, then it has α + 1 factors:
1, p, …, pα-1, pα
The total number of factors of N equals (α + 1)(β + 1) … (γ + 1)

500 = 2 x 2 x 5 x 5 x 5
So, the number in question should be of the form abq4r4s4 where a, b, q, r, s are primes that minimize abq4r4s4. This is satisfied by 7x11x24x34x54 = 62370000. This marks the end of the First Step which is where we start our search for our magic number.

The next 3 steps would need helper functions defined as below:

from math import *
# Function to calculate the number of divisors of integer n
def divisors(n):
limit = int(sqrt(n))
divisors_list = []
for i in range(1, limit+1, 1):
if n % i == 0:
divisors_list.append(i)
if i != n/i:
divisors_list.append(n/i)
return len(divisors_list)
# Function to check for triangle number
def isTriangleNumber(n):
a = int(sqrt(2*n))
return 0.5*a*(a+1) == n
# Function to calculate the last term of the series adding up to the triangle number
def lastTerm(n):
if isTriangleNumber(n):
return int(sqrt(2*n))
else:
return None

As can be seen from the above code, the algorithm to calculate divisors of an integer is as follows:
1. Start by inputting a number n
2. Let an int variable limit = √n
3. Run a loop from i = 1 to  i = limit
    3.1 if n is divisible by i
3.1.1 Add i to the list of divisors
3.1.2 if i and n/i are unequal, add n/i to the list too.
4. End

Finally, executing the 4 steps mentioned earlier can be done like so (the code took less than 2s to arrive at the answer):

# First Step
# First number 'check' to have 500 divisors
check = 2**4 * 3**4 * 5**4 * 7 * 11
# Second Step
# Starting from 'check', iterate sequentially checking for the next 'triangle' number
while not isTriangleNumber(check):
check += 1
# Third and Fourth Steps
# Calculate the last term of the series ('seriesLastTerm') that adds up to the newly calculated triangle number 'check'
seriesLastTerm = lastTerm(check)
# Iterate over triangle numbers checking for divisors > 500
while divisors(check) <= 500:
# add the next term to check to get the next triangle number
check += (seriesLastTerm + 1)
seriesLastTerm += 1
print check

Ans: 76576500

Consecutive Prime Sum — Project Euler (Problem 50)

Many problems in Project Euler relate to working with primes. I use primesieve-python to help solve such problems. It consists of Python bindings for the primesieve C++ library. Generates primes orders of magnitude faster than any pure Python code. Features:

  • Generate a list of primes
  • Count primes and prime k-tuplets
  • Print primes and prime k-tuplets
  • Find the nth prime
  • Iterate over primes using little memory

Anyway, here’s Problem 50 from Project Euler:

ProjectEuler50

Here’s how I did it:

# Question: Which prime, below one-million, can be written as the sum of the most consecutive primes
from primesieve import *
from math import *
# Generate list of primes under a million
primes_under_million = generate_primes(10**6)
# Sum of consecutive primes is of order 0.5(n^2)(logn)
# Calculate 'n' so that sum of consecutive primes is less than a million (and not necessarily prime)
nsum = 1
n = 1
while nsum < 10**6:
nsum = 0.5*(n**2)*(log(n, e))
n += 1
# Calculate index so that sum of first 'index' consecutive primes is under a million and also prime
primes_subset = primes_under_million[:n]
nsum = sum(primes_under_million[:n])
while nsum > 10**6:
n -= 1
nsum = sum(primes_under_million[:n])
primes_sum = 0
index = 0
for i in range(len(primes_subset)):
if i % 2 == 1:
pass
else:
sumprimes = sum(primes_subset[:i])
if sumprimes > primes_sum and sumprimes < 10**6 and sumprimes in primes_under_million:
primes_sum = sumprimes
index = i
# Print out sum of consecutive primes till 'index', index, n
# print primes_sum, index, n
# Check consecutive primes within a range (index to n) such that their number is greater than index and maximum
j = index + 1
start = 0
while j <= n:
while (j-start) >= (n-index):
sumprimes = sum(primes_subset[start:j])
if sumprimes > primes_sum and sumprimes in primes_under_million:
primes_sum = sumprimes
start += 1
j += 1
start = 0
print primes_sum
view raw euler50.py hosted with ❤ by GitHub

Answer: 997651

Largest Product in a Grid — Project Euler (Problem 11)

I started solving Project Euler problems this month. Check out the Project Euler tab of this blog for a list of the problems I’ve solved (with solutions) till date. Here’s a problem you might find interesting:

ProjectEuler11

Here’s my solution using Python (I basically search through the entire matrix which is of O() complexity):

I first copy the maxtrix into a text file euler11.txt so that it can be later read into Python

08 02 22 97 38 15 00 40 00 75 04 05 07 78 52 12 50 77 91 08
49 49 99 40 17 81 18 57 60 87 17 40 98 43 69 48 04 56 62 00
81 49 31 73 55 79 14 29 93 71 40 67 53 88 30 03 49 13 36 65
52 70 95 23 04 60 11 42 69 24 68 56 01 32 56 71 37 02 36 91
22 31 16 71 51 67 63 89 41 92 36 54 22 40 40 28 66 33 13 80
24 47 32 60 99 03 45 02 44 75 33 53 78 36 84 20 35 17 12 50
32 98 81 28 64 23 67 10 26 38 40 67 59 54 70 66 18 38 64 70
67 26 20 68 02 62 12 20 95 63 94 39 63 08 40 91 66 49 94 21
24 55 58 05 66 73 99 26 97 17 78 78 96 83 14 88 34 89 63 72
21 36 23 09 75 00 76 44 20 45 35 14 00 61 33 97 34 31 33 95
78 17 53 28 22 75 31 67 15 94 03 80 04 62 16 14 09 53 56 92
16 39 05 42 96 35 31 47 55 58 88 24 00 17 54 24 36 29 85 57
86 56 00 48 35 71 89 07 05 44 44 37 44 60 21 58 51 54 17 58
19 80 81 68 05 94 47 69 28 73 92 13 86 52 17 77 04 89 55 40
04 52 08 83 97 35 99 16 07 97 57 32 16 26 26 79 33 27 98 66
88 36 68 87 57 62 20 72 03 46 33 67 46 55 12 32 63 93 53 69
04 42 16 73 38 25 39 11 24 94 72 18 08 46 29 32 40 62 76 36
20 69 36 41 72 30 23 88 34 62 99 69 82 67 59 85 74 04 36 16
20 73 35 29 78 31 90 01 74 31 49 71 48 86 81 16 23 57 05 54
01 70 54 71 83 51 54 69 16 92 33 48 61 43 52 01 89 19 67 48
view raw euler11.txt hosted with ❤ by GitHub

I then execute the following code from the same working directory as euler11.txt
# import numpy module for matrix operations
from numpy import *
# read the file with the matrix of numbers
filename = 'euler11.txt'
# store each line of the file into an array
with open(filename, "r") as ins:
array = []
for line in ins:
array.append(line)
print array
# create a new array that converts the number strings into number integers
newArray = []
for i in array:
j = i.split(' ')
k = [int(n) for n in j]
newArray.append(k)
print newArray
# convert the array of integers into a matrix of integers
problemMatrix = matrix(newArray)
print problemMatrix
# set initial maximum product to be a dummy number, say 1
maxProd = 1
# search all combinations for maximum product
for i in range(16):
for j in range(16):
prod1 = problemMatrix[i,j]*problemMatrix[i+1,j]*problemMatrix[i+2,j]*problemMatrix[i+3,j]
if prod1 > maxProd:
maxProd = prod1
prod2 = problemMatrix[i,j]*problemMatrix[i,j+1]*problemMatrix[i,j+2]*problemMatrix[i,j+3]
if prod2 > maxProd:
maxProd = prod2
prod3 = problemMatrix[i,j]*problemMatrix[i+1,j+1]*problemMatrix[i+2,j+2]*problemMatrix[i+3,j+3]
if prod3 > maxProd:
maxProd = prod3
prod4 = problemMatrix[19-i,j]*problemMatrix[18-i,j+1]*problemMatrix[17-i,j+2]*problemMatrix[16-i,j+3]
if prod4 > maxProd:
maxProd = prod4
print maxProd
view raw euler11.py hosted with ❤ by GitHub

Answer: 70600674

Object Oriented Programing with Python – Particle Diffusion Simulation

I’m a newbie to the programming world. I first started programming in Python in May this year, a month after I started this blog, so I still haven’t learnt enough to contribute to economics as is the stated goal of this blog. But I know I’ll get there in a year or less.

This blog was also meant to document my learning. In May, I would have called myself Newb v0.0. Today, 3 months later, I’d like to call myself Newb v0.3 and the goal is to be at least Expert v1.0 by January 2016.

With the help of Rice University’s awesome classes on Python programming I created a cool simulation of particles diffusing into space, using the concept of Classes, which I learnt just yesterday!

Click to check out the code !

Screenshot from 2015-07-23 11:49:00

Screenshot from 2015-07-23 11:49:10

Screenshot from 2015-07-23 11:49:39

Number of Inversions in an Unsorted Array: Python Code

This is my solution to the first programming assignment of Tim Roughgarden’s course on Algorithms  that was due 12:30 PM IST today. Here’s the question quoted as it is:

Programming Question-1
Download the text file here. (Right click and save link as) This file contains all of the 100,000 integers between 1 and 100,000 (inclusive) in some order, with no integer repeated.

Your task is to compute the number of inversions in the file given, where the ith row of the file indicates the ith entry of an array.
Because of the large size of this array, you should implement the fast divide-and-conquer algorithm covered in the video lectures. The numeric answer for the given input file should be typed in the space below.
So if your answer is 1198233847, then just type 1198233847 in the space provided without any space / commas / any other punctuation marks. You can make up to 5 attempts, and we’ll use the best one for grading.
(We do not require you to submit your code, so feel free to use any programming language you want — just type the final numeric answer in the following space.)

My Solution

I modified an earlier code I wrote for merge sort to arrive at the solution. It needed just a couple of modifications, and if you look carefully, it turns out that the number of inversions are unearthed each and every time we merge two sorted sub-arrays. So, intuitively, if the merge sort algorithm was O(nlog2 n), it would take almost as many operations for counting inversions. In python, the code to sort and count inversions in an array of 10,000 integers took less than 3 seconds.

# load contents of text file into a list
# numList
NUMLIST_FILENAME = "IntegerArray.txt"
inFile = open(NUMLIST_FILENAME, 'r')
with inFile as f:
numList = [int(integers.strip()) for integers in f.readlines()]
count = 0
def inversionsCount(x):
global count
midsection = len(x) / 2
leftArray = x[:midsection]
rightArray = x[midsection:]
if len(x) > 1:
# Divid and conquer with recursive calls
# to left and right arrays similar to
# merge sort algorithm
inversionsCount(leftArray)
inversionsCount(rightArray)
# Merge sorted sub-arrays and keep
# count of split inversions
i, j = 0, 0
a = leftArray; b = rightArray
for k in range(len(a) + len(b) + 1):
if a[i] <= b[j]:
x[k] = a[i]
i += 1
if i == len(a) and j != len(b):
while j != len(b):
k +=1
x[k] = b[j]
j += 1
break
elif a[i] > b[j]:
x[k] = b[j]
count += (len(a) - i)
j += 1
if j == len(b) and i != len(a):
while i != len(a):
k+= 1
x[k] = a[i]
i += 1
break
return x
# call function and output number of inversions
inversionsCount(numList)
print count

Test my solution here with your own test cases.

Introducing cricketr! : An R package to analyze performances of cricketers

Wicked! Or must I say ‘howzzat!?’

Giga thoughts ...

Yet all experience is an arch wherethro’
Gleams that untravell’d world whose margin fades
For ever and forever when I move.
How dull it is to pause, to make an end,
To rust unburnish’d, not to shine in use!

Ulysses by Alfred Tennyson

Introduction

This is an initial post in which I introduce a cricketing package ‘cricketr’ which I have created. This package was a natural culmination to my earlier posts on cricket and my completing 9 modules of Data Science Specialization, from John Hopkins University at Coursera. The thought of creating this package struck me some time back, and I have finally been able to bring this to fruition.

So here it is. My R package ‘cricketr!!!’

This package uses the statistics info available in ESPN Cricinfo Statsguru. The current version of this package only uses data from test cricket. I plan to develop functionality for One-day and…

View original post 4,951 more words

The Merge Sort — Python Code

I have just begun working on a MOOC on algorithms offered by Stanford. Since this course gives us the liberty to choose a programming language, there isn’t any code discussed in those lectures. I plan to convert any algorithm discussed in those lectures into Python code. Since Merge Sort was the first algorithm discussed, I’m starting with that.

Merge Sort is supposedly a good introduction to divide and conquer algorithms, greatly improving upon selection, insertion and bubble sort techniques, especially when input size increases.

Pseudocode:

— Recursively sort the first half of the input array.
— Recursively sort the second half of the input array.
— Merge two sorted sub-lists into one list.

C = output [length = n]
A = 1st sorted array [n/2]
B = 2nd sorted array [n/2]
i = 0 or 1 (depending on the programming language)
j = 0 or 1 (depending on the programming language)

for k = 1 to n

if A(i) < B(j)
C(k) = A(i)
i = i + 1

else if A(i) > B(j)
C(k) = B(j)
j = j + 1

Note: the pseudocode for the merge operation ignores the end cases.

Visualizing the algorithm can be done in 2 stages — first, the recursive splitting of the arrays, 2 each 2 at a time, and second, the merge operation.

Merge-sort-example-300px MergeSort

Here’s the Python code to merge sort an array.

# Code for the merge subroutine
def merge(a,b):
""" Function to merge two arrays """
c = []
while len(a) != 0 and len(b) != 0:
if a[0] < b[0]:
c.append(a[0])
a.remove(a[0])
else:
c.append(b[0])
b.remove(b[0])
if len(a) == 0:
c += b
else:
c += a
return c
# Code for merge sort
def mergesort(x):
""" Function to sort an array using merge sort algorithm """
if len(x) == 0 or len(x) == 1:
return x
else:
middle = len(x)/2
a = mergesort(x[:middle])
b = mergesort(x[middle:])
return merge(a,b)
view raw mergesort.py hosted with ❤ by GitHub

We can divide a list in half log2 n times where n is the length of the list. The second process is the merge. Each item in the list will eventually be processed and placed on the sorted list. So the merge operation which results in a list of size n requires n operations. The result of this analysis is that log2 n splits, each of which costs n for a total of nlog2 n operations.

Other Algorithms:
Karatsuba Integer Multiplication Algorithm
Quick Sort Python Code

Review: An Introduction to Interactive Programming in Python (Part 1)

This class (Part 1 of a 2-part course on interactive programming using Python – and the first course of the Fundamentals of Computing Specialization offered by RICE Unviersity) was an excellent introduction to programming because of its focus on building interactive (and fun) applications with the lessons learned each week. Most introductory coding classes start with text based (boring?) programs, while all through this course you’re required to build a series of projects that get progressively complicated with every passing week. I’m not to be mistaken to be trashing conventional pedagogy, but then again, how many gifted coders do you know who learned to code after completing all the exercises, cover-to-cover of some programming textbook? The best way to learn to enjoy coding would be to build interactive stuff, and this course scores full points on that.

A short introduction to the class in a charmingly nerdy way

The mini-projects / assignments during the course are implemented on a cloud-based environment called CodeSkulptor (built by Scott Rixner, one of the instructors for this course). I found CodeSkulptor unique, in that it allows you to share your code (because it’s browser based) with just about anyone with an Internet connection and makes you work with a graphic user interface (GUI) module similar to Pygame, called Simplegui. It also had a debugging tool, called Viz Mode that helped visualize the process. It eases the task of debugging your code and you’ll realize how cool it is as you start using it more.

Since the course mini-projects were peer-reviewed, evaluating other people’s code also became a more straight-jacket affair, as everyone has their code on the same version of Python. This ensures that the focus is on learning to code, without wasting time on the logistics of programming environment (tuning differences in versions or IDEs). I especially enjoyed peer grading – for each mini project we completed, we had to evaluate and grade the work of 5 others. This was very rewarding – because I got the opportunity to fix bugs in others’ code (which makes you a better coder, I guess) and also got to see better implementations than the ones I had coded, further enriching the learning experience. Indeed, the benefits of peer grading and assessment have been well studied and documented.

Of all the assignments, the one I loved the most was implementing the classic arcade game Pong. You could try playing a version of the game I implemented here. It is a 2-player implementation, but you can play it as a single-player game, only if you imagine yourself to be answering this somewhat cheeky question! Which Pong character are you? Left or Right?

which pong

Which_Pong_Charater_Are_you

Pong v1.1

The principal reason behind my joining this course was the way it is structured and taught. We had to watch two sets of videos (part a and part b) and then complete one quiz for each set. The main task for each week was to complete a mini-project that was due along with the quizzes early Sunday morning, followed by assessment of peers’ mini-projects on the following Sunday-Wednesday. The instructors clearly put in A LOT OF WORK to make the lecture videos interesting, laced with humor, with just enough to get you going on your own with the week’s mini-project. That way you’d spend less time viewing the lecture videos, spending more time on actually getting the code for your mini-project to work. So in a way, one might say this course doesn’t follow standard pedagogy for an introductory programming course, but then, as Scott Rixner assures, “You’d know enough to be dangerous!

The projects that were completed in Part 1 of this course were indeed exciting:

Rock Paper Scissors Lizard Spock: A simple implementation played with the computer. This project covers basics on statements, expressions and variables, functions, logic and conditionals [I’m a huge fan of The Big Bang Theory, so I was obviously eager to complete this game. Instead of a series of if-elif-else clauses, this implementation used modular logic, all of which is taught in a really fun way. A great way to start off the course].
Guess the Number: Computer chooses a random number between 1 and 100 and you guess that number. It covered event-driven programming, local and global variables, buttons and input fields [This game although fun, might have been more interesting to code if the computer had to guess the number that the player chose, using bisection search].
Stopwatch: This was the first project that used a graphic user interface, using some modular arithmetic to get the digits of the ticking seconds in place. A game was also built on it where the player had to stop the watch right at the start of a second to score points. This game tested your reaction-time. It covered static drawing, timers and interactive drawing.
Pong: The last project of Part 1 and the most fun. Creating the game required only a minor step-up from learnings from previous weeks. It covered knowledge of lists, keyboard input, motion, positional/velocity control. Coding the ball physics where you put to use high-school physics knowledge of elasticity and collisions was very enjoyable. In my game, I set elasticity = 1 (for perfectly elastic collisions)

BallBounce

In an interview with the founders of this MOOC, who spent they say that they spent over 1000 hours building it (Part 1 and Part 2 combined, I guess). That’s an awful lot of effort and it all shows in how brilliantly the class is executed. The support system in the class is excellent. You’ll always find help available within minutes of posting your doubts and queries on the forums. I’ve seen Joe Warren (one of the main instructors of the course) replying to forum posts quite regularly. In addition, there was enough supplementary material in the form of pages on concepts and examples, practice exercises, and video content created by students from previous iterations of the class to better explain concepts and aspects of game-building, improving upon the lecture material.

Concepts and Examples

Concepts_and_examples

Practice Exercises

Practice_exercises

Student-created Videos Explaining Concepts

Student_created_videos

Overall, I had a great learning experience. I completed Part 1 with a 100 per cent score even though I had a minor hiccup while building the game Pong, which was the most satisfying of all the projects in Part 1. I would review Part 2 when I’m done with that in August this year. I’d easily recommend this course to anyone wishing to start off with Python. It is a great place to be introduced to Python, but it shouldn’t be your ONLY resource. I have been taking MIT’s 6.01x introductory Python course side-by-side. I shall review that course as soon as I’m through with it. That course is pedagogically more text-bookish, and indeed they do profess the use of their textbook to complement the course. I’m 4 weeks into that course and finding that enjoyable too – albeit in a different way. I still haven’t lost a point on any of the assignments or finger exercises there, and hope the trend continues:

CourseProgressTillQuiz

PS: In one of the forum threads, Joe posted a list of resources that could be referred to in addition to the class.

Additional_Resources

Python Books:

Another List of Books:

  • http://pythonbooks.revolunet.com/  – about 50 books –  Another good list of free python books that is kept up to date, and I believe are all free or open-source: (I won’t repeat all the books on the list here, just go check it out! Some are also on the list above, but not all)


Further Online Learning:

Solution to [Viral] Math Puzzle for Vietnamese Eight-Year-Olds (Using R)

There was this math problem that went viral recently –  it was a Singapore Math Olympiad problem meant for 14-year olds which surprisingly many adults couldn’t solve. A friend sent me a link to that problem and I solved it with pen and paper in about 10 minutes, leaving me feeling hardly challenged. I guess really-tough questions aren’t the ones that actually go viral. If anything, a math problem going viral means it caters to an average IQ, something like a 100 – which then left me wondering whatever my IQ was! I had taken many online IQ tests when I was pursuing my engineering degree and my scores ranged between 135 and 145. But they weren’t tests one could really feel great about, mainly because everyone I knew who took them, scored 110+. I never took a Mensa test in the past, so I’m not sure whether I’d have made the cut. Anyway, I decided to check Quora to see if the nerds had found anything remotely mimicking a Mensa type IQ test. So I checked out this one and got a 130. It seems to be the least I’ve ever got on an IQ test, so without giving you any rationale to why I feel this might be closer to accurately measuring your IQ, I say it’s probably worth a shot.

iq_test_score

 

A week back, there was yet another math puzzle that had gone viral, meant for Vietnamese eight-year-olds, a problem that had stumped parents and teachers. You need to fill in the empty boxes below with the digits from 1 to 9 so that the equation makes sense, following the order of operations – multiply first, then division, addition and subtraction last. Apparently, this problem was for third graders in the town of Bao Loc in the Vietnamese Highlands.

I didn’t solve this one with pen and paper, and instead wrote an R program. It’s clearly a problem with a fixed number of variables (nine) associated with standard math operators, meaning there would be 9! (362880) permutations – which makes you think there’s got to be more than one solution. It became obvious I had to write some code.

I wrote a function appropriately named baoloc() to list out the solutions to this problem. The code is self explanatory. I got 128 unique solutions, which means if the students had to solve this problem with pencil on paper, they could get one of 128 possible answers, which makes the job of the examiner arduous and boring!

## The problem could be written as u + 13v/w + x + 12y - z - 11 + pq/r -10 = 66
## which reduces to u + 13v/w + x + 12y - z + pq/r = 87
## This problem was solved on [1] "Wed May 27 17:46:52 2015"
baoloc <- function()
{
packages <- rownames(installed.packages())
if("combinat" %in% packages) library("combinat") else install.packages("combinat")
numbers <- 1:9
permutations <- permn(numbers) ## list of all permutations of vector input
solutions <- numeric(9)
for(i in 1:length(permutations))
{
solution <- permutations[[i]]
if(solution[1] + 13*(solution[2]/solution[3]) + solution[4] + 12*solution[5] - solution[6] + (solution[7]*solution[8] / solution[9]) == 87)
{
solutions <- rbind(solutions, solution)
}
}
print("The number of solutions are:")
print(nrow(solutions)-1)
return(solutions[2:nrow(solutions),])
}
view raw baoloc.R hosted with ❤ by GitHub

The above code produces the following solutions:
[1] "The number of solutions are:"
[1] 128
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
solution 9 1 2 5 6 7 3 4 8
solution 7 9 6 1 5 2 3 4 8
solution 1 9 6 7 5 2 3 4 8
solution 1 5 2 3 4 8 7 9 6
solution 1 5 2 3 4 8 9 7 6
solution 5 1 2 9 6 7 3 4 8
solution 5 1 2 9 6 7 4 3 8
solution 1 5 2 8 4 7 3 9 6
solution 1 5 2 8 4 7 9 3 6
solution 1 9 6 7 5 2 4 3 8
solution 7 9 6 1 5 2 4 3 8
solution 9 1 2 5 6 7 4 3 8
solution 1 2 6 4 7 8 5 3 9
solution 1 2 6 4 7 8 3 5 9
solution 1 4 8 2 7 9 3 5 6
solution 1 4 8 2 7 9 5 3 6
solution 7 1 4 9 6 5 2 3 8
solution 9 1 4 7 6 5 2 3 8
solution 5 4 1 9 2 7 8 3 6
solution 5 4 1 9 2 7 3 8 6
solution 9 4 1 5 2 7 8 3 6
solution 9 4 1 5 2 7 3 8 6
solution 4 9 6 1 5 8 3 7 2
solution 4 9 6 1 5 8 7 3 2
solution 8 6 4 7 5 9 1 3 2
solution 7 6 4 8 5 9 1 3 2
solution 9 4 8 5 6 7 1 3 2
solution 5 4 8 9 6 7 1 3 2
solution 1 9 6 4 5 8 7 3 2
solution 1 9 6 4 5 8 3 7 2
solution 9 1 4 7 6 5 3 2 8
solution 7 1 4 9 6 5 3 2 8
solution 1 3 9 4 7 8 2 5 6
solution 1 3 4 7 6 5 2 9 8
solution 1 3 4 7 6 5 9 2 8
solution 1 3 9 4 7 8 5 2 6
solution 1 5 3 9 4 2 7 8 6
solution 1 5 3 9 4 2 8 7 6
solution 1 3 2 9 5 6 7 4 8
solution 1 3 2 9 5 6 4 7 8
solution 1 3 6 2 7 9 5 4 8
solution 1 3 6 2 7 9 4 5 8
solution 1 3 2 4 5 8 7 9 6
solution 1 3 2 4 5 8 9 7 6
solution 6 3 1 9 2 5 8 7 4
solution 6 3 1 9 2 5 7 8 4
solution 9 3 1 6 2 5 8 7 4
solution 9 3 1 6 2 5 7 8 4
solution 7 3 1 5 2 6 8 9 4
solution 7 3 1 5 2 6 9 8 4
solution 5 3 1 7 2 6 9 8 4
solution 5 3 1 7 2 6 8 9 4
solution 9 5 3 1 4 2 7 8 6
solution 9 5 3 1 4 2 8 7 6
solution 3 1 4 2 7 9 6 5 8
solution 3 1 4 2 7 9 5 6 8
solution 7 3 4 1 6 5 9 2 8
solution 7 3 4 1 6 5 2 9 8
solution 3 6 4 9 5 8 1 7 2
solution 3 6 4 9 5 8 7 1 2
solution 5 4 8 9 6 7 3 1 2
solution 9 4 8 5 6 7 3 1 2
solution 7 6 4 8 5 9 3 1 2
solution 8 6 4 7 5 9 3 1 2
solution 9 6 4 3 5 8 1 7 2
solution 9 6 4 3 5 8 7 1 2
solution 4 3 9 1 7 8 5 2 6
solution 4 3 9 1 7 8 2 5 6
solution 4 3 2 1 5 8 9 7 6
solution 4 3 2 1 5 8 7 9 6
solution 3 2 4 8 5 1 7 9 6
solution 3 2 4 8 5 1 9 7 6
solution 5 9 3 6 2 1 8 7 4
solution 5 9 3 6 2 1 7 8 4
solution 3 5 2 1 4 8 9 7 6
solution 3 5 2 1 4 8 7 9 6
solution 6 9 3 5 2 1 7 8 4
solution 6 9 3 5 2 1 8 7 4
solution 3 9 6 2 5 1 7 4 8
solution 3 9 6 2 5 1 4 7 8
solution 3 2 8 6 5 1 7 9 4
solution 3 2 8 6 5 1 9 7 4
solution 7 3 2 8 5 9 6 1 4
solution 8 3 2 7 5 9 6 1 4
solution 8 3 2 7 5 9 1 6 4
solution 7 3 2 8 5 9 1 6 4
solution 3 2 1 5 4 7 8 9 6
solution 3 2 1 5 4 7 9 8 6
solution 3 9 2 8 1 5 7 6 4
solution 9 3 2 1 5 6 7 4 8
solution 3 9 2 8 1 5 6 7 4
solution 9 3 2 1 5 6 4 7 8
solution 2 3 6 1 7 9 4 5 8
solution 2 3 6 1 7 9 5 4 8
solution 8 9 2 3 1 5 6 7 4
solution 8 9 2 3 1 5 7 6 4
solution 2 9 6 3 5 1 4 7 8
solution 2 9 6 3 5 1 7 4 8
solution 6 2 8 3 5 1 9 7 4
solution 6 2 8 3 5 1 7 9 4
solution 7 2 8 9 6 5 3 1 4
solution 9 2 8 7 6 5 3 1 4
solution 8 7 2 5 3 9 6 1 4
solution 8 7 2 5 3 9 1 6 4
solution 5 7 2 8 3 9 1 6 4
solution 5 7 2 8 3 9 6 1 4
solution 8 2 4 3 5 1 9 7 6
solution 8 2 4 3 5 1 7 9 6
solution 8 5 2 7 4 9 3 1 6
solution 7 5 2 8 4 9 3 1 6
solution 4 2 6 1 7 8 3 5 9
solution 4 2 6 1 7 8 5 3 9
solution 7 5 2 8 4 9 1 3 6
solution 8 5 2 7 4 9 1 3 6
solution 2 4 8 1 7 9 5 3 6
solution 2 8 6 9 4 1 5 7 3
solution 2 8 6 9 4 1 7 5 3
solution 9 8 6 2 4 1 7 5 3
solution 9 8 6 2 4 1 5 7 3
solution 2 4 8 1 7 9 3 5 6
solution 2 1 4 3 7 9 5 6 8
solution 2 1 4 3 7 9 6 5 8
solution 8 5 2 1 4 7 9 3 6
solution 8 5 2 1 4 7 3 9 6
solution 5 2 1 3 4 7 9 8 6
solution 5 2 1 3 4 7 8 9 6
solution 9 2 8 7 6 5 1 3 4
solution 7 2 8 9 6 5 1 3 4

Excel using R

Very often we find ourselves caught up with really mundane tasks involving voluminous excel data — in business and yes, even in research! The story keeps repeating everywhere. I’ll talk about how to automate some mechanical and brain-dead spreadsheet operations using R. This post will cover the basics of reading and writing operations on Excel data. There are 2 packages worth installing, that I’ve taken a liking to for this purpose. xlsx and XLConnect.

install.packages("xlsx","XLConnect")

Let’s say you have huge volumes of spreadsheet data, of a survey conducted in a city several times during the year (like every month or so). What if you were interested in only some of the details from that survey, and you wanted to extract those details for each time the survey was conducted that year, and summarize the whole thing in yet another spreadsheet. You don’t want to be repeatedly copying and pasting and wasting several man-hours doing so, especially if you had hundreds of surveys to scavenge from. Instead, a simple R script could complete the same task in minutes.

I’ll talk about an example scenario and walk you through the R code. I had data coming from a survey conducted every month of the year for the district of Agra, India — home to the Taj Mahal. Since they were the same survey conducted every month, the excel files were identical in format. Each file contained different data, but of same parameters being tested each month – so data in specific cells, say, E11:V11 were instances of the same metric common to each file. I wanted to stack data from these particular cells from each file onto a fresh excel workbook. I could have copy-pasted the whole thing and done the job manually, but I wasn’t just looking at data from one district, viz., Agra, but hundreds of other districts. This was obviously a task that badly needed automating.

So, for Agra, I had 12 spreadsheets corresponding to surveys for each month of the year. The R file (Agra) that you can see along with the excel files, is the script for creating the final spreadsheet that accomplished the earlier-mentioned objective.

Agra Directory

I wanted data taken from the cells E11:V11 from each of the above spreadsheets. For example, here’s how it looks for the April and May spreadsheets (highlighted portion)

A-Agra_April

B-Agra_May

I had to write R code to get the data from each of the spreadsheets to be neatly stacked in a fresh workbook named Outliers-Agra, and give appropriate row names, column names, and worksheet name, as in the screenshot below.

Agra-Outliers.xls

How was this done using R?

First, the necessary libraries had to be loaded into R, the ones talked about at the start of this post.

library("xlsx") 
library("XLConnect")  

Next, I had to change the working directory to where the survey data was kept, in this case, a folder aptly named Agra. The files in the directory were stored in a vector named files. Since the main R-file where I was writing the script Agra.R, was part of that directory, it also was part of the vector files. The vector files was created to later read the spreadsheets iteratively using a for-loop, so Agra.R had to be deleted from it. This was done using the which() function to locate the position of the R-file, craftily using subsetting rules to remove it from the vector.

setwd("C:/Anirudh/Coding/R/Agra") ## CHANGE WORKING DIRECTORY as required
files <- list.files() ## vector containing the file names in that directory
files <- files[-which(files == "Agra.R")] ## Removes R-file (in this case, Agra.R) from the vector of files

An empty matrix, temp, was created to be exactly the size of the data to be read from all the excel files in the directory. The data from each file was then iteratively stored to fill temp. The read.xlsx() function is a function from the xlsx library. The arguments are mostly self-explanatory. files[i] subsets the ith instance in files. header is set FALSE to indicate that the top row of the excel file did not consist of variables.

temp <- matrix(nrow = length(files), ncol = 18) ## matrix, 18 columns as in 5:22, rows = no.of xls files
for(i in seq_along(files[1:length(files)]))
{
  dat <- read.xlsx(files[i], sheetIndex = 1, rowIndex = 11, colIndex = 5:22, header = FALSE)
  mat <- as.matrix(dat) ## reading data from 11th row of xlsx file into a matrix of row size 1
  temp[i,] <- mat ## storing data from above matrix into matrix temp
}

The contents of temp (a matrix object) were then stored in a data frame object before being written to a fresh excel file (this is necessary). For writing operations in excel, I used functions from the XLConnect library. The workbook had to be loaded using loadWorkbook() with the create argument set to TRUE to indicate that a fresh excel file was being created. If I had to write to an already existing file, the argument would have been set FALSE . I created a worksheet and named it ANC3 (that had got something to do with the data). startRow and startCol are set to the location where we want the data to begin from.

df <- as.data.frame(temp) ## storing contents of temp matrix as data frame
wb <- loadWorkbook("C:/Anirudh/Coding/R/Agra/Outliers-Agra.xlsx", create = TRUE) # new workbook
createSheet(wb, name = "ANC3") ## create worksheet / tab with name "ANC3"
writeWorksheet(wb, df, sheet = "ANC3", startRow = 2, startCol = 2, header = FALSE)

Next, the data in the new excel file, Outliers-Agra.xlsx was given appropriate column names (colnames) and row names (months).

colnames <- read.xlsx(files[i],sheetIndex = 1, rowIndex = 6, colIndex = 5:22, header = FALSE)
colnames <- as.data.frame(as.matrix(colnames))
months <- as.data.frame(matrix(c("April", "May", "June", "July", "August", "September","October", "November","December", "January", "February", "March")))
writeWorksheet(wb, colnames, sheet = "ANC3", startRow = 1, startCol = 2, header = FALSE)
writeWorksheet(wb, months, sheet = "ANC3", startRow = 2, startCol = 1, header = FALSE)

Lastly, the workbook needed to be saved.

saveWorkbook(wb)

After executing the R script, the new workbook Outliers-Agra is added to the parent directory.

Agra Directory with Agra.R and Outliers-Agra

So this is how some basic Excel work can be managed using R. Since these were only 12 files, writing all this code to perform simple copy and paste functions might seem like an awful lot to go through. But the data in this example was for one Indian district  (Agra) — and if you had to do the same thing thousands of times covering hundreds of districts (directories), you’d thank this gem of free software known as R, for what it can accomplish.

Featured Image: https://xkcd.com/1445/