MITx 15.071x (Analytics Edge) – 2016

I am auditing this course currently and just completed its 2nd assignment. It’s probably one of the best courses out there to learn R in a way that you go beyond the syntax with an objective in mind – to do analytics and run machine learning algorithms to derive insight from data. This course is different from machine learning courses by say, Andrew Ng in that this course won’t focus on coding the algorithm and rather would emphasize on diving right into the implementation of those algorithms using libraries that the R programming language already equips us with.

Take a look at the course logistics. And hey, they’ve got a Kaggle competition!

AnalyticsEdgeLogistics

There’s still time to enroll and grab a certificate (or simply audit). The course is offered once a year. I met a bunch of people who did well at a data hackathon I had gone to recently, who had learned the ropes in data science thanks to Analytics Edge.

MITx 6.00.2x Introduction to Computational Thinking and Data Science (Fall 2015)

MIT’s Fall 2015 iteration of 6.00.2x starts today. After an enriching learning experience with 6.00.1x, I have great expectations from this course. As the course website mildly puts it, 6.00.2x is an introduction to using computation to understand real-world phenomena. MIT OpenCourseware (OCW) mirroring the material covered in 6.00.1x and 6.00.2x can be found here.

The course follows this book by John Guttag (who happens to be one of the instructors for this course). However, purchasing the book isn’t a necessity for this course.

Introduction to Computation and Programming Using Python

One thing I loved about 6.00.1x was its dedicated Facebook group, which gave a community / classroom-peergroup feel to the course. 6.00.2x also has a Facebook group. Here’s a sneak peak:

descriptionUpdate

The syllabus and schedule for this course is shown below. The course is spread out over 2 months which includes 7 weeks of lectures.

MITx 6.00.2x Fall 2015 Course Calendar
MITx 6.00.2x Fall 2015 Course Calendar

The prerequisites for this course are pretty much covered in this set of tutorial videos that have been created by one of the TAs for 6.00.1x. If you’ve not taken 6.00.1x in the past, you can go through these videos (running time < 1hr) to judge whether or not to go ahead with 6.00.2x.

So much for the update. Got work to do! 🙂

Python to the Rescue

Another journal-like entry

Programming as a profession is only moderately interesting. It can be a good job, but you could make about the same money and be happier running a fast food joint. You’re much better off using code as your secret weapon in another profession.

People who can code in the world of technology companies are a dime a dozen and get no respect. People who can code in biology, medicine, government, sociology, physics, history, and mathematics are respected and can do amazing things to advance those disciplines.

Advice from an Old Programmer

I was reading a paper today, written by MIT’s Esther Duflo, part of a homework assignment on a MOOC on development policy (Foundations of Development Policy: Advanced Development Economics) offered by Duflo and Abhijit Banerjee. So I opened the paper and started copying important lines from the PDF to a text editor to make notes. I could copy the text, but when I pasted it onto a text editor, it turned out to be gibberish (you can try it too!).

For instance, instead of pasting

Between 1973 and 1978 the Indonesian Government constructed over 61,000 primary schools throughout the county

I got:

Ehwzhhq 4<:6 dqg 4<:;/ wkh Lqgrqhvldq Jryhuqphqw frqv wuxfwhg ryhu 94/333 sulpdu| vfkrrov wkurxjkrxw wkh frxqwu|

It was a good thing the cipher used for this text wasn’t too complicated. After some perusal, I found that ‘B’ became ‘E’, ‘e’ became ‘h’, ‘t’ became ‘w’ and so on. So I copied the entire content of the PDF to a text file and named the encrypted file estherDuflo.txt. I noticed that the encryption had been implemented only on the first 1475 lines. The remaining was plain English.

So I wrote a Python script to decrypt the gibberish, rather than simply typing out my notes. It took 20 minutes writing the code and 8 ms to execute (of course!). I didn’t want to spend a lot of time ensuring a thorough decryption, so the result wasn’t perfect, but then I’m going to make do. I named the decrypted file estherDufloDecrypted.txt.

Sample from the Encrypted File

5U LL*?} @?_ w@MLh @h!i| L?ti^ i?Uit Lu 5U LL*
L?t|h U|L? ? W?_L?it@G ,_i?Ui uhL4 @? N? t @* L*U)
, Tih4i?|
,t| ih # L
W
Devwudfw
Ehwzhhq 4<:6 dqg 4<:;/ wkh Lqgrqhvldq Jryhuqphqw frqvwuxfwhg ryhu 94/333 sulpdu|
vfkrrov wkurxjkrxw wkh frxqwu|1 Wklv lv rqh ri wkh odujhvw vfkrro frqvwuxfwlrq surjudpv rq
uhfrug1 L hydoxdwh wkh hhfw ri wklv surjudp rq hgxfdwlrq dqg zdjhv e| frpelqlqj glhuhqfhv
dfurvv uhjlrqv lq wkh qxpehu ri vfkrrov frqvwuxfwhg zlwk glhuhqfhv dfurvv frkruwv lqgxfhg
e| wkh wlplqj ri wkh surjudp1 Wkh hvwlpdwhv vxjjhvw wkdw wkh frqvwuxfwlrq ri sulpdu| vfkrrov
ohg wr dq lqfuhdvh lq hgxfdwlrq dqg hduqlqjv1 Fkloguhq djhg 5 wr 9 lq 4<:7 uhfhlyhg 3145 wr
314< pruh |hduv ri hgxfdwlrq iru hdfk vfkrro frqvwuxfwhg shu 4/333 fkloguhq lq wkhlu uhjlrq
ri eluwk1 Xvlqj wkh yduldwlrqv lq vfkrrolqj jhqhudwhg e| wklv srolf| dv lqvwuxphqwdo yduldeohv
iru wkh lpsdfw ri hgxfdwlrq rq zdjhv jhqhudwhv hvwlpdwhv ri hfrqrplf uhwxuqv wr hgxfdwlrq
udqjlqj iurp 91; shufhqw wr 4319 shufhqw1 +MHO L5/ M64/ R48/ R55,
Wkh txhvwlrq ri zkhwkhu lqyhvwphqw lq lqiudvwuxfwxuh lqfuhdvhv kxpdq fdslwdo dqg uhgxfhv
sryhuw| kdv orqj ehhq d frqfhuq wr ghyhorsphqw hfrqrplvwv dqg srolf|pdnhuv1 Iru h{dpsoh/
dydlodelolw| ri vfkrrolqj lqiudvwuxfwxuh kdv ehhq vkrzq wr eh srvlwlyho| fruuhodwhg zlwk frpsohwhg
vfkrrolqj ru hquroophqw e| Qdqf| Elugvdoo +4<;8, lq xuedq Eud}lo/ Ghqqlv GhWud| +4<;;, dqg Ohh
view raw estherDuflo.txt hosted with ❤ by GitHub

My Code
from string import *
# create decipher dictionary
l = letters[:26]
decipher = "".join([l[(i+3)%26] for i in range(len(l))])
decipher = dict(zip(decipher,l))
# open and read encrypted text
filename = 'estherDuflo.txt'
f = open(filename, 'rw')
lines = f.readlines()
lines = [l[:-1] for l in lines]
# use first 1475 lines only
newlines = lines[:1475]
# apply decryption on those 1475 lines
decipheredLines = []
for line in newlines:
x = line.lower()
s = []
for letter in x:
if letter in letters:
s.append(decipher[letter])
else:
s.append(letter)
s.append('\n')
decipheredLines.append(''.join(s))
# write deciphered text to new text file
decipheredFile = 'estherDufloDeciphered.txt'
df = open(decipheredFile, 'w')
for line in decipheredLines:
df.write("%s" % line)
# close both text files
f.close()
df.close()
view raw estherDuflo.py hosted with ❤ by GitHub

Sample from the Decrypted File
5r ii*?} @?_ t@jie @e!f| i?qf^ f?rfq ir 5r ii*
i?q|e r|i? ? t?_i?fq@d ,_f?rf rei4 @? k? q @* i*r)
, qfe4f?|
,q| fe # i
t
abstract
between 4<:6 and 4<:;/ the indonesian government constructed over 94/333 primar|
schools throughout the countr|1 this is one of the largest school construction programs on
record1 i evaluate the eect of this program on education and wages b| combining dierences
across regions in the number of schools constructed with dierences across cohorts induced
b| the timing of the program1 the estimates suggest that the construction of primar| schools
led to an increase in education and earnings1 children aged 5 to 9 in 4<:7 received 3145 to
314< more |ears of education for each school constructed per 4/333 children in their region
of birth1 using the variations in schooling generated b| this polic| as instrumental variables
for the impact of education on wages generates estimates of economic returns to education
ranging from 91; percent to 4319 percent1 +jel i5/ j64/ o48/ o55,
the question of whether investment in infrastructure increases human capital and reduces
povert| has long been a concern to development economists and polic|makers1 for e{ample/
availabilit| of schooling infrastructure has been shown to be positivel| correlated with completed
schooling or enrollment b| nanc| birdsall +4<;8, in urban bra}il/ dennis detra| +4<;;, and lee

Getting Started with R on MIT’s 14.74x (Foundations of Development Policy)

I noticed that a major grievance of many students enrolled in MIT‘s latest edX course on development policy (Foundations of Development Policy: Advanced Development Economics) was that there wasn’t enough done to get them going with the R assignments. I have posted the R code for the homework (past the deadline, of course) of the first 2 weeks, so that others get a hang of the level of R that might be needed to solve these assignments in the following weeks. I’m willing to help out those needing help getting up to speed with R required for this course. For specific queries, leave your message in the comments section.

A great place to get spend time learning R before taking Foundations of Development Policy (14.74x) would be another edX course that’s been getting great reviews recently: Introduction to R Programming

R Code for Home Work (Week 1)

# set working directory to local directory where the data is kept
setwd("~/IGIDR/Development Economics - MIT/Homework Assignment 01")
# read the data
wb_dev_ind = read.csv("wb_dev_ind.csv")
# summarize data
summary(wb_dev_ind)
# Question 1
# What is the Mean of GDP per capita? What is the standard deviation of GDP per capita?
meanGDPperCapita = mean(wb_dev_ind$gdp_per_capita, na.rm = TRUE)
print(round(meanGDPperCapita))
sdGDPperCapita = sd(wb_dev_ind$gdp_per_capita, na.rm = TRUE)
print(round(sdGDPperCapita))
# Question 2
# What is the mean illiteracy rate across all countries? What is the standard deviation?
illiteracy_all = numeric(nrow(wb_dev_ind))
wb_dev_ind$illiteracy_all = illiteracy_all
wb_dev_ind$illiteracy_all = 100 - wb_dev_ind$literacy_all
meanIlliteracy = mean(wb_dev_ind$illiteracy_all, na.rm = TRUE)
print(round(meanIlliteracy))
sdIlliteracy = sd(wb_dev_ind$illiteracy_all, na.rm = TRUE)
print(round(sdIlliteracy))
# Question 3
# What is the mean infant mortality rate across all countries? What is the standard deviation?
meanInfantMortality = mean(wb_dev_ind$infant_mortality, na.rm = TRUE)
print(round(meanInfantMortality))
sdInfantMortality = sd(wb_dev_ind$infant_mortality, na.rm = TRUE)
print(round(sdInfantMortality))
# Question 4
# What is the mean male illiteracy rate? What is the mean female illiteracy rate?
illiteracy_male = numeric(nrow(wb_dev_ind))
wb_dev_ind$illiteracy_male = illiteracy_male
wb_dev_ind$illiteracy_male = 100 - wb_dev_ind$literacy_male
meanIlliteracyMale = mean(wb_dev_ind$illiteracy_male, na.rm = TRUE)
print(round(meanIlliteracyMale))
sdIlliteracyMale = sd(wb_dev_ind$illiteracy_male, na.rm = TRUE)
print(round(sdIlliteracyMale))
illiteracy_female = numeric(nrow(wb_dev_ind))
wb_dev_ind$illiteracy_female = illiteracy_female
wb_dev_ind$illiteracy_female = 100 - wb_dev_ind$literacy_female
meanIlliteracyFemale = mean(wb_dev_ind$illiteracy_female, na.rm = TRUE)
print(round(meanIlliteracyFemale))
sdIlliteracyFemale = sd(wb_dev_ind$illiteracy_female, na.rm = TRUE)
print(round(sdIlliteracyFemale))
# Question 5
# What are the mean, minimum, and maximum illiteracy rate among the 50 richest countries
richest50 = wb_dev_ind[order(wb_dev_ind$gdp_per_capita, decreasing = TRUE),][1:50,]
summary(richest50)
# Question 6
# What are the mean, minimum, and maximum illiteracy rate among the 50 poorest countries?
poorest50 = wb_dev_ind[order(wb_dev_ind$gdp_per_capita),][1:50,]
summary(poorest50)
# Question 7
# What are the mean, minimum, and maximum infant mortality rate among the 50 richest countries?
summary(richest50)
# Question 8
# What are the mean, minimum, and maximum infant mortality rate among the 50 poorest countries?
summary(poorest50)
# Question 9
# What is the median GDP per capita?
summary(wb_dev_ind)
# Question 10-12
# Regress the infant mortality rate on per capita GDP, and then answer questions 10-12
model1 = lm(infant_mortality ~ gdp_per_capita, data = wb_dev_ind)
summary(model1)
# Question 13
# Regress the illiteracy rate on GDP per capita. Is the coefficient on per capita GDP significantly different from zero at the 5% level?
model2 = lm(illiteracy_all ~ gdp_per_capita, data = wb_dev_ind)
summary(model2)
# Question 14
# Regress the infant mortality rate on the illiteracy rate. Graph a scatter plot of the data as well as the regression line.
model3 = lm(infant_mortality ~ illiteracy_all, data = wb_dev_ind)
summary(model3)
plot(wb_dev_ind$illiteracy_all, wb_dev_ind$infant_mortality)
abline(model3)
view raw HW01.R hosted with ❤ by GitHub

R Code for Home Work (Week 2)

# Set working directory to local directory where the data is kept
setwd("~/IGIDR/Development Economics - MIT/Homework Assignment 02")
# read data
migueldata = read.csv("ted_miguel_worms.csv", header = TRUE)
attach(migueldata)
# Question 6
# How many observations are there per pupil? (Enter a whole number of 0 or higher)?
length(migueldata$pupid)
length(unique(migueldata$pupid))
# Question 7
# What percentage of the pupils are boys? (Answers within 0.50 percentage points of the correct answer will be accepted. For instance, 67 would be accepted if the correct answer is 67.45%)
mean(sex, na.rm = TRUE)
# Question 8
# What percentage of pupils took the deworming pill in 1998? (Answers within 0.50 percentage points of the correct answer will be accepted. For instance, 67 would be accepted if the correct answer is 67.45%)
mean(pill98, na.rm = TRUE)
# Question 9
# Was the percentage of schools assigned to treatment in 1998 greater than or less than the percentage of pupils that actually took the deworming pill in 1998?
mean(treat_sch98, na.rm = TRUE)
mean(treat_sch98, na.rm = TRUE) > mean(pill98, na.rm = TRUE) # Ans = Greater Than
# Question 10
# Which of the following variables from the dataset are dummy variables? (Check all that apply.)
summary(migueldata)
# Question 11
# Using the data, find and enter the difference in outcomes (Y: school participation) between students who took the pill and students who did not in 1998. (Enter your answer as a difference in proportions. For instance, if the proportion in one group is 0.61 and the proportion in the other group is 0.54, enter 0.07. Answers within 0.05 of the correct answer will be accepted. For instance, 0.28 would be accepted if the correct answer is 0.33.)
took_pill_98 = mean(migueldata[migueldata$pill98 == 1,]$totpar98, na.rm = TRUE)
no_pill_98 = mean(migueldata[migueldata$pill98 == 0,]$totpar98, na.rm = TRUE)
diff = took_pill_98 - no_pill_98
diff
# Question 12
# Since schools were randomly assigned to the deworming treatment group, the estimate calculated in the previous answer is an unbiased estimate of taking the pill on school attendance.
# False
# Explanation
# The estimated impact of 13 percentage points calculated in the previous answer might not be a good estimate of the effect of taking the pill. Many students in the randomly assigned treatment schools did not actually take the pills, so those who took the pills would not have been randomly selected at all. For instance, kids who attend school more anyway might have been more likely to be there when the pills were handed out, meaning that omitted variables would be correlated with taking the pill and future school attendance. This would bias the estimate upward i.e. the 13 percentage point difference might overstate the impact of deworming on attendance.
# Question 13
# Using the data, find and enter the difference in outcomes (Y: school participation) between students in treatment schools and students not in treatment schools in 1998, regardless of whether or not they actually took the pill. (Enter your answer as a difference in proportions. For instance, if the proportion in one group is 0.61 and the proportion in the other group is 0.54, enter 0.07. Answers within 0.05 of the correct answer will be accepted. For instance, 0.28 would be accepted if the correct answer is 0.33.)
in_treatment_sch = mean(migueldata[migueldata$treat_sch98 == 1,]$totpar98, na.rm = TRUE)
non_treatment_sch = mean(migueldata[migueldata$treat_sch98 == 0,]$totpar98, na.rm = TRUE)
diff_treatment_sch = in_treatment_sch - non_treatment_sch
diff_treatment_sch
# Question 14
# Using the data, calculate the difference in the probability of taking the pill given that a student was in a treatment school and the probability of taking it if a student was not in a treatment school. (Enter your answer as a difference in proportions. For instance, if the proportion in one group is 0.61 and the proportion in the other group is 0.54, enter 0.07. Answers within 0.05 of the correct answer will be accepted. For instance, 0.28 would be accepted if the correct answer is 0.33.)
pr_pill_treatment_sch = mean(migueldata[migueldata$treat_sch98 == 1,]$pill98, na.rm = TRUE)
pr_pill_no_treatment_sch = mean(migueldata[migueldata$treat_sch98 == 0,]$pill98, na.rm = TRUE)
diff_pr_pill_treatment_sch = pr_pill_treatment_sch - pr_pill_no_treatment_sch
# Question 15
# Using the data, derive the Wald Estimator of taking the pill on school attendance. (Enter your answer as a difference in proportions. For instance, if the proportion in one group is 0.61 and the proportion in the other group is 0.54, enter 0.07. Answers within 0.05 of the correct answer will be accepted. For instance, 0.28 would be accepted if the correct answer is 0.33.)
waldRatio = diff_treatment_sch/diff_pr_pill_treatment_sch
waldRatio
view raw HW02.R hosted with ❤ by GitHub

I hope this helps!

MOOC Review: Introduction to Computer Science and Programming Using Python (6.00.1x)

I enrolled in Introduction to Computer Science and Programming Using Python with the primary objective of learning to code using Python. This course, as the name suggests, is more than just about Python. It uses Python as a tool to teach computational thinking and serves as an introduction to computer science. The fact that it is a course offered by MIT, makes it special.

As a matter of fact, this course is aimed at students with little or no prior programming experience who feel the need to understand computational approaches to problem solving. Eric Grimson is an excellent teacher (also Chancellor of MIT) and he delves into the subject matter to a surprising amount of detail.

The video lectures are based on select chapters from an excellent book by John Guttag. While the book isn’t mandatory for the course (the video lectures do a great job of explaining the material on their own), I benefited greatly from reading the textbook. There are a couple of instances where the code isn’t presented properly in the slides (typos or indentation gone wrong when pasting code to the slides), but the correct code / study material can be found in the textbook. Also, for explanations that are more in-depth, the book comes in handy.

Introduction to Computation and Programming Using Python

MIT offers this course in 2 parts via edX. While 6.00.1x is is an introduction to computer science as a tool to solve real-world analytical problems, 6.00.2x is an introduction to computation in data science. For a general look and feel of the course, this OCW link may be a good starting point. It contains material including video lectures and problem sets that are closely related to 6.00.1x and 6.00.2x.

Each week’s material of 6.00.1x consists of 2 topics, followed by a Problem Set. Problem Sets account for 40% of your grade. Video lectures are followed by finger exercises that can be attempted any number of times. Finger exercises account for 10% of your grade. The Quiz (kind of like a mid-term exam) and the Final Exam account for 25% each. The course is of 8 weeks duration and covers the following topics (along with corresponding readings from John Guttag’s textbook).

course_structure_till_quiz

course_structure_till_final

From the questions posted on forums, it was apparent that the section of this course that most people found challenging, was efficiency and orders of growth – and in particular, the Big-O asymptotic notation and problems on algorithmic complexity.

Lectures on Classes, Inheritance and Object Oriented Programming (OOP) were covered really well in over 100 minutes of video time. I enjoyed the problem set that followed, requiring the student to build an Internet news filter alerting the user when it noticed a news story that matched that user’s interests.

The final week had lectures on the concept of Trees, which were done hurriedly when compared to the depth of detail the instructor had earlier gone to, while explaining concepts from previous weeks. However, this material was covered quite well in Guttag’s textbook and the code for tree search algorithms was provided for perusal as part of the courseware.

At the end of the course, there were some interesting add-on videos to tickle the curiosity of the learner on the applications of computation in diverse fields such as medicine, robotics, databases and 3D graphics.

The Wiki tab for this course (in the edX platform) is laden with useful links to complement each week of lectures. I never got around to reading those, but I’m going through them now, and they’re quite interesting. It’s a section that nerds would love to skim through.

I learnt a great deal from this course (scored well too) putting in close to 6-hours-a-week of study. It is being offered again on August 26, 2015. In the mean time, I’m keeping my eyes open for MIT’s data science course (6.00.2x) which is likely to be offered in October, in continuation to 6.00.1x.