**Disclaimer:** I’m not a data scientist yet. That’s still work in progress, but I’d recommend this excellent talk given by Tetiana Ivanova to put an enthusiast’s data science journey in perspective.

# Python

# My First Data Science Hackathon

So after 8 months of playing around with R and Python and blog post after blog post, I found myself finally hacking away at a problem set from the 17th storey of the *Hindustan Times* building at *Connaught Place*. I had entered my first ever data science hackathon conducted by *Analytics Vidhya*, a pioneer in analytics learning in India. Pizzas and Pepsi were on the house. Like any predictive analysis hackathon, this one accepted unlimited entries till submission time. It was from 2pm to 4:30pm today – 2.5 hours, of which I ended up wasting 1.5 hours trying to make my first submission which encountered submission error after submission error until the problem was fixed finally post lunch. I had 1 hour to try my best. It wasn’t the best performance, but I thought of blogging this experience anyway, as a reminder of the work that awaits me. I want to be the one winning prize money at the end of the day.

**🙂**

# Solutions to Machine Learning Programming Assignments

This post contains links to a bunch of code that I have written to complete Andrew Ng’s famous machine learning course which includes several interesting machine learning problems that needed to be solved using the Octave / Matlab programming language. I’m not sure I’d ever be programming in Octave after this course, but learning Octave just so that I could complete this course seemed worth the time and effort. I would usually work on the programming assignments on Sundays and spend several hours coding in Octave, telling myself that I would later replicate the exercises in **Python**.

If you’ve taken this course and found some of the assignments hard to complete, I think it might not hurt to go check online on how a particular function was implemented. If you end up copying the entire code, it’s probably your loss in the long run. But then John Maynard Keynes once said, ‘*In the long run we are all dead*‘. Yeah, and we wonder why people call Economics the dismal science!

Most people disregard Coursera’s feeble attempt at reigning in plagiarism by creating an ** Honor Code**, precisely because this so-called code-of-conduct can be easily circumvented. I don’t mind posting solutions to a course’s programming assignments because GitHub is full to the brim with such content. Plus, it’s always good to read others’ code even if you implemented a function correctly. It helps understand the different ways of tackling a given programming problem.

ex1

ex2

ex3

ex4

ex5

ex6

ex7

ex8

Enjoy!

# Spot the Difference — It’s NumPy!

My first brush with NumPy happened over writing a block of code to make a plot using** pylab**. ⇣

`pylab`

is part of`matplotlib`

(in`matplotlib.pylab`

) and tries to give you a MatLab like environment.`matplotlib`

has a number of dependencies, among them`numpy`

which it imports under the common alias`np`

.`scipy`

is not a dependency of`matplotlib`

.

I had a *tuple* (of lows and highs of temperature) of lengh 2 with 31 entries in each (the number of days in the month of July), parsed from this text file:

Boston July Temperatures | |

------------------------- | |

Day High Low | |

------------ | |

1 91 70 | |

2 84 69 | |

3 86 68 | |

4 84 68 | |

5 83 70 | |

6 80 68 | |

7 86 73 | |

8 89 71 | |

9 84 67 | |

10 83 65 | |

11 80 66 | |

12 86 63 | |

13 90 69 | |

14 91 72 | |

15 91 72 | |

16 88 72 | |

17 97 76 | |

18 89 70 | |

19 74 66 | |

20 71 64 | |

21 74 61 | |

22 84 61 | |

23 86 66 | |

24 91 68 | |

25 83 65 | |

26 84 66 | |

27 79 64 | |

28 72 63 | |

29 73 64 | |

30 81 63 | |

31 73 63 |

Given below, are 2 sets of code that do the same thing; one ** without NumPy** and the other

**. They output the following graph using**

*with*NumPy**PyLab**:

**Code without NumPy**

import pylab | |

def loadfile(): | |

inFile = open('julyTemps.txt', 'r') | |

high =[]; low = [] | |

for line in inFile: | |

fields = line.split() | |

if len(fields) < 3 or not fields[0].isdigit(): | |

pass | |

else: | |

high.append(int(fields[1])) | |

low.append(int(fields[2])) | |

return low, high | |

def producePlot(lowTemps, highTemps): | |

diffTemps = [highTemps[i] - lowTemps[i] for i in range(len(lowTemps))] | |

pylab.title('Day by Day Ranges in Temperature in Boston in July 2012') | |

pylab.xlabel('Days') | |

pylab.ylabel('Temperature Ranges') | |

return pylab.plot(range(1,32),diffTemps) | |

producePlot(loadfile()[1], loadfile()[0]) |

**Code with NumPy**

import pylab | |

import numpy as np | |

def loadFile(): | |

inFile = open('julyTemps.txt') | |

high = [];vlow = [] | |

for line in inFile: | |

fields = line.split() | |

if len(fields) != 3 or 'Boston' == fields[0] or 'Day' == fields[0]: | |

continue | |

else: | |

high.append(int(fields[1])) | |

low.append(int(fields[2])) | |

return (low, high) | |

def producePlot(lowTemps, highTemps): | |

diffTemps = list(np.array(highTemps) - np.array(lowTemps)) | |

pylab.plot(range(1,32), diffTemps) | |

pylab.title('Day by Day Ranges in Temperature in Boston in July 2012') | |

pylab.xlabel('Days') | |

pylab.ylabel('Temperature Ranges') | |

pylab.show() | |

(low, high) = loadFile() | |

producePlot(low, high) |

The difference in code lies in how the variable `diffTemps`

is calculated.

diffTemps = list(np.array(highTemps) - np.array(lowTemps))

seems more readable than

diffTemps = [highTemps[i] - lowTemps[i] for i in range(len(lowTemps))]

**Notice how straight forward it is with NumPy.** At the core of the NumPy package, is the *ndarray* object. This encapsulates n-dimensional arrays of homogeneous data types, with many operations being performed in compiled code for performance. element-by-element operations are the “default mode” when an *ndarray* is involved, but the element-by-element operation is speedily executed by pre-compiled C code.

# MITx 6.00.2x Introduction to Computational Thinking and Data Science (Fall 2015)

MIT’s Fall 2015 iteration of **6.00.2x** starts today. After an enriching learning experience with 6.00.1x, I have great expectations from this course. As the course website mildly puts it, 6.00.2x is an introduction to using computation to understand real-world phenomena. **MIT OpenCourseware (OCW)** mirroring the material covered in 6.00.1x and 6.00.2x can be found here.

The course follows this book by John Guttag (who happens to be one of the instructors for this course). However, purchasing the book isn’t a necessity for this course.

One thing I loved about **6.00.1x** was its dedicated Facebook group, which gave a community / classroom-peergroup feel to the course. **6.00.2x** also has a Facebook group. Here’s a sneak peak:

The** syllabus and schedule** for this course is shown below. The course is spread out over **2 months** which includes **7 weeks of lectures.**

The **prerequisites** for this course are pretty much covered in this set of tutorial videos that have been created by one of the *TAs* for 6.00.1x. If you’ve not taken 6.00.1x in the past, you can go through these videos (running time < 1hr) to judge whether or not to go ahead with 6.00.2x.

So much for the update. Got work to do! **🙂**

# Funny Python

If a programming language is named after a sketch comedy troupe, one knows what to expect. Python ** IS** a funny language with its own bag of surprises.

For instance, If you’ve just moved from a language such as C to Python and you’re missing curly braces (how can one *not want* whitespaces!!), and you try this:

`>>> from __future__ import braces`

Or say, if you try importing * this*.

`>>> import this`

Or if you ever wanted to know why XKCD’s *Cueball* left **Perl** for **Python**, you should know, that it was for gravity defying stunts that he couldn’t perform anywhere else. Just import antigravity!

`>>> import antigravity`

You’re led to this webcomic on your browser.

So the upshot is that you can get tickled and trolled by Python every now and then, keeping in line with its rich tradition of doing so (check out video below).

**Comedians!**

# Python to the Rescue

**Another journal-like entry**

Programming as a profession is only moderately interesting. It can be a good job, but you could make about the same money and be happier running a fast food joint. You’re much better off using code as your secret weapon in another profession.

People who can code in the world of technology companies are a dime a dozen and get no respect. People who can code in biology, medicine, government, sociology, physics, history, and mathematics are respected and can do amazing things to advance those disciplines.

I was reading a paper today, written by MIT’s * Esther Duflo*, part of a homework assignment on a MOOC on

**development policy**(Foundations of Development Policy: Advanced Development Economics) offered by

*Duflo*and

*Abhijit Banerjee*. So I opened the paper and started copying important lines from the PDF to a text editor to make notes. I could copy the text, but when I pasted it onto a text editor, it turned out to be gibberish (you can try it too!).

For instance, instead of pasting

*Between 1973 and 1978 the Indonesian Government constructed over 61,000 primary* *schools throughout the county*

I got:

*Ehwzhhq 4<:6 dqg 4<:;/ wkh Lqgrqhvldq Jryhuqphqw frqv* *wuxfwhg ryhu 94/333 sulpdu|* *vfkrrov wkurxjkrxw wkh frxqwu|*

It was a good thing the cipher used for this text wasn’t too complicated. After some perusal, I found that *‘B’* became *‘E’*, *‘e’* became *‘h’*, *‘t’* became *‘w’* and so on. So I copied the entire content of the PDF to a text file and named the encrypted file * estherDuflo.txt. *I noticed that the encryption had been implemented only on the first 1475 lines. The remaining was plain English.

So I wrote a Python script to decrypt the gibberish, rather than simply typing out my notes. It took 20 minutes writing the code and 8 ms to execute (of course!). I didn’t want to spend a lot of time ensuring a thorough decryption, so the result wasn’t perfect, but then I’m going to make do. I named the decrypted file * estherDufloDecrypted.txt*.

**Sample from the Encrypted File**

5U LL*?} @?_ w@MLh @h!i| L?ti^ i?Uit Lu 5U LL* | |

L?t|h U|L? ? W?_L?it@G ,_i?Ui uhL4 @? N? t @* L*U) | |

, Tih4i?| | |

,t| ih # L | |

W | |

Devwudfw | |

Ehwzhhq 4<:6 dqg 4<:;/ wkh Lqgrqhvldq Jryhuqphqw frqvwuxfwhg ryhu 94/333 sulpdu| | |

vfkrrov wkurxjkrxw wkh frxqwu|1 Wklv lv rqh ri wkh odujhvw vfkrro frqvwuxfwlrq surjudpv rq | |

uhfrug1 L hydoxdwh wkh hhfw ri wklv surjudp rq hgxfdwlrq dqg zdjhv e| frpelqlqj glhuhqfhv | |

dfurvv uhjlrqv lq wkh qxpehu ri vfkrrov frqvwuxfwhg zlwk glhuhqfhv dfurvv frkruwv lqgxfhg | |

e| wkh wlplqj ri wkh surjudp1 Wkh hvwlpdwhv vxjjhvw wkdw wkh frqvwuxfwlrq ri sulpdu| vfkrrov | |

ohg wr dq lqfuhdvh lq hgxfdwlrq dqg hduqlqjv1 Fkloguhq djhg 5 wr 9 lq 4<:7 uhfhlyhg 3145 wr | |

314< pruh |hduv ri hgxfdwlrq iru hdfk vfkrro frqvwuxfwhg shu 4/333 fkloguhq lq wkhlu uhjlrq | |

ri eluwk1 Xvlqj wkh yduldwlrqv lq vfkrrolqj jhqhudwhg e| wklv srolf| dv lqvwuxphqwdo yduldeohv | |

iru wkh lpsdfw ri hgxfdwlrq rq zdjhv jhqhudwhv hvwlpdwhv ri hfrqrplf uhwxuqv wr hgxfdwlrq | |

udqjlqj iurp 91; shufhqw wr 4319 shufhqw1 +MHO L5/ M64/ R48/ R55, | |

Wkh txhvwlrq ri zkhwkhu lqyhvwphqw lq lqiudvwuxfwxuh lqfuhdvhv kxpdq fdslwdo dqg uhgxfhv | |

sryhuw| kdv orqj ehhq d frqfhuq wr ghyhorsphqw hfrqrplvwv dqg srolf|pdnhuv1 Iru h{dpsoh/ | |

dydlodelolw| ri vfkrrolqj lqiudvwuxfwxuh kdv ehhq vkrzq wr eh srvlwlyho| fruuhodwhg zlwk frpsohwhg | |

vfkrrolqj ru hquroophqw e| Qdqf| Elugvdoo +4<;8, lq xuedq Eud}lo/ Ghqqlv GhWud| +4<;;, dqg Ohh |

**My Code**

from string import * | |

# create decipher dictionary | |

l = letters[:26] | |

decipher = "".join([l[(i+3)%26] for i in range(len(l))]) | |

decipher = dict(zip(decipher,l)) | |

# open and read encrypted text | |

filename = 'estherDuflo.txt' | |

f = open(filename, 'rw') | |

lines = f.readlines() | |

lines = [l[:-1] for l in lines] | |

# use first 1475 lines only | |

newlines = lines[:1475] | |

# apply decryption on those 1475 lines | |

decipheredLines = [] | |

for line in newlines: | |

x = line.lower() | |

s = [] | |

for letter in x: | |

if letter in letters: | |

s.append(decipher[letter]) | |

else: | |

s.append(letter) | |

s.append('\n') | |

decipheredLines.append(''.join(s)) | |

# write deciphered text to new text file | |

decipheredFile = 'estherDufloDeciphered.txt' | |

df = open(decipheredFile, 'w') | |

for line in decipheredLines: | |

df.write("%s" % line) | |

# close both text files | |

f.close() | |

df.close() | |

**Sample from the Decrypted File**

5r ii*?} @?_ t@jie @e!f| i?qf^ f?rfq ir 5r ii* | |

i?q|e r|i? ? t?_i?fq@d ,_f?rf rei4 @? k? q @* i*r) | |

, qfe4f?| | |

,q| fe # i | |

t | |

abstract | |

between 4<:6 and 4<:;/ the indonesian government constructed over 94/333 primar| | |

schools throughout the countr|1 this is one of the largest school construction programs on | |

record1 i evaluate the eect of this program on education and wages b| combining dierences | |

across regions in the number of schools constructed with dierences across cohorts induced | |

b| the timing of the program1 the estimates suggest that the construction of primar| schools | |

led to an increase in education and earnings1 children aged 5 to 9 in 4<:7 received 3145 to | |

314< more |ears of education for each school constructed per 4/333 children in their region | |

of birth1 using the variations in schooling generated b| this polic| as instrumental variables | |

for the impact of education on wages generates estimates of economic returns to education | |

ranging from 91; percent to 4319 percent1 +jel i5/ j64/ o48/ o55, | |

the question of whether investment in infrastructure increases human capital and reduces | |

povert| has long been a concern to development economists and polic|makers1 for e{ample/ | |

availabilit| of schooling infrastructure has been shown to be positivel| correlated with completed | |

schooling or enrollment b| nanc| birdsall +4<;8, in urban bra}il/ dennis detra| +4<;;, and lee |

# Karatsuba Multiplication Algorithm – Python Code

**Motivation for this blog post**

I’ve enrolled in Stanford Professor *Tim Roughgarden’s* Coursera MOOC on the **design and analysis of algorithms, **and while he covers the theory and intuition behind the algorithms in a surprising amount of detail, we’re left to implement them in a programming language of our choice.

**And I’m ging to post Python code for all the algorithms covered during the course!**

**The Karatsuba Multiplication Algorithm**

Karatsuba’s algorithm reduces the multiplication of two *n*-digit numbers to at most single-digit multiplications in general (and exactly when *n* is a power of 2). Although the familiar **grade school algorithm** for multiplying numbers is how we work through multiplication in our day-to-day lives, it’s slower () in comparison, but only on a computer, of course!

Here’s how the **grade school algorithm** looks:

*(The following slides have been taken from Tim Roughgarden’s notes. They serve as a good illustration. I hope he doesn’t mind my sharing them.)*

…and this is how **Karatsuba Multiplication** works on the same problem:

**A More General Treatment**

Let and be represented as -digit strings in some base . For any positive integer less than , one can write the two given numbers as

,

where and are less than . The product is then

where

These formulae require four multiplications, and were known to Charles Babbage. **Karatsuba** observed that can be computed in only three multiplications, at the cost of a few extra additions. With and as before we can calculate

which holds since

A more efficient implementation of Karatsuba multiplication can be set as , where .

**Example**

To compute the product of 12345 and 6789, choose *B* = 10 and *m* = 3. Then we decompose the input operands using the resulting base (*B*^{m} = *1000*), as:

- 12345 =
**12**·*1000*+**345** - 6789 =
**6**·*1000*+**789**

Only three multiplications, which operate on smaller integers, are used to compute three partial results:

*z*_{2}=**12****×****6**= 72*z*_{0}=**345****×****789**= 272205*z*_{1}= (**12**+**345**)**×**(**6**+**789**) −*z*_{2}−*z*_{0}= 357**×**795 − 72 − 272205 = 283815 − 72 − 272205 = 11538

We get the result by just adding these three partial results, shifted accordingly (and then taking carries into account by decomposing these three inputs in base *1000* like for the input operands):

- result =
*z*_{2}·*B*^{2m}+*z*_{1}·*B*^{m}+*z*_{0}, i.e. - result = 72 ·
*1000*^{2}+ 11538 ·*1000*+ 272205 =**83810205**.

**Pseudocode and Python code**

procedure karatsuba(num1, num2) | |

if (num1 < 10) or (num2 < 10) | |

return num1*num2 | |

/* calculates the size of the numbers */ | |

m = max(size_base10(num1), size_base10(num2)) | |

m2 = m/2 | |

/* split the digit sequences about the middle */ | |

high1, low1 = split_at(num1, m2) | |

high2, low2 = split_at(num2, m2) | |

/* 3 calls made to numbers approximately half the size */ | |

z0 = karatsuba(low1,low2) | |

z1 = karatsuba((low1+high1),(low2+high2)) | |

z2 = karatsuba(high1,high2) | |

return (z2*10^(2*m2))+((z1-z2-z0)*10^(m2))+(z0) |

def karatsuba(x,y): | |

"""Function to multiply 2 numbers in a more efficient manner than the grade school algorithm""" | |

if len(str(x)) == 1 or len(str(y)) == 1: | |

return x*y | |

else: | |

n = max(len(str(x)),len(str(y))) | |

nby2 = n / 2 | |

a = x / 10**(nby2) | |

b = x % 10**(nby2) | |

c = y / 10**(nby2) | |

d = y % 10**(nby2) | |

ac = karatsuba(a,c) | |

bd = karatsuba(b,d) | |

ad_plus_bc = karatsuba(a+b,c+d) - ac - bd | |

# this little trick, writing n as 2*nby2 takes care of both even and odd n | |

prod = ac * 10**(2*nby2) + (ad_plus_bc * 10**nby2) + bd | |

return prod |

# Teach Yourself Machine Learning the Hard Way!

This formula is kick-ass!

It has been 3 years since I have steered my interests towards Machine Learning. I had just graduated from college with a Bachelor of Engineering in Electronics and Communication Engineering. Which is, other way of saying that I was:

- a toddler in programming.
- little / no knowledge of algorithms.
- studied engineering math, but it was rusty.
- no knowledge of modern optimization.
- zero knowledge of statistical inference.

I think, most of it is true for many engineering graduates (especially, in India !). Unless, you studied mathematics and computing for undergrad.

Lucky for me, I had a great mentor and lot of online materials on these topics. This post will list many such materials I found useful, while I was learning it the hard way !

All the courses that I’m listing below have homework assignments. Make sure you work through each one of them.

**1. Learn Python**

If you are new to programming…

View original post 507 more words

# Why Parselmouth Harry Potter is also Parsermouth Harry Potter

If you’re a *Pythonista* or just a coder, you may have come across this web cartoon:

Its creator Ryan Sawyer has been working as a full-time graphic designer and freelance illustrator for the past 10 years. His projects have been featured on websites such as /Film, io9, BoingBoing, Uproxx, MusicRadar, SuperPunch, IGN, and PackagingDigest.

I recently came across an interesting thread on **Reddit** on the origins of this cartoon. Basically, the cartoonist, ergo Python-speaking-Harry, got their code from this Stack Overflow forum for short, useful Python code snippets! Convenient, right?!

What’s funny is that the forum later got closed as it was deemed **not constructive**!

The code is supposed to print a recursive count of lines of python source code from the current working directory, including an ignore list – so as to print total sloc. **Don’t blame me though, if the code doesn’t work!**

# prints recursive count of lines of python source code from current directory | |

# includes an ignore_list. also prints total sloc | |

import os | |

cur_path = os.getcwd() | |

ignore_set = set(["__init__.py", "count_sourcelines.py"]) | |

loclist = [] | |

for pydir, _, pyfiles in os.walk(cur_path): | |

for pyfile in pyfiles: | |

if pyfile.endswith(".py") and pyfile not in ignore_set: | |

totalpath = os.path.join(pydir, pyfile) | |

loclist.append( ( len(open(totalpath, "r").read().splitlines()), | |

totalpath.split(cur_path)[1]) ) | |

for linenumbercount, filename in loclist: | |

print "%05d lines in %s" % (linenumbercount, filename) | |

print "\nTotal: %s lines (%s)" %(sum([x[0] for x in loclist]), cur_path) |