MITx: 6.008.1x Computational Probability and Inference

I got really interested in Computational Probability and Inference (6.008.1x) for the following reasons:

  1. I love probability and have solved countless problems on probability ever since I learned math
  2. …and yet I’ve never coded up probabilistic models!
  3. The assignments and project work for this course are to be implemented in Python!

You don’t need to have prior experience in either probability or inference, but you should be comfortable with basic Python programming and calculus.

WHAT YOU’LL LEARN
– Basic discrete probability theory
– Graphical models as a data structure for representing probability distributions
– Algorithms for prediction and inference
– How to model real-world problems in terms of probabilistic inference

The course started on September 12, is 12-weeks long and is structured in the following manner:

Week 1 (9/12 – 9/16): Introduction to probability and computation
A first look at basic discrete probability, how to interpret it, what probability spaces and random variables are, and how to code these up and do basic simulations and visualizations.

Week 2 (9/19 – 9/23): Incorporating observations
Incorporating observations using jointly distributed random variables and using events. Three classic probability puzzles are presented to help elucidate how to interpret probability: Simpson’s paradox, Monty Hall, boy or girl paradox.

Week 3 (9/26 – 9/30): Introduction to inference, structure in distributions, and information measures
The product rule and inference with Bayes’ theorem. Independence: A structure in distributions. Measures of randomness: entropy and information divergence. Mutual information.

Week 4 (10/3 – 10/7): Expectations, and driving to infinity in modeling uncertainty
Expected values of random variables. Classic puzzle: the two envelope problem. Probability spaces and random variables that take on a countably infinite number of values and inference with these random variables.

Week 5 (10/10 – 10/14): Efficient representations of probability distributions on a computer
Introduction to undirected graphical models as a data structure for representing probability distributions and the benefits/drawbacks of these graphical models. Incorporating observations with graphical models.

Week 6 (10/17 – 10/21): Inference with graphical models, part I
Computing marginal distributions with graphical models in undirected graphical models including hidden Markov models..

Week 7 (10/24 – 10/28): Inference with graphical models, part II
Computing most probable configurations with graphical models including hidden Markov models.

Week 8 (10/31 – 11/4): Introduction to learning probability distributions
Learning an underlying unknown probability distribution from observations using maximum likelihood. Three examples: estimating the bias of a coin, the German tank problem, and email spam detection.

Week 9 (11/7 – 11/11): Parameter estimation in graphical models
Given the graph structure of an undirected graphical model, we examine how to estimate all the tables associated with the graphical model.

Week 10 (11/14 – 11/18): Model selection with information theory
Learning both the graph structure and the tables of an undirected graphical model with the help of information theory. Mutual information of random variables.

Week 11 (11/21 – 11/25): Final project
Final project assigned

Week 12 (11/28 – 12/2): Final project

 

I’m SO taking this course. Hope this interests you as well!

How to become a Data Scientist in 6 months

Disclaimer: I’m not a data scientist yet. That’s still work in progress, but I’d recommend this excellent talk given by  Tetiana Ivanova to put an enthusiast’s data science journey in perspective.

MITx 15.071x (Analytics Edge) – 2016

I am auditing this course currently and just completed its 2nd assignment. It’s probably one of the best courses out there to learn R in a way that you go beyond the syntax with an objective in mind – to do analytics and run machine learning algorithms to derive insight from data. This course is different from machine learning courses by say, Andrew Ng in that this course won’t focus on coding the algorithm and rather would emphasize on diving right into the implementation of those algorithms using libraries that the R programming language already equips us with.

Take a look at the course logistics. And hey, they’ve got a Kaggle competition!

AnalyticsEdgeLogistics

There’s still time to enroll and grab a certificate (or simply audit). The course is offered once a year. I met a bunch of people who did well at a data hackathon I had gone to recently, who had learned the ropes in data science thanks to Analytics Edge.

My First Data Science Hackathon

So after 8 months of playing around with R and Python and blog post after blog post, I found myself finally hacking away at a problem set from the 17th storey of the Hindustan Times building at Connaught Place. I had entered my first ever data science hackathon conducted by Analytics Vidhya, a pioneer in analytics learning in India. Pizzas and Pepsi were on the house. Like any predictive analysis hackathon, this one accepted unlimited entries till submission time. It was from 2pm to 4:30pm today –  2.5 hours, of which I ended up wasting 1.5 hours trying to make my first submission which encountered submission error after submission error until the problem was fixed finally post lunch. I had 1 hour to try my best. It wasn’t the best performance, but I thought of blogging this experience anyway, as a reminder of the work that awaits me. I want to be the one winning prize money at the end of the day.

🙂

screenshot-datahack analyticsvidhya com 2015-12-20 18-41-12

 

Statistical Learning – 2016

On January 12, 2016, Stanford University professors Trevor Hastie and Rob Tibshirani will offer the 3rd iteration of Statistical Learning, a MOOC which first began in January 2014, and has become quite a popular course among data scientists. It is a great place to learn statistical learning (machine learning) methods using the R programming language. For a quick course on R, check this out – Introduction to R Programming

Slides and videos for Statistical Learning MOOC by Hastie and Tibshirani available separately here. Slides and video tutorials related to this book by Abass Al Sharif can be downloaded here.

The course covers the following book which is available for free as a PDF copy.

Logistics and Effort:

statLearnEffort

Rough Outline of Schedule (based on last year’s course offering):

Week 1: Introduction and Overview of Statistical Learning (Chapters 1-2)
Week 2: Linear Regression (Chapter 3)
Week 3: Classification (Chapter 4)
Week 4: Resampling Methods (Chapter 5)
Week 5: Linear Model Selection and Regularization (Chapter 6)
Week 6: Moving Beyond Linearity (Chapter 7)
Week 7: Tree-based Methods (Chapter 8)
Week 8: Support Vector Machines (Chapter 9)
Week 9: Unsupervised Learning (Chapter 10)

Prerequisites: First courses in statistics, linear algebra, and computing.

 

MITx 6.00.2x Introduction to Computational Thinking and Data Science (Fall 2015)

MIT’s Fall 2015 iteration of 6.00.2x starts today. After an enriching learning experience with 6.00.1x, I have great expectations from this course. As the course website mildly puts it, 6.00.2x is an introduction to using computation to understand real-world phenomena. MIT OpenCourseware (OCW) mirroring the material covered in 6.00.1x and 6.00.2x can be found here.

The course follows this book by John Guttag (who happens to be one of the instructors for this course). However, purchasing the book isn’t a necessity for this course.

Introduction to Computation and Programming Using Python

One thing I loved about 6.00.1x was its dedicated Facebook group, which gave a community / classroom-peergroup feel to the course. 6.00.2x also has a Facebook group. Here’s a sneak peak:

descriptionUpdate

The syllabus and schedule for this course is shown below. The course is spread out over 2 months which includes 7 weeks of lectures.

MITx 6.00.2x Fall 2015 Course Calendar
MITx 6.00.2x Fall 2015 Course Calendar

The prerequisites for this course are pretty much covered in this set of tutorial videos that have been created by one of the TAs for 6.00.1x. If you’ve not taken 6.00.1x in the past, you can go through these videos (running time < 1hr) to judge whether or not to go ahead with 6.00.2x.

So much for the update. Got work to do! 🙂

Funny Python

If a programming language is named after a sketch comedy troupe, one knows what to expect. Python IS a funny language with its own bag of surprises.

pythonMonty
Monty Python’s Flying Circus

For instance, If you’ve just moved from a language such as C to Python and you’re missing curly braces (how can one not want whitespaces!!), and you try this:

>>> from __future__ import braces

from __future__ import braces
Click Image for Larger View

Or say, if you try importing this.

>>> import this

import this
A sense of humour is required for proper interpretation

Or if you ever wanted to know why XKCD’s Cueball left Perl for Python, you should know, that it was for gravity defying stunts that he couldn’t perform anywhere else. Just import antigravity!

>>> import antigravity

You’re led to this webcomic on your browser.

import antigravity

So the upshot is that you can get tickled and trolled by Python every now and then, keeping in line with its rich tradition of doing so (check out video below).


Comedians!

Python to the Rescue

Another journal-like entry

Programming as a profession is only moderately interesting. It can be a good job, but you could make about the same money and be happier running a fast food joint. You’re much better off using code as your secret weapon in another profession.

People who can code in the world of technology companies are a dime a dozen and get no respect. People who can code in biology, medicine, government, sociology, physics, history, and mathematics are respected and can do amazing things to advance those disciplines.

Advice from an Old Programmer

I was reading a paper today, written by MIT’s Esther Duflo, part of a homework assignment on a MOOC on development policy (Foundations of Development Policy: Advanced Development Economics) offered by Duflo and Abhijit Banerjee. So I opened the paper and started copying important lines from the PDF to a text editor to make notes. I could copy the text, but when I pasted it onto a text editor, it turned out to be gibberish (you can try it too!).

For instance, instead of pasting

Between 1973 and 1978 the Indonesian Government constructed over 61,000 primary schools throughout the county

I got:

Ehwzhhq 4<:6 dqg 4<:;/ wkh Lqgrqhvldq Jryhuqphqw frqv wuxfwhg ryhu 94/333 sulpdu| vfkrrov wkurxjkrxw wkh frxqwu|

It was a good thing the cipher used for this text wasn’t too complicated. After some perusal, I found that ‘B’ became ‘E’, ‘e’ became ‘h’, ‘t’ became ‘w’ and so on. So I copied the entire content of the PDF to a text file and named the encrypted file estherDuflo.txt. I noticed that the encryption had been implemented only on the first 1475 lines. The remaining was plain English.

So I wrote a Python script to decrypt the gibberish, rather than simply typing out my notes. It took 20 minutes writing the code and 8 ms to execute (of course!). I didn’t want to spend a lot of time ensuring a thorough decryption, so the result wasn’t perfect, but then I’m going to make do. I named the decrypted file estherDufloDecrypted.txt.

Sample from the Encrypted File

5U LL*?} @?_ w@MLh @h!i| L?ti^ i?Uit Lu 5U LL*
L?t|h U|L? ? W?_L?it@G ,_i?Ui uhL4 @? N? t @* L*U)
, Tih4i?|
,t| ih # L
W
Devwudfw
Ehwzhhq 4<:6 dqg 4<:;/ wkh Lqgrqhvldq Jryhuqphqw frqvwuxfwhg ryhu 94/333 sulpdu|
vfkrrov wkurxjkrxw wkh frxqwu|1 Wklv lv rqh ri wkh odujhvw vfkrro frqvwuxfwlrq surjudpv rq
uhfrug1 L hydoxdwh wkh hhfw ri wklv surjudp rq hgxfdwlrq dqg zdjhv e| frpelqlqj glhuhqfhv
dfurvv uhjlrqv lq wkh qxpehu ri vfkrrov frqvwuxfwhg zlwk glhuhqfhv dfurvv frkruwv lqgxfhg
e| wkh wlplqj ri wkh surjudp1 Wkh hvwlpdwhv vxjjhvw wkdw wkh frqvwuxfwlrq ri sulpdu| vfkrrov
ohg wr dq lqfuhdvh lq hgxfdwlrq dqg hduqlqjv1 Fkloguhq djhg 5 wr 9 lq 4<:7 uhfhlyhg 3145 wr
314< pruh |hduv ri hgxfdwlrq iru hdfk vfkrro frqvwuxfwhg shu 4/333 fkloguhq lq wkhlu uhjlrq
ri eluwk1 Xvlqj wkh yduldwlrqv lq vfkrrolqj jhqhudwhg e| wklv srolf| dv lqvwuxphqwdo yduldeohv
iru wkh lpsdfw ri hgxfdwlrq rq zdjhv jhqhudwhv hvwlpdwhv ri hfrqrplf uhwxuqv wr hgxfdwlrq
udqjlqj iurp 91; shufhqw wr 4319 shufhqw1 +MHO L5/ M64/ R48/ R55,
Wkh txhvwlrq ri zkhwkhu lqyhvwphqw lq lqiudvwuxfwxuh lqfuhdvhv kxpdq fdslwdo dqg uhgxfhv
sryhuw| kdv orqj ehhq d frqfhuq wr ghyhorsphqw hfrqrplvwv dqg srolf|pdnhuv1 Iru h{dpsoh/
dydlodelolw| ri vfkrrolqj lqiudvwuxfwxuh kdv ehhq vkrzq wr eh srvlwlyho| fruuhodwhg zlwk frpsohwhg
vfkrrolqj ru hquroophqw e| Qdqf| Elugvdoo +4<;8, lq xuedq Eud}lo/ Ghqqlv GhWud| +4<;;, dqg Ohh
view raw estherDuflo.txt hosted with ❤ by GitHub

My Code
from string import *
# create decipher dictionary
l = letters[:26]
decipher = "".join([l[(i+3)%26] for i in range(len(l))])
decipher = dict(zip(decipher,l))
# open and read encrypted text
filename = 'estherDuflo.txt'
f = open(filename, 'rw')
lines = f.readlines()
lines = [l[:-1] for l in lines]
# use first 1475 lines only
newlines = lines[:1475]
# apply decryption on those 1475 lines
decipheredLines = []
for line in newlines:
x = line.lower()
s = []
for letter in x:
if letter in letters:
s.append(decipher[letter])
else:
s.append(letter)
s.append('\n')
decipheredLines.append(''.join(s))
# write deciphered text to new text file
decipheredFile = 'estherDufloDeciphered.txt'
df = open(decipheredFile, 'w')
for line in decipheredLines:
df.write("%s" % line)
# close both text files
f.close()
df.close()
view raw estherDuflo.py hosted with ❤ by GitHub

Sample from the Decrypted File
5r ii*?} @?_ t@jie @e!f| i?qf^ f?rfq ir 5r ii*
i?q|e r|i? ? t?_i?fq@d ,_f?rf rei4 @? k? q @* i*r)
, qfe4f?|
,q| fe # i
t
abstract
between 4<:6 and 4<:;/ the indonesian government constructed over 94/333 primar|
schools throughout the countr|1 this is one of the largest school construction programs on
record1 i evaluate the eect of this program on education and wages b| combining dierences
across regions in the number of schools constructed with dierences across cohorts induced
b| the timing of the program1 the estimates suggest that the construction of primar| schools
led to an increase in education and earnings1 children aged 5 to 9 in 4<:7 received 3145 to
314< more |ears of education for each school constructed per 4/333 children in their region
of birth1 using the variations in schooling generated b| this polic| as instrumental variables
for the impact of education on wages generates estimates of economic returns to education
ranging from 91; percent to 4319 percent1 +jel i5/ j64/ o48/ o55,
the question of whether investment in infrastructure increases human capital and reduces
povert| has long been a concern to development economists and polic|makers1 for e{ample/
availabilit| of schooling infrastructure has been shown to be positivel| correlated with completed
schooling or enrollment b| nanc| birdsall +4<;8, in urban bra}il/ dennis detra| +4<;;, and lee

Teach Yourself Machine Learning the Hard Way!

This formula is kick-ass!

Darshan Hegde

It has been 3 years since I have steered my interests towards Machine Learning. I had just graduated from college with a Bachelor of Engineering in Electronics and Communication Engineering. Which is, other way of saying that I was:

  • a toddler in programming.
  • little / no knowledge of algorithms.
  • studied engineering math, but it was rusty.
  • no knowledge of modern optimization.
  • zero knowledge of statistical inference.

I think, most of it is true for many engineering graduates (especially, in India !). Unless, you studied mathematics and computing for undergrad.

Lucky for me, I had a great mentor and lot of online materials on these topics. This post will list many such materials I found useful, while I was learning it the hard way !

All the courses that I’m listing below have homework assignments. Make sure you work through each one of them.

1. Learn Python

If you are new to programming…

View original post 507 more words

Why Parselmouth Harry Potter is also Parsermouth Harry Potter

If you’re a Pythonista or just a coder, you may have come across this web cartoon:

Its creator Ryan Sawyer has been working as a full-time graphic designer and freelance illustrator for the past 10 years. His projects have been featured on websites such as /Film, io9, BoingBoing, Uproxx, MusicRadar, SuperPunch, IGN, and PackagingDigest.

I recently came across an interesting thread on Reddit on the origins of this cartoon. Basically, the cartoonist, ergo Python-speaking-Harry, got their code from this Stack Overflow forum for short, useful Python code snippets! Convenient, right?!

What’s funny is that the forum later got closed as it was deemed not constructive!

ParsermouthStackOverflow
Click Image to Enlarge

The code is supposed to print a recursive count of lines of python source code from the current working directory, including an ignore list – so as to print total sloc. Don’t blame me though, if the code doesn’t work!

# prints recursive count of lines of python source code from current directory
# includes an ignore_list. also prints total sloc
import os
cur_path = os.getcwd()
ignore_set = set(["__init__.py", "count_sourcelines.py"])
loclist = []
for pydir, _, pyfiles in os.walk(cur_path):
for pyfile in pyfiles:
if pyfile.endswith(".py") and pyfile not in ignore_set:
totalpath = os.path.join(pydir, pyfile)
loclist.append( ( len(open(totalpath, "r").read().splitlines()),
totalpath.split(cur_path)[1]) )
for linenumbercount, filename in loclist:
print "%05d lines in %s" % (linenumbercount, filename)
print "\nTotal: %s lines (%s)" %(sum([x[0] for x in loclist]), cur_path)
view raw sloc.py hosted with ❤ by GitHub