How to become a Data Scientist in 6 months

Disclaimer: I’m not a data scientist yet. That’s still work in progress, but I’d recommend this excellent talk given by  Tetiana Ivanova to put an enthusiast’s data science journey in perspective.


MITx 15.071x (Analytics Edge) – 2016

I am auditing this course currently and just completed its 2nd assignment. It’s probably one of the best courses out there to learn R in a way that you go beyond the syntax with an objective in mind – to do analytics and run machine learning algorithms to derive insight from data. This course is different from machine learning courses by say, Andrew Ng in that this course won’t focus on coding the algorithm and rather would emphasize on diving right into the implementation of those algorithms using libraries that the R programming language already equips us with.

Take a look at the course logistics. And hey, they’ve got a Kaggle competition!


There’s still time to enroll and grab a certificate (or simply audit). The course is offered once a year. I met a bunch of people who did well at a data hackathon I had gone to recently, who had learned the ropes in data science thanks to Analytics Edge.

Getting Started

I have been searching for good MOOCs to get me started with R and Python programming languages. I’ve already begun the Johns Hopkins University Data Science Specialization on Coursera. It consists of 9 courses (including Data Scientist’s Toolbox, R programming, Getting and Cleaning Data, Exploratory Data Analysis, Reproducible Research, Statistical Inference, Regression Models, Practical Machine Learning and Developing Data Products), ending with a 7-week Capstone Project that I’m MOST excited about. I want to get there fast.

The Capstone would consist of :
  • Building a predictive data model for analyzing large textual data sets
  • Cleaning real-world data and perform complex regressions
  • Creating visualizations to communicate data analyses
  • Building a final data product in collaboration with SwiftKey, award-winning developer of leading keyboard apps for smartphones

I started with the R programming course where I found the programming assignments to be moderately difficult. They were good practice and also time-consuming for me since I haven’t yet gotten used to the R syntax, which is supposedly unintuitive. Anyway, I completed the course with distinction (90+ marks) scoring 95 on 100, losing 5 because I hadn’t familiarized myself with Git / GitHub. I did this course for a verified certificate, which cost me $29, and looks like this:

Coursera rprog 2015

I won’t be paying for any of the remaining courses though, but still will get a certificate of accomplishment for each course I pass. I have alredy begun with Getting and Cleaning Data and Data Scientist’s Toolbox.

I checked today, and it seems Andrew Ng’s Machine Learning course has gone open to all and is self-paced. A lot of people have gone on to participate in Kaggle competitions with what they learnt in his course, so I’d like to experience it — even though it’s taught with Octave / MATLAB. My very short term goal is to start participating in these competitions ASAP.

Kaggle Competitions

I will be learning the basics of Git this week and along with that, about reading from MySQL, HDF5, the web and APIs. I intend to start reading Trevor Hastie’s highly recommended book, Introduction to Statistical Learning.

ISL Cover 2


Meanwhile, I need to get started with Git and GitHub too, and I found a very useful blog by Kevin Markham and his short concise videos are great introductory material.

Incidentally, I was in a dilemma whether to start with Hastie’s material or Andrew Ng’s course first. This is what Kevin had to say

Hastie or Ng

The only reason I have reservations against Andrew Ng’s course is that its instruction isn’t in R or Python. Also, CTO and co-founder of Kaggle, Ben Hamner mentions here how useful R and Python are vis-à-vis Matlab / Octave.

Ben Hamner on Python R Matlab v2