Analytics Vidhya Workshop / Hackathon – Experiments with Data

This was a hackathon + workshop conducted by Analytics Vidhya in which I took part and made it to the #1 on the leaderboard. The data set was straight-forward and quite clean with only a minor need for missing value treatment. This post will might be useful for people who want a walk-through on the steps involving data munging and developing machine-learned models.

screenshot-datahack.analyticsvidhya.com 2016-09-01 23-43-54

 

The workshop ended with a basic hackathon with data given on age, education, working class, occupation, marital status and gender of individuals and one had to predict the income bracket of these individuals.

I’ve posted the data and my code and solutions in this GitHub repo. An IPython Notebook has also been shared.

I approached the problem first by attempting some feature engineering (other than missing value treatment) on the data, and then ran a basic logistic classifier and a random forest classifier. However it turned out that these models performed better without feature engineering, which shows the dataset was already quite clean and informative to begin with for this competition.

I later attempted gradient boosting with parameter tuning to maximizing scores.

scikit-learn Linear Regression Example

Here’s a quick example case for implementing one of the simplest of learning algorithms in any machine learning toolbox – Linear Regression. You can download the IPython / Jupyter notebook here so as to play around with the code and try things out yourself.

I’m doing a series of posts on scikit-learn. Its documentation is vast, so unless you’re willing to search for a needle in a haystack, you’re better off NOT jumping into the documentation right away. Instead, knowing chunks of code that do the job might help.

Sharing IPython / Jupyter Notebooks via WordPress

In order to share (a static version of) your IPython / Jupyter notebook on your WordPress site, follow three straightforward steps.

Step 1: Let’s say your Jupyter Notebook looks like this:

blog_item_20160718_01

Open this notebook in a text editor and copy the content which may look like so:

blog_item_20160718_02

Step 2: Ctrl + A and Ctrl + C this content. Then Ctrl + V this to a GitHub Gist that you should create, like so:

blog_item_20160718_03

Step 3: Now simply Create public gist and embed the gist like you always embed gists on WordPress, viz., go to the HTML editor and add like so:

blog_item_20160718_04

I followed the exact steps that I’ve mentioned above to get the following result:

 

Google’s New Deep Learning MOOC Using TensorFlow

Deep learning became a hot topic in machine learning in the last 3-4 years (see inset below) and recently, Google released TensorFlow (a Python based deep learning toolkit) as an open source project to bring deep learning to everyone.

deep_learning_google_trends
Interest in the Google search term Deep Learning over time

If you have wanted to get your hands dirty with TensorFlow or needed more direction with that, here’s some good news – Google is offering an open MOOC on deep learning methods using TensorFlow here. This course has been developed with Vincent Vanhoucke, Principal Scientist at Google, and technical lead in the Google Brain team. However, this is an intermediate to advanced level course and assumes you have taken a first course in machine learning, or that you are at least familiar with supervised learning methods.

Google’s overall goal in designing this course is to provide the machine learning enthusiast a rapid and direct path to solving real and interesting problems with deep learning techniques.

What is Deep Learning?

Course Overview

Properly Uninstalling Canopy Python Installation from Linux

Motivation for this blog post:

I had downloaded Canopy at the insistence of the instructors of MIT’s introductory course on computer science using Python. That said, I rarely ever used it. I’ve all along been working on Python using a text editor and command line only. I also downloaded Anaconda and started working on IPython since I began working on a new machine learning MOOC offered by the University of Washington via Coursera. Anaconda is awesome! It has all the best scientific libraries and I love IPython compared to PyCharm or Canopy, which pale in comparison to IPython, especially if you’re using Python for Machine Learning.

Anyway, I was working on IPython, trying to import matplotlib, when I got the following ImportError:

ImportError in importing matplotlib in IPython notebook

I noticed that the matplotlib library was trying to be accessed in Canopy’s Enthought directory. Since I never used or liked Canopy anyway, I decided to uninstall, bitch!

Step by step process of uninstalling Canopy from Linux:

1) From the Canopy preferences option in the Edit menu, mark off Canopy as your default Python (this step is not available on very early versions of Canopy).

2) Restart your computer.

3) Remove the “~/Canopy” directory (or the directory where you installed Canopy).
rm -rf Canopy

4) For each Canopy user, delete one or more of the directories below, which contain that user’s “System” and “User” virtual environments, and any user macros.

  • Deleting “System” removes the environment where the Canopy GUI application runs; it will be re-created the next time that you start Canopy.
  • Deleting “User” removes all your installed Python packages; it will be re-created with only the packages bundled into the Canopy installer, the next time that you start Canopy.
  • Deleting the third directory will remove any Canopy macros which you may have written. It is usually empty. I did this from the desktop home directory itself.

(for 32-bit Canopy, replace “64bit” with “32bit”):

~/Enthought/Canopy_64bit/System
~/Enthought/Canopy_64bit/User
~/canopy

For a 64 bit system:
cd Enthought/Canopy_64bit

for a 32 bit system:
cd Enthought/Canopy_32bit

rm -rf System
rm -rf User

5) Delete the file “locations.cfg” from each user’s Canopy configuration / preferences directory. For complete Canopy removal, delete this directory entirely; if you do so, the user will lose individual preferences such as fonts, bookmarks, and recent file list.

cd ~/.canopy
cd ..
rm -rf .canopy

6) If you are uninstalling completely, edit the following files to delete any lines which reference Canopy (usually, the Canopy-related lines will have been commented out by step 1 but on some system configurations the lines might remain):

For this step, refer to my blog post on opening files in a text editor from the CMD / Terminal (Using Python).

~/.bashrc
~/.bash_profile
~/.profile

7) Restart your computer.

All these steps in one:

Screenshot from 2015-09-26 11:48:43

Once I was done with these steps, I no longer encountered any issues importing matplotlib on IPython anymore.

Screenshot from 2015-09-26 12:06:47

Machine Learning — New Coursera Specialization from the University of Washington

I have finally embarked on my first machine learning MOOC / Specialization. I love Python, and this course uses Python as the language of choice. Also, the instructors assert that Python is widely used in industry, and is becoming the de facto language for data science in industry. They use IPython Notebook in their assignments and videos.

The specialization offered by the University of Washington consists of 5 courses and a capstone project spread across about 8 months (September through April). The specialization’s first iteration kicked off yesterday.

washingtonMachineLearningThe first course, Machine Learning Foundations: A Case Study Approach is 6 weeks long, running from September 22 through November 9.

The Instructors:

Emily Fox and Carlos Guestrin
EmilyFoxguestrin-dato

Key Learning Outcomes
– Identify potential applications of machine learning in practice.
– Describe the core differences in analyses enabled by regression, classification, and clustering.
– Select the appropriate machine learning task for a potential application.
– Apply regression, classification, clustering, retrieval, recommender systems, and deep learning.
– Represent your data as features to serve as input to machine learning models.
– Assess the model quality in terms of relevant error metrics for each task.
– Utilize a dataset to fit a model to analyze new data.
– Build an end-to-end application that uses machine learning at its core.
– Implement these techniques in Python.

Week-by-Week
Week 1: Introductory welcome videos and the instructors’ views on the future of intelligent applications
Week 2: Predicting House Prices (Regression)
Week 3: Classification (Sentiment Analysis)
Week 4: Clustering and Similarity: Retrieving Documents
Week 5: Recommending Products
Week 6: Deep Learning: Searching for Images

EDIT

It’s been 3 days since the course began, and here’s how the classmate demographic looks like:

Classmates09252015