Key Insights on Sberbank Home Price Predicting Kaggle Competition Coming Soon…

June 29, 2017June 30, 2017 Anirudh Hackathons Data Science, Kaggle, Machine Learning, Python

This post is more about data science and Kaggle than about R or Python. I am currently taking part in my 2nd Kaggle competition, Sberbank Russian Housing Market — Can you predict realty price fluctuations in Russia’s volatile economy?

I’ve been stuck for about a week at the 52nd percentile among 3400+ Kagglers taking part in the competition. I’ve been told that Kaggle Kernels and discussion boards are helpful when you’re stuck or if you need to learn some practical data science that can’t be gleaned from books or tutorials.

One such discussion thread looks like this:

This person going by the pseudonym Schoolpal is currently killing it on the leaderboard and I’m eagerly looking forward to this person’s code once the competition ends in less than 24 hours. If you’re interested too, follow this discussion here.

Cheers!

Update:

This Schoolpal, as mentioned earlier, finally came in second and shared their approach here.

Abu Mostafa’s Machine Learning MOOC – Now on EdX

September 24, 2016September 24, 2016 Anirudh Technical Abu Mostafa, Algorithms, Andrew Ng, CalTech, Data Science, edX, Machine Learning, MOOC, Statistical Learning

This was in the pipeline for quite some time now. I have been waiting for his lectures on a platform such as EdX or Coursera, and the day has arrived. You can enroll and start with week 1’s lectures as they’re live now.

This course is taught by none other than Dr. Yaser S. Abu – Mostafa, whose textbook on machine learning, Learning from Data is #1 bestseller textbook (Amazon) in all categories of Computer Science. His online course has been offered earlier over here.

Teaching

Dr. Abu-Mostafa received the Clauser Prize for the most original doctoral thesis at Caltech. He received the ASCIT Teaching Awards in 1986, 1989 and 1991, the GSC Teaching Awards in 1995 and 2002, and the Richard P. Feynman prize for excellence in teaching in 1996.

Live ‘One-take’ Recordings

The lectures have been recorded from a live broadcast (including Q&A, which will let you gauge the level of CalTech students taking this course). In fact, it almost seems as though Abu Mostafa takes a direct jab at Andrew Ng’s popular Coursera MOOC by stating the obvious on his course page.

A real Caltech course, not a watered-down version

Again, while enrolling note that this is what Abu Mostafa had to say about the online course: “A Caltech course does not cater to short attention spans, and it may not provide instant gratification…[like] many MOOCs out there that are quite simple and have a ‘video game’ feel to them.” Unsurprisingly, many online students have dropped out in the past, but some of those students who “complained early on but decided to stick with the course had very flattering words to say at the end”.

Prerequisites

Basic probability
Basic matrices
Basic calculus
Some programming language/platform (I choose Python!)

If you’re looking for a challenging machine learning course, this is probably one you must take.

How to become a Data Scientist in 6 months

June 15, 2016June 17, 2016 Anirudh Non Technical Data Science, Kaggle, Machine Learning, PyData, Python

Disclaimer: I’m not a data scientist yet. That’s still work in progress, but I’d recommend this excellent talk given by Tetiana Ivanova to put an enthusiast’s data science journey in perspective.

Data Manipulation in R with dplyr – Part 3

December 22, 2015December 22, 2015 Anirudh Technical Code Snippets, Coding, Dagwood Sandwich, Data Manipulation, Data Science, dplyr, R

This happens to be my 50th blog post – and my blog is 8 months old.

🙂

This post is the third and last post in in a series of posts (Part 1 – Part 2) on data manipulation with dlpyr. Note that the objects in the code may have been defined in earlier posts and the code in this post is in continuation with code from the earlier posts.

Although datasets can be manipulated in sophisticated ways by linking the 5 verbs of dplyr in conjunction, linking verbs together can be a bit verbose.

Creating multiple objects, especially when working on a large dataset can slow you down in your analysis. Chaining functions directly together into one line of code is difficult to read. This is sometimes called the Dagwood sandwich problem: you have too much filling (too many long arguments) between your slices of bread (parentheses). Functions and arguments get further and further apart.

The %>% operator allows you to extract the first argument of a function from the arguments list and put it in front of it, thus solving the Dagwood sandwich problem.

	# %>% OPERATOR ----------------------------------------------------------------------

	# with %>% operator
	hflights %>%
	mutate(diff = TaxiOut - TaxiIn) %>%
	filter(!is.na(diff)) %>%
	summarise(avg = mean(diff))

	# without %>% operator
	# arguments get further and further apart
	summarize(filter(mutate(hflights, diff = TaxiOut - TaxiIn),!is.na(diff)),
	avg = mean(diff))


	# with %>% operator
	d <- hflights %>%
	select(Dest, UniqueCarrier, Distance, ActualElapsedTime) %>%
	mutate(RealTime = ActualElapsedTime + 100, mph = Distance/RealTime*60)

	# without %>% operator
	d <- mutate(select(hflights, Dest, UniqueCarrier, Distance, ActualElapsedTime),
	RealTime = ActualElapsedTime + 100, mph = Distance/RealTime*60)

	# Filter and summarise d
	d %>%
	filter(!is.na(mph), mph < 70) %>%
	summarise(n_less = n(), n_dest = n_distinct(Dest),
	min_dist = min(Distance), max_dist = max(Distance))

	# Let's define preferable flights as flights that are 150% faster than driving,
	# i.e. that travel 105 mph or greater in real time. Also, assume that cancelled or
	# diverted flights are less preferable than driving.


	# ADVANCED PIPING EXERCISES
	# Use one single piped call to print a summary with the following variables:

	# n_non - the number of non-preferable flights in hflights,
	# p_non - the percentage of non-preferable flights in hflights,
	# n_dest - the number of destinations that non-preferable flights traveled to,
	# min_dist - the minimum distance that non-preferable flights traveled,
	# max_dist - the maximum distance that non-preferable flights traveled

	hflights %>%
	mutate(RealTime = ActualElapsedTime + 100, mph = Distance/RealTime*60) %>%
	filter(mph < 105 \| Cancelled == 1 \| Diverted == 1) %>%
	summarise(n_non = n(), p_non = 100*n_non/nrow(hflights), n_dest = n_distinct(Dest),
	min_dist = min(Distance), max_dist = max(Distance))

	# Use summarise() to create a summary of hflights with a single variable, n,
	# that counts the number of overnight flights. These flights have an arrival
	# time that is earlier than their departure time. Only include flights that have
	# no NA values for both DepTime and ArrTime in your count.

	hflights %>%
	mutate(overnight = (ArrTime < DepTime)) %>%
	filter(overnight == TRUE) %>%
	summarise(n = n())

view raw pipingOperator.r hosted with ❤ by GitHub

group_by()

group_by() defines groups within a data set. Its influence becomes clear when calling summarise() on a grouped dataset. Summarizing statistics are calculated for the different groups separately.

	# group_by() -------------------------------------------------------------------------

	# Generate a per-carrier summary of hflights with the following variables: n_flights,
	# the number of flights flown by the carrier; n_canc, the number of cancelled flights;
	# p_canc, the percentage of cancelled flights; avg_delay, the average arrival delay of
	# flights whose delay does not equal NA. Next, order the carriers in the summary from
	# low to high by their average arrival delay. Use percentage of flights cancelled to
	# break any ties. Which airline scores best based on these statistics?

	hflights %>%
	group_by(UniqueCarrier) %>%
	summarise(n_flights = n(), n_canc = sum(Cancelled), p_canc = 100*n_canc/n_flights,
	avg_delay = mean(ArrDelay, na.rm = TRUE)) %>% arrange(avg_delay)

	# Generate a per-day-of-week summary of hflights with the variable avg_taxi,
	# the average total taxiing time. Pipe this summary into an arrange() call such
	# that the day with the highest avg_taxi comes first.

	hflights %>%
	group_by(DayOfWeek) %>%
	summarize(avg_taxi = mean(TaxiIn + TaxiOut, na.rm = TRUE)) %>%
	arrange(desc(avg_taxi))

view raw group_by.R hosted with ❤ by GitHub

Combine group_by with mutate

group_by() can also be combined with mutate(). When you mutate grouped data, mutate() will calculate the new variables independently for each group. This is particularly useful when mutate() uses the rank() function, that calculates within group rankings. rank() takes a group of values and calculates the rank of each value within the group, e.g.

rank(c(21, 22, 24, 23))

has output

[1] 1 2 4 3

As with arrange(), rank() ranks values from the largest to the smallest and this behaviour can be reversed with the desc() function.

	# Combine group_by with mutate-----

	# First, discard flights whose arrival delay equals NA. Next, create a by-carrier
	# summary with a single variable: p_delay, the proportion of flights which are
	# delayed at arrival. Next, create a new variable rank in the summary which is a
	# rank according to p_delay. Finally, arrange the observations by this new rank
	hflights %>%
	filter(!is.na(ArrDelay)) %>%
	group_by(UniqueCarrier) %>%
	summarise(p_delay = sum(ArrDelay >0)/n()) %>%
	mutate(rank = rank(p_delay)) %>%
	arrange(rank)

	# n a similar fashion, keep flights that are delayed (ArrDelay > 0 and not NA).
	# Next, create a by-carrier summary with a single variable: avg, the average delay
	# of the delayed flights. Again add a new variable rank to the summary according to
	# avg. Finally, arrange by this rank variable.
	hflights %>%
	filter(!is.na(ArrDelay), ArrDelay > 0) %>%
	group_by(UniqueCarrier) %>%
	summarise(avg = mean(ArrDelay)) %>%
	mutate(rank = rank(avg)) %>%
	arrange(rank)

	# Advanced group_by exercises-------------------------------------------------------

	# Which plane (by tail number) flew out of Houston the most times? How many times?
	# Name the column with this frequency n. Assign the result to adv1. To answer this
	# question precisely, you will have to filter() as a final step to end up with only
	# a single observation in adv1.
	# Which plane (by tail number) flew out of Houston the most times? How many times? adv1
	adv1 <- hflights %>%
	group_by(TailNum) %>%
	summarise(n = n()) %>%
	filter(n == max(n))

	# How many airplanes only flew to one destination from Houston? adv2
	# How many airplanes only flew to one destination from Houston?
	# Save the resulting dataset in adv2, that contains only a single column,
	# named nplanes and a single row.
	adv2 <- hflights %>%
	group_by(TailNum) %>%
	summarise(n_dest = n_distinct(Dest)) %>%
	filter(n_dest == 1) %>%
	summarise(nplanes = n())

	# Find the most visited destination for each carrier and save your solution to adv3.
	# Your solution should contain four columns:
	# UniqueCarrier and Dest,
	# n, how often a carrier visited a particular destination,
	# rank, how each destination ranks per carrier. rank should be 1 for every row,
	# as you want to find the most visited destination for each carrier.

	adv3 <- hflights %>%
	group_by(UniqueCarrier, Dest) %>%
	summarise(n = n()) %>%
	mutate(rank = rank(desc(n))) %>%
	filter(rank == 1)

	# Find the carrier that travels to each destination the most: adv4
	# For each destination, find the carrier that travels to that destination the most.
	# Store the result in adv4. Again, your solution should contain 4 columns:
	# Dest, UniqueCarrier, n and rank.

	adv4 <- hflights %>%
	group_by(Dest, UniqueCarrier) %>%
	summarise(n = n()) %>%
	mutate(rank = rank(desc(n))) %>%
	filter(rank == 1)

view raw group_by_miscellaneous.R hosted with ❤ by GitHub

My First Data Science Hackathon

December 20, 2015December 20, 2015 Anirudh Non Technical Analytics Vidhya, Data Science, Hackathon, Python, R

I participated in https://t.co/alLuY7JjjT
Finished 24th/54. It was my first ever #datascience #hackathon. Determined to get better at this.

— Anirudh (@anirudhjay) December 20, 2015

So after 8 months of playing around with R and Python and blog post after blog post, I found myself finally hacking away at a problem set from the 17th storey of the Hindustan Times building at Connaught Place. I had entered my first ever data science hackathon conducted by Analytics Vidhya, a pioneer in analytics learning in India. Pizzas and Pepsi were on the house. Like any predictive analysis hackathon, this one accepted unlimited entries till submission time. It was from 2pm to 4:30pm today – 2.5 hours, of which I ended up wasting 1.5 hours trying to make my first submission which encountered submission error after submission error until the problem was fixed finally post lunch. I had 1 hour to try my best. It wasn’t the best performance, but I thought of blogging this experience anyway, as a reminder of the work that awaits me. I want to be the one winning prize money at the end of the day.

🙂

Data Manipulation in R with dplyr – Part 2

December 18, 2015December 19, 2015 Anirudh Technical Code Snippets, Data Science, dplyr, R

Note that this post is in continuation with Part 1 of this series of posts on data manipulation with dplyr in R. The code in this post carries forward from the variables / objects defined in Part 1.

In the previous post, I talked about how dplyr provides a grammar of sorts to manipulate data, and consists of 5 verbs to do so:

The 5 verbs of dplyr
select – removes columns from a dataset
filter – removes rows from a dataset
arrange – reorders rows in a dataset
mutate – uses the data to build new columns and values
summarize – calculates summary statistics

I went on to discuss examples using select() and mutate(). Let’s now talk about filter(). R comes with a set of logical operators that you can use inside filter(). These operators are:
x < y, TRUE if x is less than y
x <= y, TRUE if x is less than or equal to y
x == y, TRUE if x equals y
x != y, TRUE if x does not equal y
x >= y, TRUE if x is greater than or equal to y
x > y, TRUE if x is greater than y
x %in% c(a, b, c), TRUE if x is in the vector c(a, b, c)

The following call, for example, filters df such that only the observations where the variable a is greater than the variable b:
filter(df, a > b)

	# Print out all flights in hflights that traveled 3000 or more miles
	filter(hflights, Distance > 3000)

	# All flights flown by one of JetBlue, Southwest, or Delta
	filter(hflights, UniqueCarrier %in% c('JetBlue', 'Southwest', 'Delta'))

	# All flights where taxiing took longer than flying
	filter(hflights, TaxiIn + TaxiOut > AirTime)

view raw verbs05.r hosted with ❤ by GitHub

Combining tests using boolean operators
R also comes with a set of boolean operators that you can use to combine multiple logical tests into a single test. These include & (and), | (or), and ! (not). Instead of using the & operator, you can also pass several logical tests to filter(), separated by commas. The following calls equivalent:

filter(df, a > b & c > d)
filter(df, a > b, c > d)

The is.na() will also come in handy very often. This expression, for example, keeps the observations in df for which the variable x is not NA:

filter(df, !is.na(x))

	# Combining tests using boolean operators

	# All flights that departed before 5am or arrived after 10pm
	filter(hflights, DepTime < 500 \| ArrTime > 2200 )

	# All flights that departed late but arrived ahead of schedule
	filter(hflights, DepDelay > 0 & ArrDelay < 0)

	# All cancelled weekend flights
	filter(hflights, DayOfWeek %in% c(6,7) & Cancelled == 1)

	# All flights that were cancelled after being delayed
	filter(hflights, Cancelled == 1, DepDelay > 0)

view raw verbs06.r hosted with ❤ by GitHub

A recap on select(), mutate() and filter():

	# Summarizing Exercise
	# Select the flights that had JFK as their destination: c1
	c1 <- filter(hflights, Dest == 'JFK')

	# Combine the Year, Month and DayofMonth variables to create a Date column: c2
	c2 <- mutate(c1, Date = paste(Year, Month, DayofMonth, sep = "-"))

	# Print out a selection of columns of c2
	select(c2, Date, DepTime, ArrTime, TailNum)

	# How many weekend flights flew a distance of more than 1000 miles
	# but had a total taxiing time below 15 minutes?
	nrow(filter(hflights, DayOfWeek %in% c(6,7), Distance > 1000, TaxiIn + TaxiOut < 15))

view raw verbs07.r hosted with ❤ by GitHub

Arranging Data
arrange() can be used to rearrange rows according to any type of data. If you pass arrange() a character variable, R will rearrange the rows in alphabetical order according to values of the variable. If you pass a factor variable, R will rearrange the rows according to the order of the levels in your factor (running levels() on the variable reveals this order).

By default, arrange() arranges the rows from smallest to largest. Rows with the smallest value of the variable will appear at the top of the data set. You can reverse this behaviour with the desc() function. arrange() will reorder the rows from largest to smallest values of a variable if you wrap the variable name in desc() before passing it to arrange()

	# Definition of dtc
	dtc <- filter(hflights, Cancelled == 1, !is.na(DepDelay))

	# Arrange dtc by departure delays
	arrange(dtc, DepDelay)

	# Arrange dtc so that cancellation reasons are grouped
	arrange(dtc, CancellationCode)

	# Arrange dtc according to carrier and departure delays
	arrange(dtc, UniqueCarrier, DepDelay)

	# Arrange according to carrier and decreasing departure delays
	arrange(hflights, UniqueCarrier, desc(DepDelay))

	# Arrange flights by total delay (normal order).
	arrange(hflights, DepDelay + ArrDelay)

	# Keep flights leaving to DFW before 8am and arrange according to decreasing AirTime
	arrange(filter(hflights, Dest == 'DFW', DepTime < 800), desc(AirTime))

view raw verbs08.r hosted with ❤ by GitHub

Summarizing Data

summarise(), the last of the 5 verbs, follows the same syntax as mutate(), but the resulting dataset consists of a single row instead of an entire new column in the case of mutate().

In contrast to the four other data manipulation functions, summarise() does not return an altered copy of the dataset it is summarizing; instead, it builds a new dataset that contains only the summarizing statistics.

Note: summarise() and summarize() both work the same!

You can use any function you like in summarise(), so long as the function can take a vector of data and return a single number. R contains many aggregating functions. Here are some of the most useful:

min(x) – minimum value of vector x.
max(x) – maximum value of vector x.
mean(x) – mean value of vector x.
median(x) – median value of vector x.
quantile(x, p) – pth quantile of vector x.
sd(x) – standard deviation of vector x.
var(x) – variance of vector x.
IQR(x) – Inter Quartile Range (IQR) of vector x.
diff(range(x)) – total range of vector x.

	# Print out a summary with variables min_dist and max_dist
	summarize(hflights, min_dist = min(Distance), max_dist = max(Distance))

	# Print out a summary with variable max_div
	summarize(filter(hflights, Diverted == 1), max_div = max(Distance))

	# Remove rows that have NA ArrDelay: temp1
	temp1 <- filter(hflights, !is.na(ArrDelay))

	# Generate summary about ArrDelay column of temp1
	summarise(temp1, earliest = min(ArrDelay), average = mean(ArrDelay),
	latest = max(ArrDelay), sd = sd(ArrDelay))

	# Keep rows that have no NA TaxiIn and no NA TaxiOut: temp2
	temp2 <- filter(hflights, !is.na(TaxiIn), !is.na(TaxiOut))

	# Print the maximum taxiing difference of temp2 with summarise()
	summarise(temp2, max_taxi_diff = max(abs(TaxiIn - TaxiOut)))

view raw verbs09.r hosted with ❤ by GitHub

dplyr provides several helpful aggregate functions of its own, in addition to the ones that are already defined in R. These include:

first(x) – The first element of vector x.
last(x) – The last element of vector x.
nth(x, n) – The nth element of vector x.
n() – The number of rows in the data.frame or group of observations that summarise() describes.
n_distinct(x) – The number of unique values in vector x

	# Generate summarizing statistics for hflights
	summarise(hflights, n_obs = n(), n_carrier = n_distinct(UniqueCarrier),
	n_dest = n_distinct(Dest), dest100 = nth(Dest, 100))

	# Filter hflights to keep all American Airline flights: aa
	aa <- filter(hflights, UniqueCarrier == "American")

	# Generate summarizing statistics for aa
	summarise(aa, n_flights = n(), n_canc = sum(Cancelled),
	p_canc = 100*(n_canc/n_flights), avg_delay = mean(ArrDelay, na.rm = TRUE))

view raw verbs10.r hosted with ❤ by GitHub

This would be it for Part-2 of this series of posts on data manipulation with dplyr. Part 3 would focus on the pipe operator, Group_by and working with databases.

Data Manipulation in R with dplyr – Part 1

December 17, 2015December 17, 2015 Anirudh Technical Code Snippets, Data Manipulation, Data Science, dplyr, R

dplyr is one of the packages in R that makes R so loved by data scientists. It has three main goals:

Identify the most important data manipulation tools needed for data analysis and make them easy to use in R.
Provide blazing fast performance for in-memory data by writing key pieces of code in C++.
Use the same code interface to work with data no matter where it’s stored, whether in a data frame, a data table or database.

Introduction to the dplyr package and the tbl class
This post is mostly about code. If you’re interested in learning dplyr I recommend you type in the commands line by line on the R console to see first hand what’s happening.

	# INTRODUCTION TO dplyr AND tbls
	# Load the dplyr package
	library(dplyr)

	# Load the hflights package
	library(hflights)

	# Call both head() and summary() on hflights
	head(hflights)
	summary(hflights)


	# Convert the hflights data.frame into a hflights tbl
	hflights <- tbl_df(hflights)

	# Display the hflights tbl
	hflights

	# Create the object carriers, containing only the UniqueCarrier variable of hflights
	carriers <- hflights$UniqueCarrier


	# Use lut to translate the UniqueCarrier column of hflights and before doing so
	# glimpse hflights to see the UniqueCarrier variablle
	glimpse(hflights)

	lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental",
	"DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways",
	"WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier",
	"FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")
	hflights$UniqueCarrier <- lut[hflights$UniqueCarrier]
	# Now glimpse hflights to see the change in the UniqueCarrier variable
	glimpse(hflights)

	# Fill up empty entries of CancellationCode with 'E'
	# To do so, first index the empty entries in CancellationCode
	cancellationEmpty <- hflights$CancellationCode == ""
	# Assign 'E' to the empty entries
	hflights$CancellationCode[cancellationEmpty] <- 'E'

	# Use a new lookup table to create a vector of code labels. Assign the vector to the CancellationCode column of hflights
	lut = c('A' = 'carrier', 'B' = 'weather', 'C' = 'FFA', 'D' = 'security', 'E' = 'not cancelled')
	hflights$CancellationCode <- lut[hflights$CancellationCode]

	# Inspect the resulting raw values of your variables
	glimpse(hflights)

view raw introduction.R hosted with ❤ by GitHub

Select and mutate
dplyr provides grammar for data manipulation apart from providing data structure. The grammar is built around 5 functions (also referred to as verbs) that do the basic tasks of data manipulation.

dplyr functions do not change the dataset. They return a new copy of the dataset to use.

To answer the simple question whether flight delays tend to shrink or grow during a flight, we can safely discard a lot of the variables of each flight. To select only the ones that matter, we can use select()

	hflights[c('ActualElapsedTime','ArrDelay','DepDelay')]
	# Equivalently, using dplyr:
	select(hflights, ActualElapsedTime, ArrDelay, DepDelay)

	# Print out a tbl with the four columns of hflights related to delay
	select(hflights, ActualElapsedTime, AirTime, ArrDelay, DepDelay)

	# Print out hflights, nothing has changed!
	hflights

	# Print out the columns Origin up to Cancelled of hflights
	select(hflights, Origin:Cancelled)

	# Find the most concise way to select: columns Year up to and
	# including DayOfWeek, columns ArrDelay up to and including Diverted
	# Answer to last question: be concise!
	# You may want to examine the order of hflight's column names before you
	# begin with names()
	names(hflights)
	select(hflights, -(DepTime:AirTime))

view raw verbs01.R hosted with ❤ by GitHub

dplyr comes with a set of helper functions that can help you select variables. These functions find groups of variables to select, based on their names. Each of these works only when used inside of select()

starts_with(“X”): every name that starts with “X”
ends_with(“X”): every name that ends with “X”
contains(“X”): every name that contains “X”
matches(“X”): every name that matches “X”, where “X” can be a regular expression
num_range(“x”, 1:5): the variables named x01, x02, x03, x04 and x05
one_of(x): every name that appears in x, which should be a character vector

	# Helper functions used with dplyr

	# Print out a tbl containing just ArrDelay and DepDelay
	select(hflights, ArrDelay, DepDelay)
	# Use a combination of helper functions and variable names to print out
	# only the UniqueCarrier, FlightNum, TailNum, Cancelled, and CancellationCode
	# columns of hflights
	select(hflights, UniqueCarrier, FlightNum, contains("Tail"), contains("Cancel"))

	# Find the most concise way to return the following columns with select and its
	# helper functions: DepTime, ArrTime, ActualElapsedTime, AirTime, ArrDelay,
	# DepDelay. Use only helper functions
	select(hflights, ends_with("Time"), ends_with("Delay"))

view raw verbs02.R hosted with ❤ by GitHub

In order to appreciate the usefulness of dplyr, here are some comparisons between base R and dplyr

	# Some comparisons to basic R
	# both hflights and dplyr are available

	ex1r <- hflights[c("TaxiIn","TaxiOut","Distance")]
	ex1d <- select(hflights, TaxiIn, TaxiOut, Distance)

	ex2r <- hflights[c("Year","Month","DayOfWeek","DepTime","ArrTime")]
	ex2d <- select(hflights, Year:ArrTime, -DayofMonth)

	ex3r <- hflights[c("TailNum","TaxiIn","TaxiOut")]
	ex3d <- select(hflights, TailNum, contains("Taxi"))

view raw comparisons01.R hosted with ❤ by GitHub

mutate() is the second of the five data manipulation functions. mutate() creates new columns which are added to a copy of the dataset.

	# Add the new variable ActualGroundTime to a copy of hflights and save the result as g1.
	g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime)

	# Add the new variable GroundTime to a g1. Save the result as g2.
	g2 <- mutate(g1, GroundTime = TaxiIn + TaxiOut)

	# Add the new variable AverageSpeed to g2. Save the result as g3.
	g3 <- mutate(g2, AverageSpeed = Distance / AirTime * 60)

	# Print out g3
	g3

view raw verbs03.r hosted with ❤ by GitHub

So far we have added variables to hflights one at a time, but we can also use mutate() to add multiple variables at once.

	# Add a second variable loss_percent to the dataset: m1
	m1 <- mutate(hflights, loss = ArrDelay - DepDelay, loss_percent = ((ArrDelay - DepDelay)/DepDelay)*100)

	# mutate() allows you to use a new variable while creating a next variable in the same call
	# Copy and adapt the previous command to reduce redendancy: m2
	m2 <- mutate(hflights, loss = ArrDelay - DepDelay, loss_percent = (loss/DepDelay) * 100 )

	# Add the three variables as described in the third instruction: m3
	m3 <- mutate(hflights, TotalTaxi = TaxiIn + TaxiOut, ActualGroundTime = ActualElapsedTime - AirTime, Diff = TotalTaxi - ActualGroundTime)

view raw verbs04.r hosted with ❤ by GitHub

Supplementary Material to Andrew Ng’s Machine Learning MOOC

November 25, 2015 Anirudh Technical Andrew Ng, Data Science, Machine Learning

Although the lecture videos and lecture notes from Andrew Ng‘s Coursera MOOC are sufficient for the online version of the course, if you’re interested in more mathematical stuff or want to be challenged further, you can go through the following notes and problem sets from CS 229, a 10-week course that he teaches at Stanford (which also happens to be the most enrolled course on campus). It’s not hard to end up with a 100% score on his MOOC which is obviously a (much) watered down version of the course he teaches at Stanford, at least in terms of difficulty. If you don’t believe me, just have a go at the problem sets from the links below.

Lecture Notes

Lecture notes 1 (ps) (pdf) Supervised Learning, Discriminative Algorithms
Lecture notes 2 (ps) (pdf) Generative Algorithms
Lecture notes 3 (ps) (pdf) Support Vector Machines
Lecture notes 4 (ps) (pdf) Learning Theory
Lecture notes 5 (ps) (pdf) Regularization and Model Selection
Lecture notes 6 (ps) (pdf) Online Learning and the Perceptron Algorithm. (optional reading)
Lecture notes 7a (ps) (pdf) Unsupervised Learning, k-means clustering.
Lecture notes 7b (ps) (pdf) Mixture of Gaussians
Lecture notes 8 (ps) (pdf) The EM Algorithm
Lecture notes 9 (ps) (pdf) Factor Analysis
Lecture notes 10 (ps) (pdf) Principal Components Analysis
Lecture notes 11 (ps) (pdf) Independent Components Analysis
Lecture notes 12 (ps) (pdf) Reinforcement Learning and Control

Section Notes

Section notes 1 (pdf) Linear Algebra Review and Reference
Section notes 2 (pdf) Probability Theory Review
Files for the Matlab tutorial: sigmoid.m, logistic_grad_ascent.m, matlab_session.m
Section notes 4 (ps) (pdf) Convex Optimization Overview, Part I
Section notes 5 (ps) (pdf) Convex Optimization Overview, Part II
Section notes 6 (ps) (pdf) Hidden Markov Models
Section notes 7 (pdf) The Multivariate Gaussian Distribution
Section notes 8 (pdf) More on Gaussian Distribution
Section notes 9 (pdf) Gaussian Processes

Handouts and Problem Sets

Handout #1: Course Information (HTML) (pdf)
Handout #2: Course Schedule (HTML) (pdf)
Handout #3: Cover Sheet
Handout #4: Practice Midterm 1 Solution: Solution
Handout #5: Practice Midterm 2 Solution: Solution
Problem Set 1 (pdf) Data: q1x.dat, q1y.dat, q2x.dat, q2y.dat Solution: Solution (pdf)
Problem Set 2 (pdf) Data: ps2.zip Solution: Solution (pdf)
Problem Set 3 (pdf) Solution: Solution (pdf)
Problem Set 4 (pdf)

MITx 6.00.2x Introduction to Computational Thinking and Data Science (Fall 2015)

October 21, 2015October 21, 2015 Anirudh Non Technical Coding, Data Science, Data Visualization, edX, MIT, MOOC, Python

MIT’s Fall 2015 iteration of 6.00.2x starts today. After an enriching learning experience with 6.00.1x, I have great expectations from this course. As the course website mildly puts it, 6.00.2x is an introduction to using computation to understand real-world phenomena. MIT OpenCourseware (OCW) mirroring the material covered in 6.00.1x and 6.00.2x can be found here.

The course follows this book by John Guttag (who happens to be one of the instructors for this course). However, purchasing the book isn’t a necessity for this course.

One thing I loved about 6.00.1x was its dedicated Facebook group, which gave a community / classroom-peergroup feel to the course. 6.00.2x also has a Facebook group. Here’s a sneak peak:

The syllabus and schedule for this course is shown below. The course is spread out over 2 months which includes 7 weeks of lectures.

The prerequisites for this course are pretty much covered in this set of tutorial videos that have been created by one of the TAs for 6.00.1x. If you’ve not taken 6.00.1x in the past, you can go through these videos (running time < 1hr) to judge whether or not to go ahead with 6.00.2x.

So much for the update. Got work to do! 🙂

Teach Yourself Machine Learning the Hard Way!

October 9, 2015October 12, 2015 Anirudh Non Technical Algorithms, Data Science, Machine Learning, Python

This formula is kick-ass!

Darshan Hegde

It has been 3 years since I have steered my interests towards Machine Learning. I had just graduated from college with a Bachelor of Engineering in Electronics and Communication Engineering. Which is, other way of saying that I was:

a toddler in programming.
little / no knowledge of algorithms.
studied engineering math, but it was rusty.
no knowledge of modern optimization.
zero knowledge of statistical inference.

I think, most of it is true for many engineering graduates (especially, in India !). Unless, you studied mathematics and computing for undergrad.

Lucky for me, I had a great mentor and lot of online materials on these topics. This post will list many such materials I found useful, while I was learning it the hard way !

All the courses that I’m listing below have homework assignments. Make sure you work through each one of them.

1. Learn Python

If you are new to programming…

View original post 507 more words

Discovering Python & R

— my journey as a worker bee in quant finance

Data Science

Key Insights on Sberbank Home Price Predicting Kaggle Competition Coming Soon…

Abu Mostafa’s Machine Learning MOOC – Now on EdX

How to become a Data Scientist in 6 months

Data Manipulation in R with dplyr – Part 3

My First Data Science Hackathon

Data Manipulation in R with dplyr – Part 2

Data Manipulation in R with dplyr – Part 1

Supplementary Material to Andrew Ng’s Machine Learning MOOC

MITx 6.00.2x Introduction to Computational Thinking and Data Science (Fall 2015)

Teach Yourself Machine Learning the Hard Way!

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: