# Sharing IPython / Jupyter Notebooks via WordPress

In order to share (a static version of) your IPython / Jupyter notebook on your WordPress site, follow three straightforward steps.

Step 1: Let’s say your Jupyter Notebook looks like this: Open this notebook in a text editor and copy the content which may look like so: Step 2: Ctrl + A and Ctrl + C this content. Then Ctrl + V this to a GitHub Gist that you should create, like so: Step 3: Now simply Create public gist and embed the gist like you always embed gists on WordPress, viz., go to the HTML editor and add like so: I followed the exact steps that I’ve mentioned above to get the following result:

Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
view raw knn.ipynb hosted with ❤ by GitHub

# Randomized Selection Algorithm (Quickselect) – Python Code

Find the kth smallest element in an array without sorting.

That’s basically what this algorithm does. It piggybacks on the partition subroutine from the Quick Sort. If you don’t know what that is, you can check out more about the Quick Sort algorithm here and here, and understand the usefulness of partitioning an unsorted array around a pivot. Animated visualization of the randomized selection algorithm selecting the 22nd smallest value

Python Implementation

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 from random import randrange def partition(x, pivot_index = 0): i = 0 if pivot_index !=0: x,x[pivot_index] = x[pivot_index],x for j in range(len(x)-1): if x[j+1] < x: x[j+1],x[i+1] = x[i+1],x[j+1] i += 1 x,x[i] = x[i],x return x,i def RSelect(x,k): if len(x) == 1: return x else: xpart = partition(x,randrange(len(x))) x = xpart # partitioned array j = xpart # pivot index if j == k: return x[j] elif j > k: return RSelect(x[:j],k) else: k = k - j - 1 return RSelect(x[(j+1):], k) x = [3,1,8,4,7,9] for i in range(len(x)): print RSelect(x,i),
view raw RSelect.py hosted with ❤ by GitHub

# Computing Work Done (Total Pivot Comparisons) by Quick Sort

A key aspect of the Quick Sort algorithm is how the pivot element is chosen. In my earlier post on the Python code for Quick Sort, my implementation takes the first element of the unsorted array as the pivot element.

However with some mathematical analysis it can be seen that such an implementation is O(n2) in complexity while if a pivot is randomly chosen, the Quick Sort algorithm is O(nlog2n).

To witness this in action, one can measure the work done by the algorithm comparing two cases, one with a randomized pivot choice – and one with a fixed pivot choice, say the first element of the array (or the last element of the array).

Implementation

A decent proxy for the amount of work done by the algorithm would be the number of pivot comparisons. These comparisons needn’t be computed one-by-one, rather when there is a recursive call on a subarray of length m, you should simply add m−1 to your running total of comparisons.

3 Cases

To put things in perspective, let’s look at 3 cases. (This is basically straight out of a homework assignment from Tim Roughgarden’s course on the Design and Analysis of Algorithms).
Case I with the pivot being the first element.
Case II with the pivot being the last element.
Case III using the “median-of-three” pivot rule. The primary motivation behind this rule is to do a little bit of extra work to get much better performance on input arrays that are nearly sorted or reverse sorted.

Median-of-Three Pivot Rule

Consider the first, middle, and final elements of the given array. (If the array has odd length it should be clear what the “middle” element is; for an array with even length 2k, use the kth element as the “middle” element. So for the array 4 5 6 7, the “middle” element is the second one —- 5 and not 6! Identify which of these three elements is the median (i.e., the one whose value is in between the other two), and use this as your pivot.

Python Code

This file contains all of the integers between 1 and 10,000 (inclusive, with no repeats) in unsorted order. The integer in the ith row of the file gives you the ith entry of an input array. I downloaded this file and named it QuickSort_List.txt

You can run the code below and see for yourself that the number of comparisons for Case III are 138,382 compared to 162,085 and 164,123 for Case I and Case II respectively. You can play around with the code in an IPython / Jupyter notebook here.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 #!/usr/bin/env # Case I # First element of the unsorted array is chosen as pivot element for sorting using Quick Sort def countComparisonsWithFirst(x): """ Counts number of comparisons while using Quick Sort with first element of unsorted array as pivot """ global count_pivot_first if len(x) == 1 or len(x) == 0: return x else: count_pivot_first += len(x)-1 i = 0 for j in range(len(x)-1): if x[j+1] < x: x[j+1],x[i+1] = x[i+1], x[j+1] i += 1 x,x[i] = x[i],x first_part = countComparisonsWithFirst(x[:i]) second_part = countComparisonsWithFirst(x[i+1:]) first_part.append(x[i]) return first_part + second_part # Case II # Last element of the unsorted array is chosen as pivot element for sorting using Quick Sort def countComparisonsWithLast(x): """ Counts number of comparisons while using Quick Sort with last element of unsorted array as pivot """ global count_pivot_last if len(x) == 1 or len(x) == 0: return x else: count_pivot_last += len(x)-1 x,x[-1] = x[-1],x i = 0 for j in range(len(x)-1): if x[j+1] < x: x[j+1],x[i+1] = x[i+1], x[j+1] i += 1 x,x[i] = x[i],x first_part = countComparisonsWithLast(x[:i]) second_part = countComparisonsWithLast(x[i+1:]) first_part.append(x[i]) return first_part + second_part # Case III # Median-of-three method used to choose pivot element for sorting using Quick Sort def middle_index(x): """ Returns the index of the middle element of an array """ if len(x) % 2 == 0: middle_index = len(x)/2 - 1 else: middle_index = len(x)/2 return middle_index def median_index(x,i,j,k): """ Returns the median index of three when passed an array and indices of any 3 elements of that array """ if (x[i]-x[j])*(x[i]-x[k]) < 0: return i elif (x[j]-x[i])*(x[j]-x[k]) < 0: return j else: return k def countComparisonsMedianOfThree(x): """ Counts number of comparisons while using Quick Sort with median-of-three element is chosen as pivot """ global count_pivot_median if len(x) == 1 or len(x) == 0: return x else: count_pivot_median += len(x)-1 k = median_index(x, 0, middle_index(x), -1) if k != 0: x, x[k] = x[k], x i = 0 for j in range(len(x)-1): if x[j+1] < x: x[j+1],x[i+1] = x[i+1], x[j+1] i += 1 x,x[i] = x[i],x first_part = countComparisonsMedianOfThree(x[:i]) second_part = countComparisonsMedianOfThree(x[i+1:]) first_part.append(x[i]) return first_part + second_part ##################################################################### # initializing counts count_pivot_first = 0; count_pivot_last = 0; count_pivot_median = 0 ##################################################################### # Cast I # Read the contents of the file into a Python list NUMLIST_FILENAME = "QuickSort_List.txt" inFile = open(NUMLIST_FILENAME, 'r') with inFile as f: numList = [int(integers.strip()) for integers in f.readlines()] # call functions to count comparisons countComparisonsWithFirst(numList) ##################################################################### # Read the contents of the file into a Python list NUMLIST_FILENAME = "QuickSort_List.txt" inFile = open(NUMLIST_FILENAME, 'r') with inFile as f: numList = [int(integers.strip()) for integers in f.readlines()] # call functions to count comparisons countComparisonsWithLast(numList) ##################################################################### # Read the contents of the file into a Python list NUMLIST_FILENAME = "QuickSort_List.txt" inFile = open(NUMLIST_FILENAME, 'r') with inFile as f: numList = [int(integers.strip()) for integers in f.readlines()] # call functions to count comparisons countComparisonsMedianOfThree(numList) ##################################################################### print count_pivot_first, count_pivot_last, count_pivot_median

# Quick Sort Python Code Yet another post for the crawlers to better index my site for algorithms and as a repository for Python code. The quick sort algorithm is well explained in the topmost Google search result for ‘Quick Sort Python Code’, but the code is unnecessarily convoluted. Instead, go with the code below.

In it, I assume the pivot to be the first element. You can easily add a function to  randomize selection of the pivot. Choosing a random pivot minimizes the chance that you will encounter worst-case O(n2) performance. Always choosing first or last would cause worst-case performance for nearly-sorted or nearly-reverse-sorted data.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 def quicksort(x): if len(x) == 1 or len(x) == 0: return x else: pivot = x i = 0 for j in range(len(x)-1): if x[j+1] < pivot: x[j+1],x[i+1] = x[i+1], x[j+1] i += 1 x,x[i] = x[i],x first_part = quicksort(x[:i]) second_part = quicksort(x[i+1:]) first_part.append(x[i]) return first_part + second_part alist = [54,26,93,17,77,31,44,55,20] quicksort(alist) print(alist)
view raw quicksort.py hosted with ❤ by GitHub

# Detecting Structural Breaks in China’s FX Regime

##### Edit: This post is in its infancy. Work is still ongoing as far as deriving insight from the data is concerned. More content and economic insight is expected to be added to this post as and when progress is made in that direction.

This is an attempt to detect structural breaks in China’s FX regime using Frenkel Wei regression methodology (this was later improved by Perron and Bai). I came up with the motivation to check for these structural breaks while attending a guest lecture on FX regimes by Dr. Ajay Shah delivered at IGIDR. This is work that I and two other classmates are working on as a term paper project under the supervision of Dr. Rajeswari Sengupta.

The code below can be replicated and run as is, to get same results.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 ## if fxregime or strucchange package is absent from installed packages, download it and load it if(!require('fxregime')){ install.packages("fxregime") } if(!require('strucchange')){ install.packages("strucchange") } ## load packages library("fxregime") library('strucchange') # load the necessary data related to exchange rates - 'FXRatesCHF' # this dataset treats CHF as unit currency data("FXRatesCHF", package = "fxregime") ## compute returns for CNY (and explanatory currencies) ## since China abolished fixed USD regime cny <- fxreturns("CNY", frequency = "daily", start = as.Date("2005-07-25"), end = as.Date("2010-02-12"), other = c("USD", "JPY", "EUR", "GBP")) ## compute all segmented regression with minimal segment size of ## h = 100 and maximal number of breaks = 10 regx <- fxregimes(CNY ~ USD + JPY + EUR + GBP, data = cny, h = 100, breaks = 10, ic = "BIC") ## Print summary of regression results summary(regx) ## minimum BIC is attained for 2-segment (1-break) model plot(regx) round(coef(regx), digits = 3) sqrt(coef(regx)[, "(Variance)"]) ## inspect associated confidence intervals cit <- confint(regx, level = 0.9) cit breakdates(cit) ## plot LM statistics along with confidence interval flm <- fxlm(CNY ~ USD + JPY + EUR + GBP, data = cny) scus <- gefp(flm, fit = NULL) plot(scus, functional = supLM(0.1)) ## add lines related to breaks to your plot lines(cit)

As can be seen in the figure below, the structural breaks correspond to the vertical bars. We are still working on understanding the motivations of China’s central bank in varying the degree of the managed float exchange rate. EDIT (May 16, 2016):

The code above uses data provided by the package itself. If you wished to replicate this analysis on data after 2010, you will have to use your own data. We used Quandl, which lets you get 10 premium datasets for free. An API key (for only 10 calls on premium datasets) is provided if you register there. Foreign exchange rate data (2000 onward till date) apparently, is premium data. You can find these here.

Here are the (partial) results and code to work the same methodology on the data from 2010 to 2016: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

We got breaks in 2010 and in 2015 (when China’s stock markets crashed). We would have hoped for more breaks (we can still get them), but that would depend on the parameters chosen for our regression.

# Google’s New Deep Learning MOOC Using TensorFlow

Deep learning became a hot topic in machine learning in the last 3-4 years (see inset below) and recently, Google released TensorFlow (a Python based deep learning toolkit) as an open source project to bring deep learning to everyone.￼

If you have wanted to get your hands dirty with TensorFlow or needed more direction with that, here’s some good news – Google is offering an open MOOC on deep learning methods using TensorFlow here. This course has been developed with Vincent Vanhoucke, Principal Scientist at Google, and technical lead in the Google Brain team. However, this is an intermediate to advanced level course and assumes you have taken a first course in machine learning, or that you are at least familiar with supervised learning methods.

Google’s overall goal in designing this course is to provide the machine learning enthusiast a rapid and direct path to solving real and interesting problems with deep learning techniques.

What is Deep Learning?

Course Overview

# Data Manipulation in R with dplyr – Part 3

This happens to be my 50th blog post – and my blog is 8 months old.

🙂

This post is the third and last post in in a series of posts (Part 1Part 2) on data manipulation with dlpyr. Note that the objects in the code may have been defined in earlier posts and the code in this post is in continuation with code from the earlier posts.

Although datasets can be manipulated in sophisticated ways by linking the 5 verbs of dplyr in conjunction, linking verbs together can be a bit verbose.

Creating multiple objects, especially when working on a large dataset can slow you down in your analysis. Chaining functions directly together into one line of code is difficult to read. This is sometimes called the Dagwood sandwich problem: you have too much filling (too many long arguments) between your slices of bread (parentheses). Functions and arguments get further and further apart.

The %>% operator allows you to extract the first argument of a function from the arguments list and put it in front of it, thus solving the Dagwood sandwich problem.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # %>% OPERATOR ---------------------------------------------------------------------- # with %>% operator hflights %>% mutate(diff = TaxiOut - TaxiIn) %>% filter(!is.na(diff)) %>% summarise(avg = mean(diff)) # without %>% operator # arguments get further and further apart summarize(filter(mutate(hflights, diff = TaxiOut - TaxiIn),!is.na(diff)), avg = mean(diff)) # with %>% operator d <- hflights %>% select(Dest, UniqueCarrier, Distance, ActualElapsedTime) %>% mutate(RealTime = ActualElapsedTime + 100, mph = Distance/RealTime*60) # without %>% operator d <- mutate(select(hflights, Dest, UniqueCarrier, Distance, ActualElapsedTime), RealTime = ActualElapsedTime + 100, mph = Distance/RealTime*60) # Filter and summarise d d %>% filter(!is.na(mph), mph < 70) %>% summarise(n_less = n(), n_dest = n_distinct(Dest), min_dist = min(Distance), max_dist = max(Distance)) # Let's define preferable flights as flights that are 150% faster than driving, # i.e. that travel 105 mph or greater in real time. Also, assume that cancelled or # diverted flights are less preferable than driving. # ADVANCED PIPING EXERCISES # Use one single piped call to print a summary with the following variables: # n_non - the number of non-preferable flights in hflights, # p_non - the percentage of non-preferable flights in hflights, # n_dest - the number of destinations that non-preferable flights traveled to, # min_dist - the minimum distance that non-preferable flights traveled, # max_dist - the maximum distance that non-preferable flights traveled hflights %>% mutate(RealTime = ActualElapsedTime + 100, mph = Distance/RealTime*60) %>% filter(mph < 105 | Cancelled == 1 | Diverted == 1) %>% summarise(n_non = n(), p_non = 100*n_non/nrow(hflights), n_dest = n_distinct(Dest), min_dist = min(Distance), max_dist = max(Distance)) # Use summarise() to create a summary of hflights with a single variable, n, # that counts the number of overnight flights. These flights have an arrival # time that is earlier than their departure time. Only include flights that have # no NA values for both DepTime and ArrTime in your count. hflights %>% mutate(overnight = (ArrTime < DepTime)) %>% filter(overnight == TRUE) %>% summarise(n = n())

group_by()

group_by() defines groups within a data set. Its influence becomes clear when calling summarise() on a grouped dataset. Summarizing statistics are calculated for the different groups separately.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # group_by() ------------------------------------------------------------------------- # Generate a per-carrier summary of hflights with the following variables: n_flights, # the number of flights flown by the carrier; n_canc, the number of cancelled flights; # p_canc, the percentage of cancelled flights; avg_delay, the average arrival delay of # flights whose delay does not equal NA. Next, order the carriers in the summary from # low to high by their average arrival delay. Use percentage of flights cancelled to # break any ties. Which airline scores best based on these statistics? hflights %>% group_by(UniqueCarrier) %>% summarise(n_flights = n(), n_canc = sum(Cancelled), p_canc = 100*n_canc/n_flights, avg_delay = mean(ArrDelay, na.rm = TRUE)) %>% arrange(avg_delay) # Generate a per-day-of-week summary of hflights with the variable avg_taxi, # the average total taxiing time. Pipe this summary into an arrange() call such # that the day with the highest avg_taxi comes first. hflights %>% group_by(DayOfWeek) %>% summarize(avg_taxi = mean(TaxiIn + TaxiOut, na.rm = TRUE)) %>% arrange(desc(avg_taxi))
view raw group_by.R hosted with ❤ by GitHub

Combine group_by with mutate

group_by() can also be combined with mutate(). When you mutate grouped data, mutate() will calculate the new variables independently for each group. This is particularly useful when mutate() uses the rank() function, that calculates within group rankings. rank() takes a group of values and calculates the rank of each value within the group, e.g.

rank(c(21, 22, 24, 23))

has output

 1 2 4 3

As with arrange(), rank() ranks values from the largest to the smallest and this behaviour can be reversed with the desc() function.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

# Data Manipulation in R with dplyr – Part 2

Note that this post is in continuation with Part 1 of this series of posts on data manipulation with dplyr in R. The code in this post carries forward from the variables / objects defined in Part 1.

In the previous post, I talked about how dplyr provides a grammar of sorts to manipulate data, and consists of 5 verbs to do so:

The 5 verbs of dplyr
select – removes columns from a dataset
filter – removes rows from a dataset
arrange – reorders rows in a dataset
mutate – uses the data to build new columns and values
summarize – calculates summary statistics

I went on to discuss examples using select() and mutate(). Let’s now talk about filter(). R comes with a set of logical operators that you can use inside filter(). These operators are:
x < y, TRUE if x is less than y
x <= y, TRUE if x is less than or equal to y
x == y, TRUE if x equals y
x != y, TRUE if x does not equal y
x >= y, TRUE if x is greater than or equal to y
x > y, TRUE if x is greater than y
x %in% c(a, b, c), TRUE if x is in the vector c(a, b, c)

The following call, for example, filters df such that only the observations where the variable a is greater than the variable b:
filter(df, a > b)

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # Print out all flights in hflights that traveled 3000 or more miles filter(hflights, Distance > 3000) # All flights flown by one of JetBlue, Southwest, or Delta filter(hflights, UniqueCarrier %in% c('JetBlue', 'Southwest', 'Delta')) # All flights where taxiing took longer than flying filter(hflights, TaxiIn + TaxiOut > AirTime)
view raw verbs05.r hosted with ❤ by GitHub

Combining tests using boolean operators
R also comes with a set of boolean operators that you can use to combine multiple logical tests into a single test. These include & (and), | (or), and ! (not). Instead of using the & operator, you can also pass several logical tests to filter(), separated by commas. The following calls equivalent:

filter(df, a > b & c > d)
filter(df, a > b, c > d)

The is.na() will also come in handy very often. This expression, for example, keeps the observations in df for which the variable x is not NA:

filter(df, !is.na(x))

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # Combining tests using boolean operators # All flights that departed before 5am or arrived after 10pm filter(hflights, DepTime < 500 | ArrTime > 2200 ) # All flights that departed late but arrived ahead of schedule filter(hflights, DepDelay > 0 & ArrDelay < 0) # All cancelled weekend flights filter(hflights, DayOfWeek %in% c(6,7) & Cancelled == 1) # All flights that were cancelled after being delayed filter(hflights, Cancelled == 1, DepDelay > 0)
view raw verbs06.r hosted with ❤ by GitHub

A recap on select(), mutate() and filter():

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # Summarizing Exercise # Select the flights that had JFK as their destination: c1 c1 <- filter(hflights, Dest == 'JFK') # Combine the Year, Month and DayofMonth variables to create a Date column: c2 c2 <- mutate(c1, Date = paste(Year, Month, DayofMonth, sep = "-")) # Print out a selection of columns of c2 select(c2, Date, DepTime, ArrTime, TailNum) # How many weekend flights flew a distance of more than 1000 miles # but had a total taxiing time below 15 minutes? nrow(filter(hflights, DayOfWeek %in% c(6,7), Distance > 1000, TaxiIn + TaxiOut < 15))
view raw verbs07.r hosted with ❤ by GitHub

Arranging Data
arrange() can be used to rearrange rows according to any type of data. If you pass arrange() a character variable, R will rearrange the rows in alphabetical order according to values of the variable. If you pass a factor variable, R will rearrange the rows according to the order of the levels in your factor (running levels() on the variable reveals this order).

By default, arrange() arranges the rows from smallest to largest. Rows with the smallest value of the variable will appear at the top of the data set. You can reverse this behaviour with the desc() function. arrange() will reorder the rows from largest to smallest values of a variable if you wrap the variable name in desc() before passing it to arrange()

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # Definition of dtc dtc <- filter(hflights, Cancelled == 1, !is.na(DepDelay)) # Arrange dtc by departure delays arrange(dtc, DepDelay) # Arrange dtc so that cancellation reasons are grouped arrange(dtc, CancellationCode) # Arrange dtc according to carrier and departure delays arrange(dtc, UniqueCarrier, DepDelay) # Arrange according to carrier and decreasing departure delays arrange(hflights, UniqueCarrier, desc(DepDelay)) # Arrange flights by total delay (normal order). arrange(hflights, DepDelay + ArrDelay) # Keep flights leaving to DFW before 8am and arrange according to decreasing AirTime arrange(filter(hflights, Dest == 'DFW', DepTime < 800), desc(AirTime))
view raw verbs08.r hosted with ❤ by GitHub

Summarizing Data

summarise(), the last of the 5 verbs, follows the same syntax as mutate(), but the resulting dataset consists of a single row instead of an entire new column in the case of mutate().

In contrast to the four other data manipulation functions, summarise() does not return an altered copy of the dataset it is summarizing; instead, it builds a new dataset that contains only the summarizing statistics.

Note: summarise() and summarize() both work the same!

You can use any function you like in summarise(), so long as the function can take a vector of data and return a single number. R contains many aggregating functions. Here are some of the most useful:

min(x) – minimum value of vector x.
max(x) – maximum value of vector x.
mean(x) – mean value of vector x.
median(x) – median value of vector x.
quantile(x, p) – pth quantile of vector x.
sd(x) – standard deviation of vector x.
var(x) – variance of vector x.
IQR(x) – Inter Quartile Range (IQR) of vector x.
diff(range(x)) – total range of vector x.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # Print out a summary with variables min_dist and max_dist summarize(hflights, min_dist = min(Distance), max_dist = max(Distance)) # Print out a summary with variable max_div summarize(filter(hflights, Diverted == 1), max_div = max(Distance)) # Remove rows that have NA ArrDelay: temp1 temp1 <- filter(hflights, !is.na(ArrDelay)) # Generate summary about ArrDelay column of temp1 summarise(temp1, earliest = min(ArrDelay), average = mean(ArrDelay), latest = max(ArrDelay), sd = sd(ArrDelay)) # Keep rows that have no NA TaxiIn and no NA TaxiOut: temp2 temp2 <- filter(hflights, !is.na(TaxiIn), !is.na(TaxiOut)) # Print the maximum taxiing difference of temp2 with summarise() summarise(temp2, max_taxi_diff = max(abs(TaxiIn - TaxiOut)))
view raw verbs09.r hosted with ❤ by GitHub

dplyr provides several helpful aggregate functions of its own, in addition to the ones that are already defined in R. These include:

first(x) – The first element of vector x.
last(x) – The last element of vector x.
nth(x, n) – The nth element of vector x.
n() – The number of rows in the data.frame or group of observations that summarise() describes.
n_distinct(x) – The number of unique values in vector x

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # Generate summarizing statistics for hflights summarise(hflights, n_obs = n(), n_carrier = n_distinct(UniqueCarrier), n_dest = n_distinct(Dest), dest100 = nth(Dest, 100)) # Filter hflights to keep all American Airline flights: aa aa <- filter(hflights, UniqueCarrier == "American") # Generate summarizing statistics for aa summarise(aa, n_flights = n(), n_canc = sum(Cancelled), p_canc = 100*(n_canc/n_flights), avg_delay = mean(ArrDelay, na.rm = TRUE))
view raw verbs10.r hosted with ❤ by GitHub

This would be it for Part-2 of this series of posts on data manipulation with dplyr. Part 3 would focus on the pipe operator, Group_by and working with databases.

# Data Manipulation in R with dplyr – Part 1

dplyr is one of the packages in R that makes R so loved by data scientists. It has three main goals:

1. Identify the most important data manipulation tools needed for data analysis and make them easy to use in R.
2. Provide blazing fast performance for in-memory data by writing key pieces of code in C++.
3. Use the same code interface to work with data no matter where it’s stored, whether in a data frame, a data table or database.

Introduction to the dplyr package and the tbl class
This post is mostly about code. If you’re interested in learning dplyr I recommend you type in the commands line by line on the R console to see first hand what’s happening.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # INTRODUCTION TO dplyr AND tbls # Load the dplyr package library(dplyr) # Load the hflights package library(hflights) # Call both head() and summary() on hflights head(hflights) summary(hflights) # Convert the hflights data.frame into a hflights tbl hflights <- tbl_df(hflights) # Display the hflights tbl hflights # Create the object carriers, containing only the UniqueCarrier variable of hflights carriers <- hflights\$UniqueCarrier # Use lut to translate the UniqueCarrier column of hflights and before doing so # glimpse hflights to see the UniqueCarrier variablle glimpse(hflights) lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental", "DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways", "WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier", "FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa") hflights\$UniqueCarrier <- lut[hflights\$UniqueCarrier] # Now glimpse hflights to see the change in the UniqueCarrier variable glimpse(hflights) # Fill up empty entries of CancellationCode with 'E' # To do so, first index the empty entries in CancellationCode cancellationEmpty <- hflights\$CancellationCode == "" # Assign 'E' to the empty entries hflights\$CancellationCode[cancellationEmpty] <- 'E' # Use a new lookup table to create a vector of code labels. Assign the vector to the CancellationCode column of hflights lut = c('A' = 'carrier', 'B' = 'weather', 'C' = 'FFA', 'D' = 'security', 'E' = 'not cancelled') hflights\$CancellationCode <- lut[hflights\$CancellationCode] # Inspect the resulting raw values of your variables glimpse(hflights)
view raw introduction.R hosted with ❤ by GitHub

Select and mutate
dplyr provides grammar for data manipulation apart from providing data structure. The grammar is built around 5 functions (also referred to as verbs) that do the basic tasks of data manipulation.

The 5 verbs of dplyr
select – removes columns from a dataset
filter – removes rows from a dataset
arrange – reorders rows in a dataset
mutate – uses the data to build new columns and values
summarize – calculates summary statistics

dplyr functions do not change the dataset. They return a new copy of the dataset to use.

To answer the simple question whether flight delays tend to shrink or grow during a flight, we can safely discard a lot of the variables of each flight. To select only the ones that matter, we can use select()

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 hflights[c('ActualElapsedTime','ArrDelay','DepDelay')] # Equivalently, using dplyr: select(hflights, ActualElapsedTime, ArrDelay, DepDelay) # Print out a tbl with the four columns of hflights related to delay select(hflights, ActualElapsedTime, AirTime, ArrDelay, DepDelay) # Print out hflights, nothing has changed! hflights # Print out the columns Origin up to Cancelled of hflights select(hflights, Origin:Cancelled) # Find the most concise way to select: columns Year up to and # including DayOfWeek, columns ArrDelay up to and including Diverted # Answer to last question: be concise! # You may want to examine the order of hflight's column names before you # begin with names() names(hflights) select(hflights, -(DepTime:AirTime))
view raw verbs01.R hosted with ❤ by GitHub

dplyr comes with a set of helper functions that can help you select variables. These functions find groups of variables to select, based on their names. Each of these works only when used inside of select()

• starts_with(“X”): every name that starts with “X”
• ends_with(“X”): every name that ends with “X”
• contains(“X”): every name that contains “X”
• matches(“X”): every name that matches “X”, where “X” can be a regular expression
• num_range(“x”, 1:5): the variables named x01, x02, x03, x04 and x05
• one_of(x): every name that appears in x, which should be a character vector
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # Helper functions used with dplyr # Print out a tbl containing just ArrDelay and DepDelay select(hflights, ArrDelay, DepDelay) # Use a combination of helper functions and variable names to print out # only the UniqueCarrier, FlightNum, TailNum, Cancelled, and CancellationCode # columns of hflights select(hflights, UniqueCarrier, FlightNum, contains("Tail"), contains("Cancel")) # Find the most concise way to return the following columns with select and its # helper functions: DepTime, ArrTime, ActualElapsedTime, AirTime, ArrDelay, # DepDelay. Use only helper functions select(hflights, ends_with("Time"), ends_with("Delay"))
view raw verbs02.R hosted with ❤ by GitHub

In order to appreciate the usefulness of dplyr, here are some comparisons between base R and dplyr

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # Some comparisons to basic R # both hflights and dplyr are available ex1r <- hflights[c("TaxiIn","TaxiOut","Distance")] ex1d <- select(hflights, TaxiIn, TaxiOut, Distance) ex2r <- hflights[c("Year","Month","DayOfWeek","DepTime","ArrTime")] ex2d <- select(hflights, Year:ArrTime, -DayofMonth) ex3r <- hflights[c("TailNum","TaxiIn","TaxiOut")] ex3d <- select(hflights, TailNum, contains("Taxi"))
view raw comparisons01.R hosted with ❤ by GitHub

mutate() is the second of the five data manipulation functions. mutate() creates new columns which are added to a copy of the dataset.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # Add the new variable ActualGroundTime to a copy of hflights and save the result as g1. g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime) # Add the new variable GroundTime to a g1. Save the result as g2. g2 <- mutate(g1, GroundTime = TaxiIn + TaxiOut) # Add the new variable AverageSpeed to g2. Save the result as g3. g3 <- mutate(g2, AverageSpeed = Distance / AirTime * 60) # Print out g3 g3
view raw verbs03.r hosted with ❤ by GitHub

So far we have added variables to hflights one at a time, but we can also use mutate() to add multiple variables at once.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # Add a second variable loss_percent to the dataset: m1 m1 <- mutate(hflights, loss = ArrDelay - DepDelay, loss_percent = ((ArrDelay - DepDelay)/DepDelay)*100) # mutate() allows you to use a new variable while creating a next variable in the same call # Copy and adapt the previous command to reduce redendancy: m2 m2 <- mutate(hflights, loss = ArrDelay - DepDelay, loss_percent = (loss/DepDelay) * 100 ) # Add the three variables as described in the third instruction: m3 m3 <- mutate(hflights, TotalTaxi = TaxiIn + TaxiOut, ActualGroundTime = ActualElapsedTime - AirTime, Diff = TotalTaxi - ActualGroundTime)
view raw verbs04.r hosted with ❤ by GitHub

# Generating Permutation Matrices in Octave / Matlab

I have been doing Gilbert Strang’s linear algebra assignments, some of which require you to write short scripts in MatLab, though I use GNU Octave (which is kind of like a free MatLab). I was trying out this problem: To solve this quickly, it would have been nice to have a function that would give a list of permutation matrices for every n-sized square matrix, but there was none in Octave, so I wrote a function permMatrices which creates a list of permutation matrices for a square matrix of size n.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 % function to generate permutation matrices given the size of the desired permutation matrices function x = permMatrices(n) x = zeros(n,n,factorial(n)); permutations = perms(1:n); for i = 1:size(x,3) x(:,:,i) = eye(n)(permutations(i,:),:); end endfunction
view raw permMatrices.m hosted with ❤ by GitHub

For example: The MatLab / Octave code to solve this problem is shown below:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 % Solution for part (a) p = permMatrices(3); n = size(p,3); % number of permutation matrices v = zeros(n,1); % vector of zeros with dimension equalling number of permutation matrices % check for permutation matrices other than identity matrix with 3rd power equalling identity matrix for i = 1:n if p(:,:,i)^3 == eye(3) v(i,1) = 1; end end v(1,1) = 0; % exclude identity matrix ans1 = p(:,:,v == 1) % Solution for part (b) P = permMatrices(4); m = size(P,3); % number of permutation matrices t = zeros(m,1); % vector of zeros with dimension equalling number of permutation matrices % check for permutation matrices with 4th power equalling identity matrix for i = 1:m if P(:,:,i)^4 == eye(4) t(i,1) = 1; end end % print the permutation matrices ans2 = P(:,:,t == 0)
view raw Section2_7_13.m hosted with ❤ by GitHub

Output: