Data Manipulation in R with dplyr – Part 2

Note that this post is in continuation with Part 1 of this series of posts on data manipulation with dplyr in R. The code in this post carries forward from the variables / objects defined in Part 1.

In the previous post, I talked about how dplyr provides a grammar of sorts to manipulate data, and consists of 5 verbs to do so:

The 5 verbs of dplyr
select – removes columns from a dataset
filter – removes rows from a dataset
arrange – reorders rows in a dataset
mutate – uses the data to build new columns and values
summarize – calculates summary statistics

I went on to discuss examples using select() and mutate(). Let’s now talk about filter(). R comes with a set of logical operators that you can use inside filter(). These operators are:
x < y, TRUE if x is less than y
x <= y, TRUE if x is less than or equal to y
x == y, TRUE if x equals y
x != y, TRUE if x does not equal y
x >= y, TRUE if x is greater than or equal to y
x > y, TRUE if x is greater than y
x %in% c(a, b, c), TRUE if x is in the vector c(a, b, c)

The following call, for example, filters df such that only the observations where the variable a is greater than the variable b:
filter(df, a > b)

Combining tests using boolean operators
R also comes with a set of boolean operators that you can use to combine multiple logical tests into a single test. These include & (and), | (or), and ! (not). Instead of using the & operator, you can also pass several logical tests to filter(), separated by commas. The following calls equivalent:

filter(df, a > b & c > d)
filter(df, a > b, c > d)

The is.na() will also come in handy very often. This expression, for example, keeps the observations in df for which the variable x is not NA:

filter(df, !is.na(x))

A recap on select(), mutate() and filter():

Arranging Data
arrange() can be used to rearrange rows according to any type of data. If you pass arrange() a character variable, R will rearrange the rows in alphabetical order according to values of the variable. If you pass a factor variable, R will rearrange the rows according to the order of the levels in your factor (running levels() on the variable reveals this order).

By default, arrange() arranges the rows from smallest to largest. Rows with the smallest value of the variable will appear at the top of the data set. You can reverse this behaviour with the desc() function. arrange() will reorder the rows from largest to smallest values of a variable if you wrap the variable name in desc() before passing it to arrange()

Summarizing Data

summarise(), the last of the 5 verbs, follows the same syntax as mutate(), but the resulting dataset consists of a single row instead of an entire new column in the case of mutate().

In contrast to the four other data manipulation functions, summarise() does not return an altered copy of the dataset it is summarizing; instead, it builds a new dataset that contains only the summarizing statistics.

Note: summarise() and summarize() both work the same!

You can use any function you like in summarise(), so long as the function can take a vector of data and return a single number. R contains many aggregating functions. Here are some of the most useful:

min(x) – minimum value of vector x.
   max(x) – maximum value of vector x.
mean(x) – mean value of vector x.
median(x) – median value of vector x.
quantile(x, p) – pth quantile of vector x.
  sd(x) – standard deviation of vector x.
var(x) – variance of vector x.
IQR(x) – Inter Quartile Range (IQR) of vector x.
diff(range(x)) – total range of vector x.

dplyr provides several helpful aggregate functions of its own, in addition to the ones that are already defined in R. These include:

first(x) – The first element of vector x.
last(x) – The last element of vector x.
nth(x, n) – The nth element of vector x.
n() – The number of rows in the data.frame or group of observations that summarise() describes.
n_distinct(x) – The number of unique values in vector x

This would be it for Part-2 of this series of posts on data manipulation with dplyr. Part 3 would focus on the pipe operator, Group_by and working with databases.

Advertisements

Data Manipulation in R with dplyr – Part 1

dplyr is one of the packages in R that makes R so loved by data scientists. It has three main goals:

  1. Identify the most important data manipulation tools needed for data analysis and make them easy to use in R.
  2. Provide blazing fast performance for in-memory data by writing key pieces of code in C++.
  3. Use the same code interface to work with data no matter where it’s stored, whether in a data frame, a data table or database.

Introduction to the dplyr package and the tbl class
This post is mostly about code. If you’re interested in learning dplyr I recommend you type in the commands line by line on the R console to see first hand what’s happening.

Select and mutate
dplyr provides grammar for data manipulation apart from providing data structure. The grammar is built around 5 functions (also referred to as verbs) that do the basic tasks of data manipulation.

The 5 verbs of dplyr
select – removes columns from a dataset
filter – removes rows from a dataset
arrange – reorders rows in a dataset
mutate – uses the data to build new columns and values
summarize – calculates summary statistics

dplyr functions do not change the dataset. They return a new copy of the dataset to use.

To answer the simple question whether flight delays tend to shrink or grow during a flight, we can safely discard a lot of the variables of each flight. To select only the ones that matter, we can use select()

dplyr comes with a set of helper functions that can help you select variables. These functions find groups of variables to select, based on their names. Each of these works only when used inside of select()

  • starts_with(“X”): every name that starts with “X”
  • ends_with(“X”): every name that ends with “X”
  • contains(“X”): every name that contains “X”
  • matches(“X”): every name that matches “X”, where “X” can be a regular expression
  • num_range(“x”, 1:5): the variables named x01, x02, x03, x04 and x05
  • one_of(x): every name that appears in x, which should be a character vector

In order to appreciate the usefulness of dplyr, here are some comparisons between base R and dplyr

mutate() is the second of the five data manipulation functions. mutate() creates new columns which are added to a copy of the dataset.

So far we have added variables to hflights one at a time, but we can also use mutate() to add multiple variables at once.

Statistical Learning – 2016

On January 12, 2016, Stanford University professors Trevor Hastie and Rob Tibshirani will offer the 3rd iteration of Statistical Learning, a MOOC which first began in January 2014, and has become quite a popular course among data scientists. It is a great place to learn statistical learning (machine learning) methods using the R programming language. For a quick course on R, check this out – Introduction to R Programming

Slides and videos for Statistical Learning MOOC by Hastie and Tibshirani available separately here. Slides and video tutorials related to this book by Abass Al Sharif can be downloaded here.

The course covers the following book which is available for free as a PDF copy.

Logistics and Effort:

statLearnEffort

Rough Outline of Schedule (based on last year’s course offering):

Week 1: Introduction and Overview of Statistical Learning (Chapters 1-2)
Week 2: Linear Regression (Chapter 3)
Week 3: Classification (Chapter 4)
Week 4: Resampling Methods (Chapter 5)
Week 5: Linear Model Selection and Regularization (Chapter 6)
Week 6: Moving Beyond Linearity (Chapter 7)
Week 7: Tree-based Methods (Chapter 8)
Week 8: Support Vector Machines (Chapter 9)
Week 9: Unsupervised Learning (Chapter 10)

Prerequisites: First courses in statistics, linear algebra, and computing.

 

Troubleshooting ‘Rattle’ (R library) Installation on Ubuntu

This post pertains to Ubuntu / Debian users only.

rattle is a free graphical interface for data mining with R. I wanted to visualize decision trees and had to install this library.
> install.packages('rattle')
got me the following error message:

configure: error: GTK version 2.8.0 required
ERROR: configuration failed for package ‘RGtk2’

rattle_installationNonZeroExit

This error occurs when attempting to install the RGtk2 package. The install is looking for the header files for GTK. Possibly they are not yet. Luckily the problem can be solved quite easily. Open Terminal (Ctrl + Alt + T) and type in the following commands:


sudo apt-get update
wajig install libgtk2.0-dev

Go back and try installing rattle now with the same command as earlier. It should work. It did for me! As you can see below, decision trees are visualized lot better with rattle than if you used just rpart.

rattle

Getting Started with R on MIT’s 14.74x (Foundations of Development Policy)

I noticed that a major grievance of many students enrolled in MIT‘s latest edX course on development policy (Foundations of Development Policy: Advanced Development Economics) was that there wasn’t enough done to get them going with the R assignments. I have posted the R code for the homework (past the deadline, of course) of the first 2 weeks, so that others get a hang of the level of R that might be needed to solve these assignments in the following weeks. I’m willing to help out those needing help getting up to speed with R required for this course. For specific queries, leave your message in the comments section.

A great place to get spend time learning R before taking Foundations of Development Policy (14.74x) would be another edX course that’s been getting great reviews recently: Introduction to R Programming

R Code for Home Work (Week 1)

R Code for Home Work (Week 2)

I hope this helps!

R — The Big Mover in IEEE Spectrum’s 2015 Rankings for Top 10 Programming Languages

The column on the left is the 2015 ranking; the column on the right is the 2014 ranking for comparison:

top-tech-rankings

source: The 2015 Top Ten Programming Languages

The thing to note is that the top 5 languages haven’t budged from their positions. R has pushed past PHP, JavaScirpt and Ruby, which have maintained their relative positions.  So this year’s rankings have been about R moving forward.

Skillset Necessary for Data Science

I came across this truly amazing visualization of what it takes to foray into data science by @kzawadz via twitter MarketingDistillery.com

data science