Note that this post is in continuation with Part 1 of this series of posts on data manipulation with *dplyr* in R. The code in this post carries forward from the variables / objects defined in Part 1.

In the previous post, I talked about how *dplyr* provides a grammar of sorts to manipulate data, and consists of 5 verbs to do so:

The 5 verbs of *dplyr*

**select –** removes columns from a dataset

**filter –** removes rows from a dataset

**arrange –** reorders rows in a dataset

**mutate –** uses the data to build new columns and values

**summarize –** calculates summary statistics

I went on to discuss examples using *select()* and *mutate(). *Let’s now talk about *filter(). *R comes with a set of logical operators that you can use inside *filter()*. These operators are:

*x < y,* *TRUE* if *x* is less than *y*

*x <= y*, *TRUE* if* x* is less than or equal to *y*

*x == y*, *TRUE* if* x* equals *y*

*x != y*, *TRUE* if *x* does not equal *y*

*x >= y*, *TRUE* if *x* is greater than or equal to *y*

*x > y*, *TRUE* if *x* is greater than* y*

*x %in% c(a, b, c)*, *TRUE* if *x* is in the vector *c(a, b, c)*

The following call, for example, filters *df* such that only the observations where the variable *a* is greater than the variable *b*:

*filter(df, a > b)*

# Print out all flights in hflights that traveled 3000 or more miles | |

filter(hflights, Distance > 3000) | |

# All flights flown by one of JetBlue, Southwest, or Delta | |

filter(hflights, UniqueCarrier %in% c('JetBlue', 'Southwest', 'Delta')) | |

# All flights where taxiing took longer than flying | |

filter(hflights, TaxiIn + TaxiOut > AirTime) |

**Combining tests using boolean operators**

R also comes with a set of boolean operators that you can use to combine multiple logical tests into a single test. These include *&* (and), *|* (or), and *!* (not). Instead of using the & operator, you can also pass several logical tests to *filter()*, separated by commas. The following calls equivalent:

*filter(df, a > b & c > d)*

*filter(df, a > b, c > d)*

The *is.na()* will also come in handy very often. This expression, for example, keeps the observations in *df* for which the variable *x* is not *NA*:

*filter(df, !is.na(x))*

# Combining tests using boolean operators | |

# All flights that departed before 5am or arrived after 10pm | |

filter(hflights, DepTime < 500 | ArrTime > 2200 ) | |

# All flights that departed late but arrived ahead of schedule | |

filter(hflights, DepDelay > 0 & ArrDelay < 0) | |

# All cancelled weekend flights | |

filter(hflights, DayOfWeek %in% c(6,7) & Cancelled == 1) | |

# All flights that were cancelled after being delayed | |

filter(hflights, Cancelled == 1, DepDelay > 0) |

A recap on *select()*, *mutate()* and *filter()*:

# Summarizing Exercise | |

# Select the flights that had JFK as their destination: c1 | |

c1 <- filter(hflights, Dest == 'JFK') | |

# Combine the Year, Month and DayofMonth variables to create a Date column: c2 | |

c2 <- mutate(c1, Date = paste(Year, Month, DayofMonth, sep = "-")) | |

# Print out a selection of columns of c2 | |

select(c2, Date, DepTime, ArrTime, TailNum) | |

# How many weekend flights flew a distance of more than 1000 miles | |

# but had a total taxiing time below 15 minutes? | |

nrow(filter(hflights, DayOfWeek %in% c(6,7), Distance > 1000, TaxiIn + TaxiOut < 15)) |

**Arranging Data**

*arrange()* can be used to rearrange rows according to any type of data. If you pass *arrange()* a character variable, R will rearrange the rows in alphabetical order according to values of the variable. If you pass a factor variable, R will rearrange the rows according to the order of the levels in your factor (running *levels()* on the variable reveals this order).

By default, *arrange()* arranges the rows from smallest to largest. Rows with the smallest value of the variable will appear at the top of the data set. You can reverse this behaviour with the *desc()* function. *arrange()* will reorder the rows from largest to smallest values of a variable if you wrap the variable name in *desc()* before passing it to *arrange()*

# Definition of dtc | |

dtc <- filter(hflights, Cancelled == 1, !is.na(DepDelay)) | |

# Arrange dtc by departure delays | |

arrange(dtc, DepDelay) | |

# Arrange dtc so that cancellation reasons are grouped | |

arrange(dtc, CancellationCode) | |

# Arrange dtc according to carrier and departure delays | |

arrange(dtc, UniqueCarrier, DepDelay) | |

# Arrange according to carrier and decreasing departure delays | |

arrange(hflights, UniqueCarrier, desc(DepDelay)) | |

# Arrange flights by total delay (normal order). | |

arrange(hflights, DepDelay + ArrDelay) | |

# Keep flights leaving to DFW before 8am and arrange according to decreasing AirTime | |

arrange(filter(hflights, Dest == 'DFW', DepTime < 800), desc(AirTime)) |

**Summarizing Data**

*summarise()*, the last of the 5 verbs, follows the same syntax as *mutate()*, but the resulting dataset consists of a single row instead of an entire new column in the case of *mutate()*.

In contrast to the four other data manipulation functions, *summarise()* does not return an altered copy of the dataset it is summarizing; instead, it builds a new dataset that contains only the summarizing statistics.

**Note:** *summarise()* and *summarize()* both work the same!

You can use any function you like in *summarise()*, so long as the function can take a vector of data and return a single number. R contains many aggregating functions. Here are some of the most useful:

* min(x)* – minimum value of vector* x*.

* max(x)* – maximum value of vector *x*.

*mean(x)* – mean value of vector *x*.

*median(x)* – median value of vector* x*.

*quantile(x, p)* – pth quantile of vector *x*.

* sd(x)* – standard deviation of vector *x*.

*var(x)* – variance of vector *x*.

*IQR(x)* – Inter Quartile Range (IQR) of vector *x.*

*diff(range(x))* – total range of vector* x*.

# Print out a summary with variables min_dist and max_dist | |

summarize(hflights, min_dist = min(Distance), max_dist = max(Distance)) | |

# Print out a summary with variable max_div | |

summarize(filter(hflights, Diverted == 1), max_div = max(Distance)) | |

# Remove rows that have NA ArrDelay: temp1 | |

temp1 <- filter(hflights, !is.na(ArrDelay)) | |

# Generate summary about ArrDelay column of temp1 | |

summarise(temp1, earliest = min(ArrDelay), average = mean(ArrDelay), | |

latest = max(ArrDelay), sd = sd(ArrDelay)) | |

# Keep rows that have no NA TaxiIn and no NA TaxiOut: temp2 | |

temp2 <- filter(hflights, !is.na(TaxiIn), !is.na(TaxiOut)) | |

# Print the maximum taxiing difference of temp2 with summarise() | |

summarise(temp2, max_taxi_diff = max(abs(TaxiIn - TaxiOut))) |

*dplyr* provides several helpful aggregate functions of its own, in addition to the ones that are already defined in R. These include:

*first(x)* – The first element of vector x.

*last(x)* – The last element of vector x.

*nth(x, n)* – The nth element of vector x.

*n()* – The number of rows in the data.frame or group of observations that *summarise()* describes.

*n_distinct(x)* – The number of unique values in vector x

# Generate summarizing statistics for hflights | |

summarise(hflights, n_obs = n(), n_carrier = n_distinct(UniqueCarrier), | |

n_dest = n_distinct(Dest), dest100 = nth(Dest, 100)) | |

# Filter hflights to keep all American Airline flights: aa | |

aa <- filter(hflights, UniqueCarrier == "American") | |

# Generate summarizing statistics for aa | |

summarise(aa, n_flights = n(), n_canc = sum(Cancelled), | |

p_canc = 100*(n_canc/n_flights), avg_delay = mean(ArrDelay, na.rm = TRUE)) |

This would be it for Part-2 of this series of posts on data manipulation with *dplyr*. **Part 3** would focus on the **pipe operator**, **Group_by** and working with databases.