# Data Manipulation in R with dplyr – Part 3

This happens to be my 50th blog post – and my blog is 8 months old.

🙂

This post is the third and last post in in a series of posts (Part 1Part 2) on data manipulation with dlpyr. Note that the objects in the code may have been defined in earlier posts and the code in this post is in continuation with code from the earlier posts.

Although datasets can be manipulated in sophisticated ways by linking the 5 verbs of dplyr in conjunction, linking verbs together can be a bit verbose.

Creating multiple objects, especially when working on a large dataset can slow you down in your analysis. Chaining functions directly together into one line of code is difficult to read. This is sometimes called the Dagwood sandwich problem: you have too much filling (too many long arguments) between your slices of bread (parentheses). Functions and arguments get further and further apart.

The %>% operator allows you to extract the first argument of a function from the arguments list and put it in front of it, thus solving the Dagwood sandwich problem.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # %>% OPERATOR ---------------------------------------------------------------------- # with %>% operator hflights %>% mutate(diff = TaxiOut - TaxiIn) %>% filter(!is.na(diff)) %>% summarise(avg = mean(diff)) # without %>% operator # arguments get further and further apart summarize(filter(mutate(hflights, diff = TaxiOut - TaxiIn),!is.na(diff)), avg = mean(diff)) # with %>% operator d <- hflights %>% select(Dest, UniqueCarrier, Distance, ActualElapsedTime) %>% mutate(RealTime = ActualElapsedTime + 100, mph = Distance/RealTime*60) # without %>% operator d <- mutate(select(hflights, Dest, UniqueCarrier, Distance, ActualElapsedTime), RealTime = ActualElapsedTime + 100, mph = Distance/RealTime*60) # Filter and summarise d d %>% filter(!is.na(mph), mph < 70) %>% summarise(n_less = n(), n_dest = n_distinct(Dest), min_dist = min(Distance), max_dist = max(Distance)) # Let's define preferable flights as flights that are 150% faster than driving, # i.e. that travel 105 mph or greater in real time. Also, assume that cancelled or # diverted flights are less preferable than driving. # ADVANCED PIPING EXERCISES # Use one single piped call to print a summary with the following variables: # n_non - the number of non-preferable flights in hflights, # p_non - the percentage of non-preferable flights in hflights, # n_dest - the number of destinations that non-preferable flights traveled to, # min_dist - the minimum distance that non-preferable flights traveled, # max_dist - the maximum distance that non-preferable flights traveled hflights %>% mutate(RealTime = ActualElapsedTime + 100, mph = Distance/RealTime*60) %>% filter(mph < 105 | Cancelled == 1 | Diverted == 1) %>% summarise(n_non = n(), p_non = 100*n_non/nrow(hflights), n_dest = n_distinct(Dest), min_dist = min(Distance), max_dist = max(Distance)) # Use summarise() to create a summary of hflights with a single variable, n, # that counts the number of overnight flights. These flights have an arrival # time that is earlier than their departure time. Only include flights that have # no NA values for both DepTime and ArrTime in your count. hflights %>% mutate(overnight = (ArrTime < DepTime)) %>% filter(overnight == TRUE) %>% summarise(n = n())

group_by()

group_by() defines groups within a data set. Its influence becomes clear when calling summarise() on a grouped dataset. Summarizing statistics are calculated for the different groups separately.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # group_by() ------------------------------------------------------------------------- # Generate a per-carrier summary of hflights with the following variables: n_flights, # the number of flights flown by the carrier; n_canc, the number of cancelled flights; # p_canc, the percentage of cancelled flights; avg_delay, the average arrival delay of # flights whose delay does not equal NA. Next, order the carriers in the summary from # low to high by their average arrival delay. Use percentage of flights cancelled to # break any ties. Which airline scores best based on these statistics? hflights %>% group_by(UniqueCarrier) %>% summarise(n_flights = n(), n_canc = sum(Cancelled), p_canc = 100*n_canc/n_flights, avg_delay = mean(ArrDelay, na.rm = TRUE)) %>% arrange(avg_delay) # Generate a per-day-of-week summary of hflights with the variable avg_taxi, # the average total taxiing time. Pipe this summary into an arrange() call such # that the day with the highest avg_taxi comes first. hflights %>% group_by(DayOfWeek) %>% summarize(avg_taxi = mean(TaxiIn + TaxiOut, na.rm = TRUE)) %>% arrange(desc(avg_taxi))
view raw group_by.R hosted with ❤ by GitHub

Combine group_by with mutate

group_by() can also be combined with mutate(). When you mutate grouped data, mutate() will calculate the new variables independently for each group. This is particularly useful when mutate() uses the rank() function, that calculates within group rankings. rank() takes a group of values and calculates the rank of each value within the group, e.g.

rank(c(21, 22, 24, 23))

has output

[1] 1 2 4 3

As with arrange(), rank() ranks values from the largest to the smallest and this behaviour can be reversed with the desc() function.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters